PDF rendering at scale — architecture for 1M PDFs/day cover illustration

PDF rendering at scale — architecture for 1M PDFs/day

Generating one PDF from HTML is a five-minute tutorial. Generating a million a day, 24×7, with 99.9% availability and without leaking customer data, is a multi-year engineering investment. This post is the architecture I’d build if I were starting that journey today — based on the experience of building 21pdf and watching competitors succeed and fail at the same problem.

It’s production-focused. If you’re evaluating whether to self-host (see the library-vs-API post) or just curious how a managed API works under the hood, this fills in the operational layer.

TL;DR

  • Shape the system as an async job queue, not a synchronous HTTP service. POST /convert → job_id → poll or webhook → download. This lets Chromium take 200ms-30s without holding HTTP connections open.
  • Chromium pool, not single browsers. Long-lived processes (recycled every 500-2000 requests), fresh tab per request, 3-5 concurrent tabs per process.
  • Dual rate limits: monthly quota (402) and per-customer concurrency (429). Different error codes for different classes of back-pressure.
  • SSRF defence in depth: HTTP-boundary private-IP block + in-browser request interceptor. Both layers, not one.
  • Monitoring the four golden signals: latency, queue depth, success rate, worker memory. Alert on trends, not single spikes.
  • Expect and plan for: Chromium crashes, memory leaks, font fallbacks, poison-pill jobs, noisy-neighbour workloads. Each is tractable individually; collectively they’re the service’s operational cost.

The architecture at a glance

 ┌──────────────┐   ┌───────────────┐   ┌─────────────────┐   ┌──────────────┐
 │   Client     │──▶│  API server   │──▶│  Redis / SQS    │──▶│  Worker      │
 │  (your app)  │   │  (auth, quota │   │  job queue      │   │  (Chromium   │
 │              │   │   HMAC, SSRF  │   │                 │   │   pool)      │
 │              │   │   boundary)   │   │                 │   │              │
 └──────▲───────┘   └───────────────┘   └─────────────────┘   └──────┬───────┘
        │                   │                                        │
        │                   │ status?                                 │ PDF bytes
        │                   ▼                                        ▼
        │           ┌───────────────┐                          ┌──────────────┐
        └───────────│  Jobs table   │◀─────────────────────────│  Object      │
              GET   │  (Postgres)   │     record + pointer     │  storage     │
              job   │               │                          │  (S3/R2/GCS) │
              PDF   └───────────────┘                          └──────────────┘

Five components, each with its own concerns:

  1. API server — accepts convert requests, authenticates, enforces quotas/concurrency, SSRF-checks URLs, inserts a job row, enqueues.
  2. Job queue — durable, at-least-once delivery. Redis with BullMQ is the common starting point.
  3. Worker pool — dequeues jobs, routes to a Chromium process, captures the PDF, uploads to storage, updates the job row.
  4. Chromium pool — long-lived browser processes, lifecycle-managed by the worker.
  5. Object storage — PDF bytes, signed URLs for download, lifecycle policies for retention.

The async shape is critical. Synchronous POST /convert returning PDF bytes works for the first 1,000 users and falls over at the 10,000th — HTTP connections are expensive, Chromium is unpredictable, and blocking the API server on rendering eats your capacity. Async is the only shape that scales.

Sizing the Chromium pool

Chromium is the bottleneck at scale. Everything else (API, queue, storage) scales linearly with cheap hardware. Chromium scales with RAM and CPU, both more expensive.

Throughput per process

A single Chromium process handles:

  • 3-5 concurrent tabs comfortably (baseline render on simple pages: ~200-400ms)
  • Up to 10 tabs under pressure (p95 degrades significantly)
  • 1 tab per second throughput baseline at 500ms per render
  • ~500MB-1GB RSS idle, 2-4GB under load

Back-of-envelope for 1M PDFs/day:

  • 1M / 86400s ≈ 12 PDFs/sec average
  • Peak ≈ 30-50/sec (4× average is a reasonable peak for business-hours workloads)
  • At 3 concurrent tabs/process, peak requires ~15 processes
  • Add 50% headroom for autoscaling → 20-25 worker processes
  • At 4GB RSS each → 80-100GB total RAM across the fleet
  • CPU: 2 vCPUs per worker → 40-50 vCPUs total

This is 10-15 medium-sized cloud VMs or Kubernetes nodes. Not a lot. Chromium-at-scale is expensive because each unit is expensive, not because you need an army.

Autoscaling

Scale out on queue depth or p95 latency, not on CPU. Chromium is bursty — CPU goes 100% for 500ms, then idles — and CPU-based autoscaling thrashes. Queue depth is the honest signal: if jobs are waiting, you need more workers.

A typical autoscaling policy:

  • Queue depth < 10 jobs → scale down to minimum (e.g. 2 workers)
  • Queue depth 10-100 jobs → scale to 5-10 workers
  • Queue depth 100-1000 jobs → scale to 20-50 workers
  • Queue depth > 1000 jobs → page somebody, something is wrong

Set scale_down_cooldown high (15-30 min) so bursts of traffic don’t cause flapping.

Process recycling

Every Chromium process must be recycled. Memory leaks over thousands of renders are the norm, not a bug. Recycle triggers:

  • Request count — kill after N=500-2000 requests
  • Memory watermark — kill when RSS > threshold (e.g. 2GB)
  • Age — kill if older than 24 hours
  • After crash — kill the whole process if one tab crashed (the renderer might have corrupted state)

Recycling is graceful: drain in-flight tabs, browser.close(), wait for exit, spawn a replacement, re-add to pool. The worker becomes unavailable for ~2-5 seconds during a recycle — size your pool to absorb this without queue depth climbing.

The job queue

The queue’s job is to decouple the API server from the rendering pool. Common options:

Redis with BullMQ / Sidekiq / asynq

Default choice for most teams. BullMQ (Node) and Sidekiq (Ruby) both have:

  • At-least-once delivery
  • Retry with exponential backoff
  • Priority queues
  • Delayed jobs
  • Dead-letter queues
  • Built-in dashboards
// BullMQ producer in the API server
await pdfQueue.add('render', {
  jobId: job.id,
  html: body.html,
  options: body.options,
  userId: ctx.userId,
}, {
  attempts: 3,
  backoff: { type: 'exponential', delay: 1000 },
  removeOnComplete: 100,
  removeOnFail: 1000,
});
// BullMQ worker
new Worker('pdf-render', async (job) => {
  const { html, options } = job.data;
  const browser = await pool.acquire();
  const page = await browser.newPage();
  try {
    await page.setContent(html);
    if (options.wait_for_network_idle) await page.waitForNetworkIdle();
    const pdf = await page.pdf(mapOptions(options));
    await storage.put(`jobs/${job.data.jobId}.pdf`, pdf);
    await db.update('jobs', { id: job.data.jobId }, { status: 'succeeded' });
  } finally {
    await page.close();
    pool.release(browser);
  }
}, { connection: redis, concurrency: 3 });

Typical throughput: 10,000+ jobs/sec on a single Redis instance. Well above PDF-rendering needs.

SQS

AWS-managed, durable, zero ops. A fit if you’re already on AWS and don’t want to run Redis. Tradeoffs:

  • Slightly higher per-job latency (~50ms vs ~5ms for Redis)
  • Visibility timeouts can be tricky if jobs take longer than expected
  • No built-in dead-letter queue dashboards (you build those)

Fine choice. Redis is faster to iterate on; SQS is faster to operate.

What NOT to use

  • Postgres LISTEN/NOTIFY — works up to ~1k jobs/min, gets noisy above that
  • Kafka — overkill; the Kafka operational overhead dwarfs the PDF rendering work
  • In-memory queues (just an array in Node) — you’ll lose jobs on process restart

Start with Redis + BullMQ. Migrate to something fancier only when you have a specific reason.

Rate limiting and quotas

At scale you have two separate problems: how much each customer can do per billing cycle (quota) and how much at once (concurrency). They have different solutions and different error codes.

Quota: 402 Payment Required

Implemented in the API server’s middleware, in front of the convert endpoint:

// Go example — matches the 21pdf pattern in internal/billing/middleware.go.
func RequireQuota(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        userID := userFromCtx(r.Context())
        sub := fetchSubscription(userID)
        plan := fetchPlan(sub.PlanID)

        if plan.QuotaPerCycle >= 0 && sub.UsageThisCycle >= plan.QuotaPerCycle {
            w.Header().Set("X-Quota-Used", strconv.Itoa(sub.UsageThisCycle))
            w.Header().Set("X-Quota-Limit", strconv.Itoa(plan.QuotaPerCycle))
            http.Error(w, "monthly quota exhausted — upgrade to continue", 402)
            return
        }
        next.ServeHTTP(w, r)
    })
}

Increment on success, not on submission. Failed renders don’t consume the customer’s quota — they didn’t get value. The increment happens in the worker after successful upload:

await db.query(
  "UPDATE subscriptions SET usage_this_cycle = usage_this_cycle + 1 WHERE user_id = $1",
  [job.data.userId],
);

Use UPDATE ... = + 1 (atomic). Don’t read-then-write, that races.

Quota reset: align with the billing cycle, not calendar month. When Stripe (or Razorpay, or whatever) sends subscription.charged or invoice.paid, reset the counter.

Concurrency: 429 Too Many Requests

Implemented at queue-submission time, checking the number of in-flight jobs per customer:

SELECT COUNT(*) FROM jobs
WHERE user_id = $1 AND status IN ('queued', 'processing');

If this exceeds the plan’s concurrency_limit, reject the new request with 429 and Retry-After: 5. The client retries in 5s; by then, in-flight jobs have usually drained enough to accept.

Why 429 and not 402? Different semantics. 402 says “you’ve hit your monthly limit — upgrade”. 429 says “you’re going too fast — slow down”. Upgrading doesn’t help with 429; slowing down doesn’t help with 402. Conflating them confuses the client’s retry logic.

Global rate limits

In addition to per-customer limits, you want a global limit to protect the service from bursts. Typical shape: 1000 jobs/sec ingested across the whole API. Beyond that, return 429 to the next customer regardless of who they are. This prevents a whale customer’s spike from DoS’ing everyone else.

SSRF at scale

If your API accepts url inputs (and most do), SSRF is your single biggest security concern. I’ve covered the pattern in the HTML-to-PDF API guide; at scale you additionally need:

Per-request audit log

Log every URL fetch with: requester ID, URL, resolved IP, timestamp, allow/deny decision, reason. 90-day retention minimum for incident investigation.

Anomaly detection

A customer suddenly fetching 100 different *.internal.* hostnames is almost certainly a probe. Alert on:

  • New allowed hostnames per customer exceeding baseline by >10×
  • Blocked SSRF attempts exceeding threshold (e.g. 50/hour per customer)
  • First-time fetches of hostnames resembling cloud metadata endpoints (metadata, 169.254.169.254, aws, gcp)

SSRF test harness

Run a continuous test from outside your infrastructure that:

  • Submits a URL resolving to a private IP — should be blocked at boundary
  • Submits a URL with DNS rebinding (initially public IP, then flips to private) — should be blocked by Chromium interceptor
  • Submits a URL with an HTTP redirect to a private IP — should be blocked at follow-up

Failing tests page on-call immediately. SSRF regressions are the class of bug that sits silent for months and then costs you a CVE.

Storage

PDF bytes are large relative to JSON API payloads. A million PDFs/day × average 200KB = 200GB/day = 6TB/month.

Where to store

  • S3 / R2 / GCS — cheap, reliable, signed URL support
  • MinIO — if you’re self-hosting object storage. Viable at modest scale; operational work past 10TB
  • Your database — never. Postgres handles blobs badly at scale; disk is expensive

Lifecycle policies

Rarely do customers need a PDF older than a week. Typical retention:

  • Hot tier (S3 Standard, R2, GCS Standard): 30 days — easy customer access
  • Warm tier (S3 Infrequent Access): 30-90 days — customer has to wait a second, costs 1/3 as much
  • Cold tier (Glacier / Archive): 90 days - 10 years for customers that pay for extended retention; 3-6 hour retrieval

21pdf’s PDF_RETENTION_DAYS default is 30 — after that, the orphan sweeper deletes the PDF and marks the job row accordingly. Customers who need longer retention store the PDFs themselves.

Signed URLs

For downloads, issue a signed URL with short expiry (5-15 minutes). Don’t serve PDFs through your API — stream them from S3 directly to the client. Your API server avoids holding large HTTP bodies.

from boto3 import client
s3 = client('s3')

url = s3.generate_presigned_url(
    'get_object',
    Params={'Bucket': 'pdfs', 'Key': f'jobs/{job_id}.pdf'},
    ExpiresIn=600,  # 10 minutes
)

Failure modes

A production PDF rendering service will encounter all of these. Each has a known mitigation.

1. Chromium crash mid-render

Detection: page.pdf() throws Protocol error, or times out.

Response: Kill the browser process, let the pool spawn a replacement, mark the job failed and requeue if attempt count < 3. Don’t retry forever — poison-pill jobs exist (malformed HTML that reliably crashes Chromium). Fail to DLQ after 3 attempts.

2. Memory leak

Detection: worker RSS grows over time, eventually OOM-killed.

Response: Recycle the browser process proactively. If OOMs are happening, your N (requests-per-process-lifetime) is too high — lower it. Monitor with a Prometheus recording rule; alert when rate of OOMs > 0.

3. Font fallback

Detection: customer reports rendered PDF uses different font than their screen.

Response:

  • Confirm wait_for_network_idle was set on the request
  • Confirm the font URL is accessible from your infra (CDN blocked, CORS issue)
  • Pre-bundle common fonts in your Docker image
  • Document workaround (inline @font-face as base64)

4. Queue back-pressure

Detection: queue depth climbs, p95 latency rises.

Response:

  • Autoscale workers up
  • If already at max capacity, reject new jobs with 429 + Retry-After proportional to queue depth
  • Alert on-call if depth > 5-minute backlog

5. Poison-pill jobs

Detection: the same job_id fails with the same error on every attempt.

Response: cap retries at 3. Move to dead-letter queue. Alert on DLQ growth rate, not absolute count — some poison jobs are normal. Investigate weekly.

6. Noisy-neighbour customer

Detection: one customer accounts for >50% of queue depth, other customers see degraded latency.

Response: enforce per-customer concurrency limits aggressively. Consider per-customer queues (BullMQ “flow” pattern) at very high scale. Charge the noisy customer for a higher tier.

7. Third-party dependencies flaking

Detection: all url inputs to a specific hostname (e.g. fonts.googleapis.com) start timing out.

Response:

  • Raise wait_for_network_idle timeout temporarily
  • Cache the font file locally if it’s stable
  • Add a circuit breaker for that hostname (if you’ve seen 50 failed fetches to it in 60s, fail fast instead of waiting 30s each)

8. Storage upload failure

Detection: worker successfully rendered the PDF but s3.put threw.

Response: retry the upload with exponential backoff (3-5 attempts). If still failing, mark the job failed with a specific error code (storage_error, not render_error — the customer’s HTML rendered fine). Page on-call if storage-error rate > 1%.

9. Database-side race

Detection: two workers process the same job simultaneously.

Response: SELECT FOR UPDATE SKIP LOCKED when the worker claims a job from the jobs table. BullMQ and Sidekiq handle this for you; rolling your own queue means you implement it.

Monitoring

The four signals I’d never operate a PDF service without:

1. Render latency (p50/p95/p99 histogram)

Per-endpoint and per-customer. If p95 doubles, you have a problem — usually font-fetch latency spike, or a customer submitting massive HTML.

2. Queue depth

Absolute number of queued jobs. Alert threshold: more than 5-minute backlog at current throughput. This catches scaling issues earlier than latency (latency only rises after the queue’s backed up).

3. Success rate

(succeeded / (succeeded + failed)) over 5-minute windows. Alert threshold: < 99%. Break down by failure reason — a 1% failure rate that’s 100% storage_error is different from 1% that’s render_timeout.

4. Worker memory

Per-worker RSS over time. Alert threshold: RSS > 80% of container limit. This catches leaks before they OOM; a leak that ramps to OOM in 6 hours gives you a day to investigate with no customer impact.

Plus: SSRF audit counter

Per-customer SSRF-block count per day. Alert on anomalies (>10× baseline) — this is your detection for probe activity.

Plus: Chromium version pinned

Export the running Chromium version as a Prometheus gauge. When you update it (weekly or as CVEs warrant), you can correlate version changes with latency/failure regressions.

Observability tooling

Reasonable default stack:

  • Metrics: Prometheus + Grafana, or Datadog if you want managed
  • Logs: structured JSON logs, shipped to Loki / CloudWatch Logs / Elastic
  • Traces: OpenTelemetry; helpful for tracing a slow job through API → queue → worker → storage
  • Alerting: PagerDuty / OpsGenie / Linear-based. Route different severity to different channels

At 1M PDFs/day, your observability bill will be $500-$2000/month. Budget for it.

Capacity planning

Every few months, check these numbers:

  • Current daily peak volume — not average; what was the busiest hour?
  • Render latency distribution — p99 should be within 3× of p50. If the long tail is wider, investigate (usually a single customer with huge HTML).
  • Worker pool utilisation at peak — target 60-70%. Above 80%, you’re one bad burst away from queue backlog.
  • Cost per 1000 PDFs — should be falling over time as volumes grow and fixed costs amortise. If it’s rising, something is wrong (inefficient queries, overprovisioned pool).

Incident response

When things break at 3am, you want:

Runbooks

Written-down response to the 5 most common incidents:

  1. “Queue depth climbing” → check autoscaler, verify Chromium pool healthy, check for noisy-neighbour customer
  2. “All renders failing” → check Chromium version (recent update?), check font CDN availability, check storage reachability
  3. “Single customer reporting errors” → check their quota/concurrency, check recent log entries for their user_id, verify auth not expired
  4. “SSRF alert fired” → identify customer, review their recent URL fetches, decide whether to freeze their API key
  5. “Memory ramping toward OOM” → lower N (requests-per-process), recycle workers manually, investigate which customer’s jobs are heavy

Rollback strategy

Deploy changes with the ability to roll back in under 5 minutes. For PDF rendering, this usually means:

  • Keep the previous Chromium version pinned in your image registry
  • Feature-flag new render options; default off
  • Roll out new worker versions canary-style (5% → 25% → 100% over an hour)

Status page

Public status page (Statuspage.io, Instatus, or self-hosted). Incident postmortems posted within 48 hours. Customers forgive downtime; they don’t forgive silence.

Let 21pdf operate the fleet

100 PDFs / month free, $9-$69 paid plans. We run the Chromium pool, SSRF layers, queue, storage, and monitoring — you POST HTML and get back PDFs.

Get API key → See what ships

Closing

PDF rendering at scale is a well-understood engineering problem with no exotic answers. Async job queue, Chromium pool with lifecycle management, dual rate limits, SSRF defence in depth, monitoring the right four signals.

The cost at 1M PDFs/day is roughly:

  • Compute: $1,500-$3,000/month (15-25 worker nodes)
  • Storage: $500-$1,000/month (6TB hot + lifecycle)
  • Observability: $500-$2,000/month
  • Engineering time: 10-20 hours/month steady-state
  • Total: ~$3,000-$6,000/month + half an engineer’s attention

Against that, a managed API at 30M PDFs/month runs $3,000-$10,000/month with zero engineering time. The economics roughly break even at this scale; the choice becomes about control and strategic fit.

If you’re building toward 1M/day, the architecture above is the one that gets you there. If you’re at 10k/day and growing, stay on a managed API and come back to this post when you’re 3-6 months from the crossover.

— 21pdf Engineering

Frequently asked questions

How many Chromium processes do I need for 1M PDFs per day?

Roughly: 1M/day ≈ 12 PDFs/second average, with peaks of 30-50/sec during business hours. Each Chromium process comfortably handles 3-5 concurrent renders at sub-second p50. So ~10-15 worker processes at peak, with autoscaling headroom. Budget 4-8GB RAM per worker. Total: 8-15 nodes at 2-4 vCPU each.

Should I use SQS, RabbitMQ, or Redis for the PDF render queue?

Redis (via BullMQ, Sidekiq, asynq) is the fastest to set up and handles 99% of use cases below 10k PDFs/minute. SQS is better if you're AWS-native and want managed durability without operating Redis. RabbitMQ is overkill for single-tenant PDF rendering but makes sense if you're already running it. Don't use Postgres as a queue at this scale — the LISTEN/NOTIFY pattern works but gets noisy over millions of jobs.

How do I prevent a single customer from starving others' queue time?

Per-customer concurrency limits enforced at queue-dequeue time. Reject with 429 + Retry-After when a customer has N in-flight jobs. Pair with quota gates (reject with 402 when monthly limit hit). 21pdf's Pro tier is 3 concurrent + 10k/month; this shape works well at all volumes.

What's the Chromium memory leak and how do I handle it?

Chromium's renderer leaks ~5-50MB per complex page over thousands of renders. Not a bug — expected GC behaviour. Mitigation: recycle the browser process after N requests (typical N=500-2000) or after M minutes. Create fresh tabs per request; close tabs immediately after. Memory grows regardless — the process recycle is what resets it.

How should I handle a Chromium crash mid-render?

Wrap page.pdf() in a hard timeout (e.g. 30s). On timeout or crash, mark the job failed, kill the browser process, let the pool spawn a replacement. Requeue the job if it's the first attempt; give up after 2-3 retries to prevent a poison-pill job from bouncing around forever.

Webhook delivery or polling — which is better for async PDFs?

Polling is simpler and correct for most workloads. Clients already have retry logic for HTTP; adding webhooks adds a delivery-reliability problem you now have to solve. Webhooks make sense when customers are building event-driven architectures and want push updates — offer it as an option, not the default.

What's the right SLA to publish for a PDF rendering service?

99.9% uptime for most B2B SaaS (8.76 hours/year allowed downtime). 99.95% or 99.99% only if you have 3+ years of operational data to back it up. Publishing an SLA you can't measurably meet is a credibility loss — customers will audit the status page and find you out.

How do I monitor a PDF rendering service?

Four golden signals: render latency (p50/p95/p99 histogram), queue depth, success rate, worker memory. Plus: Chromium version pinned value, SSRF-block count per day, per-customer quota exhaustion events. Alerting thresholds: p95 latency > 5s, queue depth > 5min backlog, success rate < 99%, worker RSS > 2GB.