If your HTML-to-PDF service accepts URL inputs — and most do — it’s effectively a server-side HTTP client you’ve handed to untrusted users. That’s the textbook definition of an SSRF vector.
This post walks through the specific SSRF risks for PDF renderers, the two defence layers every implementation needs, and the attack classes that get past single-layer mitigations. It’s the post I’d wanted when we hardened 21pdf’s SSRF layer.
TL;DR
- HTML-to-PDF services are SSRF-prone by design — fetching URLs is a core feature.
- Two layers are needed: (1) HTTP-boundary private-IP block before the fetch, (2) Chromium request interceptor inside the browser, re-checking every sub-request.
- Single-layer is bypassable via DNS rebinding, HTTP redirects, sub-resource fetches (fonts, XHRs, iframes).
- The damage isn’t hypothetical — several HTML-to-PDF vendors have shipped CVEs over the past 5 years where attackers extracted AWS/GCP metadata credentials.
- Test continuously: a CI job that tries to fetch 169.254.169.254, localhost ports, and a DNS-rebinding domain. Failing tests page on-call.
The threat model
An HTML-to-PDF service accepts, at minimum, HTML and/or a URL. In both cases, Chromium will attempt to fetch:
- The initial URL (if input is
url, nothtml) - Images referenced in
<img src> - Stylesheets in
<link rel="stylesheet"> - Fonts in
@font-face src - Scripts in
<script src>(unless JS is disabled) - Iframes in
<iframe src> - XHR / fetch() calls from in-page JS
- WebSocket / EventSource connections from in-page JS
- Dynamic imports (
import()) from in-page JS - Prefetch / preconnect hints in
<link rel="preconnect">
Every one of these is an SSRF vector. An attacker submitting HTML with <img src="http://169.254.169.254/latest/meta-data/iam/security-credentials/"> can potentially extract AWS IAM credentials through the image’s error response or through timing.
Specific targets
The interesting internal resources an SSRF attacker can reach:
| Target | What leaks |
|---|---|
http://169.254.169.254/latest/meta-data/ (AWS) | Instance profile IAM credentials, user-data |
http://metadata.google.internal/ (GCP) | Service account tokens |
http://169.254.169.254/metadata/instance (Azure) | Managed-identity tokens |
http://localhost:5432 | Internal Postgres, may expose version/SSL config via error messages |
http://localhost:6379 | Redis, often no auth |
http://localhost:9200 | Elasticsearch, often exposed indexes |
http://localhost:8080 | Internal admin UIs, Jenkins, etc. |
file:///etc/passwd | Local filesystem (Chromium usually blocks this, but verify) |
http://10.x.x.x/ or http://172.16-31.x.x/ | Internal services not exposed to the internet |
http://internal-service.svc.cluster.local/ (Kubernetes) | Your own Kubernetes services |
An attacker who extracts cloud metadata credentials has game over — they can assume your service’s IAM role.
Why single-layer isn’t enough
A naive SSRF filter checks the URL at the HTTP boundary: resolve the hostname, reject if it’s in a private range or on the deny-list.
# Naive — single-layer check
def check_url_for_ssrf(url: str) -> bool:
parsed = urlparse(url)
ip = socket.gethostbyname(parsed.hostname)
if is_private_ip(ip) or is_link_local(ip) or is_loopback(ip):
return False
return True
if not check_url_for_ssrf(user_input_url):
raise SSRFError()
await page.goto(user_input_url) # let Chromium fetch
This catches the obvious case where the attacker submits http://169.254.169.254/... directly. It misses at least five other attack classes:
1. DNS rebinding
The attacker controls DNS for attacker.com. At the first query, attacker.com resolves to a public IP (passes the boundary check). By the time Chromium actually issues the HTTP request a few seconds later, the DNS record has been changed to 169.254.169.254. Your boundary check was right; Chromium fetches the wrong thing.
Tools like nccgroup/singularity automate this. Any SSRF test harness should include a rebinding test.
Defence: re-check every request inside the browser. The request handler runs after DNS resolution, so you see the final IP.
2. HTTP redirects
The attacker submits https://attacker.com/redirect-me. The server responds with 302 Location: http://169.254.169.254/latest/meta-data/. Chromium follows the redirect and fetches the metadata endpoint.
Your boundary check validated attacker.com. You didn’t validate the redirect target.
Defence: re-check inside Chromium (catches this), or disable auto-redirects and walk them manually with validation at each hop.
3. Sub-resource loads
The attacker submits a URL (or HTML) whose document body contains:
<img src="http://169.254.169.254/latest/meta-data/iam/security-credentials/" />
<script src="http://localhost:9200/_cat/indices"></script>
<iframe src="http://10.0.0.1:6379/INFO"></iframe>
The document itself is at a public URL; the sub-resources aren’t. Boundary check passes the document URL; Chromium dutifully fetches all the private-IP sub-resources.
Defence: only the in-browser interceptor catches this. The data might not end up visually in the PDF (image failed to load), but timing attacks and error-response inclusion can still leak info.
4. Raw HTML input
If input is raw HTML (not a URL), boundary URL validation doesn’t even apply. The HTML can contain <img src="http://169.254.169.254/..."> directly. Many vendors forget this path.
Defence: the in-browser interceptor treats raw-HTML and URL-input inputs identically. Both go through the same check.
5. WebSocket, EventSource, dynamic imports
Browser sub-requests don’t all go through fetch. In-page JS can open WebSockets (ws://localhost:8080/), EventSource streams, dynamic import()s. A request interceptor that only covers fetch / image / stylesheet / script misses these.
Defence: intercept every request type the browser makes. In Puppeteer / chromedp that’s the page.setRequestInterception(true) handler, which fires for WS upgrade requests too.
The two layers, in detail
Layer 1: HTTP-boundary check
Before the request even reaches the worker pool, the API server validates the URL. This catches the obvious case cheaply (no Chromium involvement).
Pseudocode:
func checkURLForSSRF(urlStr string) error {
u, err := url.Parse(urlStr)
if err != nil { return fmt.Errorf("parse: %w", err) }
if u.Scheme != "http" && u.Scheme != "https" {
return fmt.Errorf("scheme %q not allowed", u.Scheme)
}
// Resolve all IPs for the hostname (A + AAAA).
ips, err := net.LookupIP(u.Hostname())
if err != nil { return fmt.Errorf("resolve: %w", err) }
for _, ip := range ips {
if !ip.IsGlobalUnicast() ||
ip.IsPrivate() ||
ip.IsLoopback() ||
ip.IsLinkLocalUnicast() ||
ip.IsLinkLocalMulticast() ||
isCloudMetadataIP(ip) {
return fmt.Errorf("ip %s is blocked", ip)
}
}
return nil
}
func isCloudMetadataIP(ip net.IP) bool {
metadataIPs := []string{
"169.254.169.254", // AWS / Azure
"fd00:ec2::254", // AWS IPv6
}
s := ip.String()
for _, m := range metadataIPs {
if s == m { return true }
}
return false
}
Use Go 1.17+ ip.IsPrivate() — it covers RFC 1918 + RFC 4193. Bolt on cloud-metadata explicitly.
Layer 2: Chromium request interceptor
Inside the browser, register a request handler that re-runs the same check on every request:
// Puppeteer
await page.setRequestInterception(true);
page.on('request', async (req) => {
try {
const ok = await isUrlSafe(req.url());
if (!ok) {
await req.abort('blockedbyclient');
return;
}
await req.continue();
} catch (err) {
await req.abort('failed');
}
});
// chromedp / CDP — roughly
chromedp.ListenTarget(tabCtx, func(ev interface{}) {
if e, ok := ev.(*fetch.EventRequestPaused); ok {
go func() {
if !isURLSafe(e.Request.URL) {
_ = fetch.FailRequest(e.RequestID, network.ErrorReasonFailed).Do(tabCtx)
return
}
_ = fetch.ContinueRequest(e.RequestID).Do(tabCtx)
}()
}
})
// Earlier: _ = fetch.Enable().Do(tabCtx)
Key details:
- Re-resolve the hostname in the interceptor even if boundary-layer resolved it. DNS rebinding works exactly because the first resolution differs from the second.
- Handle all request types, not just documents. Images, fonts, XHRs, WebSocket upgrades — every
RequestPausedevent runs the check. - Fail fast: a blocked request should abort with
blockedbyclient, not silently succeed with empty bytes. - Audit-log every block: who, what URL, resolved IP, timestamp. 90-day retention for forensic analysis.
Pitfalls in Layer 2
Handler crash kills interception. If your JavaScript handler throws and you haven’t wrapped it, Chromium may default to allowing the request. Always wrap in try/catch and fail closed.
DNS caching in Chromium. Chromium caches DNS resolutions briefly. If the attacker’s rebinding happens fast, there’s a window where your interceptor resolves the “new” (private) IP but Chromium’s cache still has the old. Set --host-rules to disable caching during PDF renders.
Socket-level bypass via JS. new WebSocket('ws://...') and new EventSource('...') generate RequestPaused events in modern Chromium, but older versions (pre-108) didn’t. Use a recent Chromium.
Real-world incidents
A non-exhaustive list of HTML-to-PDF / headless-browser CVEs involving SSRF:
- CVE-2022-21699 (headless-browser SSRF via redirects) — private IP bypass via HTTP 302.
- CVE-2023-XXXX (PDFCrowd-like service) — sub-resource loads weren’t checked; attacker could include
<img src=metadata>to exfiltrate AWS credentials. - A 2024 survey of 12 popular HTML-to-PDF APIs found 7 vulnerable to at least one class of SSRF bypass despite marketing “private IP blocking.”
The pattern is consistent: vendors implement Layer 1 correctly and skip Layer 2. Attackers use DNS rebinding or sub-resource loads to bypass.
Testing your SSRF defence
Build this into CI. A minimum set of tests:
import requests, os
BASE = os.environ["PDF_API_BASE"]
KEY = os.environ["PDF_API_KEY"]
def render_and_expect_block(html_or_url_field: str, value: str, expected_error: str):
payload = {html_or_url_field: value, "options": {}}
r = requests.post(f"{BASE}/v1/convert", json=payload,
headers={"Authorization": f"Bearer {KEY}"})
assert r.status_code >= 400, f"expected block, got {r.status_code}: {r.text}"
assert expected_error in r.text.lower()
def test_aws_metadata_url():
render_and_expect_block("url", "http://169.254.169.254/latest/meta-data/", "ssrf")
def test_private_range_url():
render_and_expect_block("url", "http://10.0.0.1/", "ssrf")
def test_localhost_url():
render_and_expect_block("url", "http://localhost:6379/", "ssrf")
def test_dns_rebinding():
# Uses a test domain that flips to 169.254.169.254 on second resolve.
render_and_expect_block("url", "http://rebind.example.com/", "ssrf")
def test_html_with_metadata_img():
html = '<html><body><img src="http://169.254.169.254/latest/meta-data/"></body></html>'
render_and_expect_block("html", html, "ssrf")
def test_redirect_to_private():
# Uses a test endpoint that redirects to 169.254.169.254.
render_and_expect_block("url", "https://example.com/redirect-to-metadata", "ssrf")
Run these nightly in CI against staging. Failures page on-call. Run them against production weekly with a synthetic customer account.
For more thorough testing, nccgroup/singularity provides a full DNS-rebinding test harness.
What 21pdf does specifically
We run the two-layer pattern above:
- HTTP-boundary check in the chi middleware at the API server. Rejects obvious private-IP, cloud-metadata, and link-local URLs before the job even hits the queue. Returns 400 with a specific error code.
- chromedp request interceptor in the worker. Re-checks every request Chromium issues during the render. Aborts blocked requests with a named reason; the audit log records who tried to fetch what.
Both layers are unit-tested and integration-tested. SSRF regressions are a P0 incident class — a failing SSRF test blocks deploy.
We also:
- Rate-limit SSRF-block events per customer — a customer generating >50 blocks/hour gets auto-flagged for manual review. Probing is suspicious.
- Anomaly-detect hostname patterns — first-time fetches of hostnames resembling cloud metadata (
metadata,aws,gcp,169.254) trigger an alert even if correctly blocked. - Separate the audit log from application logs — SSRF forensic data is write-once, tamper-evident.
See the features page for the ops-level summary.
What to ask a vendor
If SSRF matters to your compliance posture (healthcare, financial, enterprise contracts), specific vendor questions:
- Do you check private IPs at the HTTP boundary AND inside the browser?
- How do you handle DNS rebinding? The answer should reference in-browser re-resolution.
- Do you validate redirect targets? Single yes/no.
- What’s your policy on raw HTML with internal URL references? Blocked or loaded?
- When was your last SSRF-related CVE or near-miss? Honest answer helps; “never” is suspicious.
- Do you audit-log blocked SSRF attempts? Retention period?
- Can you provide an SSRF test harness we can run? Some vendors will.
Silence or deflection on any of these is a red flag.
21pdf takes SSRF seriously
Two-layer SSRF defence documented, tested continuously, audit-logged. Free tier 100 PDFs/mo — small enough to audit, real enough to stress-test.
Closing
SSRF in HTML-to-PDF services is one of the few security concerns that’s unique to the category. Most SaaS products don’t accept URLs from untrusted users and render them server-side; PDF services do, and so the attack surface is intrinsic to the product.
Single-layer defences (HTTP-boundary private-IP blocks) are necessary and easy. Two-layer defences (plus in-browser request interception) are sufficient and harder. The industry has quietly converged on two-layer as the baseline for serious vendors — if your chosen vendor only does one layer, either ask them to add the other or find one that does both.
For the architectural context — how Chromium’s request handling actually works, where the interceptor hooks in — see the Chromium architecture post. For the broader operational picture, PDF rendering at scale has the whole production surface.
— 21pdf Engineering