PDFs explained: what's actually inside the format

PDF is the file format everyone has opinions about and almost nobody understands. We send them, print them, sign them, and curse at them. Yet the format is more than 30 years old and the basic structure has barely changed. This post is a friendly tour of what's actually inside a PDF — enough to understand why some PDFs are tiny and some are huge, why text sometimes can't be selected, and why "edit a PDF" is harder than it sounds.

Where PDF came from

Adobe's John Warnock published the spec in 1993, as a static document format you could send to anyone and trust to look the same. The breakthrough was bundling fonts and layout instructions so the receiver didn't need anything but a PDF reader. Before PDF, the polite way to send a formatted document was a PostScript file plus an apology that the recipient might not have the right fonts.

PostScript is the ancestor: a stack-based page description language. PDF inherited PostScript's drawing model — moveto, lineto, fill, show text at coordinates — but dropped the programmability. A PDF declares what to draw; a PostScript file ran a program that produced what to draw. Static beats dynamic when you want predictability.

Anatomy of a PDF file

If you open a PDF in a text editor, the first line reads like:

%PDF-1.7

Then a smattering of high-ASCII bytes (so file utilities recognise it as binary), then a long sequence of objects. PDF is fundamentally a database of objects bolted onto a flat file. Every object has a number, a generation, and a body. The body is a dictionary (<<…>>), a stream of bytes, a number, a string, a name, an array, or a reference to another object.

The structure:

Header — the %PDF-1.7 line.
Body — the objects: pages, fonts, images, content streams, metadata.
Cross-reference table — a table of file offsets so a reader can jump to object N without scanning the whole file.
Trailer — points at the root object and the cross-reference table.

That's the entire format. The interesting work happens in the body.

Pages, content streams, and resources

A PDF starts at the catalog object, which has a /Pages reference. That points at a tree of page objects. Each page object has:

A /MediaBox — the page size in points (a point is 1/72 inch).
A /Resources dictionary — the fonts, images, and colour spaces the page can use.
A /Contents stream — the actual drawing instructions.

The content stream looks a bit like assembly:

BT
/F1 12 Tf
100 700 Td
(Hello, world) Tj
ET

BT begins a text block, /F1 12 Tf sets font F1 at 12 points, 100 700 Td moves the cursor, (Hello, world) Tj shows the literal text, ET ends. Multiply by every glyph on every page and you have the picture.

That's why text in a PDF doesn't always copy cleanly. The content stream cares about visual positioning, not reading order. A PDF generator that places each glyph individually can produce text that reads "Hello, world" to your eyes but H e l l o , w o r l d (with extra spaces) to a copy-paste operation. Or the other way around: glyphs visually adjacent but stored in arbitrary order, so copy-paste returns gibberish. PDFs from accounting software are notorious for this.

Streams and compression

Most of what makes a PDF big is streams — binary blobs inside objects. A content stream is text, but it's wrapped in a stream with a filter:

8 0 obj
<< /Length 487 /Filter /FlateDecode >>
stream
[compressed bytes]
endstream
endobj

/FlateDecode is zlib compression. Other common filters: /DCTDecode (JPEG), /CCITTFaxDecode (TIFF G4 — scanned black-and-white pages), /JBIG2Decode (a more efficient bitonal codec), /LZWDecode (older). Multiple filters can stack: a PDF page could be JPEG-encoded image data inside a Flate-compressed stream.

That's why PDFs from a phone scanner are often huge: each page is a high-resolution JPEG of the scan, ~1MB per page. A PDF assembled from already-compressed images won't shrink by re-compressing the wrapper — the JPEGs are already as small as they can get without quality loss. Knowing this is the difference between a useful compression tool and a useless one. (We're working on a real compression tool that re-encodes embedded images at lower quality — the only honest way to shrink a scan-based PDF.)

Fonts: the rabbit hole

Fonts are why PDFs are reliable across machines. The PDF spec allows fonts to be:

Standard 14 — Helvetica, Times, Courier, Symbol, ZapfDingbats, plus variants. Every PDF reader is required to know these; they don't need to be embedded.
Embedded — the font file is bundled inside the PDF, in TrueType (/TrueType), Type 1, or CFF format. Most PDFs do this.
Subset — only the glyphs actually used in the document, with subset prefix in the font name (XYZABC+Helvetica). Saves space at the cost of breaking editability.

Subsetting is why "I just want to fix this typo" is hard. The PDF contains glyphs for "Hello, world" but not for the letter "Z" that you want to insert. To edit, the editor needs the original font (or a near-enough match) to grab the missing glyph and re-embed it.

The PDF also stores per-glyph positioning and kerning. When a reader copies text from a PDF, it reverses that mapping. It works perfectly when the PDF includes a /ToUnicode map (a table from glyph IDs back to Unicode characters). It works badly when the PDF doesn't, which is depressingly common.

Images, vectors, and transparency

A PDF can contain raster images (JPEG, PNG-ish, JBIG2 for bitonal scans) and vector content (the content-stream drawing commands). Most modern PDFs are a mix: vector for text and shapes, raster for photos and icons.

The same image can appear multiple times via xobjects — drawn once into an object, then referenced wherever needed. That's how a 50-page PDF with the same logo on every page can still be small. The logo is one xobject, referenced 50 times.

PDF 1.4 added transparency, which is what made design tools take the format seriously. PDF 2.0 added smarter colour management. Each spec bump added features that 90% of PDFs ignore. Most PDFs in the wild are still PDF 1.4 or 1.6.

Why every PDF is different

The format is permissive. Three PDFs that produce identical visible output can be wildly different inside:

Word → Save as PDF typically embeds full fonts, keeps the text searchable, and produces tidy content streams. Small file, copy-paste works.
Scanner → PDF is one giant JPEG per page. Often no text layer at all. Large file. Copy-paste returns nothing because there's no text to copy.
Designer → PDF (InDesign, Illustrator) flattens text into vector outlines for safe printing. Visually identical, but each character becomes a path — uncopyable, unsearchable, hostile to screen readers.
Web page → PDF uses CSS print rules. Text is real, but it's often laid out one absolute-positioned glyph at a time, making copy-paste fragile.

When you split a PDF, merge two PDFs, or convert pages to images, the source PDF matters more than the tool. A clean Word-generated PDF behaves predictably. A flattened InDesign export behaves predictably in a different way. A phone scan misbehaves consistently.

What our tools do (and don't)

The PDF Merge tool concatenates PDFs by reading each as a tree of pages and copying the page objects into a new file. The output is a new PDF; the original objects are preserved.
The Split PDF tool does the inverse — copies a range of pages into a new document. The non-selected pages are dropped along with any resources they referenced (a saving for big multi-section PDFs).
The PDF to JPG tool renders each page to a canvas via pdf.js — the same engine your browser uses — then saves the canvas as a JPG. The result is a raster snapshot; it loses text-layer information.

What we explicitly don't do: re-flow content, edit text, embed fonts you don't have, or recover deleted text. Those operations need a parser and editor an order of magnitude more complex than what runs in a static-export browser tool. They're also where most commercial PDF tools earn their licence fees.

A short reading list, if you want to go deeper

ISO 32000-2 is the current PDF spec. It's free to read; search "PDF 2.0 specification ISO 32000-2".
pdf-lib (TypeScript) — the library we use here. Its source code is approachable; reading the PDFDocument.load path will teach you more about the format than any spec.
pdf.js (Mozilla) — the renderer behind every Firefox PDF view. Heavier read but excellent if you want to write PDF rendering yourself.

PDFs are stranger than they look. Once you know they're a database of objects with a drawing-command stream per page, every weird behaviour starts to make sense.