PDF metadata: what your file is telling people

A PDF is two documents: the one you see and the one you didn't mean to send. The second one is metadata — the file's name, the author of the source document, the software that made it, sometimes a full edit history, sometimes the original Word file embedded whole. This post walks through what's hiding inside and how to strip it before the file reaches someone who shouldn't see it.

Why metadata exists in PDFs

PDF inherited metadata from print culture. A printed document had a header, footer, maybe a stamp. A digital document added structured fields: title, author, subject, keywords, creation tool. Useful for libraries, search engines, document management systems. Less useful when the file leaks into a context the author didn't intend.

PDF supports two metadata systems:

Info dictionary — older, key/value pairs in the document's trailer. Title, Author, Subject, Keywords, Creator, Producer, CreationDate, ModDate.
XMP metadata — an embedded XML document (Adobe's Extensible Metadata Platform). More structured, supports custom schemas, includes things like document UUID and revision history.

Most modern PDFs have both. The two can disagree, which is a fun source of bugs in metadata-aware tooling.

What's in there, with examples

A typical Word-generated PDF carries:

/Title       (Q3 2025 Sales Report — DRAFT)
/Author      (Sarah Mitchell)
/Creator     (Microsoft® Word for Microsoft 365)
/Producer    (Microsoft® Word for Microsoft 365)
/CreationDate (D:20251104093142+05'30')
/ModDate     (D:20251104101755+05'30')

What this tells someone who gets the file:

The author by name (probably their work username; often their full name).
Their organisation's editing software and version.
Their timezone (the +05'30' is IST; a clean signal of India).
The actual creation and last-edit times.
That this is or was a draft.

If you exported from Adobe InDesign, the trail is longer. The Creator might say "Adobe InDesign 2025.4 (Macintosh)" and the Producer might say "Adobe PDF Library 17.0". Now the reader knows your OS and your design suite versions too.

XMP can carry much more

The XMP block is XML. It often contains:

xmpMM:DocumentID and xmpMM:InstanceID — unique IDs that persist across edits.
xmpMM:History — a list of edit events, each with timestamp, software, action.
pdf:Keywords, dc:subject — author-supplied terms.
xmp:CreatorTool, xmp:ModifyDate.

The history list is the spicy one. A PDF that's been edited five times in Acrobat contains five entries describing what was done, by whom, with which version. For a redacted document this can betray the order in which redactions were applied — which sometimes reveals what was redacted.

Hidden text under redactions

The most famous PDF-metadata-related incident category is bad redactions. Common pattern:

Lawyer draws a black box over sensitive text in Acrobat.
They save and send.
Recipient opens the PDF, selects all text, copies. The redacted text comes through.

The black rectangle is an annotation laid on top of the page. The underlying text is still there in the content stream. Anyone can copy it or extract it with pdftotext. A proper redaction removes the text from the content stream, then optionally draws a black box for visual confirmation.

Several intelligence agencies, law firms, and a sitting US president's lawyers have leaked sensitive material this way. It's not metadata strictly speaking, but it's in the same "stuff you didn't mean to send" bucket.

Embedded files

PDFs can contain entire other files. Adobe Acrobat's "Attach a File" feature embeds the source document, a spreadsheet, an image, anything. The PDF technically has file attachments as part of the spec. They appear in the sidebar of full Acrobat but not in lighter PDF readers.

People often attach the original .docx or .xlsx to the PDF as a "convenience", forgetting that the PDF was supposed to be the safer, fixed version. Now the recipient can extract the editable source — including comments, change tracking, hidden columns, the formulas in your "exported" spreadsheet.

Hidden text and bookmarks

A PDF can contain text that isn't visible: white-on-white text from the source document, layer-hidden text, text positioned outside the page's media box. None of it shows on screen; all of it copies and extracts.

Bookmarks (the sidebar table of contents) often expose the document's outline structure including draft chapter titles that were renamed before publication. Comments and form annotations live in the file even when they don't display.

Forms and form data

If your PDF is a fillable form, the form fields and their values are stored in the AcroForm dictionary. When you "Save filled form", the PDF carries every value you typed. If you printed-then-scanned the form instead, the values are baked into the page image and there's nothing to extract — but that's much heavier than just sending the filled PDF.

A trap: editing a form value, then saving, doesn't always remove the previous value. PDF supports incremental saves, where new content is appended without rewriting the original. The old form data may sit in the file untouched. Tools sometimes detect and clean this; default save in most PDF apps doesn't.

What to strip before sending

A practical "send-ready" checklist:

Flatten annotations. Highlights, sticky notes, drawn boxes — bake them into the page so a reader can't toggle them off.
Remove form fields, or at least save in "no form" mode so values can't be changed.
Strip the Info dictionary down to whatever you want public. Some tools have "Sanitize Document" or "Reduce File Size" options that clear metadata too.
Strip the XMP block or rewrite it with only fields you want.
Remove embedded files unless you intend them.
Verify redactions by running pdftotext on the PDF and grepping for the text you redacted.
Linearize/re-save to collapse incremental saves into a single clean copy.

If you've redacted something sensitive, treat the PDF as if a curious reporter is going to run every extraction tool on it. Because somebody will.

What our tools touch

The PDF tools here operate at the pdf-lib layer. When we save a PDF — after merging, splitting, rotating, watermarking, or page-numbering — the output is a re-serialized PDF. The Info dictionary and XMP block are preserved unless you tell us to strip them; we don't silently delete metadata, because some users need it.

A few specific behaviours:

The PDF Watermark tool adds a new content-stream block on each page. The original content stream is untouched. The watermark is not a separate annotation, so users can't toggle it off in their PDF reader.
The PDF Page Numbers tool also draws into the content stream — numbers become part of the page, not annotations.
We don't currently surface a "strip metadata" toggle. It's on the roadmap; until then, if you need a sanitised file, the simplest free path is Adobe Acrobat's "Sanitize" or running the PDF through qpdf --object-streams=generate --linearize.

Hashing a clean PDF

Once you've cleaned the file, take a moment to record its hash. The Hash Generator computes SHA-256 in your browser. Keep the hash somewhere — a paper trail of "this was the version I sent on 2026-05-27" is useful if the document ever becomes evidence in a dispute. Adversarial parties can't substitute a different file for the original without the hash changing, and you don't need a blockchain for that property; you just need a hash.

A note on Word

Microsoft Word's "Document Inspector" (File → Info → Check for Issues → Inspect Document) flags hidden text, comments, revision history, author names, properties, and embedded objects. Run it on the .docx before exporting to PDF. Everything you don't catch in Word ends up in the PDF as either visible metadata or, worse, copyable content.

The same applies to Pages, Google Docs (its export-to-PDF strips most things, but worth verifying), and design tools. Cleaning the source is more effective than cleaning the export.

PDF metadata is one of those things you only think about after something leaks. The good news: most of it is easy to strip once you know it's there.