PDF accessibility: tags, reading order, OCR, and the screen-reader gap

PDFs are the format every organisation uses for "official documents" — and the format that excludes screen-reader users most often. Not because PDFs can't be accessible. Because most of them aren't, and the gap is invisible to the people creating them. This post is a practical guide to what makes a PDF accessible, why so many fail, and what to check before you send one.

The two PDF worlds

There are two kinds of PDFs:

Untagged PDFs — most of what exists. The file contains drawing instructions ("show this glyph at this coordinate") but no structural information about which glyphs form a heading, a paragraph, or a table cell.
Tagged PDFs — a structure tree alongside the visual content. Headings are marked as H1, paragraphs as P, tables as Table with TR and TD cells, lists as L with LI items, images with alt text.

A screen reader walks the tag tree. With no tags, all it has is the raw content stream — glyph positions, in whatever order the PDF generator wrote them. That order is often the visual order of placement, which can be column-by-column, footnote-then-body, or random. The result for the listener is gibberish.

A useful test: read the PDF aloud by tab-clicking through the text. If the order makes sense, the structure is probably good. If it jumps mid-sentence or reads sidebars between paragraphs, the structure is broken.

Where untagged PDFs come from

Almost everywhere:

"Save as PDF" from a browser print dialog. Almost never produces tags. Headings, lists, tables become invisible structurally.
Designer PDFs (Adobe InDesign export). Tagged only if the designer enabled it explicitly. Default is off.
PDFs assembled from scans. Pure image, no structure. We'll talk about OCR below.
PDFs generated by scripts (wkhtmltopdf, headless Chrome, pdf-lib, most server-side generators). Untagged unless you wire up tagging explicitly.
PDFs concatenated from multiple sources. Even if the inputs were tagged, naïve merging may break the tag tree.

What actually tags PDFs well:

Microsoft Word's "Save as PDF" with "Document structure tags for accessibility" checked. Default is on; many people uncheck it by accident. Best out-of-the-box tagging available.
Adobe Acrobat Pro's "Make Accessible" wizard. Adds tags to an untagged PDF interactively. Slow, manual, expensive (Acrobat Pro licence). But it works.
LaTeX with accsupp or tagpdf. Possible but requires understanding both.

If you regularly produce PDFs and they need to be accessible, Word remains the path of least resistance.

Scanned PDFs: the worst case

A PDF made of scanned pages is, structurally, a sequence of images. There's nothing for a screen reader to read because there's no text — only a picture of text. Selection returns nothing, search finds nothing, screen readers find nothing.

The fix is OCR (Optical Character Recognition): a tool reads the image, recognises characters, and writes a text layer into the PDF alongside the image. The image stays visible; the text becomes selectable, searchable, screen-readable.

OCR quality varies widely:

Tesseract (open source) — excellent for clean English type, reasonable for many other languages, poor for handwriting and weird layouts.
Adobe Acrobat's OCR — strong, decades of training data, expensive.
Google Cloud Vision, Azure Document AI, AWS Textract — best in 2026 for hard cases (handwriting, multi-column, mixed languages), but they involve uploading the PDF to a cloud service.

For privacy-sensitive documents, Tesseract in the browser (via tesseract.js, a WebAssembly port) is the only OCR path that keeps the file local. It's slower than cloud OCR and accuracy is lower on complex layouts, but the privacy story is correct. We're scoping an in-browser OCR tool; it'll appear on Utilo when it's solid.

For now, the practical workflow for a scanned-but-needs-to-be-accessible PDF:

Run it through OCR somewhere you trust (your laptop with Tesseract, Acrobat if you have it).
Verify the text layer by selecting text in the PDF and pasting into a notepad.
If the OCR missed sections (common at the top/bottom of pages), re-OCR with a different engine.

Alt text for images

Images inside a tagged PDF should have an alternate text description in the tag. Without it, a screen reader announces "image" or "graphic" — useless context.

The default behaviour:

Word — picks up "Alt Text" from the inserted image's properties. Right-click the image → Alt Text. Many people skip this.
Designer PDFs — Adobe InDesign has an "Object Export Options" pane where alt text can be set per image. Almost nobody does.
Programmatically generated — the generator usually has an API. pdf-lib does not currently expose alt text writing in a friendly way; deeper PDF libraries do.

Bad alt text is worse than no alt text. "Photo of a graph" tells the listener nothing. "Sales grew from $1.2M in Q1 to $1.8M in Q4" is the alt text the chart's image is trying to convey.

Tables: the hardest case

Tables in PDFs are notoriously bad for accessibility. A correct table tag tree looks like:

Table
├ TR (row)
│ ├ TH (header cell) "Quarter"
│ ├ TH "Revenue"
├ TR
│ ├ TD "Q1"
│ ├ TD "$1.2M"
...

What a screen reader does with this: announces "Table with 2 columns, 5 rows" then reads cells with their headers, so "Q1, Revenue: $1.2M" instead of just "$1.2M".

A common failure: tables drawn visually with lines and text, no Table tag, no header cells. Listener hears a row of numbers with no context.

Even when tagged, tables get tricky when cells span multiple rows or columns, or when there are header rows above other header rows. Most automated taggers handle simple tables and stumble on complex ones. Test before shipping.

Reading order and rotation

When a PDF was generated with multiple columns or sidebars, the reading order in the underlying content stream may not match the visual reading order. A screen reader reads the underlying order. The result: sentences from column 2 interleave with column 1.

The fix is a tagged PDF that explicitly declares the reading order via the structure tree. Acrobat Pro's Reading Order tool lets you re-order. Without that, the only fix is regenerating the PDF from a source that knows the structure.

A small accessibility win: page numbers. A blind reader can't see "where am I in this document"; an announced page number gives orientation. The PDF Page Numbers tool adds numbers to every page. Useful for both sighted and screen-reader users — and one of the few accessibility wins that doesn't require restructuring the file.

Forms

Fillable PDF forms are sometimes accessible, sometimes not. Best case:

Each field has a /T (title) that describes what it asks for ("First name", not just "Field 1").
Each field is in the tag tree at the right reading-order position.
Required fields are marked /Ff with the required bit set.

Most PDF forms produced by web-to-PDF converters fail all three. They render visually because the form widgets float on top of the page, but the screen-reader user gets a sequence of "Edit text" prompts with no context.

If forms accessibility matters, an HTML form is a dramatically better choice than a PDF form. Web forms have decades of accessibility tooling; PDF forms have a fraction of that.

What to check before sending

A 60-second accessibility audit:

Open the PDF, select all text (Ctrl+A), copy, paste into a notepad. If you get gibberish or whitespace, the text layer is broken (likely a scan without OCR, or a designer PDF with text flattened to outlines).
Open the Tags panel (Acrobat: View → Show/Hide → Navigation Panes → Tags). If empty, the PDF is untagged. Add tags or regenerate.
Listen to the first page with your OS's built-in screen reader (macOS VoiceOver: Cmd+F5; Windows Narrator: Ctrl+Win+Enter). Ten seconds of listening teaches you more than ten minutes of guessing.
Run Acrobat's accessibility checker (Tools → Accessibility → Full Check), if you have it. Generates a punch list.
Add a meaningful title in document properties (File → Properties → Description → Title). Screen readers announce the title before reading the document; "Untitled" or the filename is jarring.

If any of these are easy wins, fix them. If accessibility actually matters (legal compliance, public-facing document, audience known to include disabled users), budget proper tagging time — it's not a 5-minute job.

Tools and standards worth knowing

PDF/UA (Universal Accessibility) — the formal standard for accessible PDFs (ISO 14289). A PDF either complies or doesn't. Government procurement often requires PDF/UA compliance.
WCAG 2.2 AA — the broader web accessibility standard. PDFs delivered via the web should meet it. The PDF/UA spec aligns with WCAG.
PAC (PDF Accessibility Checker) — a free Windows tool that validates against PDF/UA. More rigorous than Acrobat's built-in checker.
NVDA (open-source Windows screen reader) — free and widely used; test your PDFs against it.

The privacy + accessibility tension

There's a real trade-off between in-browser-only tools (privacy-preserving) and accessibility features (often involving heavy ML for OCR and structure recognition). The cloud OCR services are more accurate; they require uploading the file. Our position is that we'll keep building the local versions even when they're slightly weaker, and we'll be honest about the gap. If you need cloud-grade OCR on a sensitive document, that's a real trade-off you have to make consciously — not one we'll quietly make for you.

For documents that aren't sensitive, the bigger accessibility win is upstream: produce the document in a tool that tags well (Word, modern LibreOffice), don't downgrade by going through scan-and-print, and check the result before sending.

A short call to action

If you regularly produce PDFs:

Turn on "Document structure tags for accessibility" in Word (or your equivalent). It's a checkbox.
Add alt text to every image. It's a right-click menu.
Open the resulting PDF in a screen reader and listen to a page. It's 30 seconds.

That's 90% of the accessibility wins for 5% of the effort. The remaining 10% is genuinely hard and may require a tagging specialist for legal-grade compliance. But the first 90% will lift your output above the median, which is below acceptable for most public-facing documents.

PDF accessibility is a quiet topic until it's an expensive lawsuit. Investing the small amount upfront is dramatically cheaper than retrofitting after a complaint.