Workflows··6 min read

How to Prepare PDFs for ChatGPT, Claude, and Gemini

A practical guide to extracting clean, AI-ready text from PDFs — born-digital and scanned — so ChatGPT, Claude, and Gemini answer accurately and don't waste tokens on headers, footers, and page numbers.

PDFs were designed for printers, not language models. When you drop a 200-page report into ChatGPT, Claude, or Gemini, the parsed text usually carries a long tail of headers, footers, page numbers, footnote markers, and broken hyphenated words. The model has to spend tokens — and attention — on all of it.

This guide walks through how to turn any PDF into the kind of clean, paragraph-structured text an AI can actually reason over. The steps work the same whether you're feeding Claude's 200K context window, GPT-4's 128K, or Gemini's 2M.

Why raw PDF text fails

Most extraction libraries pull text in reading order, but they don't know which lines repeat on every page. A typical academic PDF carries the journal name in the header, the page number in the footer, and the article title every few pages. Across a long document those repeated strings can total thousands of tokens.

Hyphenation is the second silent killer. PDFs break long words across line endings ("informa-tion") and most parsers preserve the hyphen. Suddenly your document has hundreds of words the model treats as misspellings.

The clean-text checklist

  • Detect and remove repeated headers and footers across pages
  • Strip standalone page-number lines
  • Re-join hyphenated line breaks ("informa-\ntion" → "information")
  • Collapse extra whitespace and double-spaces
  • Rebuild paragraphs based on line spacing, not raw line breaks
  • Run OCR if no embedded text is found (scanned PDFs)

Born-digital vs scanned: which path?

A born-digital PDF (one exported from Word, Google Docs, or LaTeX) has embedded text — extraction is fast and lossless. A scanned PDF is just images of pages; you need OCR. PDF to Clean Text automatically detects which kind you have and falls back to in-browser Tesseract OCR when needed. Nothing uploads.

How much does cleaning actually save?

On a typical 80-page consulting report we tested:

  • Raw extraction: 64,200 tokens
  • Header/footer stripped: 58,100 tokens (−9.5%)
  • Page numbers + hyphenation fixed: 56,400 tokens (−12.2%)
  • Paragraphs rebuilt: 54,900 tokens (−14.5%) and noticeably better summaries

Fourteen percent doesn't sound dramatic until you multiply it by every API call across a year. For RAG pipelines, cleaner chunks also mean better embeddings — your retriever finds the right paragraph more often.

Handling whole folders

If you're preparing a knowledge base or a multi-document brief, run the whole batch through Batch Document Extractor. It handles PDFs, Word docs, images, and plain text together and returns a ZIP with one clean file per input — perfect for feeding into Claude Projects, ChatGPT Custom GPTs, or any RAG ingestion pipeline.

Putting it in your prompt

Once you have clean text, structure your prompt clearly:

  1. Lead with the question or task
  2. Add the document under a clear delimiter ("---DOCUMENT START---")
  3. Repeat the question after the document — models give noticeably better answers when the ask brackets the context

For very long inputs, also consider compressing first with Context Compressor before sending. The combination of clean + compressed often cuts your prompt in half without losing information.

Tools mentioned

Frequently asked

Can I just upload the PDF directly to ChatGPT?

You can, but ChatGPT's built-in PDF parser keeps headers, footers, and page numbers, which dilutes the model's attention and burns tokens. Pre-cleaning the text usually produces tighter, more accurate answers.

What if my PDF is scanned?

Run OCR first. Most browser-based tools (including ours) detect missing embedded text and fall back to OCR automatically.

Does Claude handle long PDFs better than ChatGPT?

Claude's larger context window helps, but it still benefits from cleaned input. Noise costs the same number of tokens whether the model can hold them or not.

Keep reading