Models··7 min read

The Best Way to Feed Long Documents to Claude (and Other Long-Context Models)

Claude's 200K-token context is generous, but you'll still want to clean, compress, and structure long documents before sending them. Here's a step-by-step playbook.

Claude 3.5 Sonnet's 200,000-token context is the kind of number that makes you want to dump entire bookshelves into a chat. Resist that urge. Even with all that room, how you prepare the input still determines how good the answer is — and how much you pay per call.

Position matters more than people think

Long-context models exhibit a "lost in the middle" effect: information placed in the very middle of a long prompt is recalled less reliably than information near the start or end. The fix is mechanical:

  1. Put the question first
  2. Order documents from most-relevant to least-relevant
  3. Repeat the question at the very end

That single change has been shown to improve recall on multi-document QA by double-digit percentages.

Step 1: Clean every document

Use PDF to Clean Text for PDFs and Batch Document Extractor for mixed folders. Strip headers, footers, page numbers, and broken hyphenation. A 200K-token context filled with clean prose comfortably outperforms a 200K-token context that's 30% noise.

Step 2: Compress aggressively

For very long inputs, Context Compressor can shrink prompts 30–50% without meaningful information loss by collapsing whitespace, removing low-signal phrases, and condensing repetition. Two practical wins:

  • Lower per-call cost (Claude charges by tokens in)
  • More context room left over for the model's response

Step 3: Use clear delimiters

Claude responds well to structured prompts. Wrap each document in XML-like tags so the model can reference them precisely:

<documents>
  <document index="1">
    <source>Q3-report.pdf</source>
    <content>...cleaned text...</content>
  </document>
  <document index="2">
    <source>competitor-analysis.pdf</source>
    <content>...cleaned text...</content>
  </document>
</documents>

Question: Which competitor grew fastest in Q3, and why?

Step 4: Ask once, ask well

With a long context, you only get one shot per call (each follow-up means re-sending the whole thing). Front-load any sub-questions, requested formats, and constraints in the first message.

What about Gemini's 2M context?

Everything above applies, just more so. Gemini 1.5 Pro and 2.0 happily accept hour-long videos and 1,500-page books, but the lost-in-the-middle effect is more pronounced at extreme lengths. Cleaning and ordering matter more, not less.

When to chunk instead

If your real corpus is bigger than even a 200K window, switch to retrieval. Build a small RAG system where you embed cleaned chunks and only feed the top-matches into the prompt. Batch Document Extractor's ZIP output (one file per source) drops directly into LangChain or LlamaIndex.

Tools mentioned

Frequently asked

Does Claude really use all 200K tokens equally?

No. Like every transformer, Claude weights tokens near the start and end of its context more heavily. Putting the question at both ends and the most important document first measurably improves accuracy.

Should I split a long document into chunks?

If the document is genuinely longer than the context window, yes. Otherwise, sending it as one continuous block lets the model build a coherent mental model of the whole.

Is compression always worth it?

For documents over ~30K tokens, almost always. Below that, the time savings vanish.

Keep reading