Building a RAG Knowledge Base from Mixed Document Types
Most real knowledge lives across PDFs, scanned images, Word docs, and web exports. Here's how to extract them into a clean, embeddable corpus — entirely in your browser.
Retrieval-Augmented Generation pipelines live or die on input quality. The fanciest embedding model and re-ranker can't recover from a corpus where every chunk carries page numbers, headers, and broken hyphens. The pre-processing step often matters more than the choice of model.
The realistic input
Most knowledge bases have to ingest a mix:
- Born-digital PDFs (reports, papers)
- Scanned PDFs (contracts, invoices, archive material)
- Word docs (.docx) and the occasional .doc
- Images of pages (phone photos, screenshots)
- Plain-text and Markdown notes
- HTML exported from internal wikis
One pipeline, every type
Batch Document Extractor handles all of these in one drop. Each file is routed to the right engine (pdf.js for PDFs, Tesseract OCR for images, mammoth for Word, native parsing for text/HTML) and the output is a ZIP with one cleaned text file per input — exactly the shape an embedding pipeline expects.
Chunking strategy that works
- Split on paragraph boundaries first
- Then merge paragraphs until you reach 500–1,000 tokens
- Add 100 tokens of overlap between chunks (helps when answers straddle a boundary)
- Carry source metadata on every chunk (filename, page if known)
Why metadata matters
When the LLM cites a fact at answer time, you want it to point back to "Q3-strategy.docx, p. 12" — not just "the corpus". Carry the source filename through the entire pipeline. Good extraction tools (and the batch extractor's per-file ZIP) make this trivial.
Embedding models in 2026
- OpenAI text-embedding-3-small: cheap, fast, 1536-dim
- Voyage-3: outstanding quality for retrieval, 1024-dim
- Cohere embed-multilingual-v3: best when your corpus crosses languages
- BGE-M3: strong open-source option you can host yourself
Serve with citations, not just answers
Modern frontends expect inline citations. Have your retriever return chunks with their metadata, then prompt the model to use a structured answer format that references the source IDs. This is where clean source files (one per logical document) pay off — citations stay coherent.
The minimum viable RAG, all browser-side
For very small corpora (under a few thousand chunks), you can build the entire pipeline client-side: extract with the batch tool, chunk in JavaScript, embed via an API call, store in IndexedDB, and serve. Total cost: pennies. Markdown Converter is useful here for normalizing any HTML pages you scrape into clean Markdown chunks.
Tools mentioned
Frequently asked
Should I chunk before or after cleaning?
Always after. Chunking noisy text creates noisy embeddings, which hurts retrieval recall.
What chunk size is best?
500–1,000 tokens with 100 tokens of overlap is a strong default for most embedding models.
Do I need a vector database for a small corpus?
Under ~10,000 chunks, a flat in-memory index (FAISS-CPU or even cosine similarity on a numpy array) is fine and fast.
Keep reading
How to Prepare PDFs for ChatGPT, Claude, and Gemini
A practical guide to extracting clean, AI-ready text from PDFs — born-digital and scanned — so ChatGPT, Claude, and Gemini answer accurately and don't waste tokens on headers, footers, and page numbers.
The Best Way to Feed Long Documents to Claude (and Other Long-Context Models)
Claude's 200K-token context is generous, but you'll still want to clean, compress, and structure long documents before sending them. Here's a step-by-step playbook.
Cleaning Whisper Transcripts for AI Summaries
OpenAI Whisper, Otter, and YouTube transcripts are full of timestamps, filler words, and speaker noise. Here's how to strip them before sending to ChatGPT or Claude — and why it matters.