Workflows··7 min read

Building a RAG Knowledge Base from Mixed Document Types

Most real knowledge lives across PDFs, scanned images, Word docs, and web exports. Here's how to extract them into a clean, embeddable corpus — entirely in your browser.

Retrieval-Augmented Generation pipelines live or die on input quality. The fanciest embedding model and re-ranker can't recover from a corpus where every chunk carries page numbers, headers, and broken hyphens. The pre-processing step often matters more than the choice of model.

The realistic input

Most knowledge bases have to ingest a mix:

  • Born-digital PDFs (reports, papers)
  • Scanned PDFs (contracts, invoices, archive material)
  • Word docs (.docx) and the occasional .doc
  • Images of pages (phone photos, screenshots)
  • Plain-text and Markdown notes
  • HTML exported from internal wikis

One pipeline, every type

Batch Document Extractor handles all of these in one drop. Each file is routed to the right engine (pdf.js for PDFs, Tesseract OCR for images, mammoth for Word, native parsing for text/HTML) and the output is a ZIP with one cleaned text file per input — exactly the shape an embedding pipeline expects.

Chunking strategy that works

  1. Split on paragraph boundaries first
  2. Then merge paragraphs until you reach 500–1,000 tokens
  3. Add 100 tokens of overlap between chunks (helps when answers straddle a boundary)
  4. Carry source metadata on every chunk (filename, page if known)

Why metadata matters

When the LLM cites a fact at answer time, you want it to point back to "Q3-strategy.docx, p. 12" — not just "the corpus". Carry the source filename through the entire pipeline. Good extraction tools (and the batch extractor's per-file ZIP) make this trivial.

Embedding models in 2026

  • OpenAI text-embedding-3-small: cheap, fast, 1536-dim
  • Voyage-3: outstanding quality for retrieval, 1024-dim
  • Cohere embed-multilingual-v3: best when your corpus crosses languages
  • BGE-M3: strong open-source option you can host yourself

Serve with citations, not just answers

Modern frontends expect inline citations. Have your retriever return chunks with their metadata, then prompt the model to use a structured answer format that references the source IDs. This is where clean source files (one per logical document) pay off — citations stay coherent.

The minimum viable RAG, all browser-side

For very small corpora (under a few thousand chunks), you can build the entire pipeline client-side: extract with the batch tool, chunk in JavaScript, embed via an API call, store in IndexedDB, and serve. Total cost: pennies. Markdown Converter is useful here for normalizing any HTML pages you scrape into clean Markdown chunks.

Tools mentioned

Frequently asked

Should I chunk before or after cleaning?

Always after. Chunking noisy text creates noisy embeddings, which hurts retrieval recall.

What chunk size is best?

500–1,000 tokens with 100 tokens of overlap is a strong default for most embedding models.

Do I need a vector database for a small corpus?

Under ~10,000 chunks, a flat in-memory index (FAISS-CPU or even cosine similarity on a numpy array) is fine and fast.

Keep reading