From a Folder of Word Docs to a RAG-Ready Knowledge Base in 10 Minutes
You inherited a folder of .docx files — policies, runbooks, onboarding guides — and someone wants them queryable by an internal AI. Here's the fastest path from raw documents to a working RAG corpus.
The pattern is everywhere: a team has 50–500 Word documents that hold the institutional knowledge — HR policies, deployment runbooks, customer playbooks — and now leadership wants an AI assistant that can answer questions over them. The tutorials all jump straight to "embed and store in a vector DB," skipping the unglamorous part: getting the documents into a state where embedding actually works well.
That preparation is 80% of the project. Get it right and your retrieval is sharp. Get it wrong and the AI hallucinates in confident, well-formatted Markdown.
The full pipeline
- Drop all .docx files into Batch Document Extractor.
- Choose the per-file ZIP output (one .txt per document).
- Pass each text file through the Markdown Converter to add structure.
- Compress optional sections (long preambles, signature blocks).
- Chunk by Markdown heading.
- Embed and store in your vector DB.
Step 1: Bulk extract with the right tool
Batch Document Extractor handles up to 20 files at a time, reading .docx with mammoth (the same library LangChain uses internally). Drop them in, click extract, choose "ZIP per file." You now have one .txt per source document, ready for the next step.
For sets larger than 20, run multiple batches. Or, if you're comfortable with code, mammoth itself is a five-line Node.js script.
Step 2: Add structure with Markdown
Plain text from Word loses headings — they become paragraphs that look like every other paragraph. Run the text through Markdown Converter to detect headings (typically by font weight and size in the original DOCX) and emit `#` and `##` markers. This is the single biggest lever for retrieval quality.
Step 3: Compress what doesn't matter
Most policy documents have boilerplate: revision history tables, "this document supersedes..." paragraphs, page-footer disclaimers. They embed and retrieve like any other text but never answer a useful question. Run Context Compressor on each file to strip them.
Step 4: Chunk by heading, not by character
The lazy default is "split every 1,000 characters." That cuts paragraphs in half and produces chunks like "...the employee shall // be entitled to..." which embed terribly. Instead:
- Split on `##` (or `#`) so each chunk is one section.
- If a section exceeds ~1,500 tokens, sub-split on paragraph boundaries.
- Add 100 tokens of overlap between adjacent chunks for context.
- Tag each chunk with its document filename and section heading as metadata.
Step 5: Embed and retrieve
Use OpenAI's `text-embedding-3-small` for a good cost/quality balance — about $0.02 per million tokens. For 500 documents averaging 5,000 tokens each, that's roughly $0.05 to embed the entire corpus. Store in Pinecone, Weaviate, Qdrant, or even a single Postgres table with pgvector.
At query time, embed the user's question, retrieve the top 5–10 chunks, and stuff them into the model's context with the question. That's RAG.
What good retrieval looks like
- User asks a specific question; system returns 3–5 relevant chunks, each from a different section.
- Chunks include the source filename and heading in metadata so the AI can cite.
- AI's answer references specific clauses, not vague summaries.
- When a question has no answer in the corpus, the AI says so instead of hallucinating.
Common failure modes (and the fix)
Retrieval returns the same chunk for every question. Your chunks are too long or too few. Re-chunk smaller.
AI cites a section that doesn't say what it claims. The chunk's heading didn't match its content (a common artifact when the converter mistags). Re-run conversion with stricter heading detection.
AI invents document names. Your metadata isn't being passed into the prompt. Verify the chunk's filename is included in the system prompt.
Maintenance
When a document changes, re-run conversion for that file only and replace its chunks in the vector DB. A small Python script with a file-modified-date check makes this near-instant. The corpus stays current; the AI stays accurate.
Tools mentioned
Frequently asked
Why Markdown for RAG instead of plain text?
Headings let the chunker split semantically (one chunk per section) instead of arbitrarily by character count. Retrieval quality jumps.
How big should each chunk be?
500–1,000 tokens for generic Q&A, 1,500–2,000 if your questions span multiple paragraphs. Always include 100 tokens of overlap between chunks.
Can I do this without writing code?
Conversion: yes, fully no-code. Embedding and retrieval: you'll need either LangChain/LlamaIndex (Python) or a no-code platform like Pinecone Assistant or Vercel's RAG starter.
Keep reading
How to Prepare PDFs for ChatGPT, Claude, and Gemini
A practical guide to extracting clean, AI-ready text from PDFs — born-digital and scanned — so ChatGPT, Claude, and Gemini answer accurately and don't waste tokens on headers, footers, and page numbers.
The Best Way to Feed Long Documents to Claude (and Other Long-Context Models)
Claude's 200K-token context is generous, but you'll still want to clean, compress, and structure long documents before sending them. Here's a step-by-step playbook.
Cleaning Whisper Transcripts for AI Summaries
OpenAI Whisper, Otter, and YouTube transcripts are full of timestamps, filler words, and speaker noise. Here's how to strip them before sending to ChatGPT or Claude — and why it matters.