When you have a folder of mixed documents — PDFs, scanned images, Word manuscripts, plain-text notes — extracting them one at a time is tedious. Batch Document Extractor processes up to 20 files of any supported type in a single pass and gives you one combined output or a ZIP with one clean text file per input.
How to use it
- 1.
Drop your files
Mix PDFs, images, Word .docx, and plain text — up to 20 files, 50 MB each. Reorder or remove individual entries before processing.
- 2.
Click Extract all
Each file is routed to the right engine: pdf.js for PDFs, Tesseract for images, mammoth for Word documents.
- 3.
Download combined or zipped
Grab the merged .txt or .md from the output panel, or download a ZIP with one file per input — useful for RAG ingestion.
Building a knowledge base for RAG
Retrieval-augmented generation pipelines need clean text, one document per file. The ZIP export here gives you exactly that — drop the unzipped folder straight into LangChain, LlamaIndex, or any embeddings pipeline.
Best for
- Bulk-converting scanned receipts and invoices
- Preparing a folder of research papers for AI
- Migrating Word manuscripts to plain text
- Building a small RAG knowledge base