When you have a folder of mixed documents — PDFs, scanned images, Word manuscripts, plain-text notes — extracting them one at a time is tedious. Batch Document Extractor processes up to 20 files of any supported type in a single pass and gives you one combined output or a ZIP with one clean text file per input.

How to use it

1.
Drop your files
Mix PDFs, images, Word .docx, and plain text — up to 20 files, 50 MB each. Reorder or remove individual entries before processing.
2.
Click Extract all
Each file is routed to the right engine: pdf.js for PDFs, Tesseract for images, mammoth for Word documents.
3.
Download combined or zipped
Grab the merged .txt or .md from the output panel, or download a ZIP with one file per input — useful for RAG ingestion.

Building a knowledge base for RAG

Retrieval-augmented generation pipelines need clean text, one document per file. The ZIP export here gives you exactly that — drop the unzipped folder straight into LangChain, LlamaIndex, or any embeddings pipeline.

Best for

Bulk-converting scanned receipts and invoices
Preparing a folder of research papers for AI
Migrating Word manuscripts to plain text
Building a small RAG knowledge base

Extract text from many docs at once.

How to use it

Building a knowledge base for RAG

Best for

One queue, every format.

Frequently asked

How to use it

Building a knowledge base for RAG

Best for

One queue, every format.

Frequently asked

Which formats are supported?

How many files at once?

Does anything upload?

Why is image extraction slow?