Extract text from many docs at once.

Drop PDFs, images, Word docs, and text files together. Get clean text out — combined, as Markdown, or zipped one-per-file. Everything runs in your browser; nothing uploads.

When you have a folder of mixed documents — PDFs, scanned images, Word manuscripts, plain-text notes — extracting them one at a time is tedious. Batch Document Extractor processes up to 20 files of any supported type in a single pass and gives you one combined output or a ZIP with one clean text file per input.

How to use it

  1. 1.

    Drop your files

    Mix PDFs, images, Word .docx, and plain text — up to 20 files, 50 MB each. Reorder or remove individual entries before processing.

  2. 2.

    Click Extract all

    Each file is routed to the right engine: pdf.js for PDFs, Tesseract for images, mammoth for Word documents.

  3. 3.

    Download combined or zipped

    Grab the merged .txt or .md from the output panel, or download a ZIP with one file per input — useful for RAG ingestion.

Building a knowledge base for RAG

Retrieval-augmented generation pipelines need clean text, one document per file. The ZIP export here gives you exactly that — drop the unzipped folder straight into LangChain, LlamaIndex, or any embeddings pipeline.

Best for

  • Bulk-converting scanned receipts and invoices
  • Preparing a folder of research papers for AI
  • Migrating Word manuscripts to plain text
  • Building a small RAG knowledge base

One queue, every format.

Mix scanned receipts (OCR), research PDFs, Word manuscripts, and plain text in a single batch. Each file is extracted with the right tool — pdf.js for PDFs, Tesseract for images, mammoth for .docx — then merged into one tidy output for your LLM.

Frequently asked

Related reading