PDF to clean Markdown.

Convert any PDF to tidy Markdown — headings, paragraphs, and structure preserved. Perfect for Obsidian, GitHub, RAG pipelines, and feeding clean context to ChatGPT, Claude, and Gemini.

Markdown is the lingua franca of LLMs and modern note-taking. PDF to Markdown takes any PDF — research paper, ebook, contract, scanned report — and gives you tidy .md with headings, paragraphs, and lists preserved. Drop the result into Obsidian, Notion, GitHub, an MDX docs site, or straight into ChatGPT, Claude, or Gemini.

Conversion runs locally with pdf.js. Scanned PDFs trigger an automatic OCR fallback via Tesseract. Files never leave your browser.

How to use it

  1. 1.

    Drop the PDF

    Born-digital and scanned PDFs both work. Up to ~50 MB recommended.

  2. 2.

    Wait for conversion

    Pages stream in via a progress bar. Scanned PDFs take longer because OCR runs per page.

  3. 3.

    Copy or download .md

    Switch between Markdown, Cleaned (no syntax), and Raw. Download as .md for direct paste into Obsidian, GitHub, or your RAG pipeline.

Why Markdown beats raw PDF text for AI

LLMs are trained on enormous amounts of GitHub-flavored Markdown. They recognize ## headings, lists, and code blocks instantly — which means better chunking for RAG, better summaries, and roughly half the tokens compared to HTML. Markdown also survives the round-trip into and out of a model, so structured edits stay structured.

How heading detection works

The converter walks every paragraph and promotes short, capitalized lines that don't end in sentence punctuation to ## (H2) headings. It's a heuristic — perfect for most reports and ebooks, occasionally over-eager on tables of contents. The Cleaned tab is always available if you want the same content with no Markdown syntax at all.

Best for

  • Research papers headed for Obsidian or Notion
  • Building a Markdown-based RAG knowledge base
  • Feeding clean structured context to ChatGPT or Claude
  • Migrating PDF reports into MDX docs sites
  • Long ebooks where heading structure matters

Why Markdown?

Markdown is the lingua franca of LLMs. ChatGPT, Claude, and Gemini are all trained on enormous amounts of GitHub-flavored Markdown, so they recognize headings, lists, and code blocks instantly. Compared to raw HTML or DOCX, Markdown costs roughly half the tokens for the same content while preserving the structure that helps the model reason.

This tool extracts text with pdf.js, then heuristically promotes short, capitalized lines to ## headings, joins broken hyphenated words, removes repeated headers/footers and page numbers, and rebuilds clean paragraphs. Scanned PDFs fall back to in-browser OCR via Tesseract — no upload, no account, no API key.

The output drops cleanly into Obsidian, Notion, GitHub README files, MDX-based docs sites, and RAG pipelines using LangChain or LlamaIndex.

Frequently asked

Related reading