Techniques··7 min read

How to Convert HTML to Markdown for LLMs (and Cut Tokens by 60%)

Webpage HTML is bloated with classes, inline styles, and wrapper divs the model doesn't need. Convert to Markdown first to slash token cost and dramatically improve summarization quality.

Webpages are mostly markup. Strip the visible text out of any modern site and you'll typically find that the HTML is 2–3× the size of the actual content. Wrapper divs, utility classes, inline styles, embedded scripts, tracking attributes — all of it ships to the model when you paste raw HTML, and all of it costs tokens.

The token math

On a typical news article we tested:

  • Raw HTML source: 8,200 tokens
  • Visible text only: 3,400 tokens
  • Markdown (with structure preserved): 3,100 tokens

That's a 62% reduction with structure intact. Across an API workload that's a 60% reduction in input cost without changing what the model can answer. Run your own pages through Token Counter to see the difference for yourself.

Why models prefer Markdown

GPT, Claude, and Gemini have all read enormous amounts of Markdown during training (every README, every doc site, every Stack Overflow answer). They parse it natively. HTML works too, but the model has to spend attention deciding which tags carry meaning and which are styling — attention you'd rather it spent on your actual question.

The conversion step

Drop any HTML — copied source, scraped page, exported docs — into Markdown Converter. It strips wrapper divs, classes, and inline styles while preserving headings, lists, code blocks, tables, and links. Everything runs in your browser.

What gets preserved

  • Headings (h1–h6 → # through ######)
  • Ordered and unordered lists
  • Inline emphasis (bold, italic, strikethrough)
  • Links with their href targets
  • Code blocks (fenced, with language hint when available)
  • Tables (using pipe syntax)
  • Blockquotes

What gets dropped

  • <script> and <style> blocks
  • All class and id attributes
  • Inline styles
  • Comments
  • Tracking attributes (data-*, aria-* not used for content)

For very long inputs: compress next

After conversion, a long article still might be 5K–15K tokens. Context Compressor chops another 30–60% with minimal information loss. The combo of HTML → Markdown → compressed regularly cuts a 20K-token webpage to under 6K — fits comfortably in any model's window with room left for a multi-turn conversation.

Edge cases

A few situations where keeping HTML is the right call:

  • You want the model to fix or analyze the markup itself (e.g., “why isn't this nav rendering?”)
  • You're extracting structured data from microdata or schema.org markup
  • Tables are deeply nested with merged cells — Markdown's pipe syntax can't represent that
  • You're asking the model to produce HTML output and want it to mirror your input style

For everything else — summarization, Q&A, translation, rewriting — Markdown is the right default. See the deeper analysis in Markdown vs HTML tokens and the broader trade-offs in Markdown vs HTML for LLMs.

Tools mentioned

Frequently asked

Can I just paste raw HTML into ChatGPT?

You can, but you'll burn 2–3× the tokens for the same content, and the model wastes attention parsing markup. Convert first.

What about <a> tag attributes?

Markdown links carry the URL but lose title and rel attributes. If those matter for your task (e.g., SEO analysis), keep HTML. Otherwise convert.

Will the converter break code blocks?

No — <pre> and <code> tags become fenced Markdown code blocks with language hints preserved when the original used class='language-X'.

Keep reading