Token Optimization: Cut Your AI API Costs in Half
Practical, browser-side techniques to reduce input tokens without losing meaning. Compress prompts, clean documents, and prune system messages — the savings compound across every call.
Token costs scale linearly with usage, so a 40% reduction on your prompts is a 40% reduction on your bill — every day, forever. The same compression also leaves more context room for your model to actually answer, and reduces latency. Here's the playbook.
Audit the system prompt first
The system prompt is sent on every single call. A 1,200-token system prompt that runs a million times a month is 1.2 billion tokens. Things to cut:
- Repeated rules said three different ways
- Examples that never trigger
- Polite framing ("Please be careful to…" → "Be careful to…")
- Verbose role descriptions when one sentence will do
Compress user-supplied content
Use Context Compressor on long pasted text. The "Light" setting is safe for any input — it removes whitespace and obvious redundancy without changing meaning. "Aggressive" goes further; spot-check the output before relying on it.
Pre-process documents
For PDFs, the single biggest token saver is removing repeating headers and footers. PDF to Clean Text does this automatically and typically cuts 10–15% on long documents — more on academic PDFs with running headers.
Strip transcripts
Auto-generated transcripts are 30–50% noise. Transcript Cleaner strips timestamps, filler words, and duplicate speaker labels with toggles for each. A 60-minute meeting commonly drops from 14,000 to 8,500 tokens.
Prefer Markdown to HTML
For copied web content, Markdown is roughly half the tokens of equivalent HTML and structurally closer to the model's training data. Convert with the markdown utilities before pasting.
Write tighter prompts
- Use "Output JSON with keys: title, body" instead of three sentences describing it
- Number your requirements (1, 2, 3) — this often replaces "Also remember to…" elsewhere in the prompt
- Drop "Thank you" and "Please". The model isn't offended.
Cache what you can
Both Claude and OpenAI now offer prompt caching. If you send the same long context (a knowledge base, a system prompt) to many calls, caching can drop the price of that prefix by 90%. Combine caching with cleaned, compressed input for the biggest wins.
Measure, don't guess
Pick a representative call and tokenize the prompt before and after. If you're not measuring, you can't tell whether a "shorter" rewrite actually reduced tokens or just felt shorter.
Tools mentioned
Frequently asked
Does compression hurt response quality?
Light compression (whitespace, redundancy) doesn't. Aggressive compression can. The sweet spot is a 30–40% reduction.
Should I optimize the system prompt?
Yes — it's sent on every call, so savings compound the most. Audit it line by line.
Is there a tokenizer for browsers?
Yes. Most cost calculators use a roughly 4-characters-per-token approximation, which is accurate within ±10% for English prose.
Keep reading
How to Prepare PDFs for ChatGPT, Claude, and Gemini
A practical guide to extracting clean, AI-ready text from PDFs — born-digital and scanned — so ChatGPT, Claude, and Gemini answer accurately and don't waste tokens on headers, footers, and page numbers.
The Best Way to Feed Long Documents to Claude (and Other Long-Context Models)
Claude's 200K-token context is generous, but you'll still want to clean, compress, and structure long documents before sending them. Here's a step-by-step playbook.
Cleaning Whisper Transcripts for AI Summaries
OpenAI Whisper, Otter, and YouTube transcripts are full of timestamps, filler words, and speaker noise. Here's how to strip them before sending to ChatGPT or Claude — and why it matters.