Cleaning Whisper Transcripts for AI Summaries
OpenAI Whisper, Otter, and YouTube transcripts are full of timestamps, filler words, and speaker noise. Here's how to strip them before sending to ChatGPT or Claude — and why it matters.
Auto-generated transcripts are a goldmine for AI workflows: meeting summaries, podcast show notes, lecture study guides, sales-call analysis. But the raw output from Whisper, Otter, Fireflies, or YouTube's auto-captions is consistently 30–50% noise. Sending that noise to a paid API is a tax you don't have to pay.
What "noise" actually means
- Timestamps on every line ("[00:14:22] So I was thinking…")
- Filler words: um, uh, like, you know, sort of, I mean
- Repeated speaker labels every line, even when the speaker hasn't changed
- Hesitation markers and false starts ("the — the — the proposal")
- VTT/SRT cue numbers and arrow timestamps
How much do you actually save?
A 60-minute meeting transcript from Otter typically lands around 14,000 tokens. Cleaning it usually drops it to 8,000–9,500 — a 30–45% reduction. On Claude Sonnet's pricing that's a real number across a year of weekly summaries.
The cleaning pipeline
Transcript Cleaner handles all of the above with toggles you can flip individually. Drop in your Whisper output, choose what to remove, and copy out clean prose. It works on plain text, SRT, and VTT.
Prompt patterns that work after cleaning
A cleaned transcript pairs beautifully with a structured prompt. Some patterns we've found reliable:
Executive summary
Summarize the following meeting transcript in 5 bullet points. For each bullet, name the speaker who proposed it. End with a "Decisions" section and an "Open questions" section. Transcript: [paste cleaned transcript]
Action items
Extract every action item from this transcript. Format as a Markdown table: Owner | Action | Mentioned timestamp | Confidence (high/medium/low). Transcript: [paste cleaned transcript]
When the transcript is still too long
For multi-hour calls, run the cleaned transcript through Context Compressor as a second pass. You'll lose nothing important — just the verbal scaffolding that doesn't survive the trip into a summary anyway.
Tools mentioned
Frequently asked
Will ChatGPT understand a raw VTT or SRT transcript?
Yes, but it spends tokens parsing timestamps and produces flatter summaries. Strip them first.
What about speaker names?
Keep them on speaker changes only. Repeating "John:" before every John line is pure noise.
Is filler-word removal lossy?
Slightly — sometimes "like" carries meaning. For summarization it almost never matters; for direct-quote pulling, leave fillers in.
Keep reading
How to Prepare PDFs for ChatGPT, Claude, and Gemini
A practical guide to extracting clean, AI-ready text from PDFs — born-digital and scanned — so ChatGPT, Claude, and Gemini answer accurately and don't waste tokens on headers, footers, and page numbers.
The Best Way to Feed Long Documents to Claude (and Other Long-Context Models)
Claude's 200K-token context is generous, but you'll still want to clean, compress, and structure long documents before sending them. Here's a step-by-step playbook.
GPT-4 Vision: Best Image Size and Format for Accuracy
Bigger isn't better for AI vision. Here's the resolution and format sweet spot for GPT-4 Vision, Claude 3.5 Sonnet, and Gemini, plus why downscaling before upload often improves accuracy.