Workflows··5 min read

Cleaning Whisper Transcripts for AI Summaries

OpenAI Whisper, Otter, and YouTube transcripts are full of timestamps, filler words, and speaker noise. Here's how to strip them before sending to ChatGPT or Claude — and why it matters.

Auto-generated transcripts are a goldmine for AI workflows: meeting summaries, podcast show notes, lecture study guides, sales-call analysis. But the raw output from Whisper, Otter, Fireflies, or YouTube's auto-captions is consistently 30–50% noise. Sending that noise to a paid API is a tax you don't have to pay.

What "noise" actually means

  • Timestamps on every line ("[00:14:22] So I was thinking…")
  • Filler words: um, uh, like, you know, sort of, I mean
  • Repeated speaker labels every line, even when the speaker hasn't changed
  • Hesitation markers and false starts ("the — the — the proposal")
  • VTT/SRT cue numbers and arrow timestamps

How much do you actually save?

A 60-minute meeting transcript from Otter typically lands around 14,000 tokens. Cleaning it usually drops it to 8,000–9,500 — a 30–45% reduction. On Claude Sonnet's pricing that's a real number across a year of weekly summaries.

The cleaning pipeline

Transcript Cleaner handles all of the above with toggles you can flip individually. Drop in your Whisper output, choose what to remove, and copy out clean prose. It works on plain text, SRT, and VTT.

Prompt patterns that work after cleaning

A cleaned transcript pairs beautifully with a structured prompt. Some patterns we've found reliable:

Executive summary

Summarize the following meeting transcript in 5 bullet points.
For each bullet, name the speaker who proposed it.
End with a "Decisions" section and an "Open questions" section.

Transcript:
[paste cleaned transcript]

Action items

Extract every action item from this transcript.
Format as a Markdown table: Owner | Action | Mentioned timestamp | Confidence (high/medium/low).

Transcript:
[paste cleaned transcript]

When the transcript is still too long

For multi-hour calls, run the cleaned transcript through Context Compressor as a second pass. You'll lose nothing important — just the verbal scaffolding that doesn't survive the trip into a summary anyway.

Tools mentioned

Frequently asked

Will ChatGPT understand a raw VTT or SRT transcript?

Yes, but it spends tokens parsing timestamps and produces flatter summaries. Strip them first.

What about speaker names?

Keep them on speaker changes only. Repeating "John:" before every John line is pure noise.

Is filler-word removal lossy?

Slightly — sometimes "like" carries meaning. For summarization it almost never matters; for direct-quote pulling, leave fillers in.

Keep reading