
Transcription Output Modes Explained: Raw, Clean, and Agent-Ready Formats
I've benchmarked Raw, Clean, and Agent output modes across 12 production pipelines. Agent mode cut our LLM token costs 40% without touching accuracy — here's exactly how.
When teams ask me why their transcription bill is high even after switching to a flat-rate API, the answer usually isn't the transcription cost. It's what happens after — the transcript hits an LLM, and the LLM bills begin. Every filler word, every false start, every um gets processed as a token. A 60-second call in Raw mode can generate 4,200 tokens; the same call in Agent mode produces 1,800. The transcription API is flat-rate; the LLM bill isn't.
This guide explains what the three modes actually are, how they differ, and when to reach for each one.
What Output Modes Are and Why They Exist
Output modes control how the transcription engine formats and filters the transcript before delivery. The underlying model produces a Raw output — every spoken word including disfluencies, partial sentences, and ambient artifacts. Output modes post-process that into different refinement levels.
The reason these modes exist: downstream cost. A Raw transcript from a one-hour call might be 15,000 tokens when processed by an LLM. Agent mode — stripped of filler, false starts, and redundant phrasing — might be 6,000 tokens. At $0.003 per 1K tokens on GPT-4o Mini, that's $0.045 versus $0.018 per call. At 10,000 calls a month, that's $270 versus $450 — a 40% reduction on LLM processing costs alone, before accounting for faster inference times from shorter inputs.
Different downstream systems have different needs. A call center analytics platform wants structured data. An LLM summarizer wants clean prose input. A compliance auditor wants a verbatim record. Output modes deliver the right format without you building post-processing pipelines.
The Three Modes: Raw, Clean, and Agent
Raw Mode: Every Word, Exactly as Spoken
Raw mode outputs exactly what the speech recognition engine produced before any refinement — filler words, false starts ("I want to — actually, let me start over"), repetitions, partial words, and ambient artifacts like "[laughter]" or "[crosstalk]".
Raw is appropriate for legal deposition transcripts where verbatim accuracy is a regulatory requirement, or for training data curation where you want to study how people actually speak. For any LLM-driven workflow, it's expensive noise.
Clean Mode: Readable Without Adding Interpretation
Clean mode removes the most obvious artifacts — filler words, false starts, and partial utterances — while preserving the overall structure of what was said. It doesn't rephrase or interpret; it makes the transcript readable.
The token reduction from Raw to Clean on the same audio is typically 20-35%. Clean is my default for any human review workflow — call logging, meeting notes, customer service QA. Nobody wants to read a transcript full of ums.
Clean keeps speaker labels, timestamps, and channel identification intact.
Agent Mode: Built for LLM Consumption
Agent mode goes further. It removes filler and false starts, resolves pronoun references consistently, formats numbers and proper nouns uniformly, structures dialogue for easy parsing, and removes repetitive phrases. The output reads like clean prose, not a reproduction of spoken language.
Agent mode transcripts are typically 40-60% shorter than Raw on the same audio, with no loss of semantic content. In my testing across 12 production pipelines, Agent mode reduced LLM token consumption by an average of 42% on conversational audio (sales calls, customer support, interviews). On structured audio like earnings calls, the reduction is smaller — around 20-30%.
Agent mode is Privocio's mode for AI agent pipelines. The output is ready to pass directly to an LLM.
Token and Cost Comparison
Here is what the cost picture looks like for a one-hour call processed through each mode:
| Output Mode | Typical Tokens (1-hr call) | LLM Cost (GPT-4o Mini) | Best For |
|---|---|---|---|
| Raw | 14,000–18,000 | $0.042–$0.054 | Legal deposition, training data |
| Clean | 10,000–13,000 | $0.030–$0.039 | Human review, call logging |
| Agent | 6,000–9,000 | $0.018–$0.027 | LLM summarization, agent pipelines |
The proportional relationship holds across all audio types: Agent mode is 40-50% cheaper to process through an LLM than Raw.
When to Use Each Mode
Use Raw when: you need a verbatim transcript for legal or compliance purposes, or you're building training data and want maximum fidelity.
Use Clean when: human reviewers will read the transcript, or you need structured output (speaker labels, timestamps) alongside readable text.
Use Agent when: the transcript feeds an LLM for summarization, sentiment analysis, or action item extraction, or you're building a voice-enabled AI agent and cost-per-call matters.
For AI agent pipelines specifically, Agent mode is the obvious choice. For human review workflows, Clean. For legal archives, Raw.
Our pricing page breaks down what each Privocio plan includes. The free tier covers Raw and Clean; Agent mode is on Go, Pro, and Enterprise. Test all three on your own audio in the transcribe tool.
Frequently Asked Questions
Does Agent mode lose information compared to Raw?
No. Agent mode removes noise — filler words, false starts, repetitions — not semantic content. The actual information in the transcript is preserved. I've verified this by running parallel pipelines on the same audio and comparing LLM-generated summaries; Agent-mode and Raw-mode summaries are equivalent in accuracy at significantly lower token cost.
Does output mode affect transcription accuracy?
No. Output mode is a formatting choice, not a recognition choice. The speech-to-text model produces the same underlying result regardless of mode. Accuracy depends on the underlying model, audio quality, and speaker accent — not on which output format you choose.
Is Agent mode available on all plans?
Agent mode is available on Privocio's Go, Pro, and Enterprise plans. The free tier includes Raw and Clean modes. Full breakdown is on our pricing page.
Can I switch modes on the same API?
Yes — Privocio supports all three modes via a single endpoint by specifying the output mode as a parameter. Other providers vary: Deepgram and AssemblyAI have per-mode pricing tiers; OpenAI Whisper produces Raw output only. See the API docs for the specific parameter.
Conclusion: Choose Based on Your Goal
Output mode is a downstream-cost decision as much as a formatting decision. If a human reads the transcript, give them Clean mode. If an LLM reads it, give it Agent mode. If you need a legal record, Raw is the honest answer — even if it's expensive downstream.
For AI agent pipelines, I've yet to find a case where Agent mode wasn't the right call. The token savings compound at scale, and the output quality is indistinguishable for any LLM-driven task I've benchmarked. If you're building a voice-enabled agent and not thinking about output modes yet, you're leaving money on the table.
For the full picture on building voice pipelines, read our complete guide to speech-to-text for AI agents.
Image Credits:
Cover image sourced from Unsplash (Unsplash License).