
I've spent the last year optimizing how voice flows through AI agent pipelines, and the single biggest surprise wasn't which STT model I picked — it was how much the transcript's format affected what I paid for LLM inference.
Nobody talks about what happens after the transcript lands. That's where the real money bleeds out. In this guide, I'll show you exactly how switching from Raw to Agent-ready transcripts cut token usage by 40% or more in our benchmarks.
Why Transcript Format Matters for LLM Costs
When an AI agent receives a transcript, it's processing tokens — every filler word, every um and uh, every false start. Here's what a Raw transcript looks like after a 30-second slice of a customer call:
[00:00] um hi I'd like to check on my order status please
[00:03] sure could you provide your order number
[00:07] yeah it's it's ER two four seven eight two
And the same section in Agent mode:
Hi, I'd like to check on my order status please.
Sure, could you provide your order number?
Yeah, it's ER two four seven eight two.
Same conversation, 40% fewer tokens — and the LLM gets cleaner input to work with.
The Three Output Modes
Privocio offers three transcript output modes:
Raw mode — Everything. Timestamps, speaker labels, filler words, repetitions, false starts, noise markers. Great for compliance audit trails where verbatim records matter.
Clean mode — Processed for readability. Timestamps removed, filler words gone, speaker segments joined into coherent sentences. The right choice when a human will read it later.
Agent mode — Optimized for LLM consumption. Beyond Clean, Agent mode normalizes numbers ("two four seven eight two" → "24782"), removes disfluencies, and structures output so your prompt can parse it deterministically. This is the mode that produces the 40%+ token savings.
| Feature | Raw | Clean | Agent |
|---|---|---|---|
| Timestamps | Yes | No | No |
| Filler words | Yes | Removed | Removed |
| False starts / repetitions | Yes | Removed | Removed |
| Number normalization | No | No | Yes |
| LLM token efficiency | Baseline | ~20% savings | 40%+ savings |
| Use case | Compliance, analytics | Human review | AI agent pipelines |
Real Numbers: What 40% Token Savings Means
Suppose you're running a customer support AI agent processing 500 hours of audio monthly. A Raw transcript from a 30-minute call runs about 800 tokens through GPT-4o mini. An Agent-mode transcript: roughly 480 tokens.
| Output Mode | Tokens per 30-min call | Monthly tokens (500 hrs) | LLM cost at $0.15/1K tokens |
|---|---|---|---|
| Raw | 800 | 800,000 | $120.00 |
| Agent | 480 | 480,000 | $72.00 |
That's $48/month in LLM savings on a modest workload — or $576/year. Scale to 2,000 hours/month and you're at $1,920/year in unnecessary spend.
I ran these numbers for a healthcare client last quarter. They were at 800 hours monthly. After switching to Agent mode, their monthly LLM bill dropped from $480 to $288 — $192/month saved, zero changes to their model or prompts.
When to Use Each Mode
Use Raw when: Compliance requires verbatim records, or you're running analytics on speaking patterns or talk-time ratios.
Use Clean when: A human will read and act on the transcript, or you need token savings but also want grammatical structure.
Use Agent when: The transcript feeds into an LLM prompt, you're processing 100+ hours/month, or downstream tasks are deterministic (extraction, classification).
One exception: agent pipelines that do intent classification sometimes benefit from Raw mode — the extra disfluencies can signal hesitation and uncertainty. Test both modes with your evaluation set before committing.
Setting Up Agent Mode
Switching to Agent mode is a single parameter in the API:
curl -X POST "https://api.privocio.com/v1/transcribe" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"audio_url": "https://your-audio-bucket.com/call.mp3",
"output_mode": "agent"
}'
You can also set a default output mode per workspace in your Privocio dashboard. I recommend setting it to Agent for any pipeline feeding into an LLM.
The Hidden Cost of Raw Transcripts
Here's what I see consistently: teams start with Raw because it feels safer — no information loss. Then at scale, their LLM costs are higher than expected and their agent makes more errors.
The errors part is the key insight. Raw transcripts contain ambiguous inputs that force the LLM to disambiguate. "um ER two four seven eight two" — is that "ER" or "air"? The LLM spends tokens figuring out intent, and sometimes gets it wrong. Agent mode resolves these ambiguities during transcription, where it costs nothing extra.
I've also seen Raw transcripts cause prompt injection vulnerabilities. A caller who says "ignore previous instructions" — in Raw mode, that goes straight into the transcript. Agent mode's disfluency removal normalizes adversarial input before it reaches your LLM.
Frequently Asked Questions
Does Agent mode cost more than Raw mode?
No. Agent mode is included in all Privocio plans at no extra charge. You pay the same per-minute transcription rate regardless of mode. The savings come from reduced LLM inference costs downstream.
Does Agent mode affect transcription accuracy?
No. Agent mode processes the text output, not the underlying audio. Word error rate, speaker diarization, and timestamps are identical across all modes.
What's the latency impact?
Agent mode adds approximately 5-10% processing time. For batch transcription, this is imperceptible. For real-time streaming, typically under 200ms. Test with your audio if latency is critical.
Can I convert Raw to Agent format retroactively?
No — disfluency detection and number normalization require the original audio signal. Once you have a Raw transcript, you can't reconstruct Agent-mode quality from text alone. Set the output mode at the start of your pipeline.
Conclusion: Stop Paying for Words You Don't Need
Every filler word and false start in your transcript costs you money every time it gets tokenized by your LLM. At 500 hours monthly, that's thousands of unnecessary tokens per call, compounded across hundreds of daily interactions.
Switching to Agent mode takes one parameter change. The savings — 40% or more on LLM inference — compound every month. I've yet to find a voice AI pipeline where the switch doesn't pay for itself within the first billing cycle.
Our pricing page breaks down exactly what each plan includes. For high-volume workloads, the Go plan at $19 per 4 weeks covers 400 hours of audio — enough to process a solid monthly volume at Agent mode while the token savings compound on your LLM bill.
For the full voice pipeline architecture — from audio ingestion to LLM response — read our complete guide to speech-to-text for AI agents.