EngineeringJune 28, 2026

Real-time STT for call coaching: under 200ms or your CSAT dies

Agent-assist platforms live and die by the gap between the customer's last syllable and the suggestion on the agent's screen. The cognitive science behind the 200ms threshold, where real-time STT vendors actually land, and the latency budget that fits inside a single conversational turn.

Mateo Bustamante6 min read

Agent-assist tools sell a simple promise: while the customer is still talking, surface the right answer on the agent's screen. That promise has a latency budget, and the budget is much tighter than most platforms admit. The line between a product that closes deals and a product that gets uninstalled sits somewhere around 200 milliseconds.

Where the 200ms threshold comes from

Human conversational turn-taking is one of the most studied and most consistent behaviors in cognitive science. Across languages and cultures, the median gap between speakers is roughly 200 milliseconds. The next speaker starts planning their response while the previous speaker is still talking; the audible silence between turns is just the muscular handoff.

Any agent-assist suggestion that arrives after that 200ms window has already missed the cognitive moment it was supposed to support. The agent reads it, processes it, adapts — by which point the customer has spoken another full sentence and the conversation has moved past the suggested intervention. The platform looks slow even though, on a single-request graph, the API came back in well under a second.

Real-time STT isn't fast enough if it's faster than the customer. It has to be faster than the customer's next sentence.

The real-time latency budget, end to end

Inside a 200ms budget, the agent-assist pipeline has to fit:

  • Audio chunking — even with 100ms chunks, the STT engine has already eaten half the budget before it starts processing.
  • STT partial emission — the engine has to decide it has enough audio to emit a partial transcript worth showing to the downstream model.
  • Intent / suggestion model inference — the LLM or classifier that reads the partial transcript and decides what to surface.
  • Network round-trips × 2-3 — agent's browser to backend, backend to model, model back to agent.
  • UI render — the suggestion has to actually paint on the agent's screen, not sit in a network buffer.

Realistically, that means the STT layer needs to deliver partial transcripts in 50-80ms from chunk arrival for the rest of the pipeline to fit inside 200ms total.

Where real-time STT vendors actually land

Public latency claims across the names most agent-assist teams evaluate, measured as the time from the end of a 100ms audio chunk to a partial transcript:

  • Deepgram Streaming — 100-300ms partial latency depending on language and accuracy mode. Falls inside the budget on English; tighter on non-English.
  • AssemblyAI Streaming — listed at sub-300ms on the marketing page; real-world reports closer to 400ms on the universal endpoint.
  • Google Cloud STT streaming — sub-200ms on the default endpoint; quality model variants add 50-100ms.
  • AWS Transcribe streaming — generally 200-500ms partial latency. Acceptable for transcription; tight for agent-assist.
  • Orchard real-time — sub-100ms partial latency on production volume, with the same per-minute rate as batch. The latency budget for the rest of the pipeline doubles.

The latency-cost frontier

The dirty secret of real-time STT pricing is that vendors price it at a premium because they can — the engineering to keep streaming infrastructure responsive at scale is real, but the premium is also commercial. Across the public pricing pages:

  • Deepgram: $0.0077/min real-time vs $0.0043/min batch — a 79% premium.
  • Rev.ai: $0.025/min real-time vs $0.0030/min batch — a 733% premium on the same audio.
  • AssemblyAI: $0.0150/min real-time vs $0.0117/min batch — a 28% premium.
  • Orchard: $0.00042/min real-time = $0.00042/min batch. No premium; the published rate is the rate regardless of mode.

For a platform running 1M minutes of real-time per month, the delta between Deepgram's real-time rate and Orchard's is $7,280/month — $87K/year for the streaming workload alone. At Rev.ai's premium, the gap widens to $24,580/month.

Who should care most

If your product is post-call analytics, real-time STT latency is interesting trivia. If your product is anything that runs during the call — agent-assist, live coaching, real-time compliance, live captioning — the latency budget is the product. The vendor that fits inside it without charging a premium isn't a nice-to-have; it's the only viable infrastructure.

Try Orchard

The cheapest minute on the market. 500 minutes free at signup, no card.