Build vs buy: should your conversation QA platform run its own STT?
When the STT bill crosses six figures a month, every CTO is asked the same question by the board: why don't we bring this in-house? The honest decision tree, the hidden costs of self-hosting Whisper, and the case for buying from a vendor priced like infrastructure.
Every conversation QA platform crosses the same threshold at the same point in its lifecycle. The STT bill goes from a line item nobody reads to a board question nobody can answer confidently: why don't we just run this ourselves? The instinct is reasonable. The math, run honestly, almost always says no — for reasons that have nothing to do with capability and everything to do with where the leverage sits.
The build option, costed honestly
"Building" STT in 2026 doesn't mean training a model from scratch. It means self-hosting an open-source one — Whisper, Distil-Whisper, faster-whisper, or one of the more recent multilingual open releases — on rented GPU. The architecture is well-trodden; the bill is the part most teams underestimate.
A representative self-host stack for a platform processing 5M minutes/month:
- Compute: 5M minutes / 60 = 83,333 hours of audio. At a 50× real-time factor on a single A100 (a reasonable production target with batch optimization), the stack needs ~1,667 GPU-hours/month. At Vast.ai spot rates (~$0.50/hr) you're at $833/month for compute alone — but spot pricing isn't tier-1 reliable.
- Reserved capacity: tier-1 redundancy means two GPUs reserved at on-demand rates (~$1.20/hr) — call it $1,750/month per replica × 2 zones = $3,500/month.
- Diarization: a separate model (pyannote, sherpa-onnx) running on the same or adjacent GPU, add 30-50% to compute spend.
- Engineering time: one senior MLOps engineer half-time, indefinitely. At loaded cost, ~$8,000/month.
- On-call rotation: somebody picks up the pager when the queue backs up at 3am. Distributed across existing rotation has cultural cost even when budgeted at zero.
- Continuous model evaluation: WER drift, accent regressions, new language coverage. Budget a quarter of an applied scientist, ~$5,000/month.
Honest all-in monthly TCO at 5M-minute scale: ~$18,000-22,000/month, before counting the opportunity cost of the engineering attention. Annualized: ~$250K/year, plus a permanent dependency on the team's most senior infra brain.
The buy option, costed honestly
On the buy side, the same 5M-minute workload across the names most teams evaluate:
- OpenAI Whisper API: $0.006/min × 5M = $30,000/month.
- Deepgram Nova-3 + diarization: ~$0.0083/min × 5M = $41,500/month.
- AWS Transcribe: $0.024/min × 5M = $120,000/month.
At Deepgram-class pricing, buy is more expensive than build — this is the real reason CTOs entertain the question. Self-host saves ~$20K/month against the cheapest serious vendor. Multiplied across 12 months and 3 years, that's a real engineering hire.
Build vs buy isn't a question about capability. It's a question about where vendor pricing leaves margin on the table.
When build actually wins
There are real cases where self-hosting STT is the right answer:
- Air-gapped or sovereign deployments — the customer's audio cannot leave their VPC; vendor SaaS is contractually disqualified.
- Specialized custom vocabularies where the vendor's hotword/biasing interface isn't expressive enough. Rare in practice; usually solvable with prompt biasing.
- Strategic IP play — your transcription quality is the product, not the input to it. If you're selling STT, build STT.
When buy wins (almost everyone)
Three signals tell a CTO to keep buying:
- The team's strategic moat is what you do with the transcripts (scoring, coaching, intelligence), not how you produce them.
- The customer base does not require sovereign deployment. Standard SaaS terms are acceptable.
- The MLOps headcount could be deployed on a higher-leverage problem.
If all three apply, build is a vanity project. The math only works because vendor pricing is bad.
The third option
The build-vs-buy decision tree assumes the buy side is priced the way the buy side has always been priced. At $0.024/min, building yourself for $5/min in TCO is rational. At $0.0043, less so. At $0.00042 — Orchard's per-minute rate at production volume — the math inverts entirely.
Same 5M minutes at $0.00042 = $2,100/month. That's not buy-vs-build territory. That's "buy the line item and redeploy the MLOps engineer onto your actual product" territory. The decision tree collapses into a single answer the second the per-minute number drops an order of magnitude below your own TCO floor.