EngineeringJune 28, 2026

Build vs buy: should your conversation QA platform run its own STT?

When the STT bill crosses six figures a month, every CTO is asked the same question by the board: why don't we bring this in-house? The honest decision tree, the hidden costs of self-hosting Whisper, and the case for buying from a vendor priced like infrastructure.

Mateo Bustamante7 min read

Every conversation QA platform crosses the same threshold at the same point in its lifecycle. The STT bill goes from a line item nobody reads to a board question nobody can answer confidently: why don't we just run this ourselves? The instinct is reasonable. The math, run honestly, almost always says no — for reasons that have nothing to do with capability and everything to do with where the leverage sits.

The build option, costed honestly

"Building" STT in 2026 doesn't mean training a model from scratch. It means self-hosting an open-source one — Whisper, Distil-Whisper, faster-whisper, or one of the more recent multilingual open releases — on rented GPU. The architecture is well-trodden; the bill is the part most teams underestimate.

A representative self-host stack for a platform processing 5M minutes/month:

Compute: 5M minutes / 60 = 83,333 hours of audio. At a 50× real-time factor on a single A100 (a reasonable production target with batch optimization), the stack needs ~1,667 GPU-hours/month. At Vast.ai spot rates (~$0.50/hr) you're at $833/month for compute alone — but spot pricing isn't tier-1 reliable.
Reserved capacity: tier-1 redundancy means two GPUs reserved at on-demand rates (~$1.20/hr) — call it $1,750/month per replica × 2 zones = $3,500/month.
Diarization: a separate model (pyannote, sherpa-onnx) running on the same or adjacent GPU, add 30-50% to compute spend.
Engineering time: one senior MLOps engineer half-time, indefinitely. At loaded cost, ~$8,000/month.
On-call rotation: somebody picks up the pager when the queue backs up at 3am. Distributed across existing rotation has cultural cost even when budgeted at zero.
Continuous model evaluation: WER drift, accent regressions, new language coverage. Budget a quarter of an applied scientist, ~$5,000/month.

Honest all-in monthly TCO at 5M-minute scale: ~$18,000-22,000/month, before counting the opportunity cost of the engineering attention. Annualized: ~$250K/year, plus a permanent dependency on the team's most senior infra brain.

The buy option, costed honestly

On the buy side, the same 5M-minute workload across the names most teams evaluate:

OpenAI Whisper API: $0.006/min × 5M = $30,000/month.
Deepgram Nova-3 + diarization: ~$0.0083/min × 5M = $41,500/month.
AWS Transcribe: $0.024/min × 5M = $120,000/month.

At Deepgram-class pricing, buy is more expensive than build — this is the real reason CTOs entertain the question. Self-host saves ~$20K/month against the cheapest serious vendor. Multiplied across 12 months and 3 years, that's a real engineering hire.

Build vs buy isn't a question about capability. It's a question about where vendor pricing leaves margin on the table.

When build actually wins

There are real cases where self-hosting STT is the right answer:

Air-gapped or sovereign deployments — the customer's audio cannot leave their VPC; vendor SaaS is contractually disqualified.
Specialized custom vocabularies where the vendor's hotword/biasing interface isn't expressive enough. Rare in practice; usually solvable with prompt biasing.
Strategic IP play — your transcription quality is the product, not the input to it. If you're selling STT, build STT.

When buy wins (almost everyone)

Three signals tell a CTO to keep buying:

The team's strategic moat is what you do with the transcripts (scoring, coaching, intelligence), not how you produce them.
The customer base does not require sovereign deployment. Standard SaaS terms are acceptable.
The MLOps headcount could be deployed on a higher-leverage problem.

If all three apply, build is a vanity project. The math only works because vendor pricing is bad.

The third option

The build-vs-buy decision tree assumes the buy side is priced the way the buy side has always been priced. At $0.024/min, building yourself for $5/min in TCO is rational. At $0.0043, less so. At $0.00042 — Orchard's per-minute rate at production volume — the math inverts entirely.

Same 5M minutes at $0.00042 = $2,100/month. That's not buy-vs-build territory. That's "buy the line item and redeploy the MLOps engineer onto your actual product" territory. The decision tree collapses into a single answer the second the per-minute number drops an order of magnitude below your own TCO floor.

The build option, costed honestly

The buy option, costed honestly

When build actually wins

When buy wins (almost everyone)

The third option

The cheapest minute on the market. 500 minutes free at signup, no card.