EngineeringJune 28, 2026

Build vs buy: should your conversation QA platform run its own STT?

When the STT bill crosses six figures a month, every CTO is asked the same question by the board: why don't we bring this in-house? The honest decision tree, the hidden costs of self-hosting Whisper, and the case for buying from a vendor priced like infrastructure.

Mateo Bustamante7 min read

Every conversation QA platform crosses the same threshold at the same point in its lifecycle. The STT bill goes from a line item nobody reads to a board question nobody can answer confidently: why don't we just run this ourselves? The instinct is reasonable. The math, run honestly, almost always says no — for reasons that have nothing to do with capability and everything to do with where the leverage sits.

The build option, costed honestly

"Building" STT in 2026 doesn't mean training a model from scratch. It means self-hosting an open-source one — Whisper, Distil-Whisper, faster-whisper, or one of the more recent multilingual open releases — on rented GPU. The architecture is well-trodden; the bill is the part most teams underestimate.

A representative self-host stack for a platform processing 5M minutes/month:

  • Compute: 5M minutes / 60 = 83,333 hours of audio. At a 50× real-time factor on a single A100 (a reasonable production target with batch optimization), the stack needs ~1,667 GPU-hours/month. At Vast.ai spot rates (~$0.50/hr) you're at $833/month for compute alone — but spot pricing isn't tier-1 reliable.
  • Reserved capacity: tier-1 redundancy means two GPUs reserved at on-demand rates (~$1.20/hr) — call it $1,750/month per replica × 2 zones = $3,500/month.
  • Diarization: a separate model (pyannote, sherpa-onnx) running on the same or adjacent GPU, add 30-50% to compute spend.
  • Engineering time: one senior MLOps engineer half-time, indefinitely. At loaded cost, ~$8,000/month.
  • On-call rotation: somebody picks up the pager when the queue backs up at 3am. Distributed across existing rotation has cultural cost even when budgeted at zero.
  • Continuous model evaluation: WER drift, accent regressions, new language coverage. Budget a quarter of an applied scientist, ~$5,000/month.

Honest all-in monthly TCO at 5M-minute scale: ~$18,000-22,000/month, before counting the opportunity cost of the engineering attention. Annualized: ~$250K/year, plus a permanent dependency on the team's most senior infra brain.

The buy option, costed honestly

On the buy side, the same 5M-minute workload across the names most teams evaluate:

  • OpenAI Whisper API: $0.006/min × 5M = $30,000/month.
  • Deepgram Nova-3 + diarization: ~$0.0083/min × 5M = $41,500/month.
  • AWS Transcribe: $0.024/min × 5M = $120,000/month.

At Deepgram-class pricing, buy is more expensive than build — this is the real reason CTOs entertain the question. Self-host saves ~$20K/month against the cheapest serious vendor. Multiplied across 12 months and 3 years, that's a real engineering hire.

Build vs buy isn't a question about capability. It's a question about where vendor pricing leaves margin on the table.

When build actually wins

There are real cases where self-hosting STT is the right answer:

  • Air-gapped or sovereign deployments — the customer's audio cannot leave their VPC; vendor SaaS is contractually disqualified.
  • Specialized custom vocabularies where the vendor's hotword/biasing interface isn't expressive enough. Rare in practice; usually solvable with prompt biasing.
  • Strategic IP play — your transcription quality is the product, not the input to it. If you're selling STT, build STT.

When buy wins (almost everyone)

Three signals tell a CTO to keep buying:

  • The team's strategic moat is what you do with the transcripts (scoring, coaching, intelligence), not how you produce them.
  • The customer base does not require sovereign deployment. Standard SaaS terms are acceptable.
  • The MLOps headcount could be deployed on a higher-leverage problem.

If all three apply, build is a vanity project. The math only works because vendor pricing is bad.

The third option

The build-vs-buy decision tree assumes the buy side is priced the way the buy side has always been priced. At $0.024/min, building yourself for $5/min in TCO is rational. At $0.0043, less so. At $0.00042 — Orchard's per-minute rate at production volume — the math inverts entirely.

Same 5M minutes at $0.00042 = $2,100/month. That's not buy-vs-build territory. That's "buy the line item and redeploy the MLOps engineer onto your actual product" territory. The decision tree collapses into a single answer the second the per-minute number drops an order of magnitude below your own TCO floor.

Try Orchard

The cheapest minute on the market. 500 minutes free at signup, no card.