PricingJune 14, 2026

The cheapest speech-to-text API in 2026: a 12-provider breakdown

Orchard's batch tier runs at $0.00042/min — 7× cheaper than Deepgram, 14× cheaper than AssemblyAI, 57× cheaper than AWS Transcribe. Source-linked comparison across 12 providers.

Mateo Bustamante · Ramiro Alvarez8 min read

Orchard's Pro plan transcribes audio at $0.00042 per minute. That makes it the cheapest speech-to-text API in this comparison by 7× — and 57× cheaper than AWS Transcribe at the same volume. The pricing pages of the other eleven providers we benchmarked are easy to read at a glance, but the per-minute cost they actually imply moves by a factor of over 50× once you normalize for batch vs real-time, included credits, and the free-tier traps.

We pulled the public pricing pages of twelve speech-to-text APIs and laid them side by side. The numbers in this post are normalized to USD per minute, sourced to each provider's live pricing page, and dated. This is the first in a series we'll re-publish quarterly — if you're running production-scale transcription (podcast networks, call-center analytics, batch dubbing, legal discovery), the gap between "the price on the homepage" and "the cheque you actually sign" is wider than most people realize. The table below is meant to close that gap.

How we read the pricing pages

A quick note on how we normalized the numbers, because the providers don't make this easy:

Pulled from each provider's public pricing page on June 14, 2026. Source links are in the table; click any provider to verify.
Where a batch tier is broken out, we use that. Where it isn't, we use the standard/real-time tier.
Pay-as-you-go where possible. Most early teams don't have leverage to negotiate committed-volume discounts, so committed pricing isn't honest to compare against.
Per-minute prices are rounded to the nearest tenth of a cent and converted from hourly rates where the provider only quotes by the hour.
Currency normalized to USD using the spot rate on the same day for providers (Gladia, Speechmatics) that quote in EUR or GBP.

The numbers

Twelve providers, normalized to USD per minute. Orchard's row is highlighted because we're writing the post — but the ordering is purely by price, ascending, so the cheapest tier ends up at the top regardless of who we are.

Prices as of June 14, 2026 · public pricing pages, normalized to USD/min

Provider	Batch ($/min)	Real-time ($/min)	Notes
Orchard (Pro)	$0.00042	$0.00042	$25/mo for 60,000 min, shared STT + TTS + Clone balance, no throttling.
Voicegain	$0.0033	$0.0058	Whisper-Pro batch tier, pay-as-you-go.
Rev.ai	$0.0035	$0.020	Reverb async model. Human transcription separately at $1.99/min.
Deepgram (Nova-3)	$0.0043	$0.0077	Pay-as-you-go. Free $200 credit at signup.
Soniox	$0.0050	$0.083 /hr	Speech AI batch. Real-time billed per hour, not per minute.
AssemblyAI (Universal)	$0.0062	$0.0123	Best model batch at $0.37/hr. Includes summarization extra.
OpenAI Whisper API	$0.0060	—	Single-tier price. No native batch discount, no real-time API.
Gladia	$0.0102	$0.0124	Solo plan €0.612/hr. EU-priced; USD here is approximate.
Speechmatics	$0.0117	$0.0117	Standard plan; Enhanced model billed at premium tier.
Azure AI Speech	$0.0167	$0.0167	Standard tier. $1/audio hour. Higher tiers for custom models.
AWS Transcribe	$0.024	$0.024	First 250K min/mo at $0.024. Tier-2 drops to $0.015.
Google Cloud STT	$0.024	$0.024	v1 standard. Long-form model billed at $0.016/min.

The 50× gap nobody talks about

AWS Transcribe sits at $0.024/min on the standard tier. Orchard's Pro plan works out to $0.00042/min. That's a 57× spread for the same input — sixty seconds of audio in, a transcript out. The audio doesn't change. The transcript is recognizably the same artifact. The price differs by almost two orders of magnitude.

How does that happen? A few reasons, in roughly the order they matter:

Legacy hyperscaler pricing. AWS, Google and Azure quote STT at the price the model cost them to run in 2020. Inference cost has dropped 8–10× since then; their pricing has not. They have no incentive to drop it until enterprise procurement starts asking — and procurement asks slowly.
Branding tax. Providers in the $0.005–0.012 range are charging for the brand recognition and the polish of the dev portal, not the cost of compute. There's nothing wrong with that — but if your use case doesn't need the brand, you're paying for someone else's pitch deck.
Real-time anchor. Many providers price batch as a small discount off real-time, even though real-time costs dramatically more to run (low-latency GPUs sit idle between streams). Batch should be priced from cost, not from real-time. We do.

The audio doesn't change. The transcript is recognizably the same artifact. The price differs by almost two orders of magnitude.
Translation: most STT pricing is a margin choice, not a cost floor.

Why batch is where the wins are

Batch jobs are the unglamorous backbone of the speech industry. Nobody tweets about transcribing 20,000 hours of podcast backlog. But that workload — long files, no latency budget, massive concurrency — is where pricing differentiation compounds.

At $0.024/min on AWS Transcribe, 100,000 hours of audio costs $144,000. At Deepgram Nova-3 batch ($0.0043/min), it's $25,800. At Orchard Pro it's $2,520 — a 57× swing on the same job. For a podcast network, a legal discovery firm, or anyone doing historical audio digitization, that's the difference between "the budget covers it" and "the project gets shelved."

What "cheapest" usually costs you

Three traps that the bottom-of-the-table providers usually share — and how we sidestep each one:

1. Throttling at the worst possible moment

Most cheap STT APIs throttle aggressively on requests-per-minute and concurrent streams. The price page advertises "unlimited usage," but the rate-limit page tells you you can run six concurrent jobs. For a batch workload, that's a cap on throughput, not on cost — and it usually shows up at 3 AM when the job is supposed to finish.

Orchard's paid plans don't throttle on concurrency. You send 200 simultaneous batch jobs, we run 200 simultaneous batch jobs. The architecture is sized for this from day one because batch is our anchor workload, not an afterthought to a real-time product.

2. Hidden quality compromises

Cheap STT often means a smaller model with materially worse WER on accented speech, code-switched content (especially LATAM Spanish ↔ English), or noisy environments. The price is real, but so is the cost of re-transcribing with a better provider when the first pass is unusable.

We benchmark publicly against the leading models on Spanish, Portuguese, and English. Our WER stays inside the top quartile of the providers in this table — at 0.4 cents a minute.

3. Batch turnaround in "hours, maybe"

Some batch tiers are cheap because the provider doesn't commit to a turnaround SLA. The job lands when it lands. For a production pipeline triggering downstream steps, that unpredictability is more expensive than the cents you saved.

We commit to a target turnaround under 0.15× real-time on paid plans — a 60-minute audio file finishes in under 9 minutes, measured. If we miss it on a given job, the metric shows in the dashboard. No fine print.

Why price alone isn't the pitch

We could end the post here on the price headline. But the thing we've seen too often in the API economy is that price-leader providers churn fast: the moment a competitor cuts price by 20%, half the customers move. The only way to beat that gravity is to make sure the price advantage sits on top of a real product advantage.

For STT, that's the combination of:

Price — the topic of this post.
Quality (WER) — competitive with the top of this table on the languages we serve.
Throughput (RTF + concurrency) — paid plans don't throttle.
Coverage — 90+ languages, with active fine-tuning per LATAM province / state.

Any single one of these is a commodity. The combination is what we sell.

Price is the headline. The product is the combination.

How to use this table

If you're building or evaluating an STT vendor right now:

Estimate your monthly minutes honestly. Most teams undercount; round up by 30%.
Multiply by the per-minute rate from the table, using the source link to verify the current published price.
Add 15% for the throttling you don't see — jobs that need to be retried because of rate limits, or pushed to a slower tier.
Run a 1-hour pilot on three providers. Compare WER on your actual audio, not theirs. Compare turnaround time. Compare what the support response looks like at 1 AM on a Saturday.

We'll publish the next update in September 2026. Subscribe to the RSS feed if you want it in your reader the day it lands.