PricingJune 26, 2026

The cheapest speech analytics API in 2026: WER, RTF and the full per-minute breakdown

Speech analytics — transcription + speaker diarization + intelligence — should cost a fraction of a cent per minute in 2026. Orchard runs the full pipeline at $0.0030/min at production volume. Source-linked comparison across 12 providers, plus a frank read on WER and RTF.

Mateo Bustamante · Ramiro Alvarez10 min read

Speech analytics — the full pipeline of transcription, speaker diarization, and downstream intelligence (sentiment, topics, entities) — should cost a fraction of a cent per minute of audio in 2026. The contact-center and conversation-intelligence vendors are still quoting it as if it were 2021. Orchard runs the full pipeline at $0.0030 per minute at production volume: about 3× cheaper than Deepgram with intelligence add-ons, 5× cheaper than AssemblyAI's equivalent bundle, and 20× cheaper than Google Contact Center AI for the same audio in, same artifacts out.

This is the post you wanted when the analytics vendor your team evaluated quoted you an annual contract starting at six figures. It's also the post you wanted when the "cheaper" vendor advertised one number on the homepage and then charged you four times that on the actual invoice because every feature you needed sat behind a per-minute add-on. The numbers in this breakdown are normalized to USD per minute of the full pipeline, sourced to each provider's live pricing page, and dated.

What we mean by "speech analytics"

Speech-to-text alone is a commodity. The pricing fights there have already happened (we covered them in the cheapest STT API post). Speech analytics is the next layer up — what teams running call centers, podcast networks, legal discovery and compliance pipelines actually buy. For this comparison, we define the pipeline as three artifacts delivered for every minute of audio:

Transcription — a written record of what was said, time-aligned to the audio.
Speaker diarization — "who spoke when", labeled per turn.
Intelligence — sentiment per turn, topic tagging, entity extraction. The minimum surface area a compliance, sales coaching or QA team can't live without.

Vendors price these three artifacts as separate line items so the "starting from" price on the homepage is the cheapest possible subset. The numbers we publish below sum the three line items, because that's the price your finance team sees.

How we read the pricing pages

The normalization rules — the same shape as our other quarterly tables:

Pulled from each provider's public pricing page on June 26, 2026. Source links live in the table.
Batch tier where the provider exposes one; otherwise standard. Stacked with the speaker-diarization add-on and the cheapest analytics/intelligence bundle that includes sentiment.
Pay-as-you-go where possible. Committed-volume contracts get materially cheaper on every vendor in the table, but early teams don't have leverage to negotiate them.
Per-minute prices rounded to the nearest hundredth of a cent and converted from hourly rates where the provider only quotes hourly.
Currency normalized to USD at the spot rate on the date stamp.

The numbers

Twelve providers, normalized to USD per minute of the full transcription + diarization + intelligence pipeline. Orchard is highlighted because we're writing the post, but the ordering is purely by price, ascending.

Prices as of June 26, 2026 · public pricing pages, normalized to USD/min

Provider	Batch ($/min)	Real-time ($/min)	Notes
Orchard (production volume)	$0.0030	$0.0030	Bundled: STT + speaker diarization + intelligence. One number for the pipeline, no per-feature add-ons. Drops further on dedicated contracts >50K hr/mo.
Deepgram (Nova-3 + Intelligence)	$0.0089	$0.0123	Nova-3 batch + diarization + sentiment add-ons stacked. Pay-as-you-go.
Rev.ai (Reverb + speakers)	$0.0085	$0.025	Reverb async + speaker labels. Sentiment is a separate API call.
Speechmatics (Standard + speakers)	$0.0117	$0.0117	Standard plan with speaker change & diarization enabled.
AssemblyAI (Universal + Intelligence)	$0.0162	$0.0245	Universal STT + speaker labels + sentiment + entity detection stacked.
Symbl.ai	$0.0180	$0.0220	Conversation intelligence bundle. Trackers, action items, topics included.
Gladia (with diarization)	$0.0136	$0.0166	Solo plan + diarization add-on. EUR-priced; USD here is approximate.
AWS Transcribe Call Analytics	$0.0365	$0.0420	Transcribe + Contact Lens analytics. First 250K min/mo at standard rate.
Google CCAI (Insights)	$0.0600	$0.0600	Conversation Insights plus standard STT. Volume tiers at 1M+ minutes.
Azure AI Speech (with diarization)	$0.0250	$0.0250	Standard tier + Conversation Transcription with speaker separation.
Verbit	$0.2500	—	Human-in-the-loop analytics. Listed for context; not directly comparable on price.
Otter.ai Business	$0.0420	$0.0420	$20/user/mo, ~6,000 transcription min/user. Normalized to per-minute.

Why the gap is 10× — 66×

Three structural reasons compound to produce a two-order-of-magnitude spread for what is, technically, the same artifact:

Add-on stacking is the business model. The mid-tier vendors (AssemblyAI, Speechmatics, Symbl) bill every intelligence feature as a separate line item. The headline price is the floor; the real price is the floor times three. We bill one number that covers the pipeline because the intelligence layer rides on a downstream LLM call that doesn't cost us anything proportional to the audio minute.
The hyperscalers price for procurement, not for cost. AWS Contact Lens and Google CCAI Insights sit at $0.04–0.06/min because the buyer is a contact-center procurement team with a year-over-year budget, not a developer comparing API prices. Their cost to run the inference is the same as anyone else's in 2026 — but procurement asks slowly.
Latency-first architectures pay for idle GPUs. Most analytics vendors run real-time as the primary product and treat batch as a small discount off real-time. Real-time requires GPUs sized for peak, sitting idle between streams. Batch should be priced from cost, not from real-time. We do.

The same audio in. The same transcript, diarization, sentiment out. The price differs by almost two orders of magnitude.
Translation: most analytics pricing is a margin choice, not a cost floor.

WER: cheap that isn't worse

Price alone is a trap. Every speech analytics buyer has been burned by a vendor whose price page was attractive and whose Word Error Rate on their actual audio was 20–30% worse than the incumbent. The cost of re-running the analytics with a real provider after the first pass produces unusable transcripts is often higher than the price gap that drew you there.

We benchmark on the use cases that actually matter to our customers, not on the leaderboard datasets that are easy to score on:

English calls — sales calls, support calls, podcast interviews. Cross-talk, brand names, code-switching to short Spanish phrases.
Spanish (Latin American + Iberian) — Argentine, Mexican, Colombian, Chilean, Iberian. Heavy accented content where most providers degrade noticeably.
Portuguese (BR) — adjacency to Spanish, where the model has to pick a language and stick with it inside a conversation.
Code-switched LATAM Spanish ↔ English — the single biggest WER vulnerability across the industry, and the dominant pattern in LATAM SaaS sales calls.

Our WER on these workloads sits inside the top quartile of the providers in the table above — not at parity with the most expensive incumbent on every clip, but consistently competitive with Deepgram Nova-3, AssemblyAI Universal, and Speechmatics on the language coverage we serve. We'll be publishing the full benchmark page in Q3 2026 with clip-by-clip results and the methodology open-sourced.

RTF: how fast the pipeline returns

Real-Time Factor (RTF) is the ratio of audio duration to wall-clock processing time. An RTF of 0.30× means a 60-minute file finishes in 18 wall-clock minutes. An RTF of 2.00× means the same file takes 2 hours. For batch analytics pipelines that trigger downstream steps — billing, QA flagging, CRM updates — the difference between 0.30 and 2.00 is the difference between "tonight's report" and "tomorrow afternoon's report."

Most providers don't publish RTF. They publish a turnaround SLA ("under 4 hours") that's loose enough to cover the queue depth at peak hours, and they hope you don't ask harder. We commit to a target RTF of under 0.30× for the full analytics pipeline on paid plans — measured per job, surfaced in the dashboard. If we miss the target, the metric shows red. No fine print.

What 0.30× RTF buys at scale

Concretely:

A 60-minute call → analytics complete in under 18 minutes.
1,000 hours of backlog audio → finishes inside a single business day on the default paid concurrency.
50,000 hours/month of sustained throughput → handled on a fixed, dedicated worker pool we size against your contract.

Paid plans don't throttle on concurrency. You send 200 simultaneous batch jobs, we run 200. The architecture is sized for this from day one because batch analytics is our anchor workload, not an afterthought to a real-time product.

What "cheapest" usually costs you

Three traps that the bottom of every vendor table tends to share — the same three our first STT post called out, restated for the analytics pipeline:

1. Throttling at the worst possible moment

Most cheap analytics APIs throttle aggressively on requests-per-minute and concurrent jobs. The price page says "unlimited"; the rate-limit page says six concurrent streams. For a contact-center backlog, that's a cap on throughput, not on cost — and it usually surfaces at 3 AM when the report is supposed to be on the GM's desk by 8.

2. Feature-stacking on the invoice

Speaker labels: $0.005/min. Sentiment: $0.005/min. Entity detection: $0.003/min. Topic modelling: $0.005/min. Each one is an honest line item; the four together quietly triple the headline price. We bill one number for the pipeline because we'd rather argue with you about whether the price is too high than about whether the invoice matches the quote.

3. Batch turnaround in "hours, maybe"

Some batch tiers are cheap because the provider doesn't commit to a turnaround SLA. The job lands when it lands. For a production pipeline triggering downstream steps, that unpredictability is more expensive than the cents you saved on the per-minute line.

For enterprise volumes

The numbers above are the published self-serve price. For workloads above 50,000 hours/month, the price drops further on a dedicated pool — we quote that case-by-case because the right architecture depends on whether the audio is bursty (peak hours of US business day) or batch-fed (overnight backlog ingestion).

Two reference points we're comfortable putting in writing:

A 60,000 hours/month dedicated analytics pipeline lands well under $0.003/min all-in. That's already a fraction of what AWS Contact Lens quotes at one-tenth the volume.
At 500,000 hours/month, the unit economics shift again. The architecture moves to a reserved worker pool and per-minute pricing settles in the low-tenths-of-a-cent range. Quoted on contract.

Why price alone isn't the pitch

We'll repeat what we wrote in the STT post because it matters more on analytics, not less: price-leader providers churn fast. The moment a competitor cuts price by 20%, half the customers move. The only way to beat that gravity is to make sure the price advantage sits on top of a real product advantage. For speech analytics, that's the combination of:

Price — the topic of this post.
Quality (WER) — top-quartile on the languages we serve.
Throughput (RTF + concurrency) — paid plans don't throttle.
One bill, not a stack — transcription, diarization, and intelligence inside one plan.
Coverage — 90+ languages with consistent quality on the most common LATAM and European accents.

Any single one is a commodity. The combination is what we sell.

Price is the headline. The pipeline is the product.

How to use this table

If you're building or evaluating a speech analytics vendor:

Estimate your monthly minutes honestly. Most teams undercount peak; round up by 30%.
Multiply by the bundled per-minute rate in the column for your usage pattern (batch vs real-time). Click through to verify the published number.
Add the intelligence add-ons your team actually uses if the vendor unbundles them. Skip this step if the vendor bills one number for the pipeline (we do).
Add 15% for throttling you don't see — retries against rate limits, downgrades to slower tiers.
Run a 1-hour pilot on three providers on your audio. Compare WER, diarization accuracy, RTF, and what support looks like at 1 AM on a Saturday.

The next quarterly update lands in September 2026. The RSS feed at /blog/rss.xml pushes it to your reader the day it goes live.