Comparison · STT API

Orchard vs AssemblyAI.

AssemblyAI is the cleanest scope-match in the STT vendor landscape: async-first, native diarization, webhooks, audio intelligence layer (LeMUR for summaries, entities, sentiment). Orchard ships the same scope on STT + adds TTS and voice cloning on the same balance — at $0.00042 per minute, the cheapest production-grade STT rate on the market. Below, the numbers and an honest tradeoff.

Last updated 2026-06-22 · AssemblyAI prices from assemblyai.com/pricing

Cheapest minute on the market
$0.00042 / min

Lower per-minute STT rate than any other production-grade vendor publishing public pricing in 2026.

See plans
FieldOrchardAssemblyAI
STT price per minute (PAYG)$0.00042 ⭐$0.0020 (Universal · async)
Real-time / streaming rateasync batch only today$0.0025 (Streaming)
Plan with 30,000 min / mo$25 (Pro, all products)$60 metered (Universal only)
Speaker diarization✓ (pyannote 3.1, included)✓ (included)
Async + webhooks
LLM-powered summaries (LeMUR-equivalent)✗ (use your own LLM on the transcript)✓ (per-token billed)
Audio intelligence add-ons✗ (planned: sentiment, chapters)✓ ($0.06/h each)
TTS on same balance✓ (17 languages)
Voice cloning on same balance✓ (F5 + XTTS)
Spanish (LATAM) tier-1 quality✓ (rioplatense fine-tune)generic multilingual
OpenAI Whisper API compatibility✓ (drop-in)proprietary
Free credit / tier500 min / mo (no card)$50 one-time credit
Data used for trainingnevernever

~5× cheaper per minute · TTS + clone bundled · same async + diar + webhook scope

The cheapest minute on the market

Pricing

AssemblyAI's Universal model sits at $0.0020 per minute for async transcription — already one of the more aggressive per-minute rates among full-feature STT vendors. Orchard ships at $0.00042 per minute, 5× cheaper than AssemblyAI Universal and (as of 2026) the lowest per-minute STT rate of any production-grade API publishing public pricing.

AssemblyAI Universal
30,000 min × $0.0020
$60 / month
Orchard PAYG
30,000 min × $0.00042
$12.60 / month
Orchard Pro plan
Flat
$25 / month (30k min + diar + TTS + clone)

Where the gap widens further: audio intelligence add-ons. AssemblyAI charges $0.06/hour for auto chapters, $0.06/hour for sentiment analysis, $0.06/hour for entity detection. Stack three of those onto a 30k-min/month workload and you're at $150/month total AssemblyAI bill vs $25 flat on Orchard Pro — and the Pro plan also covers your TTS + voice cloning workload.

Honest gap

LeMUR and audio intelligence

AssemblyAI's strongest product differentiation is LeMUR — their LLM layer that runs on top of the transcript for summaries, Q&A, entity extraction, action items. It's well-engineered and deeply integrated. Orchard doesn't ship a direct equivalent; we keep the surface minimal (transcript in, transcript out) and expect callers to pipe the result into their own LLM (Claude, GPT-4, Llama, whatever) for downstream intelligence.

The tradeoff:

  • LeMUR convenience: one API call, structured output, no separate LLM key to manage. Costs per-token on top of the transcription bill.
  • DIY approach: you control the LLM, prompt, output schema, fallback strategy. Cheaper at scale (most teams already pay an OpenAI/Anthropic bill); locks in nothing. Adds ~10 lines of code.

Audio intelligence add-ons (auto chapters, sentiment, entity detection, PII redaction) are on our roadmap; today we don't ship them as packaged features. If you need them as a one-API experience right now, AssemblyAI is the right pick.

Universal vs Whisper-class

Accuracy

AssemblyAI Universal is a proprietary model trained from scratch; it benchmarks well on English (typically top-3 on public WER leaderboards). Orchard runs a tuned Whisper-large derivative (whisper.cpp + Core ML) — same model family as OpenAI's original Whisper API, with our optimizations for throughput and Spanish.

  • English (clean): Universal wins by ~1-2% absolute WER on US-accent corpora.
  • English (accented): Roughly parity. Whisper's broader pre-training corpus helps with accents.
  • Spanish (neutral): Parity within ±1.5% WER.
  • Spanish (rioplatense): Orchard wins by ~4% absolute — we ship a fine-tune for porteño speech.
  • Multispeaker diar: Both ship native diarization. Quality is comparable on standard podcasts.
Async vs streaming

Latency & throughput

Both vendors are async-first with optional streaming. AssemblyAI ships a real-time WebSocket today; Orchard's equivalent is still on the roadmap, with the sync HTTP endpoint covering most voice-agent use cases at ~150 ms p50.

  • Real-time partial transcripts: AssemblyAI wins via WebSocket streaming.
  • Sync HTTP (short utterance): Roughly parity at ~150 ms.
  • Batch (long-form): Orchard shards across the cluster — 60 min audio in ~90 s wall (40× real-time). AssemblyAI batch is comparable.
  • Webhooks on completion: both support callback URLs.
One balance, three products

What's included beyond STT

AssemblyAI is STT-first and doesn't ship a TTS or voice cloning product. Orchard ships three on the same balance:

Text-to-Speech

12 voices, 17 languages, Piper engine. Sub-2 s synth latency on CPU. Same per-minute rate as STT, drawn from the same balance — no separate bill.

Voice cloning

F5-AR for Spanish (rioplatense fine-tune), XTTS for the other 16 languages. 6-60 s reference, unlimited synth thereafter.

Speaker diarization

pyannote.audio 3.1 on GPU. 30-min audio diarized in 4 s. 1.5× the per-minute cost, included quota on every plan.

OpenAI SDK swap

Migration

AssemblyAI's SDK is well-documented but proprietary. Orchard mirrors the OpenAI Whisper API shape so if you wrap STT behind your own service layer, migration is an env var change. Side by side:

// AssemblyAI
const client = new AssemblyAI({ apiKey: process.env.AAI_KEY });
const transcript = await client.transcripts.transcribe({
  audio: "https://example.com/podcast.mp3",
  speaker_labels: true,
});

// Orchard (OpenAI SDK, drop-in)
const client = new OpenAI({
  baseURL: "https://api.orchardrun.com/v1",
  apiKey:  process.env.ORCHARD_API_KEY,
});
const transcription = await client.audio.transcriptions.create({
  file:  fs.createReadStream("podcast.mp3"),
  model: "whisper-1",
});

// For diar + async: POST /v1/transcriptions/upload
//   form fields: file, diarize=true, webhook_url=https://...

The async + diar + webhook combo is documented at /docs#async and /docs#diarization.

Honest tradeoff

When AssemblyAI is the right call

Three scenarios where AssemblyAI is the better pick today:

  • LeMUR is core to your product. If you ship a feature that depends on the LLM-on-transcript workflow as a single API call (meeting summarizer, voice assistant memory, call analytics with action items), the cognitive cost of swapping to "transcript + your own LLM call" can outweigh the price delta.
  • You need audio intelligence add-ons as packaged features. Auto chapters, sentiment analysis, entity detection, PII redaction — these are first-class on AssemblyAI today. They're on our roadmap but not shipped.
  • Real-time streaming voice agents. AssemblyAI's WebSocket streaming handles partial transcripts at sub-300 ms latency. Until our streaming endpoint ships, that's a real gap.

For everything else — high-volume batch transcription, multilingual workloads, anything Spanish-heavy, anything that benefits from TTS or voice cloning on the same bill, anyone minimizing infrastructure cost per minute — Orchard is the economically obvious choice at $0.00042/min, the cheapest production-grade STT rate on the market.

Common pre-migration questions

FAQ

How can you offer $0.00042 / min when AssemblyAI is at $0.0020?+
We run our own cluster of Apple Silicon nodes optimized for whisper.cpp batch throughput, not rented hyperscale GPU time. The infrastructure cost per minute is a fraction of what cloud GPU vendors charge. Bundling helps too — one customer using STT + TTS + clone is three product line items on one bill, so margins stay healthy even at this rate.
Is the accuracy comparable to Universal?+
On clean English, Universal wins by 1-2% absolute WER. On accented English, Spanish, Portuguese and other languages, Whisper-class typically matches or beats Universal because of its broader pre-training corpus. The free tier (500 min/mo) lets you benchmark both on your real audio before committing.
What's the LeMUR equivalent on Orchard?+
We deliberately don't ship one. The recommended pattern is to take the transcript Orchard returns and pipe it into your own LLM (Claude, GPT-4, Llama). Adds ~10 lines of code, costs less per token at scale, and gives you full control over the prompt and output schema. If "summarize this call" is a feature you ship a lot of, this approach is cheaper than LeMUR + STT combined.
Do you support real-time streaming today?+
Not yet. The sync HTTP endpoint handles short utterances (under 60 s) with ~150 ms p50 latency — good enough for many voice-agent use cases. True WebSocket streaming with partial transcripts is on the roadmap. If streaming is your hard requirement today, AssemblyAI is the right pick.
Is there a free tier I can test against?+
Yes: 500 minutes / month, no credit card, includes STT + TTS + Clone. Run your own benchmark on the same audio you'd send to AssemblyAI before committing.

The cheapest minute on the market. $0.00042.

5× cheaper than AssemblyAI Universal. OpenAI-compatible SDK. Diar, TTS and voice cloning on the same balance. Free 500 min a month to benchmark on your real audio before paying.