Orchard vs AssemblyAI.
AssemblyAI is the cleanest scope-match in the STT vendor landscape: async-first, native diarization, webhooks, audio intelligence layer (LeMUR for summaries, entities, sentiment). Orchard ships the same scope on STT + adds TTS and voice cloning on the same balance — at $0.00042 per minute, the cheapest production-grade STT rate on the market. Below, the numbers and an honest tradeoff.
Last updated 2026-06-22 · AssemblyAI prices from assemblyai.com/pricing
Lower per-minute STT rate than any other production-grade vendor publishing public pricing in 2026.
| Field | Orchard | AssemblyAI |
|---|---|---|
| STT price per minute (PAYG) | $0.00042 ⭐ | $0.0020 (Universal · async) |
| Real-time / streaming rate | async batch only today | $0.0025 (Streaming) |
| Plan with 30,000 min / mo | $25 (Pro, all products) | $60 metered (Universal only) |
| Speaker diarization | ✓ (pyannote 3.1, included) | ✓ (included) |
| Async + webhooks | ✓ | ✓ |
| LLM-powered summaries (LeMUR-equivalent) | ✗ (use your own LLM on the transcript) | ✓ (per-token billed) |
| Audio intelligence add-ons | ✗ (planned: sentiment, chapters) | ✓ ($0.06/h each) |
| TTS on same balance | ✓ (17 languages) | ✗ |
| Voice cloning on same balance | ✓ (F5 + XTTS) | ✗ |
| Spanish (LATAM) tier-1 quality | ✓ (rioplatense fine-tune) | generic multilingual |
| OpenAI Whisper API compatibility | ✓ (drop-in) | proprietary |
| Free credit / tier | 500 min / mo (no card) | $50 one-time credit |
| Data used for training | never | never |
~5× cheaper per minute · TTS + clone bundled · same async + diar + webhook scope
Pricing
AssemblyAI's Universal model sits at $0.0020 per minute for async transcription — already one of the more aggressive per-minute rates among full-feature STT vendors. Orchard ships at $0.00042 per minute, 5× cheaper than AssemblyAI Universal and (as of 2026) the lowest per-minute STT rate of any production-grade API publishing public pricing.
Where the gap widens further: audio intelligence add-ons. AssemblyAI charges $0.06/hour for auto chapters, $0.06/hour for sentiment analysis, $0.06/hour for entity detection. Stack three of those onto a 30k-min/month workload and you're at $150/month total AssemblyAI bill vs $25 flat on Orchard Pro — and the Pro plan also covers your TTS + voice cloning workload.
LeMUR and audio intelligence
AssemblyAI's strongest product differentiation is LeMUR — their LLM layer that runs on top of the transcript for summaries, Q&A, entity extraction, action items. It's well-engineered and deeply integrated. Orchard doesn't ship a direct equivalent; we keep the surface minimal (transcript in, transcript out) and expect callers to pipe the result into their own LLM (Claude, GPT-4, Llama, whatever) for downstream intelligence.
The tradeoff:
- LeMUR convenience: one API call, structured output, no separate LLM key to manage. Costs per-token on top of the transcription bill.
- DIY approach: you control the LLM, prompt, output schema, fallback strategy. Cheaper at scale (most teams already pay an OpenAI/Anthropic bill); locks in nothing. Adds ~10 lines of code.
Audio intelligence add-ons (auto chapters, sentiment, entity detection, PII redaction) are on our roadmap; today we don't ship them as packaged features. If you need them as a one-API experience right now, AssemblyAI is the right pick.
Accuracy
AssemblyAI Universal is a proprietary model trained from scratch; it benchmarks well on English (typically top-3 on public WER leaderboards). Orchard runs a tuned Whisper-large derivative (whisper.cpp + Core ML) — same model family as OpenAI's original Whisper API, with our optimizations for throughput and Spanish.
- English (clean): Universal wins by ~1-2% absolute WER on US-accent corpora.
- English (accented): Roughly parity. Whisper's broader pre-training corpus helps with accents.
- Spanish (neutral): Parity within ±1.5% WER.
- Spanish (rioplatense): Orchard wins by ~4% absolute — we ship a fine-tune for porteño speech.
- Multispeaker diar: Both ship native diarization. Quality is comparable on standard podcasts.
Latency & throughput
Both vendors are async-first with optional streaming. AssemblyAI ships a real-time WebSocket today; Orchard's equivalent is still on the roadmap, with the sync HTTP endpoint covering most voice-agent use cases at ~150 ms p50.
- Real-time partial transcripts: AssemblyAI wins via WebSocket streaming.
- Sync HTTP (short utterance): Roughly parity at ~150 ms.
- Batch (long-form): Orchard shards across the cluster — 60 min audio in ~90 s wall (40× real-time). AssemblyAI batch is comparable.
- Webhooks on completion: both support callback URLs.
What's included beyond STT
AssemblyAI is STT-first and doesn't ship a TTS or voice cloning product. Orchard ships three on the same balance:
Text-to-Speech
12 voices, 17 languages, Piper engine. Sub-2 s synth latency on CPU. Same per-minute rate as STT, drawn from the same balance — no separate bill.
Voice cloning
F5-AR for Spanish (rioplatense fine-tune), XTTS for the other 16 languages. 6-60 s reference, unlimited synth thereafter.
Speaker diarization
pyannote.audio 3.1 on GPU. 30-min audio diarized in 4 s. 1.5× the per-minute cost, included quota on every plan.
Migration
AssemblyAI's SDK is well-documented but proprietary. Orchard mirrors the OpenAI Whisper API shape so if you wrap STT behind your own service layer, migration is an env var change. Side by side:
// AssemblyAI
const client = new AssemblyAI({ apiKey: process.env.AAI_KEY });
const transcript = await client.transcripts.transcribe({
audio: "https://example.com/podcast.mp3",
speaker_labels: true,
});
// Orchard (OpenAI SDK, drop-in)
const client = new OpenAI({
baseURL: "https://api.orchardrun.com/v1",
apiKey: process.env.ORCHARD_API_KEY,
});
const transcription = await client.audio.transcriptions.create({
file: fs.createReadStream("podcast.mp3"),
model: "whisper-1",
});
// For diar + async: POST /v1/transcriptions/upload
// form fields: file, diarize=true, webhook_url=https://...The async + diar + webhook combo is documented at /docs#async and /docs#diarization.
When AssemblyAI is the right call
Three scenarios where AssemblyAI is the better pick today:
- LeMUR is core to your product. If you ship a feature that depends on the LLM-on-transcript workflow as a single API call (meeting summarizer, voice assistant memory, call analytics with action items), the cognitive cost of swapping to "transcript + your own LLM call" can outweigh the price delta.
- You need audio intelligence add-ons as packaged features. Auto chapters, sentiment analysis, entity detection, PII redaction — these are first-class on AssemblyAI today. They're on our roadmap but not shipped.
- Real-time streaming voice agents. AssemblyAI's WebSocket streaming handles partial transcripts at sub-300 ms latency. Until our streaming endpoint ships, that's a real gap.
For everything else — high-volume batch transcription, multilingual workloads, anything Spanish-heavy, anything that benefits from TTS or voice cloning on the same bill, anyone minimizing infrastructure cost per minute — Orchard is the economically obvious choice at $0.00042/min, the cheapest production-grade STT rate on the market.
FAQ
How can you offer $0.00042 / min when AssemblyAI is at $0.0020?+
Is the accuracy comparable to Universal?+
What's the LeMUR equivalent on Orchard?+
Do you support real-time streaming today?+
Is there a free tier I can test against?+
The cheapest minute on the market.
$0.00042.
5× cheaper than AssemblyAI Universal. OpenAI-compatible SDK. Diar, TTS and voice cloning on the same balance. Free 500 min a month to benchmark on your real audio before paying.