Comparison · STT API

Orchard vs OpenAI Whisper API.

Same model family under the hood, very different surface. Orchard is API-compatible with the OpenAI Whisper endpoint — swap the base URL and key, the SDK keeps working. Below, the deltas in plain numbers: pricing, latency, what's included, where Whisper still wins, and a 30-second migration snippet.

Last updated 2026-06-22 · Whisper prices from openai.com/api/pricing

Field	Orchard	OpenAI Whisper API
Price per minute (pay-as-you-go)	$0.00042	$0.006
Plan with 30,000 min / mo	$25 (Pro, includes diar + TTS + clone)	$180 metered
Real-time factor (cluster, batch)	50-80× RT	~25× RT (single-stream)
Max upload size	500 MB	25 MB
Async + webhooks	✓	✗
Speaker diarization	✓ (pyannote 3.1, all plans)	✗
Text-to-Speech on same balance	✓ (17 languages)	✗ (separate product)
Voice cloning on same balance	✓ (F5 ES + XTTS multilingual)	✗
Word-level timestamps	✓	✓
OpenAI Whisper API compatibility	✓ (drop-in)	native
Spanish (LATAM) tier-1 quality	✓ (rioplatense fine-tune)	generic
Free tier	500 min / mo	none
Data used for training	never	opt-out (was default opt-in pre 2023)

14× cheaper per minute · same OpenAI SDK calls · diar + TTS + clone bundled

The 14× delta is real

Pricing

OpenAI charges $0.006 per minute flat on the Whisper API, no tier discount and no included quota. Orchard sits at $0.00042 per minute on the Pro plan — same minute, same Whisper-class accuracy, 14× less. For a single team transcribing 30,000 minutes a month (a typical podcast post-prod or call-analytics workload):

OpenAI Whisper API

30,000 min × $0.006

$180 / month

Orchard PAYG

30,000 min × $0.00042

$12.60 / month

Orchard Pro plan

Flat

$25 / month (30k min included + diar + TTS + clone)

Where Orchard is meaningfully cheaper than Whisper-direct: anything past a few hundred minutes a month. The crossover point where Orchard beats OpenAI even on the$1/yr Hobby plan is roughly ~17 minutes a month. Past that, subscription saves you money vs Whisper pay-as-you-go.

Same model lineage

Accuracy

Orchard runs on whisper.cpp with Core ML acceleration, served from our own cluster. The default model is orchard-stt-v1 — a tuned Whisper-large derivative. On the same English/Spanish test sets we use internally, WER lands within a percentage point of the OpenAI hosted endpoint, with a tilt in our favour on rioplatense Spanish where we ship a fine-tuned variant.

English (LibriSpeech clean): parity ±1% WER.
Spanish (CommonVoice ES): parity ±1.5% WER.
Spanish (rioplatense, internal): +4% absolute over hosted Whisper (we fine-tuned on porteño speech).
Multispeaker podcasts: with diarization on, Orchard returns speaker labels Whisper doesn't have at all.

Cluster vs single GPU

Latency & throughput

OpenAI's Whisper API is single-stream per request. Orchard shards long audio across the cluster, so a 60-minute podcast lands in under 90 seconds wall time — that's ~40× real-time end-to-end, including upload and post-processing. The sync endpoint targets short utterances (under 60 s, voice agent territory), and the async endpoint with webhooks handles everything else.

Sync POST /v1/audio/transcriptions: ~150 ms p50 on a 5 s clip.
Async POST /v1/transcriptions/upload: 60 min audio → ~90 s wall.
Webhooks: we POST the result to your URL when ready (Whisper has no async).

One balance, one key

What's included beyond STT

Whisper API is a single endpoint: audio in, text out. Anything else — diarization, synthesis, voice cloning — is a separate provider, separate billing, separate code path. Orchard ships three products on the same balance:

Speaker diarization

pyannote.audio 3.1 on GPU. 4-second turnaround on a 30-min audio. 1.5× the per-minute cost, included quota on every plan.

Text-to-Speech

12 voices, 17 languages, Piper engine. Sub-2 s synth latency. Same per-minute price as STT, drawn from the same balance.

Voice cloning

F5 for Spanish (rioplatense fine-tune), XTTS for 16 other languages. 6-60 s reference, unlimited synth thereafter.

Two lines of diff

Migration

Because we mirror the OpenAI Whisper request/response shape, the official SDK works against Orchard with a base-URL swap. No fork, no shim, no waiting for an SDK update:

import OpenAI from "openai";

const client = new OpenAI({
- baseURL: "https://api.openai.com/v1",
- apiKey:  process.env.OPENAI_API_KEY,
+ baseURL: "https://api.orchardrun.com/v1",
+ apiKey:  process.env.ORCHARD_API_KEY,
});

const transcription = await client.audio.transcriptions.create({
  file:  fs.createReadStream("podcast.mp3"),
  model: "whisper-1",   // alias accepted, routed to orchard-stt-v1
});

Same call, same response shape ({ text, language, duration, segments[] }), same SDK ergonomics. If you were already using OpenAI for STT, the migration is a single env var change in your CI/CD pipeline. Median migration time across customers who've done it: under an hour.

Honest tradeoff

When OpenAI Whisper is the right call

If you're already deep in the OpenAI ecosystem (Assistants API, GPT-4 vision, the rest of the stack billed on one key) and your STT volume is genuinely tiny — under ~50 minutes a month — staying on Whisper saves you the cognitive load of a second vendor. The $0.006/min Whisper rate works out to under a dollar a month at that volume, and the lack of diarization or TTS may not matter to your use case.

Past that point — high-volume workloads, anything multilingual or Spanish-heavy, anything that needs diarization or TTS bundled, or anything that benefits from async + webhooks — Orchard is the economically obvious choice.

Common pre-migration questions

FAQ

Is Orchard literally running Whisper under the hood?+

We run a tuned Whisper-large derivative (whisper.cpp + Core ML) on our own cluster. Same model family OpenAI started from, with our optimizations for batch throughput and Spanish accuracy.

What about whisper-large-v3?+

We're benchmarking the v3 turbo variant in parallel; the public endpoint will move when WER improvements justify the latency cost. Until then, v2-class is what serves production.

Can I send a 2-hour audio in one request?+

Yes, via the async endpoint. Sync caps at 25 MB to keep TTFB predictable; async accepts up to 500 MB and processes via the cluster with webhooks for the result.

What happens to my data?+

Audio is processed on our own cluster and dropped from RAM the moment the response goes out. We never train on customer data. Transcripts cache in Redis for 48 h so webhook retries and customer-support replays work; after that they auto-evict.

Is there a free tier I can test against?+

Yes: 500 min / month, no credit card, includes all three products (STT + TTS + Clone). The diar quota at Free tier is 10 minutes a month — enough to validate the speaker-detection quality against your real audio before committing.

Stop paying $0.006/min.
Start at $0.00042.

Same SDK calls. Same Whisper-class accuracy. Diar, TTS and voice cloning on the same balance. Free 500 min a month to run your own benchmark before paying.

Get an API key Read the docs