
We benchmarked six speech-to-text providers across English, code-switched English/Hindi, and noisy conference audio. The shortlist was easier to make than we expected.
What WER doesn't capture
Word-error rate gets quoted constantly, but on its own it doesn't predict how a model will perform inside a voice agent. The metrics that mattered more for us:
- First-partial latency: How quickly do we get any transcript at all?
- Stability of partials: How often does the partial change as more audio arrives?
- Endpointing accuracy: How well does the model decide a turn is done?
Surprises
The cheapest provider in our test wasn't the worst. The most expensive wasn't the best. Pricing tiers and quality tiers don't track each other as cleanly as you'd hope.

