Blog - Syllable

Ibrahim Cotran, Christopher Zheng

Sep 23, 2025

Speech-to-text (STT) services often struggle when there's short audio snippets for transcription (e.g., numbers, yes, no, etc.) or with the long-tail of names. This is because the STT services are missing the additional context that can improve transcription accuracy. We ran a targeted benchmark of three STT engines on 97 short audio clips to compare accuracy. Here are the results.

How we tested

Content: 97 single-utterance clips with a diverse set of names and numbers (phone numbers, ZIP codes, and dates)
Scoring: Measured edit distance with leading, trailing, and repeated spaces removed
Exact (edit distance = 0), Close (0 < edit distance < 2), or Neither (edit distance >= 2)
Models tested: Google STT V2, Google STT V2 Chirp 2, and Deepgram Nova-3
All were run with default settings; no phrase lists or language hints were provided

Key patterns observed

Names are still hard — While many models have improved with names like "Siobhan", names such as "Min-seo Kim" or "Mai Pham" are consistently challenging for STT services to transcribe accurately.
Numbers are much improved — Across phone numbers, ZIP codes, and dates, engines were solid, but we still see occasional issues with repeated digits (e.g., 80001 was transcribed as 8001). ### Lessons Learned Deepgram Nova 3 was consistently the best STT across names and numbers, but both Google options were dependable for numbers. Either way, it's probably valuable to implement validation tools with simple rules (i.e., regex for phone number formats, zip code length, sensible date ranges, etc.). Additionally, from a cost perspective, Deepgram Nova 3 is priced at $0.0077 per minute while Google STT V2 costs $0.024 per minute and Google STT V2 Chirp 2 $0.016 per minute.

The Best speech-to-text (STT) models (STT Benchmarks)

Sep 23, 2025

How we tested

Key patterns observed

View All Blog Posts