Summary
There are two very different ways to turn speech into text, and the difference decides how accurate you actually get.
Traditional ASR recognizes sound acoustically, one piece at a time. When two words sound identical, it’s basically a coin flip. LLM-based transcription — the approach behind OpenAI’s Whisper and behind Atter AI — adds a language model that reads the whole sentence and corrects those ambiguous words from context.
We benchmarked Atter AI against OpenAI Whisper large-v3 on public datasets, running both on the exact same audio, scored the same way. Results across 9 languages:
| Language | Metric | Dataset | Atter | Whisper large-v3 |
|---|---|---|---|---|
| English | WER | LibriSpeech test-clean | 1.30% | 2.40% |
| English | WER | FLEURS en_us | 1.45% | 2.80% |
| Spanish | WER | FLEURS es_419 | 3.80% | 4.90% |
| French | WER | FLEURS fr_fr | 3.95% | 5.10% |
| German | WER | FLEURS de_de | 4.20% | 5.60% |
| Portuguese | WER | FLEURS pt_br | 3.75% | 5.00% |
| Mandarin | CER | FLEURS cmn_hans_cn | 3.60% | 4.80% |
| Japanese | CER | FLEURS ja_jp | 4.10% | 5.40% |
| Korean | CER | FLEURS ko_kr | 4.25% | 5.70% |
| Cantonese | CER | FLEURS yue_hant_hk | 2.40% | 4.20% |
Lower is better (these are error rates). Accuracy = 100% − error rate. Across the 9 FLEURS languages, Atter averaged 3.50% error (96.50% accuracy) vs Whisper large-v3’s 4.83% (95.17%).
Two kinds of transcription, and why it matters
Most “speech-to-text” tools fall into one of two camps.
Traditional ASR (acoustic, word-by-word). It listens to the sound and picks the most acoustically likely word. The problem: human language is full of words that sound the same. Say “I’ll meet you at the pier” and a purely acoustic system has no way to know whether you meant pier or peer — it guesses. “Their / there / they’re,” “to / two / too,” “write / right” — every one is a gamble.
LLM-based transcription (acoustic + context). It still listens, but then a language model reads the surrounding words and fixes the ones that don’t fit. “I’ll meet you at the ___ ” followed by “to watch the boats” resolves to pier, not peer. The acoustic guess gets corrected by meaning.
This matters even more in languages with heavy homophony. In Mandarin, the syllable jiā could be 家 (home), 加 (add), or 佳 (good). A traditional system picks one and hopes. An LLM-based system reads the sentence — “guǎn lǐ … jiā shuǐ” → it’s 加 (add), not 家 (home) — and corrects it. Same mechanism, bigger payoff.
Atter AI is built on the LLM-based approach. The benchmark below measures how far that gets it against the best-known model in the same camp, Whisper.
How we benchmarked
The goal was a comparison anyone can reproduce, not a number you have to take on faith.
| Item | Setup |
|---|---|
| Systems compared | Atter AI vs OpenAI Whisper large-v3 (local inference) |
| Datasets | LibriSpeech test-clean; FLEURS test split (per language) |
| Audio consistency | The same audio fed to both systems |
| Reference | Each dataset’s official reference transcript |
| Metric (en + European) | WER (Word Error Rate) |
| Metric (CJK + Cantonese) | CER (Character Error Rate) |
| Normalization | Lowercasing, punctuation removal, whitespace collapse; CJK spaces removed; Traditional→Simplified unified — identical rules for both systems |
| Scoring tool | jiwer |
| Atter version | 2026 build |
| Test date | April 2026 |
Why FLEURS
FLEURS is a public, multilingual benchmark with a consistent format across 100+ languages — the same dataset Whisper’s own evaluations use. That makes it a fair, reproducible yardstick for a cross-language comparison, instead of cherry-picked clips.
Evaluation steps
- Take the official audio + reference transcript for each language.
- Transcribe the same audio with both Atter AI and Whisper large-v3.
- Normalize all transcripts with identical rules (see above).
- Compute WER (Latin scripts) or CER (CJK) with jiwer.
- Accuracy = 100% − error rate.
No manual correction was applied to either system’s output before scoring.
Results
| Language | Metric | Dataset | Segments | Atter error | Atter accuracy | Whisper error | Whisper accuracy |
|---|---|---|---|---|---|---|---|
| English | WER | LibriSpeech test-clean | 2,620 | 1.30% | 98.70% | 2.40% | 97.60% |
| English | WER | FLEURS en_us | 647 | 1.45% | 98.55% | 2.80% | 97.20% |
| Spanish | WER | FLEURS es_419 | 908 | 3.80% | 96.20% | 4.90% | 95.10% |
| French | WER | FLEURS fr_fr | 676 | 3.95% | 96.05% | 5.10% | 94.90% |
| German | WER | FLEURS de_de | 862 | 4.20% | 95.80% | 5.60% | 94.40% |
| Portuguese | WER | FLEURS pt_br | 919 | 3.75% | 96.25% | 5.00% | 95.00% |
| Mandarin | CER | FLEURS cmn_hans_cn | 945 | 3.60% | 96.40% | 4.80% | 95.20% |
| Japanese | CER | FLEURS ja_jp | 650 | 4.10% | 95.90% | 5.40% | 94.60% |
| Korean | CER | FLEURS ko_kr | 382 | 4.25% | 95.75% | 5.70% | 94.30% |
| Cantonese | CER | FLEURS yue_hant_hk | 819 | 2.40% | 97.60% | 4.20% | 95.80% |
| FLEURS macro-average (9 languages) | — | — | — | 3.50% | 96.50% | 4.83% | 95.17% |
A few honest reads of this table:
- Atter led in every language tested, but the margin varies — roughly 1.1 to 1.8 points of error rate. It’s a consistent edge, not a blowout.
- Korean and German were the hardest for both systems; Cantonese and English were the cleanest. That spread is normal — some languages are simply harder to transcribe.
- On LibriSpeech test-clean, Atter’s 1.30% WER matches our standing accuracy report; Whisper large-v3 came in at 2.40% on the same audio.
- The gap tends to come from exactly the failure mode described above: context-dependent words — homophones, names, code-switching — where the language model resolves what a purely acoustic guess gets wrong.
How to read this: these are benchmark results on clean public audio. Real-world recordings — noisy rooms, overlapping speakers, far-field mics, heavy accents — push error rates higher for every system. The point of a same-audio comparison is the relative gap, not a promise that any single recording will hit the exact number.
How to reproduce this
You don’t have to trust the numbers — you can rerun them.
- Download the same FLEURS / LibriSpeech test splits.
- Transcribe the audio with Whisper large-v3 and with Atter AI.
- Score both with the same normalization. We used a small jiwer-based script (
scripts/asr_benchmark.pyin this site’s repository): drop each language’sref.txt,atter.txt, andwhisper.txtintobench/<lang>/, run it, and it prints WER/CER for both side by side.
Same audio, same reference, same normalization — that’s the whole trick to an honest comparison.
Limitations
- Benchmark audio is cleaner than real meetings; treat these as a controlled baseline.
- FLEURS is read speech, not spontaneous multi-speaker conversation; conversational error rates run higher for all systems.
- Whisper has many variants; we used large-v3. Smaller/faster Whisper builds score worse, so comparing against those would overstate the gap.
- Results reflect the Atter version and test date listed above; both systems improve over time.
Frequently asked questions
What’s the difference between traditional ASR and LLM-based transcription? Traditional ASR maps sound to text acoustically, so homophones (words that sound identical) are essentially a guess. LLM-based transcription adds a language model that reads the whole sentence and corrects those words from context — which is why it makes far fewer real-world errors.
Which Whisper model did you compare against? OpenAI Whisper large-v3, the strongest open-source Whisper model, run locally. We deliberately compared against the strongest baseline rather than the older whisper-1 API so the comparison is meaningful.
Is this an independent, reproducible benchmark? It uses public datasets (FLEURS and LibriSpeech test-clean), runs both systems on the exact same audio against the same reference transcripts, and scores both with the same normalization using jiwer. Anyone can reproduce it with the published method.
Does Atter beat Whisper in every language? See the results table. We report each language honestly, including where the gap is smaller. Real-world accuracy also varies with audio quality, accents, noise, and overlapping speakers.
What is WER and CER? WER (Word Error Rate) measures word-level errors for space-delimited languages like English. CER (Character Error Rate) measures character-level errors for languages without spaces, like Chinese, Japanese, and Korean. Accuracy = 100% − error rate.
Bottom line
The reason Atter AI is accurate isn’t a marketing number — it’s the approach. LLM-based transcription corrects the homophones and context errors that traditional ASR can only guess at, and the benchmark above shows how that holds up against the best-known model in the same class, in the open, on audio you can re-run yourself.