A comparative analysis evaluated 31 automatic speech recognition (STT) models on medical audio data, revealing new perspectives on performance and hardware requirements.
VibeVoice-ASR 9B: a new benchmark
Microsoft's VibeVoice-ASR 9B model achieved a WER of 8.34%, positioning itself as a leader among open-source solutions. This result approaches the performance of Gemini 2.5 Pro (8.15%). However, VibeVoice 9B requires approximately 18GB of VRAM and, even on high-end hardware such as H100, is slower (97 seconds per file) compared to models like Parakeet (6 seconds per file).
Other models evaluated
In addition to VibeVoice, other models were evaluated, including ElevenLabs Scribe v2, NVIDIA Nemotron Speech Streaming 0.6B, and Voxtral Mini. Parakeet TDT 0.6B v3 stands out for its efficiency, achieving a WER of 9.35% with only 6 seconds per file on Apple Silicio.
Corrections to Whisper's normalizer
The analysis identified and corrected some bugs in Whisper's text normalizer, which caused a 2-3% increase in WER across all models. The corrections concern the handling of the interjection "oh" and the homogenization of lexical variants (e.g. ok/okay/k).
For those evaluating on-premise deployments, there are trade-offs between accuracy, hardware requirements, and inference speed. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!