VibeVoice 9B: New open-source benchmark for medical STT

A comparative analysis evaluated 31 automatic speech recognition (STT) models on medical audio data, revealing new perspectives on performance and hardware requirements.

VibeVoice-ASR 9B: a new benchmark

Microsoft's VibeVoice-ASR 9B model achieved a WER of 8.34%, positioning itself as a leader among open-source solutions. This result approaches the performance of Gemini 2.5 Pro (8.15%). However, VibeVoice 9B requires approximately 18GB of VRAM and, even on high-end hardware such as H100, is slower (97 seconds per file) compared to models like Parakeet (6 seconds per file).

Other models evaluated

In addition to VibeVoice, other models were evaluated, including ElevenLabs Scribe v2, NVIDIA Nemotron Speech Streaming 0.6B, and Voxtral Mini. Parakeet TDT 0.6B v3 stands out for its efficiency, achieving a WER of 9.35% with only 6 seconds per file on Apple Silicio.

Corrections to Whisper's normalizer

The analysis identified and corrected some bugs in Whisper's text normalizer, which caused a 2-3% increase in WER across all models. The corrections concern the handling of the interjection "oh" and the homogenization of lexical variants (e.g. ok/okay/k).

For those evaluating on-premise deployments, there are trade-offs between accuracy, hardware requirements, and inference speed. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.

VibeVoice 9B: New open-source benchmark for medical STT

VibeVoice-ASR 9B: a new benchmark

Other models evaluated

Corrections to Whisper's normalizer

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

Microsoft releases VibeVoice-ASR for speech recognition

Linus Torvalds' "Vibe Coding": A Good Idea? It Depends

Google delves into the development of the Gemini model in a podcast

👥 Join 160+ AI explorers