AI-RADAR.IT · AI-RADAR.NET · AI-RADAR.TECH

News & analysis on local LLMs, stack & on-prem hardware.

📁 Frameworks AI generated

Whisper and silent hallucinations: how to mitigate them

Published on 2026-03-05 20:58 ℹ️ LocalLLaMA 📰 Read the original source article →

🏷️ Fine-Tuning

Whisper e le allucinazioni silenziose: come mitigarle

A team of developers encountered an unexpected behavior while using Whisper for meeting transcription: in the absence of audio, the model does not remain silent, but generates meaningful phrases, although unfounded.

The problem of hallucinations

These "hallucinations" are not random noise, but well-formed, often recurring phrases. Examples include generic thanks, references to subtitles, or, worse, repetitive loops that go on for entire paragraphs. The cause lies in Whisper's training on a vast dataset of audio from YouTube, which leads it to "complete" the silence with the most probable phrases, such as those typical of the final thanks of the videos.

Proposed solutions

The team has implemented several strategies to mitigate the problem:

Silero VAD as a pre-gate: Use a Voice Activity Detection (VAD) model to avoid submitting audio segments without voice to Whisper.
condition_on_previous_text=False: Disable this option, which would otherwise trigger a cascade of hallucinations, feeding the next window's prompt with the wrong output.
Exact-string blocklist: Maintain a list of typical phrases generated by Whisper and discard the corresponding segments.
Repeated-output detection: Stop the transcription if the same text is generated consecutively for a certain number of times.
beam_size=1: Set a small beam size for faster decoding and less prone to loops.

These techniques have proven effective in significantly reducing Whisper's hallucinations in production environments. Unlike CTC/transducer models, which generate blank tokens during silence, Whisper's architecture requires continuous text generation, making these countermeasures necessary.

It is important to note that some hallucinations may contain violent or harmful content, which makes it crucial to implement mitigation mechanisms, especially in sensitive contexts such as medical transcription.

AI-Radar Takeaway

A team discovered that Whisper, during silences, generates coherent but non-existent phrases, not just noise. They analyze the causes, linked to training on YouTube, and propose solutions: a pre-filter with Silero VAD, disabling 'condition_on_previous_text', a blocklist of typical phrases, and other measures to reduce the problem.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

RunPod GPU Cloud Platform

Flexible GPU cloud with pay-per-second billing. Deploy instantly with Docker support, auto-scaling, and a wide selection of GPU types from RTX 4090 to H100.

✓ No commitments ✓ Instant deployment ✓ Production-ready

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

OmniVoice: One-Shot Voice Cloning and its Potential for On-Premise Deployments

OmniVoice: One-Shot Voice Cloning and its Potential for On-Premise Deployments

A Reddit user expressed significant enthusiasm for OmniVoice, a one-shot voice cloning technology. Although not a Large Language Model, its ease of use and abil

Voice AI Systems: New Vulnerabilities to Hidden Audio Attacks

Voice AI Systems: New Vulnerabilities to Hidden Audio Attacks

New research reveals that AI voice systems, including Large Audio-Language Models (LALMs), are susceptible to “AudioHijack” attacks. These attacks exploit imper

SoproTTS v1.5: Zero-Shot Voice Cloning TTS for ~$100

SoproTTS v1.5: Zero-Shot Voice Cloning TTS for ~$100

SoproTTS v1.5 is a 135M parameter TTS (text-to-speech) model offering zero-shot voice cloning. Trained for approximately $100 on a single GPU, the model achieve

Telegram updates support for passkeys and audio in stories

Telegram updates support for passkeys and audio in stories

The latest version of Telegram introduces new features to protect conversations and personalize stories.

AnyTTS: Universal Text-to-Speech for AI Chat Systems

Frameworks Feb 05

AnyTTS: Universal Text-to-Speech for AI Chat Systems

A developer created AnyTTS, a system that allows using any text-to-speech (TTS) engine with various AI chat interfaces, including ChatGPT and local LLM models.

More in Frameworks

GNOME’s AI Assistant Now Generates Images: Newelle 1.4.5 Arrives

Llama.cpp cuts CUDA synchronizations, boosting on-premise inference performance

DeepSeek V4 Flash and MiniMax M3 on llama.cpp: When will native support arrive?

llama.cpp: Vulkan Tensor Parallelism Now Within Reach

A software veteran builds a local LLM harness and asks the community: what do you need?

Patronus AI secures $50M to crash-test AI agents

→ View all in Frameworks →

AI-Radar AI Hardware

GPUs, servers, and AI accelerators: buying guides and comparisons.

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Register free → Already a member? Log in