A team of developers encountered an unexpected behavior while using Whisper for meeting transcription: in the absence of audio, the model does not remain silent, but generates meaningful phrases, although unfounded.
The problem of hallucinations
These "hallucinations" are not random noise, but well-formed, often recurring phrases. Examples include generic thanks, references to subtitles, or, worse, repetitive loops that go on for entire paragraphs. The cause lies in Whisper's training on a vast dataset of audio from YouTube, which leads it to "complete" the silence with the most probable phrases, such as those typical of the final thanks of the videos.
Proposed solutions
The team has implemented several strategies to mitigate the problem:
- Silero VAD as a pre-gate: Use a Voice Activity Detection (VAD) model to avoid submitting audio segments without voice to Whisper.
condition_on_previous_text=False: Disable this option, which would otherwise trigger a cascade of hallucinations, feeding the next window's prompt with the wrong output.- Exact-string blocklist: Maintain a list of typical phrases generated by Whisper and discard the corresponding segments.
- Repeated-output detection: Stop the transcription if the same text is generated consecutively for a certain number of times.
beam_size=1: Set a small beam size for faster decoding and less prone to loops.
These techniques have proven effective in significantly reducing Whisper's hallucinations in production environments. Unlike CTC/transducer models, which generate blank tokens during silence, Whisper's architecture requires continuous text generation, making these countermeasures necessary.
It is important to note that some hallucinations may contain violent or harmful content, which makes it crucial to implement mitigation mechanisms, especially in sensitive contexts such as medical transcription.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!