ASR Accent Adaptation with Multimodal Data Selection

Automatic speech recognition (ASR) systems often experience performance degradation when processing accents different from those they were trained on. Adapting to new accents typically requires a large amount of labeled data, which can be costly and time-consuming.

A new study proposes a reference-free data selection pipeline, guided by multimodal consistency, for accent adaptation in ASR systems. This approach aims to overcome the limitations of text-based selection heuristics, which may favor fluent but acoustically mismatched hypotheses, leading to error amplification during fine-tuning.

The pipeline starts with a target-aware preselection step based on submodular mutual information to improve query relevance and reduce the computational load. Subsequently, it generates multiple transcriptions per utterance via perturbation-based decoding and evaluates each hypothesis using two reference-free signals: speech-to-text alignment in a shared embedding space and the predicted word error rate (WER). A simple percentile-based selection rule retains reliable pseudo-labels for fine-tuning, while discarding noisy utterances.

The results show that, in an in-domain setting, selecting approximately 1,500 utterances from a pool of 30,000 achieves a WER of 10.91%, a value close to the 10.45% obtained using 30,000 supervised labels. In a cross-domain setting with a mismatched candidate pool, consistency-filtered subsets avoid the degradation caused by unfiltered pseudo-labels in the presence of a strong accent shift. Experiments on a stronger ASR backbone further confirm the advantages over random sampling and recent selection baselines.