Benchmarking Commercial ASR Systems on Code-Switching Speech: New Multilingual Benchmarks

The Code-Switching Challenge for Commercial ASR Systems

The natural alternation between two or more languages within a single utterance, a phenomenon known as code-switching, represents one of the most complex and under-studied conditions for automatic speech recognition (ASR systems). Existing benchmarks for commercial ASR providers tend to focus on clean, monolingual audio, offering a single Word Error Rate (WER) value that provides little insight into real-world multilingual performance in diverse operational scenarios where human interaction often disregards rigid linguistic boundaries.

For companies operating globally or serving multilingual communities, an ASR system's ability to handle code-switching is fundamental. Accurate transcription in these contexts is not just a matter of efficiency; it can have significant implications for compliance, customer service quality, and voice data analysis, especially in regulated sectors. The lack of reliable metrics for these conditions has, until now, made it difficult for CTOs and infrastructure architects to make informed decisions regarding ASR solution deployments.

A New Benchmark and Innovative Methodology

To address this challenge, a new benchmark has been introduced, evaluating five commercial ASR providers across four specific language pairs: Egyptian Arabic-English, Saudi Arabic (Najdi/Hijazi)-English, Persian (Farsi)-English, and German-English. Each dataset comprises 300 samples, selected through a two-stage pipeline. The first stage employs a heuristic filter that scores transcripts based on five structural code-switching signals. Subsequently, an ensemble consisting of GPT-4o and Gemini 1.5 Pro scores candidates across six linguistic dimensions.

This methodology not only improves the quality and relevance of the test data but also offers a significant advantage in terms of cost. The Large Language Model (LLM)-based selection and scoring pipeline reduced LLM scoring costs by approximately 91% compared to an exhaustive scoring approach. For system evaluation, two metrics were used: the traditional Word Error Rate (WER) and BERTScore. The latter proved more reliable for Arabic and Persian language pairs, where transliteration variance can penalize semantically correct transcriptions if evaluated solely with WER.

Key Results and Deployment Relevance

The benchmark results highlighted that ElevenLabs Scribe v2 achieved the lowest WER across all four language pairs, with an overall value of 13.2% and a notable 13.1% on Egyptian Arabic. The same system also led on BERTScore, reaching an overall value of 0.936. It is important to note that a difficulty-stratified analysis revealed performance gaps that would have been masked by aggregate averages, providing a more granular view of system capabilities. Furthermore, BERT embedding projections confirmed semantic proximity between reference and hypothesis transcriptions, despite surface-level script differences.

This benchmarking dataset has been made publicly available on Hugging Face, offering the community a valuable tool for future research and evaluations. For technical decision-makers, these results underscore the importance of considering specific benchmarks for code-switching, especially when evaluating ASR solutions for multilingual environments. A system's ability to manage these complexities is a critical factor for successful deployment, directly impacting accuracy and usability.

Perspectives for Infrastructure and Data Sovereignty

The adoption of robust, multilingual ASR systems has direct implications for infrastructure strategies, particularly for organizations prioritizing on-premise or hybrid deployments. The management of sensitive voice data, which may include personal or proprietary information, often requires solutions that ensure data sovereignty and regulatory compliance, such as GDPR. In these scenarios, the ability to accurately process code-switching locally, without relying on external cloud services for every stage of the process, becomes a fundamental requirement.

The data preparation methodology, which employs LLMs for scoring, also highlights how these models can be integrated into data pre-processing pipelines, even in on-premise contexts. Although the specific LLMs used in this study are cloud services, the principle of leveraging advanced language models to improve data quality or for evaluation tasks can be replicated with self-hosted LLMs. This offers CTOs and architects the flexibility to build AI infrastructures that balance performance, costs, and security requirements, while also providing a deeper analysis of the trade-offs between different ASR solutions.