When Language Mixing Trips Up LLMs: The Indi-RomCoM Benchmark

Anyone working with artificial intelligence knows the lab is not the real world. And in the real world, billions of people communicate daily by mixing different languages in the same sentence, typing them in Latin script for convenience. This phenomenon, called Romanized Code Mixing (RCM), is the norm for multilingual communities from India to Europe, but it remains largely unexplored territory for Large Language Models.

A research team has just shone a spotlight on this gap with Indi-RomCoM, a benchmark designed to measure how well LLMs can follow instructions and reason when the input is a fluid blend of English and an Indian language written in Roman characters. The verdict is uncomfortable: models stumble, often badly.

Seven tasks, four languages, and a sobering lesson

The core of Indi-RomCoM is a set of seven tasks ranging from toxicity detection to summarization, with reasoning questions in between. The tests cover four widely spoken Indian languages and three controlled levels of code-mixing intensity, from light to extreme. The team evaluated a suite of models including proprietary giants, open-weight variants, and LLMs specifically trained on Indian languages, under both zero-shot and few-shot settings.

The clearest result is a widespread drop in performance when the text contains RCM. The denser the mixing, the more models struggle. However, not all tasks suffer equally: reasoning tasks, where the model produces an articulated explanation, show less degradation than detection tasks like toxic content flagging. Essentially, the ability to reason helps compensate for the linguistic confusion.

What it means for sovereign and on-premise AI choices

For an organization evaluating on-premise deployment of an LLM, Indi-RomCoM raises a concrete consideration. Public benchmarks often focus on clean English or well-formed single languages, but real communication from employees, customers, or citizens is a tangle of mixed codes. A corporate assistant or support chatbot that fails on Hindi-English input in Latin script is not a niche product failure — it could be a daily breakdown in markets with hundreds of millions of speakers.

Those running local infrastructure for data sovereignty reasons must ask whether the models they plan to put into production have been tested against such scenarios. The snapshot provided by Indi-RomCoM suggests that even the most renowned models may need specific fine-tuning or text pre-processing strategies before they can reliably operate in a genuinely multilingual context.

A broader perspective: we need inclusive benchmarks

The work highlights a wider issue: the AI industry has so far prioritized evaluation on clean, monolingual data, while everyday usage is messy, fluid, and multilingual. The emergence of benchmarks like Indi-RomCoM is not just an academic exercise but a tool to drive the development of more inclusive systems. For the enterprise world, this means the Total Cost of Ownership of an on-premise LLM might include hidden adaptation costs if the chosen model was not designed for the real linguistic landscape it will face.

The absence of specific hardware data in the benchmark does not dilute the message: before powering up GPUs and servers, you need an honest check of the ability to understand people who speak as they live — mixing languages without formalities. An RCM test could become a standard item on the checklist for anyone designing an AI that truly wants to converse with the world.