ChiEngMixBench: A New Benchmark for Code-Mixing

Code-mixing, the practice of blending different languages within the same sentence or conversation, is increasingly prevalent in interactions between humans and large language models (LLMs). However, existing approaches often reduce this phenomenon to a simple translation problem, making it difficult to assess whether a model's language switching behavior is context-appropriate and aligned with human conventions.

To address this gap, ChiEngMixBench has been introduced, the first benchmark specifically designed to evaluate code-mixing ability in authentic community contexts. This tool is based on a general construction pipeline that enables scalable dataset development across different domains and bilingual pairs.

Spontaneity and Naturalness as Key Metrics

ChiEngMixBench considers code-mixing as a cognitive alignment problem, characterized by two complementary signals: spontaneity and naturalness. Empirical evaluations demonstrate that these metrics can systematically distinguish code-mixing performance across different models.

Beyond benchmarking, the research has led to the discovery of an implicitly emergent Terminology Layering Strategy, a phenomenon consistent with the Matrix Language Frame (MLF) theory. This indicates a structured cognitive alignment between multilingual language models and human communication.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.