Indic-TunedLens: A New Approach for Interpreting Multilingual LLMs in Indian Languages

Multilingual large language models (LLMs) are increasingly deployed in linguistically diverse regions like India. However, most interpretability tools are primarily designed for English. This creates challenges in interpreting models when applied to other languages.

To address this challenge, Indic-TunedLens has been developed, an interpretability framework specifically designed for Indian languages. Unlike the standard Logit Lens, which directly decodes intermediate activations, Indic-TunedLens adjusts hidden states for each target language, aligning them with the target output distributions. This enables a more faithful decoding of model representations.

The framework was evaluated on 10 Indian languages using the MMLU benchmark, demonstrating significant improvements over existing interpretability methods, especially for morphologically rich, low-resource languages. The results provide crucial insights into the layer-wise semantic encoding of multilingual transformers.

The model is available on Hugging Face Spaces, and the source code is accessible on GitHub.