LLMs and Scripts: Semantic Abstraction Beyond Tokenization

Script Invariance in Language Models

A recent study published on arXiv investigates whether the features learned by large language models (LLMs) represent abstract meaning or are tied to the specific written form of the text. The research focuses on Serbian digraphia, a situation where the Serbian language can be written in both Latin and Cyrillic alphabets, with a near-perfect mapping between characters.

Methodology and Results

The researchers analyzed the feature activations of Sparse Autoencoders (SAEs) across the Gemma model family (270M-27B parameters). They found that identical sentences in different Serbian scripts activate highly overlapping features, far exceeding random baselines. Interestingly, changing script causes less representational divergence than paraphrasing within the same script, suggesting that SAE features prioritize meaning over orthographic form. Cross-script cross-paraphrase comparisons provide evidence against memorization, as these combinations rarely co-occur in training data, but still show substantial feature overlap. This script invariance strengthens with model scale.

Implications

The findings suggest that SAE features can capture semantics at a level of abstraction above surface tokenization. The study proposes Serbian digraphia as a general evaluation paradigm for probing the abstractness of learned representations. For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these options.

LLMs and Scripts: Semantic Abstraction Beyond Tokenization

Script Invariance in Language Models

Methodology and Results

Implications

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

Evaluating LLMs for Greek QA: The DemosQA Benchmark

Enhancing Transaction Understanding with LLM-based Sentence Embeddings

LLM and Korean Language: Can Human Training Outperform Automation?