Imagine having to trust a collaborator who always answers correctly, but you have no idea how they reach their conclusions. Something similar happens with large language models: benchmarks tell us a model works, but what really happens inside its layers of computation? A research team has tried to answer with an original approach, formalizing four axioms—causality, minimality, separability, and stability—to measure the quality of latent thought representations, independently of final accuracy.

Beyond the black box of benchmarks

The paper, Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs, starts from an observation: standard evaluations conflate the model's overall capacity with the quality of its internal representations. When an LLM fails a task, we don't know whether the problem lies in the encoding of information or in subsequent processing. To separate the two, the researchers defined quantitative metrics based on four functional principles that a “well-formed” representation should respect.

The four axioms are:

  • Causality: the representation must contain information actually relevant to the answer.
  • Minimality: it must be compact, without unnecessary redundancy.
  • Separability: it must distinguish not only between different tasks but also between different questions within the same task.
  • Stability: small variations in the input should not significantly alter the representation.

By calculating these indicators directly on internal vectors, without going through the final output, the framework allows diagnosis of flaws that would otherwise remain invisible.

The audit across 23 reasoning tasks

The test was conducted on a range of open-weight models—thus publicly inspectable—including dense architectures, reasoning-distilled ones, and those trained with reinforcement learning. Tasks span from spatial reasoning to factual QA. The results are stark: no model manages to satisfy all four axioms simultaneously. The representations reliably distinguish the type of task (e.g., math vs. geography), but not two different questions within the same category. Moreover, they contain very little additional information beyond what is already present in the initial input embedding: latent thought adds surprisingly little.

The flaw is structural, not dependent on scale or training recipe. This uniformity suggests that current transformer architectures, as designed, struggle to build rich and truly distinctive internal representations.

Why this research matters for on-premise deployment

For an organization that chooses to keep its LLMs on local infrastructure—whether for data sovereignty, regulatory compliance, or operational control—model transparency becomes critical. In a self-hosted scenario, being able to trust not only accuracy but also the internal robustness of the model reduces the risk of erratic behavior in production. Tools like this axiomatic framework offer a deeper audit lens, complementary to traditional benchmarks.

Of course, the work is a piece of pure research, not a ready-made product. But it signals an important direction: model evaluation is moving beyond the monoculture of success rates to look inside the mechanics of reasoning. Those evaluating models for on-premise deployment today often find themselves comparing dozens of variants in terms of quantization, fine-tuning, and context window: adding a dimension of structural representation analysis can make a difference when reliability is non-negotiable.

A broader perspective

The finding that models struggle to separate similar questions and do not enrich the initial input much has implications beyond research. If confirmed on a larger scale, it could steer development toward architectures that devote more resources to building a genuinely informative latent space. In the meantime, for those operating on-premise, the message is clear: superficial evaluations are not enough, and structural audits like the one proposed here can become part of a more mature selection process.