When a language model reuses its hidden states as runtime memory, each iteration produces a prediction and feeds the next loop. That’s architecturally elegant, but it raises an uncomfortable question: does the per-step loss function actually control all the variables involved? Recent research shows it does not: dense cross-entropy only governs the variables that the readout exposes, leaving a large blind spot in the recurrent dynamics.

The hidden scale that escapes the loss

The mechanism is as simple as it is dangerous. In many looped LLMs, the readout is scale-invariant thanks to normalizations such as RMSNorm or LayerNorm, which strip away the radial magnitude of the vector before projecting onto the vocabulary. The immediate loss therefore is blind to the Euclidean norm of the hidden state. Meanwhile, the pre-norm residual recurrence keeps accumulating and propagating that very same scale, with no direct restraint.

The result, documented on 44M and 129M parameter looped transformers, is striking: without architectural intervention, hidden-state norms blow up to thousands or tens of thousands. This loss of control is not caused by insufficient supervision – every step has its own cross-entropy – but by a structural visibility gap.

Two solutions for one general rule

The authors point to two complementary paths, converging on a clean design rule. The first is to make scale visible to the loss, for example by using readouts that preserve radial information or by adding explicit norm penalties. The second is to remove scale entirely from the loop through recurrence that does not carry amplitude information. Both approaches bring norms back to modest values (in the tens) and, consistent with the rule, scale-controlled variants achieve lower perplexity at matched inference depth in variable-depth benchmarks.

The resulting principle is as direct as it is operational: dense supervision trains intermediate exits, but recurrent scale control demands an explicit architectural intervention. There is no loss-only shortcut.

Why on-premise training teams should care

For teams running LLM training or fine-tuning on local infrastructure, this dynamic has concrete implications. Training looped models without the right safeguards can hide scale drifts that degrade performance and prolong debugging cycles. The computational cost wasted on unproductive experiments translates directly into higher TCO for GPUs and storage systems.

The rule that emerged – make scale visible or remove it – becomes a near-zero-cost architectural validation criterion during design. Those working on self-hosted stacks and aiming to retain data sovereignty can incorporate these checks directly into their experimentation pipeline, without waiting for external benchmarks. The effort is minimal: replace the scale-invariant readout or add a regularization term, decisions that can be evaluated on proprietary workloads and datasets.

A reminder for recurrent architectures

This finding updates an often neglected aspect: not everything that is technically ‘trainable’ is automatically under control. The normalization mechanisms that ease convergence in static transformers can create fresh obstacles when the network loops back onto itself. It’s a classic trade-off: an ingredient that stabilizes the forward pass can destabilize recurrence if you look at it from the wrong angle.

The message to the community is clear: when designing cyclical architectures, a cross-check between the cost function and the variables actually policed is essential. A readout that is too ‘blind’ is not an optimization problem, but a model definition issue. Fixing it doesn’t add complexity – it adds coherence.