Transformers, despite their effectiveness, often behave like black boxes, resisting targeted surgical interventions. Ablating an attention head seemingly crucial can produce minimal changes due to distributed redundancy.

Modularity Unveiled

A new research proposes an architectural approach that combines dual-stream processing (separating token and contextual representations), per-layer supervision (providing an independent gradient signal at each layer), and gated attention (regularizing toward discrete activation patterns). This unveils a latent modularity.

Ablation and Control

Models trained with per-layer supervision show ablation effects 5 to 23 times greater than controls trained with standard objectives. This allows for 4 times greater control over target behaviors, with smooth and predictable variations in model output. Per-layer supervision significantly increases the variance of ablation effects, revealing the dependencies between predictions and circuits.

Validation

The approach is validated through engineered features that capture computational dynamics, an architecture that provides positive control for modularity, and causal experiments demonstrating functional reorganization, where different tasks are routed through different attention heads. This transforms interpretability from passive observation to active control.