Moonshot AI: New Transformer Architecture with 'Attention Residuals'

Moonshot AI (Kimi) has published a paper introducing a new architecture for Transformer models, based on 'Attention Residuals'. This architecture replaces traditional residual connections, used since 2015 (ResNet).

The Dilution Problem

Standard residual connections tend to accumulate the outputs of all previous layers. In deeper layers, such as the fortieth, the outputs of layers 1 to 39 are summed. This, according to Kimi, leads to a dilution of information from the early layers.

The Solution: Attention Residuals

The proposed solution is to allow each layer to selectively 'attend' to the outputs of previous layers, rather than simply summing them. In practice, each layer can choose which earlier layers are most important for the current input, using learned attention weights.

Results

Moonshot AI's benchmarks show:

Improvements of 3 to 7.5 points in math reasoning tests, code generation, and tasks requiring extended context.
Compute savings of approximately 1.25x with the 'block attention residual' variant.
Training overhead of less than 4% and inference latency increase of less than 2%.
Scalability: larger models benefit more from this architecture.

A 'block attention residual' variant has also been developed, in which layers are grouped into blocks. Within a block, normal residual connections are used, while attention is used between blocks. This approach maintains much of the benefit while reducing execution costs.

Comparison with DeepSeek

DeepSeek had recently proposed another solution to improve residual connections (mHC), but with a completely different approach. While DeepSeek adds parallel streams, Kimi introduces selective attention. According to some comparisons, Kimi's approach requires about 1/6 of the memory bandwidth compared to DeepSeek mHC, achieving similar or better results.

Practical Implications

Kimi's version is potentially 'drop-in replaceable': you replace the residual module, keep everything else unchanged, retrain, and get improvements. DeepSeek mHC, on the other hand, requires a complete restructuring of the model architecture.

Final Thoughts

Karpathy commented that attention could be applied in more places in the Transformer than previously thought. For those who develop models locally, this innovation could lead to significant improvements in quality without the need for larger models: same number of parameters, better information flow, better results. The interaction with quantization remains to be evaluated, as the attention weights between layers may be sensitive to precision.