Moonshot AI Introduces Attention Residuals to Optimize Transformer Scaling

Moonshot AI Releases 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔 to Replace Fixed Residual Mixing with Depth-Wise Attention for Better Scaling in Transformers

Moonshot AI has developed Attention Residuals (AttnRes) to replace the standard fixed residual accumulation found in modern Transformers. The new architecture achieves validation losses comparable to standard models trained with 25% more compute.

Why This Matters

Standard Transformer architectures suffer from PreNorm dilution, where fixed unit weights in residual connections cause hidden-state magnitudes to grow with depth, weakening individual layer contributions. While ideal models assume all layers contribute equally, the technical reality is that irreversible information loss and lack of selective access create a bottleneck that limits scaling efficiency and forces deeper layers to produce larger outputs to remain influential.

Key Insights

Moonshot AI’s scaling laws (2026) show Block AttnRes achieves lower validation loss across all compute ranges compared to PreNorm baselines.
The concept of selective access allows layers to aggregate specific earlier representations using softmax attention rather than a single compressed residual stream.
Block AttnRes, used in Moonshot’s Kimi Linear model (48B parameters), reduces depth-wise memory overhead from O(Ld) to O(Nd) by partitioning layers into blocks.
Performance on the MMLU benchmark improved from 73.5 to 74.6 when integrating AttnRes into MoE architectures with 3B activated parameters.
Initializing pseudo-query vectors to zero allows AttnRes to behave like equal-weight averaging at the start of training, preventing early instability.

Practical Applications

Large-scale MoE training (Kimi Linear + 1.4T tokens): Using Block AttnRes maintains training stability by keeping output magnitudes bounded, but failing to use block-level representations can lead to significant O(Ld) memory overhead in pipeline parallelism.
High-reasoning tasks (Math/HumanEval evaluation): AttnRes improved Math scores from 53.5 to 57.1, though neglecting RMSNorm on layer outputs before attention can allow large-magnitude layers to dominate depth-wise weights.

References:

https://www.marktechpost.com/2026/03/15/moonshot-ai-releases-𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏-𝑹𝒆𝒔𝒊𝒅/

On This Page

Moonshot AI Releases 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔 to Replace Fixed Residual Mixing with Depth-Wise Attention for Better Scaling in Transformers

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

Moonshot AI Introduces Kimi K2 Thinking: A Breakthrough in Long-Horizon Reasoning and Tool Use

Building Autonomous ML Research Loops with Karpathy’s AutoResearch Framework

Safely Deploying ML Models to Production: Four Controlled Strategies