Skip to main content

On This Page

Moonshot AI Introduces Attention Residuals to Optimize Transformer Scaling

β€’ 2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Moonshot AI Releases π‘¨π’•π’•π’†π’π’•π’Šπ’π’ π‘Ήπ’†π’”π’Šπ’…π’–π’‚π’π’” to Replace Fixed Residual Mixing with Depth-Wise Attention for Better Scaling in Transformers

Moonshot AI has developed Attention Residuals (AttnRes) to replace the standard fixed residual accumulation found in modern Transformers. The new architecture achieves validation losses comparable to standard models trained with 25% more compute.

Why This Matters

Standard Transformer architectures suffer from PreNorm dilution, where fixed unit weights in residual connections cause hidden-state magnitudes to grow with depth, weakening individual layer contributions. While ideal models assume all layers contribute equally, the technical reality is that irreversible information loss and lack of selective access create a bottleneck that limits scaling efficiency and forces deeper layers to produce larger outputs to remain influential.

Key Insights

  • Moonshot AI’s scaling laws (2026) show Block AttnRes achieves lower validation loss across all compute ranges compared to PreNorm baselines.
  • The concept of selective access allows layers to aggregate specific earlier representations using softmax attention rather than a single compressed residual stream.
  • Block AttnRes, used in Moonshot’s Kimi Linear model (48B parameters), reduces depth-wise memory overhead from O(Ld) to O(Nd) by partitioning layers into blocks.
  • Performance on the MMLU benchmark improved from 73.5 to 74.6 when integrating AttnRes into MoE architectures with 3B activated parameters.
  • Initializing pseudo-query vectors to zero allows AttnRes to behave like equal-weight averaging at the start of training, preventing early instability.

Practical Applications

  • Large-scale MoE training (Kimi Linear + 1.4T tokens): Using Block AttnRes maintains training stability by keeping output magnitudes bounded, but failing to use block-level representations can lead to significant O(Ld) memory overhead in pipeline parallelism.
  • High-reasoning tasks (Math/HumanEval evaluation): AttnRes improved Math scores from 53.5 to 57.1, though neglecting RMSNorm on layer outputs before attention can allow large-magnitude layers to dominate depth-wise weights.

References:

Continue reading

Next article

AI News Weekly Summary: Mar 07 - Mar 15, 2026

Related Content