Moonshot AI Introduces Attention Residuals to Optimize Transformer Scaling
These articles are AI-generated summaries. Please check the original sources for full details.
Moonshot AI Releases π¨ππππππππ πΉππππ ππππ to Replace Fixed Residual Mixing with Depth-Wise Attention for Better Scaling in Transformers
Moonshot AI has developed Attention Residuals (AttnRes) to replace the standard fixed residual accumulation found in modern Transformers. The new architecture achieves validation losses comparable to standard models trained with 25% more compute.
Why This Matters
Standard Transformer architectures suffer from PreNorm dilution, where fixed unit weights in residual connections cause hidden-state magnitudes to grow with depth, weakening individual layer contributions. While ideal models assume all layers contribute equally, the technical reality is that irreversible information loss and lack of selective access create a bottleneck that limits scaling efficiency and forces deeper layers to produce larger outputs to remain influential.
Key Insights
- Moonshot AIβs scaling laws (2026) show Block AttnRes achieves lower validation loss across all compute ranges compared to PreNorm baselines.
- The concept of selective access allows layers to aggregate specific earlier representations using softmax attention rather than a single compressed residual stream.
- Block AttnRes, used in Moonshotβs Kimi Linear model (48B parameters), reduces depth-wise memory overhead from O(Ld) to O(Nd) by partitioning layers into blocks.
- Performance on the MMLU benchmark improved from 73.5 to 74.6 when integrating AttnRes into MoE architectures with 3B activated parameters.
- Initializing pseudo-query vectors to zero allows AttnRes to behave like equal-weight averaging at the start of training, preventing early instability.
Practical Applications
- Large-scale MoE training (Kimi Linear + 1.4T tokens): Using Block AttnRes maintains training stability by keeping output magnitudes bounded, but failing to use block-level representations can lead to significant O(Ld) memory overhead in pipeline parallelism.
- High-reasoning tasks (Math/HumanEval evaluation): AttnRes improved Math scores from 53.5 to 57.1, though neglecting RMSNorm on layer outputs before attention can allow large-magnitude layers to dominate depth-wise weights.
References:
Continue reading
Next article
AI News Weekly Summary: Mar 07 - Mar 15, 2026
Related Content
Meta AI Open-Sources NeuralBench: A Standardized Benchmark for EEG Foundation Models
Meta AI's NeuralBench-EEG v1.0 standardizes NeuroAI evaluation across 36 tasks and 94 datasets, revealing that 150K-parameter models often rival 157M-parameter foundation models.
Implementing Prompt Compression to Reduce Agentic Loop Costs
Learn how prompt compression reduces the quadratic token costs of agentic AI loops by up to 67% using techniques like recursive summarization and instruction distillation.
Moonshot AI Introduces Kimi K2 Thinking: A Breakthrough in Long-Horizon Reasoning and Tool Use
Moonshot AI releases Kimi K2 Thinking, an open-source thinking model capable of executing 200β300 sequential tool calls without human intervention, optimized for long-horizon reasoning and agentic tasks.