DeepSeek Applies 1967 Matrix Normalization to Stabilize Hyper Connections
These articles are AI-generated summaries. Please check the original sources for full details.
Manifold Constrained Hyper Connections Stabilize LLM Training
DeepSeek researchers have addressed instability in large language model training stemming from hyper connections by applying the Sinkhorn-Knopp algorithm – a matrix normalization technique from 1967. Hyper connections, while increasing model expressivity, introduce amplification factors that can cause training to diverge at scale.
Why This Matters
Traditional residual connections maintain stable signal propagation, but hyper connections, which widen this pathway, can lead to exponentially growing activations. This instability limits the effective depth and scale of models, increasing training costs and potentially preventing convergence; a 27B parameter model exhibited peaks in amplification reaching 3000, hindering stable training.
Key Insights
- Amax Gain Magnitude: DeepSeek defined this metric to measure worst-case amplification in signal paths, revealing excessive growth in hyper connections.
- Doubly Stochastic Matrices: Constraining residual mixing matrices to this manifold ensures a convex combination of residual streams, preserving feature mass and regularizing norms.
- Sinkhorn-Knopp Algorithm (1967): This classical algorithm efficiently projects matrices onto the doubly stochastic manifold, enabling practical implementation of the constraint.
Working Example
import numpy as np
def sinkhorn_knopp(A, num_iter=20):
"""
Approximates a doubly stochastic matrix using the Sinkhorn-Knopp algorithm.
Args:
A: Input matrix.
num_iter: Number of Sinkhorn-Knopp iterations.
Returns:
Doubly stochastic approximation of A.
"""
for _ in range(num_iter):
A = np.divide(A, np.sum(A, axis=0, keepdims=True))
A = np.divide(A, np.sum(A, axis=1, keepdims=True))
return A
# Example usage:
A = np.random.rand(4, 4)
doubly_stochastic_A = sinkhorn_knopp(A)
print("Original Matrix A:\n", A)
print("\nDoubly Stochastic Approximation:\n", doubly_stochastic_A)
print("\nRow Sums:", np.sum(doubly_stochastic_A, axis=1))
print("\nColumn Sums:", np.sum(doubly_stochastic_A, axis=0))
Practical Applications
- DeepSeek MoE Models: Successfully applied mHC to 3B, 9B, and 27B mixture-of-experts models, achieving improved performance on language benchmarks.
- Pitfall: Naively increasing the width of residual connections (as in hyper connections) without regularization can lead to exploding gradients and unstable training, negating potential benefits.
References:
Continue reading
Next article
Is vibe coding as powerful as it seems?
Related Content
Microsoft Research Releases OptiMind: A 20B Parameter Model for Optimization
Microsoft Research’s OptiMind achieves a 20.7% improvement in formulation accuracy across optimization benchmarks by translating natural language into solver-ready models.
Google DeepMind's AlphaEvolve: LLM-Driven Semantic Evolution for MARL Algorithms
DeepMind's AlphaEvolve uses LLMs to discover VAD-CFR, an algorithm that surpassed state-of-the-art performance in 10 out of 11 games through semantic evolution.
Vectors, Dimensions, and Feature Spaces: The Geometric Foundation of Machine Learning
An engineering guide to representing real-world objects as vectors in high-dimensional feature spaces using PHP for normalization and linear modeling.