DeepSeek Applies 1967 Matrix Normalization to Stabilize Hyper Connections

Manifold Constrained Hyper Connections Stabilize LLM Training

DeepSeek researchers have addressed instability in large language model training stemming from hyper connections by applying the Sinkhorn-Knopp algorithm – a matrix normalization technique from 1967. Hyper connections, while increasing model expressivity, introduce amplification factors that can cause training to diverge at scale.

Why This Matters

Traditional residual connections maintain stable signal propagation, but hyper connections, which widen this pathway, can lead to exponentially growing activations. This instability limits the effective depth and scale of models, increasing training costs and potentially preventing convergence; a 27B parameter model exhibited peaks in amplification reaching 3000, hindering stable training.

Key Insights

Amax Gain Magnitude: DeepSeek defined this metric to measure worst-case amplification in signal paths, revealing excessive growth in hyper connections.
Doubly Stochastic Matrices: Constraining residual mixing matrices to this manifold ensures a convex combination of residual streams, preserving feature mass and regularizing norms.
Sinkhorn-Knopp Algorithm (1967): This classical algorithm efficiently projects matrices onto the doubly stochastic manifold, enabling practical implementation of the constraint.

Working Example

import numpy as np

def sinkhorn_knopp(A, num_iter=20):
    """
    Approximates a doubly stochastic matrix using the Sinkhorn-Knopp algorithm.

    Args:
        A: Input matrix.
        num_iter: Number of Sinkhorn-Knopp iterations.

    Returns:
        Doubly stochastic approximation of A.
    """
    for _ in range(num_iter):
        A = np.divide(A, np.sum(A, axis=0, keepdims=True))
        A = np.divide(A, np.sum(A, axis=1, keepdims=True))
    return A

# Example usage:
A = np.random.rand(4, 4)
doubly_stochastic_A = sinkhorn_knopp(A)

print("Original Matrix A:\n", A)
print("\nDoubly Stochastic Approximation:\n", doubly_stochastic_A)
print("\nRow Sums:", np.sum(doubly_stochastic_A, axis=1))
print("\nColumn Sums:", np.sum(doubly_stochastic_A, axis=0))

Practical Applications

DeepSeek MoE Models: Successfully applied mHC to 3B, 9B, and 27B mixture-of-experts models, achieving improved performance on language benchmarks.
Pitfall: Naively increasing the width of residual connections (as in hyper connections) without regularization can lead to exploding gradients and unstable training, negating potential benefits.

References:

https://www.marktechpost.com/2026/01/03/deepseek-researchers-apply-a-1967-matrix-normalization-algorithm-to-fix-instability-in-hyper-connections/

On This Page

Manifold Constrained Hyper Connections Stabilize LLM Training

Why This Matters

Key Insights

Working Example

Practical Applications

Continue reading

Related Content

Microsoft Research Releases OptiMind: A 20B Parameter Model for Optimization

Why Intent Prediction Needs More Than an LLM: A Behavioral AI Perspective

Google DeepMind's AlphaEvolve: LLM-Driven Semantic Evolution for MARL Algorithms