Skip to main content

On This Page

DeepSeek Applies 1967 Matrix Normalization to Stabilize Hyper Connections

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Manifold Constrained Hyper Connections Stabilize LLM Training

DeepSeek researchers have addressed instability in large language model training stemming from hyper connections by applying the Sinkhorn-Knopp algorithm – a matrix normalization technique from 1967. Hyper connections, while increasing model expressivity, introduce amplification factors that can cause training to diverge at scale.

Why This Matters

Traditional residual connections maintain stable signal propagation, but hyper connections, which widen this pathway, can lead to exponentially growing activations. This instability limits the effective depth and scale of models, increasing training costs and potentially preventing convergence; a 27B parameter model exhibited peaks in amplification reaching 3000, hindering stable training.

Key Insights

  • Amax Gain Magnitude: DeepSeek defined this metric to measure worst-case amplification in signal paths, revealing excessive growth in hyper connections.
  • Doubly Stochastic Matrices: Constraining residual mixing matrices to this manifold ensures a convex combination of residual streams, preserving feature mass and regularizing norms.
  • Sinkhorn-Knopp Algorithm (1967): This classical algorithm efficiently projects matrices onto the doubly stochastic manifold, enabling practical implementation of the constraint.

Working Example

import numpy as np

def sinkhorn_knopp(A, num_iter=20):
    """
    Approximates a doubly stochastic matrix using the Sinkhorn-Knopp algorithm.

    Args:
        A: Input matrix.
        num_iter: Number of Sinkhorn-Knopp iterations.

    Returns:
        Doubly stochastic approximation of A.
    """
    for _ in range(num_iter):
        A = np.divide(A, np.sum(A, axis=0, keepdims=True))
        A = np.divide(A, np.sum(A, axis=1, keepdims=True))
    return A

# Example usage:
A = np.random.rand(4, 4)
doubly_stochastic_A = sinkhorn_knopp(A)

print("Original Matrix A:\n", A)
print("\nDoubly Stochastic Approximation:\n", doubly_stochastic_A)
print("\nRow Sums:", np.sum(doubly_stochastic_A, axis=1))
print("\nColumn Sums:", np.sum(doubly_stochastic_A, axis=0))

Practical Applications

  • DeepSeek MoE Models: Successfully applied mHC to 3B, 9B, and 27B mixture-of-experts models, achieving improved performance on language benchmarks.
  • Pitfall: Naively increasing the width of residual connections (as in hyper connections) without regularization can lead to exploding gradients and unstable training, negating potential benefits.

References:

Continue reading

Next article

Is vibe coding as powerful as it seems?

Related Content