Skip to main content

On This Page

Why Gradient Descent Zigzags and How Momentum Fixes It

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Why Gradient Descent Zigzags and How Momentum Fixes It

Standard gradient descent is fundamentally inefficient on loss surfaces with uneven curvature, often requiring significantly more iterations to converge. In a controlled simulation, vanilla gradient descent took 185 steps to reach the minimum, whereas Momentum optimization achieved convergence in 159 steps.

Why This Matters

In real-world neural network training, loss surfaces are rarely symmetric and typically exhibit high condition numbers, where curvature is 100x steeper in one direction than another. This technical reality forces a trade-off where standard gradient descent must use a low learning rate to avoid divergence in steep regions, which inadvertently causes near-stagnation in flat regions where progress is most needed.

Key Insights

  • Anisotropic surfaces with high condition numbers (e.g., 100) force vanilla gradient descent into inefficient zigzagging patterns.
  • Momentum introduces a velocity term that acts as an exponential moving average of past gradients to smooth parameter updates.
  • In steep directions, alternating gradient signs cancel out in the velocity update, effectively dampening oscillations.
  • Consistent gradients in flatter directions accumulate over time, allowing the optimizer to accelerate across plateaus.
  • The stability limit for gradient descent is 2/lambda_max; exceeding this results in immediate divergence, as seen with beta=0.99.

Working Examples

Comparison of Vanilla Gradient Descent and Momentum update logic.

def gradient_descent(start, lr, steps=300):\n    path = [np.array(start, dtype=float)]\n    pos = np.array(start, dtype=float)\n    for _ in range(steps):\n        pos = pos - lr * grad(*pos)\n        path.append(pos.copy())\n    return np.array(path)\n\ndef momentum_gd(start, lr, beta, steps=300):\n    path = [np.array(start, dtype=float)]\n    pos = np.array(start, dtype=float)\n    v = np.zeros(2)\n    for _ in range(steps):\n        g = grad(*pos)\n        v = beta * v + (1 - beta) * g\n        pos = pos - lr * v\n        path.append(pos.copy())\n    return np.array(path)

Practical Applications

  • Use Case: Training deep neural networks on complex loss surfaces where beta=0.9 serves as the typical sweet spot for stabilizing updates. Pitfall: Setting beta too high (e.g., 0.99) causes the optimizer to overshoot the minimum and fail to stabilize.
  • Use Case: Navigating anisotropic bowls where one axis is 100x steeper than the other to reduce convergence steps from 185 to 159. Pitfall: Using a learning rate above the stability limit (2 / lambda_max) which causes the optimizer to diverge outright.

References:

Continue reading

Next article

ZenWinHook: Achieving Thread-Safe Windows Hooking and Instruction Relocation in C++

Related Content