Modules
03/09
Optimization

Contents

Momentum & Nesterov Acceleration

Giving gradient descent a sense of velocity and foresight. The physics behind faster convergence.

Introduction

Vanilla gradient descent is memoryless. Each step, it looks at the current gradient, takes a step, and immediately forgets everything. It doesn't know if it's been heading in the same direction for 1000 iterations or if it just turned around.

This causes two major problems:

Oscillation

In narrow valleys (ravines), SGD bounces back and forth between steep walls, wasting computation.

Slow Progress

Along flat directions, gradients are tiny. SGD crawls, never building up speed.

The Solution

Momentum solves both by giving the optimizer memory. Nesterov improves it further by adding foresight.

The SGD Problem: Ravines

Real loss surfaces aren't perfect spherical bowls. They're often ravines: elongated valleys where the surface curves sharply in one direction and gently in another.

The Ravine Scenario

Consider f(x,y)=x2+10y2f(x,y) = x^2 + 10y^2. The y-direction has 10× steeper curvature than x.

Y-Direction (Steep Walls)

Gradient = 20y (HUGE)

SGD takes big steps, overshoots, bounces back

X-Direction (Gentle Floor)

Gradient = 2x (small)

SGD takes tiny steps, makes slow progress

The Result

SGD zigzags wildly across the ravine (y-direction) while creeping along the floor (x-direction). Most computation is wasted on oscillations, not progress toward the minimum.

Physics of Momentum

The fix comes from physics. Standard SGD treats the optimization variable like a massless particle: apply a force (gradient), it moves instantly; remove the force, it stops instantly.

Momentum treats it like a heavy ball rolling down a hill. The ball has inertia. Once it's moving, it tends to keep moving in the same direction, even if the local slope changes.

Velocity vv

The ball doesn't move based on current gradient alone; it moves based on accumulated velocity. The gradient is just a force that changes that velocity.

Friction β\beta

Without friction, the ball would oscillate forever in a bowl. We need a decay factor (like air resistance) so the ball eventually stops at the bottom.

Classical Momentum

Instead of updating weights directly with the gradient, we update a velocity vector:

Velocity Update

vt+1=βvt+ηL(θt)v_{t+1} = \beta v_t + \eta \nabla L(\theta_t)

Parameter Update

θt+1=θtvt+1\theta_{t+1} = \theta_t - v_{t+1}

β\beta

Momentum coefficient (typically 0.9). Controls how much "memory" we keep.

η\eta

Learning rate.

vtv_t

Velocity at step t. An exponential moving average of past gradients.

Exponential Moving Average

With β=0.9\beta = 0.9, the velocity roughly averages the last 1/(10.9)=101/(1-0.9) = 10 gradients. Recent gradients matter most; ancient history fades.

Interactive: SGD vs Momentum

Watch the optimizers navigate a ravine. Notice how SGD oscillates wildly while momentum methods smooth out the path.

Optimizer Race

Steps: 0

Live Performance

SGD
Loss:36.84000
Momentum
Loss:36.84000
Nesterov
Loss:36.84000

How Momentum Kills Oscillations

In a ravine, gradients alternate: left wall → right wall → left wall. Without momentum, SGD follows each gradient blindly, bouncing forever.

With Momentum

When gradients point left-right-left-right, they cancel out in the velocity. The y-component of velocity shrinks to near zero.

Meanwhile, the x-component (along the valley floor) doesn't cancel. It accumulates! The optimizer builds up speed in the consistent direction and ignores the oscillations.

This is exactly like a car's shock absorbers: they damp out high-frequency bumps while preserving the smooth, long-term trajectory.

Nesterov Acceleration

Classical momentum has a subtle problem: it's reactive. It computes the gradient at the current position, then applies momentum. But we're about to move due to momentum anyway. Why not look ahead first?

Lookahead Position

θ~t=θtβvt\tilde{\theta}_t = \theta_t - \beta v_t

Velocity Update (at lookahead)

vt+1=βvt+ηL(θ~t)v_{t+1} = \beta v_t + \eta \nabla L(\tilde{\theta}_t)

Parameter Update

θt+1=θtvt+1\theta_{t+1} = \theta_t - v_{t+1}

We compute the gradient not at θt\theta_t, but at the "lookahead" position θ~t\tilde{\theta}_t where momentum would take us. This gives Nesterov its predictive power.

Nesterov's Lookahead

Classical Momentum

1. Look at current gradient
2. Update velocity
3. Move

Reactive: responds after the fact

Nesterov

1. Jump ahead (momentum)
2. Look at gradient there
3. Correct the jump

Proactive: anticipates the future

Why This Helps

If momentum is about to overshoot, the gradient at the lookahead position will point backward, "correcting" the momentum before it's too late. Nesterov catches mistakes earlier, leading to faster convergence.

Tuning Beta

Beta (β\beta) controls the momentum strength. Higher β = more memory, slower decay.

β = 0

No momentum. Just vanilla SGD. Forgets everything instantly.

β = 0.9

Standard choice. Averages ~10 past gradients. Good for most problems.

β = 0.99

High momentum. Averages ~100 gradients. Very smooth, can overshoot.

Interactive: Beta Tuning

Explore how β affects velocity decay and memory. Adjust the slider to see the trade-offs.

Momentum Coefficient (β) Tuning

See how β affects velocity decay. Higher β = more memory, slower decay.

0 (No momentum)0.99 (Very strong)
Half-life: 6.6 steps1.00.0Time Steps →Velocity
Half-life
6.6
steps
Effective Window
10.0
gradients
After 20 steps
12.2%
remaining

Interpretation

β = 0.90: Velocity decays by 10.0% per step.

High β (0.9): Standard choice. Smooth optimization with good momentum.

Convergence Rates

For smooth convex functions, provable convergence rates (how fast loss decreases):

SGD (no momentum)O(1/t)O(1/t)
MomentumO(1/t)O(1/t)
NesterovO(1/t2)O(1/t^2)

Nesterov is Provably Optimal

For smooth convex functions, O(1/t2)O(1/t^2) is the best possible rate using only gradient information. Nesterov proved this in 1983 and showed his method achieves it. No first-order method can do better.

Practical Guidance

When to Use Momentum

  • Almost always. There's rarely a reason to use vanilla SGD.
  • Computer Vision: SGD + Momentum often generalizes better than Adam.
  • If training is too noisy, try lower beta (0.5-0.8).

Nesterov vs Vanilla Momentum

  • Nesterov is usually better or equal. Use it by default.
  • In PyTorch: optimizer = SGD(..., nesterov=True)
  • The improvement is more noticeable in convex/near-convex problems.

Momentum vs Adam

This is a hot debate. Adam converges faster early in training, but SGD+Momentum often finds flatter minima that generalize better. Many vision papers use SGD+Momentum for final results, even if they prototype with Adam.