Introduction
Vanilla gradient descent is memoryless. Each step, it looks at the current gradient, takes a step, and immediately forgets everything. It doesn't know if it's been heading in the same direction for 1000 iterations or if it just turned around.
This causes two major problems:
Oscillation
In narrow valleys (ravines), SGD bounces back and forth between steep walls, wasting computation.
Slow Progress
Along flat directions, gradients are tiny. SGD crawls, never building up speed.
The Solution
Momentum solves both by giving the optimizer memory. Nesterov improves it further by adding foresight.
The SGD Problem: Ravines
Real loss surfaces aren't perfect spherical bowls. They're often ravines: elongated valleys where the surface curves sharply in one direction and gently in another.
The Ravine Scenario
Consider . The y-direction has 10× steeper curvature than x.
Y-Direction (Steep Walls)
Gradient = 20y (HUGE)
SGD takes big steps, overshoots, bounces back
X-Direction (Gentle Floor)
Gradient = 2x (small)
SGD takes tiny steps, makes slow progress
The Result
SGD zigzags wildly across the ravine (y-direction) while creeping along the floor (x-direction). Most computation is wasted on oscillations, not progress toward the minimum.
Physics of Momentum
The fix comes from physics. Standard SGD treats the optimization variable like a massless particle: apply a force (gradient), it moves instantly; remove the force, it stops instantly.
Momentum treats it like a heavy ball rolling down a hill. The ball has inertia. Once it's moving, it tends to keep moving in the same direction, even if the local slope changes.
Velocity
The ball doesn't move based on current gradient alone; it moves based on accumulated velocity. The gradient is just a force that changes that velocity.
Friction
Without friction, the ball would oscillate forever in a bowl. We need a decay factor (like air resistance) so the ball eventually stops at the bottom.
Classical Momentum
Instead of updating weights directly with the gradient, we update a velocity vector:
Velocity Update
Parameter Update
Momentum coefficient (typically 0.9). Controls how much "memory" we keep.
Learning rate.
Velocity at step t. An exponential moving average of past gradients.
Exponential Moving Average
With , the velocity roughly averages the last gradients. Recent gradients matter most; ancient history fades.
Interactive: SGD vs Momentum
Watch the optimizers navigate a ravine. Notice how SGD oscillates wildly while momentum methods smooth out the path.
Optimizer Race
Live Performance
How Momentum Kills Oscillations
In a ravine, gradients alternate: left wall → right wall → left wall. Without momentum, SGD follows each gradient blindly, bouncing forever.
With Momentum
When gradients point left-right-left-right, they cancel out in the velocity. The y-component of velocity shrinks to near zero.
Meanwhile, the x-component (along the valley floor) doesn't cancel. It accumulates! The optimizer builds up speed in the consistent direction and ignores the oscillations.
This is exactly like a car's shock absorbers: they damp out high-frequency bumps while preserving the smooth, long-term trajectory.
Nesterov Acceleration
Classical momentum has a subtle problem: it's reactive. It computes the gradient at the current position, then applies momentum. But we're about to move due to momentum anyway. Why not look ahead first?
Lookahead Position
Velocity Update (at lookahead)
Parameter Update
We compute the gradient not at , but at the "lookahead" position where momentum would take us. This gives Nesterov its predictive power.
Nesterov's Lookahead
Classical Momentum
1. Look at current gradient
2. Update velocity
3. Move
Reactive: responds after the fact
Nesterov
1. Jump ahead (momentum)
2. Look at gradient there
3. Correct the jump
Proactive: anticipates the future
Why This Helps
If momentum is about to overshoot, the gradient at the lookahead position will point backward, "correcting" the momentum before it's too late. Nesterov catches mistakes earlier, leading to faster convergence.
Tuning Beta
Beta () controls the momentum strength. Higher β = more memory, slower decay.
β = 0
No momentum. Just vanilla SGD. Forgets everything instantly.
β = 0.9
Standard choice. Averages ~10 past gradients. Good for most problems.
β = 0.99
High momentum. Averages ~100 gradients. Very smooth, can overshoot.
Interactive: Beta Tuning
Explore how β affects velocity decay and memory. Adjust the slider to see the trade-offs.
Momentum Coefficient (β) Tuning
See how β affects velocity decay. Higher β = more memory, slower decay.
Interpretation
β = 0.90: Velocity decays by 10.0% per step.
High β (0.9): Standard choice. Smooth optimization with good momentum.
Convergence Rates
For smooth convex functions, provable convergence rates (how fast loss decreases):
Nesterov is Provably Optimal
For smooth convex functions, is the best possible rate using only gradient information. Nesterov proved this in 1983 and showed his method achieves it. No first-order method can do better.
Practical Guidance
When to Use Momentum
Nesterov vs Vanilla Momentum
- Nesterov is usually better or equal. Use it by default.
- In PyTorch:
optimizer = SGD(..., nesterov=True) - The improvement is more noticeable in convex/near-convex problems.
Momentum vs Adam
This is a hot debate. Adam converges faster early in training, but SGD+Momentum often finds flatter minima that generalize better. Many vision papers use SGD+Momentum for final results, even if they prototype with Adam.