Introduction
In momentum-based methods, we use a single learning rate for all parameters. But neural networks have millions of parameters, and they're not created equal:
- Parameters connected to frequent features (common words, bright pixels) get big, stable gradients.
- Parameters connected to rare features (unusual words, edge cases) get small, noisy gradients.
The Core Insight
Adaptive optimizers give each parameter its own learning rate, automatically tuned based on gradient history. This normalizes the optimization landscape and lets rare features catch up.
The Problem with Global Learning Rate
Consider training a word embedding model. The word "the" appears millions of times; its gradient is massive and stable. The word "serendipity" appears twice; its gradient is tiny and unreliable.
If LR is Large
"the" overshoots and oscillates. "serendipity" finally learns something.
If LR is Small
"the" converges nicely. "serendipity" barely moves in a lifetime of training.
The Solution
Divide the learning rate by the "magnitude" of recent gradients. Big gradients get small effective LR. Small gradients get large effective LR. The playing field is leveled.
AdaGrad (Adaptive Gradient, 2011)
AdaGrad (Duchi et al., 2011) was the first breakthrough for sparse data problems. It maintains a sum of squared gradients for each parameter:
Accumulator
Parameter Update
How It Works
- Frequent features: Large , so effective LR = is small.
- Rare features: Small , so effective LR stays large.
- (typically ) prevents division by zero.
AdaGrad's Strength
Excellent for sparse data (NLP, click-through prediction). Rare events get aggressive updates when they finally appear, compensating for their infrequency.
Interactive: Learning Rate Adaptation
See how AdaGrad adapts the learning rate differently for frequent vs rare features. Watch the accumulator grow and the effective learning rate change.
Adaptive Learning Rate
Decays quickly to prevent oscillation.
AdaGrad's Fatal Flaw
AdaGrad has a critical problem: the accumulator is a sum of positive numbers. It only grows, never shrinks.
The Learning Rate Death Spiral
As training progresses:
The effective learning rate decays to zero. The model freezes before reaching the optimum. Training simply stops making progress.
This is acceptable for convex problems (you're near the optimum anyway). But for deep learning's non-convex landscapes, you need to keep exploring. AdaGrad gives up too early.
RMSprop (Root Mean Square Propagation)
RMSprop was proposed by Geoff Hinton in a Coursera lecture (Lecture 6e). It's never been formally published, yet it powers much of modern AI.
The fix is simple: instead of a cumulative sum, use an exponential moving average (EMA) of squared gradients. The accumulator "forgets" ancient history.
Leaky Accumulator
Parameter Update
Why EMA Works
With , the accumulator averages roughly the last 10 squared gradients. It doesn't grow to infinity; it stabilizes around the recent average magnitude. Learning continues indefinitely.
Adam (Adaptive Moment Estimation)
Adam (Kingma & Ba, 2015) combines the best of Momentum and RMSprop:
First Moment (Mean)
Like Momentum: smooths the gradient direction.
Second Moment (Variance)
Like RMSprop: scales the learning rate.
Adam Update
where and are bias-corrected moments
Default Hyperparameters
Learning Rate
First Moment Decay
Second Moment Decay
Interactive: Optimizer Race
Watch all four optimizers race through a ravine. Notice how AdaGrad slows down over time while RMSprop and Adam maintain speed.
Adaptive Optimizers Race
Live Performance
Bias Correction Deep-Dive
Adam initializes and . This creates a problem in early training.
The Initialization Bias
At step 1, with :
The estimate is 1000× smaller than the true squared gradient! Without correction, the learning rate would explode.
The Fix
Divide by to scale up early estimates:
At t=1: . Dividing by 0.001 multiplies by 1000, exactly compensating for the bias.
Interactive: Bias Correction
See the bias correction in action. Toggle it on/off to understand why it's necessary in early training.
Bias Correction
AdamW: Decoupled Weight Decay
For years, researchers noticed that Adam generalized worse than SGD+Momentum on vision tasks. Loshchilov & Hutter (2017) discovered the culprit: incorrect implementation of L2 regularization.
The Problem: L2 Regularization vs Weight Decay
In SGD, L2 regularization and weight decay are mathematically equivalent. Adding to the loss produces gradient , which after the update gives:
The term is weight decay. For SGD, adding L2 to the loss and applying weight decay directly produce the same result. But for Adam, they diverge.
Standard Adam + L2
Adds to the gradient before adaptive scaling:
Problem: The weight decay term gets scaled by . Parameters with large gradients receive less regularization.
AdamW (Decoupled)
Applies weight decay after the Adam update, not through the gradient:
Why Decoupling Matters
Uniform Regularization
In AdamW, every parameter receives the same relative weight decay , regardless of gradient magnitude. This matches the intended behavior of L2 regularization.
Hyperparameter Independence
The optimal weight decay becomes independent of the learning rate . In standard Adam+L2, you need to retune whenever you change .
Better Generalization
AdamW achieves generalization comparable to SGD+Momentum while retaining Adam's fast convergence. This closed the gap that made practitioners prefer SGD for vision tasks.
AdamW for Transformers
AdamW is the default optimizer for BERT, GPT, ViT, and virtually all modern transformers. Typical settings: to , to , , . If you're training a transformer, use AdamW.
Adam vs AdamW
X has 20x steeper gradients. Watch how regularization differs.
The Problem
Adam+L2 divides weight decay by . Steep directions (large ) get weaker regularization. AdamW keeps decay uniform.
The Adam Controversy
Adam is not universally loved. There's an ongoing debate about when to use it.
Adam Wins: Fast Convergence
Adam converges faster in early training. Great for prototyping, NLP, and when compute is limited. It's forgiving of learning rate choice.
SGD+Momentum Wins: Better Generalization
Many vision papers report that SGD+Momentum finds flatter minima that generalize better to test data. The noise in SGD acts as implicit regularization.
The Practical Compromise
Many practitioners use Adam for early training (reach a good region fast), then switch to SGD for fine-tuning (find a flat minimum). Some use learning rate warmup to help Adam's early instability.
Comparison Table
| Optimizer | Key Feature | Best For |
|---|---|---|
| SGD | No memory | Simple convex problems |
| Momentum | Velocity accumulation | Vision (often beats Adam) |
| AdaGrad | Sum of squared gradients | Sparse NLP data |
| RMSprop | Leaky average of squares | RNNs, RL |
| Adam | Momentum + RMSprop + Bias Correction | Default choice for most tasks |
| AdamW | Adam + Correct Weight Decay | Transformers (BERT, GPT, ViT) |