Modules
04/09
Optimization

Contents

Adaptive Learning Rates

From AdaGrad to AdamW: the quest to automate the most sensitive hyperparameter in deep learning.

Introduction

In momentum-based methods, we use a single learning rate η\eta for all parameters. But neural networks have millions of parameters, and they're not created equal:

  • Parameters connected to frequent features (common words, bright pixels) get big, stable gradients.
  • Parameters connected to rare features (unusual words, edge cases) get small, noisy gradients.

The Core Insight

Adaptive optimizers give each parameter its own learning rate, automatically tuned based on gradient history. This normalizes the optimization landscape and lets rare features catch up.

The Problem with Global Learning Rate

Consider training a word embedding model. The word "the" appears millions of times; its gradient is massive and stable. The word "serendipity" appears twice; its gradient is tiny and unreliable.

If LR is Large

"the" overshoots and oscillates. "serendipity" finally learns something.

If LR is Small

"the" converges nicely. "serendipity" barely moves in a lifetime of training.

The Solution

Divide the learning rate by the "magnitude" of recent gradients. Big gradients get small effective LR. Small gradients get large effective LR. The playing field is leveled.

AdaGrad (Adaptive Gradient, 2011)

AdaGrad (Duchi et al., 2011) was the first breakthrough for sparse data problems. It maintains a sum of squared gradients for each parameter:

Accumulator

Gt=Gt1+gt2G_t = G_{t-1} + g_t^2

Parameter Update

θt+1=θtηGt+ϵgt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \cdot g_t

How It Works

  • Frequent features: Large GtG_t, so effective LR = η/Gt\eta / \sqrt{G_t} is small.
  • Rare features: Small GtG_t, so effective LR stays large.
  • ϵ\epsilon (typically 10810^{-8}) prevents division by zero.

AdaGrad's Strength

Excellent for sparse data (NLP, click-through prediction). Rare events get aggressive updates when they finally appear, compensating for their infrequency.

Interactive: Learning Rate Adaptation

See how AdaGrad adapts the learning rate differently for frequent vs rare features. Watch the accumulator grow and the effective learning rate change.

Adaptive Learning Rate

High0Training Time →
Accumulated Gradient (G)
166.97
Sum of squared gradients. Grows rapidly due to frequent updates.
Effective Learning Rate
0.0387
ηeff=η/Gt\eta_{eff} = \eta / \sqrt{G_t}
Decays quickly to prevent oscillation.
"The word "the" appears constantly. AdaGrad brakes hard to stop it from exploding."

AdaGrad's Fatal Flaw

AdaGrad has a critical problem: the accumulator GtG_t is a sum of positive numbers. It only grows, never shrinks.

The Learning Rate Death Spiral

As training progresses:

Gt    ηGt0G_t \to \infty \implies \frac{\eta}{\sqrt{G_t}} \to 0

The effective learning rate decays to zero. The model freezes before reaching the optimum. Training simply stops making progress.

This is acceptable for convex problems (you're near the optimum anyway). But for deep learning's non-convex landscapes, you need to keep exploring. AdaGrad gives up too early.

RMSprop (Root Mean Square Propagation)

RMSprop was proposed by Geoff Hinton in a Coursera lecture (Lecture 6e). It's never been formally published, yet it powers much of modern AI.

The fix is simple: instead of a cumulative sum, use an exponential moving average (EMA) of squared gradients. The accumulator "forgets" ancient history.

Leaky Accumulator

E[g2]t=βE[g2]t1+(1β)gt2E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta) g_t^2

Parameter Update

θt+1=θtηE[g2]t+ϵgt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \cdot g_t

Why EMA Works

With β=0.9\beta = 0.9, the accumulator averages roughly the last 10 squared gradients. It doesn't grow to infinity; it stabilizes around the recent average magnitude. Learning continues indefinitely.

Adam (Adaptive Moment Estimation)

Adam (Kingma & Ba, 2015) combines the best of Momentum and RMSprop:

First Moment (Mean)

Like Momentum: smooths the gradient direction.

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1)g_t

Second Moment (Variance)

Like RMSprop: scales the learning rate.

vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2

Adam Update

θt+1=θtηv^t+ϵm^t\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t

where m^t\hat{m}_t and v^t\hat{v}_t are bias-corrected moments

Default Hyperparameters

Learning Rate

η=0.001\eta = 0.001

First Moment Decay

β1=0.9\beta_1 = 0.9

Second Moment Decay

β2=0.999\beta_2 = 0.999

Interactive: Optimizer Race

Watch all four optimizers race through a ravine. Notice how AdaGrad slows down over time while RMSprop and Adam maintain speed.

Adaptive Optimizers Race

Steps: 0

Live Performance

SGD
Loss:25.74000
AdaGrad
Stops early (LR→0)
Loss:25.74000
RMSprop
Leaky accumulator
Loss:25.74000
Adam
Adaptive + Momentum
Loss:25.74000

Bias Correction Deep-Dive

Adam initializes m0=0m_0 = 0 and v0=0v_0 = 0. This creates a problem in early training.

The Initialization Bias

At step 1, with β2=0.999\beta_2 = 0.999:

v1=0.999(0)+0.001g12=0.001g12v_1 = 0.999(0) + 0.001 \cdot g_1^2 = 0.001 g_1^2

The estimate is 1000× smaller than the true squared gradient! Without correction, the learning rate would explode.

The Fix

Divide by (1βt)(1 - \beta^t) to scale up early estimates:

m^t=mt1β1t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}
v^t=vt1β2t\hat{v}_t = \frac{v_t}{1 - \beta_2^t}

At t=1: 10.9991=0.0011 - 0.999^1 = 0.001. Dividing by 0.001 multiplies by 1000, exactly compensating for the bias.

Interactive: Bias Correction

See the bias correction in action. Toggle it on/off to understand why it's necessary in early training.

Bias Correction

True: 1.00.0Training Steps (t)
Time Step (t)
60
Correction Factor
1.00×
11βt\frac{1}{1 - \beta^t} (where β=0.9\beta=0.9)
Uncorrected
0.998
Biased towards 0
Corrected
1.000
Matches True (1.0)
"Without correction, Adam starts too slow. The factor boosts early estimates to match reality."

AdamW: Decoupled Weight Decay

For years, researchers noticed that Adam generalized worse than SGD+Momentum on vision tasks. Loshchilov & Hutter (2017) discovered the culprit: incorrect implementation of L2 regularization.

The Problem: L2 Regularization vs Weight Decay

In SGD, L2 regularization and weight decay are mathematically equivalent. Adding λ2w2\frac{\lambda}{2}\|w\|^2 to the loss produces gradient L+λw\nabla L + \lambda w, which after the update gives:

wt+1=wtη(L+λwt)=(1ηλ)wtηLw_{t+1} = w_t - \eta(\nabla L + \lambda w_t) = (1 - \eta\lambda)w_t - \eta\nabla L

The term (1ηλ)wt(1 - \eta\lambda)w_t is weight decay. For SGD, adding L2 to the loss and applying weight decay directly produce the same result. But for Adam, they diverge.

Standard Adam + L2

Adds λw\lambda w to the gradient before adaptive scaling:

mt=β1mt1+(1β1)(gt+λwt)m_t = \beta_1 m_{t-1} + (1-\beta_1)(g_t + \lambda w_t)
vt=β2vt1+(1β2)(gt+λwt)2v_t = \beta_2 v_{t-1} + (1-\beta_2)(g_t + \lambda w_t)^2

Problem: The weight decay term λw\lambda w gets scaled by 1/vt1/\sqrt{v_t}. Parameters with large gradients receive less regularization.

AdamW (Decoupled)

Applies weight decay after the Adam update, not through the gradient:

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1)g_t
vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2
wt+1=wtηm^tv^t+ϵηλwtw_{t+1} = w_t - \eta\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} - \eta\lambda w_t

Why Decoupling Matters

Uniform Regularization

In AdamW, every parameter receives the same relative weight decay ηλ\eta\lambda, regardless of gradient magnitude. This matches the intended behavior of L2 regularization.

Hyperparameter Independence

The optimal weight decay λ\lambda becomes independent of the learning rate η\eta. In standard Adam+L2, you need to retune λ\lambda whenever you change η\eta.

Better Generalization

AdamW achieves generalization comparable to SGD+Momentum while retaining Adam's fast convergence. This closed the gap that made practitioners prefer SGD for vision tasks.

AdamW for Transformers

AdamW is the default optimizer for BERT, GPT, ViT, and virtually all modern transformers. Typical settings: η=1e-4\eta = 1e\text{-}4 to 3e-43e\text{-}4, λ=0.01\lambda = 0.01 to 0.10.1, β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999. If you're training a transformer, use AdamW.

Adam vs AdamW

X has 20x steeper gradients. Watch how regularization differs.

Steps: 0
Step: 0
Adam + L2
Loss:66.240
Effective λx\lambda_x:0.3000
Effective λy\lambda_y:0.3000
λx<λy\lambda_x < \lambda_y : steep direction gets less regularization
AdamW
Loss:66.240
λx\lambda_x:0.3000
λy\lambda_y:0.3000
λx=λy\lambda_x = \lambda_y : uniform decay, better generalization

The Problem

Adam+L2 divides weight decay by v\sqrt{v}. Steep directions (large vv) get weaker regularization. AdamW keeps decay uniform.

The Adam Controversy

Adam is not universally loved. There's an ongoing debate about when to use it.

Adam Wins: Fast Convergence

Adam converges faster in early training. Great for prototyping, NLP, and when compute is limited. It's forgiving of learning rate choice.

SGD+Momentum Wins: Better Generalization

Many vision papers report that SGD+Momentum finds flatter minima that generalize better to test data. The noise in SGD acts as implicit regularization.

The Practical Compromise

Many practitioners use Adam for early training (reach a good region fast), then switch to SGD for fine-tuning (find a flat minimum). Some use learning rate warmup to help Adam's early instability.

Comparison Table

OptimizerKey FeatureBest For
SGDNo memorySimple convex problems
MomentumVelocity accumulationVision (often beats Adam)
AdaGradSum of squared gradientsSparse NLP data
RMSpropLeaky average of squaresRNNs, RL
AdamMomentum + RMSprop + Bias CorrectionDefault choice for most tasks
AdamWAdam + Correct Weight DecayTransformers (BERT, GPT, ViT)