What is the Adam optimizer and how does it work?

Adam (Adaptive Moment Estimation) combines a first moment—an exponential moving average of gradients similar to momentum—with a second moment, an exponential moving average of squared gradients similar to RMSprop. Both estimates are bias-corrected to account for zero initialization, and the update divides the smoothed gradient by the square root of the smoothed squared gradient. This gives each parameter its own adaptive effective learning rate, making Adam robust to gradient scale differences across parameters.

What is the difference between AdaGrad, RMSprop, and Adam?

AdaGrad accumulates all squared gradients from the start of training, so its effective learning rate decays monotonically to zero, making it unsuitable for non-convex deep learning. RMSprop fixes this by replacing the cumulative sum with an exponential moving average of squared gradients so that old history fades away. Adam builds on RMSprop by also tracking a first-moment (momentum) EMA of the gradients and applying bias correction to both moments, making it more stable and generally faster to converge.

When should you use Adam vs SGD with momentum?

Use Adam when you need fast convergence, are working with transformers or NLP tasks, or want robustness to learning rate choice without extensive tuning. Prefer SGD with momentum for vision tasks where final generalization quality matters most, or when you have a carefully tuned learning rate schedule, as SGD often finds flatter minima that generalize better to test data. A common practical compromise is to use Adam for early training to reach a good region of the loss landscape, then switch to SGD for fine-tuning.

What is the difference between Adam and AdamW?

AdamW decouples weight decay from the gradient update by applying it directly to the weights after the Adam step, rather than adding the regularization term to the gradient before adaptive scaling. In standard Adam with L2 regularization, the regularization term gets divided by the adaptive scaling factor, so parameters with large gradients receive less regularization than intended. AdamW ensures uniform regularization across all parameters and makes the optimal weight decay hyperparameter independent of the learning rate, which is why it is the default optimizer for transformers like BERT, GPT, and ViT.

What are the hyperparameters of Adam and how do you tune them?

Adam's main hyperparameters are the learning rate (default 0.001), beta1 controlling the first-moment decay (default 0.9), beta2 controlling the second-moment decay (default 0.999), and epsilon for numerical stability (default 1e-8). The learning rate is by far the most sensitive and should be tuned first, either with a learning rate finder or a warmup-then-decay schedule. Beta1 and beta2 rarely need adjustment for most tasks, though a lower beta2 such as 0.98 can improve stability when training transformers.

Adam, AdaGrad, RMSprop: Adaptive Learning Rates

Introduction

In momentum-based methods, we use a single learning rate $\eta$ for all parameters. But neural networks have millions of parameters, and they're not created equal:

Parameters connected to frequent features (common words, bright pixels) get big, stable gradients.
Parameters connected to rare features (unusual words, edge cases) get small, noisy gradients.

The Core Insight

Adaptive optimizers give each parameter its own learning rate, automatically tuned based on gradient history. This normalizes the optimization landscape and lets rare features catch up.

The Problem with Global Learning Rate

Consider training a word embedding model. The word "the" appears millions of times; its gradient is massive and stable. The word "serendipity" appears twice; its gradient is tiny and unreliable.

If LR is Large

"the" overshoots and oscillates. "serendipity" finally learns something.

If LR is Small

"the" converges nicely. "serendipity" barely moves in a lifetime of training.

The Solution

Divide the learning rate by the "magnitude" of recent gradients. Big gradients get small effective LR. Small gradients get large effective LR. The playing field is leveled.

AdaGrad (Adaptive Gradient, 2011)

AdaGrad (Duchi et al., 2011) was the first breakthrough for sparse data problems. It maintains a sum of squared gradients for each parameter:

Accumulator

G_t = G_{t-1} + g_t^2

Parameter Update

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \cdot g_t

How It Works

Frequent features: Large $G_t$ , so effective LR = $\eta / \sqrt{G_t}$ is small.
Rare features: Small $G_t$ , so effective LR stays large.
$\epsilon$ (typically $10^{-8}$ ) prevents division by zero.

AdaGrad's Strength

Excellent for sparse data (NLP, click-through prediction). Rare events get aggressive updates when they finally appear, compensating for their infrequency.

Interactive: Learning Rate Adaptation

See how AdaGrad adapts the learning rate differently for frequent vs rare features. Watch the accumulator grow and the effective learning rate change.

Adaptive Learning Rate

Visualizing AdaGrad's decay based on feature frequency.

Accumulated Gradient (G)

161.37

Sum of squared gradients. Grows rapidly due to frequent updates.

Effective Learning Rate

0.0394

\eta_{eff} = \eta / \sqrt{G_t}

Decays quickly to prevent oscillation.

"The word "the" appears constantly. AdaGrad brakes hard to stop it from exploding."

AdaGrad's Fatal Flaw

AdaGrad has a critical problem: the accumulator $G_t$ is a sum of positive numbers. It only grows, never shrinks.

The Learning Rate Death Spiral

As training progresses:

G_t \to \infty \implies \frac{\eta}{\sqrt{G_t}} \to 0

The effective learning rate decays to zero. The model freezes before reaching the optimum. Training simply stops making progress.

This is acceptable for convex problems (you're near the optimum anyway). But for deep learning's non-convex landscapes, you need to keep exploring. AdaGrad gives up too early.

RMSprop (Root Mean Square Propagation)

RMSprop was proposed by Geoff Hinton in a Coursera lecture (Lecture 6e). It's never been formally published, yet it powers much of modern AI.

The fix is simple: instead of a cumulative sum, use an exponential moving average (EMA) of squared gradients. The accumulator "forgets" ancient history.

Leaky Accumulator

E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta) g_t^2

Parameter Update

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \cdot g_t

Why EMA Works

With $\beta = 0.9$ , the accumulator averages roughly the last 10 squared gradients. It doesn't grow to infinity; it stabilizes around the recent average magnitude. Learning continues indefinitely.

Adam (Adaptive Moment Estimation)

Adam (Kingma & Ba, 2015) combines the best of Momentum and RMSprop:

First Moment (Mean)

Like Momentum: smooths the gradient direction.

m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t

Second Moment (Variance)

Like RMSprop: scales the learning rate.

v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2

Adam Update

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t

where $\hat{m}_t$ and $\hat{v}_t$ are bias-corrected moments

Default Hyperparameters

Learning Rate

$\eta = 0.001$

First Moment Decay

$\beta_1 = 0.9$

Second Moment Decay

$\beta_2 = 0.999$

Interactive: Optimizer Race

Watch all four optimizers race through a ravine. Notice how AdaGrad slows down over time while RMSprop and Adam maintain speed.

Adaptive Optimizers Race

Comparing trajectories in a high-curvature landscape.

Steps: 0

Live Performance

SGD

Loss:25.74000

AdaGrad

Stops early (LR→0)

Loss:25.74000

RMSprop

Leaky accumulator

Loss:25.74000

Adam

Adaptive + Momentum

Loss:25.74000

Bias Correction Deep-Dive

Adam initializes $m_0 = 0$ and $v_0 = 0$ . This creates a problem in early training.

The Initialization Bias

At step 1, with $\beta_2 = 0.999$ :

v_1 = 0.999(0) + 0.001 \cdot g_1^2 = 0.001 g_1^2

The estimate is 1000× smaller than the true squared gradient! Without correction, the learning rate would explode.

The Fix

Divide by $(1 - \beta^t)$ to scale up early estimates:

\hat{m}_t = \frac{m_t}{1 - \beta_1^t}

\hat{v}_t = \frac{v_t}{1 - \beta_2^t}

At t=1: $1 - 0.999^1 = 0.001$ . Dividing by 0.001 multiplies by 1000, exactly compensating for the bias.

Interactive: Bias Correction

See the bias correction in action. Toggle it on/off to understand why it's necessary in early training.

Bias Correction

Addressing the zero-initialization problem in Moving Averages.

Time Step (t)

Correction Factor

1.00×

\frac{1}{1 - \beta^t}

(where

\beta=0.9

)

Uncorrected

0.998

Biased towards 0

Corrected

1.000

Matches True (1.0)

"Without correction, Adam starts too slow. The factor boosts early estimates to match reality."

AdamW: Decoupled Weight Decay

For years, researchers noticed that Adam generalized worse than SGD+Momentum on vision tasks. Loshchilov & Hutter (2017) discovered the culprit: incorrect implementation of L2 regularization.

The Problem: L2 Regularization vs Weight Decay

In SGD, L2 regularization and weight decay are mathematically equivalent. Adding $\frac{\lambda}{2}\|w\|^2$ to the loss produces gradient $\nabla L + \lambda w$ , which after the update gives:

w_{t+1} = w_t - \eta(\nabla L + \lambda w_t) = (1 - \eta\lambda)w_t - \eta\nabla L

The term $(1 - \eta\lambda)w_t$ is weight decay. For SGD, adding L2 to the loss and applying weight decay directly produce the same result. But for Adam, they diverge.

Standard Adam + L2

Adds $\lambda w$ to the gradient before adaptive scaling:

m_t = \beta_1 m_{t-1} + (1-\beta_1)(g_t + \lambda w_t)

v_t = \beta_2 v_{t-1} + (1-\beta_2)(g_t + \lambda w_t)^2

Problem: The weight decay term $\lambda w$ gets scaled by $1/\sqrt{v_t}$ . Parameters with large gradients receive less regularization.

AdamW (Decoupled)

Applies weight decay after the Adam update, not through the gradient:

m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t

v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2

w_{t+1} = w_t - \eta\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} - \eta\lambda w_t

Why Decoupling Matters

Uniform Regularization

In AdamW, every parameter receives the same relative weight decay $\eta\lambda$ , regardless of gradient magnitude. This matches the intended behavior of L2 regularization.

Hyperparameter Independence

The optimal weight decay $\lambda$ becomes independent of the learning rate $\eta$ . In standard Adam+L2, you need to retune $\lambda$ whenever you change $\eta$ .

Better Generalization

AdamW achieves generalization comparable to SGD+Momentum while retaining Adam's fast convergence. This closed the gap that made practitioners prefer SGD for vision tasks.

AdamW for Transformers

AdamW is the default optimizer for BERT, GPT, ViT, and virtually all modern transformers. Typical settings: $\eta = 1e\text{-}4$ to $3e\text{-}4$ , $\lambda = 0.01$ to $0.1$ , $\beta_1 = 0.9$ , $\beta_2 = 0.999$ . If you're training a transformer, use AdamW.

Adam vs AdamW

X has 20x steeper gradients. Watch how regularization differs.

Steps: 0

Step: 0

Adam + L2

Loss:66.240

Effective

\lambda_x

:0.3000

Effective

\lambda_y

:0.3000

\lambda_x < \lambda_y

: steep direction gets less regularization

AdamW

Loss:66.240

\lambda_x

:0.3000

\lambda_y

:0.3000

\lambda_x = \lambda_y

: uniform decay, better generalization

The Problem

Adam+L2 divides weight decay by $\sqrt{v}$ . Steep directions (large $v$ ) get weaker regularization. AdamW keeps decay uniform.

The Adam Controversy

Adam is not universally loved. There's an ongoing debate about when to use it.

Adam Wins: Fast Convergence

Adam converges faster in early training. Great for prototyping, NLP, and when compute is limited. It's forgiving of learning rate choice.

SGD+Momentum Wins: Better Generalization

Many vision papers report that SGD+Momentum finds flatter minima that generalize better to test data. The noise in SGD acts as implicit regularization.

The Practical Compromise

Many practitioners use Adam for early training (reach a good region fast), then switch to SGD for fine-tuning (find a flat minimum). Some use learning rate warmup to help Adam's early instability.

Comparison Table

Optimizer	Key Feature	Best For
SGD	No memory	Simple convex problems
Momentum	Velocity accumulation	Vision (often beats Adam)
AdaGrad	Sum of squared gradients	Sparse NLP data
RMSprop	Leaky average of squares	RNNs, RL
Adam	Momentum + RMSprop + Bias Correction	Default choice for most tasks
AdamW	Adam + Correct Weight Decay	Transformers (BERT, GPT, ViT)

Contents

Introduction

The Core Insight

The Problem with Global Learning Rate

If LR is Large

If LR is Small

The Solution

AdaGrad (Adaptive Gradient, 2011)

How It Works

AdaGrad's Strength

Interactive: Learning Rate Adaptation

Adaptive Learning Rate

AdaGrad's Fatal Flaw

The Learning Rate Death Spiral

RMSprop (Root Mean Square Propagation)

Why EMA Works

Adam (Adaptive Moment Estimation)

First Moment (Mean)

Second Moment (Variance)

Default Hyperparameters

Interactive: Optimizer Race

Adaptive Optimizers Race

Live Performance

Bias Correction Deep-Dive

The Initialization Bias

The Fix

Interactive: Bias Correction

Bias Correction

AdamW: Decoupled Weight Decay

The Problem: L2 Regularization vs Weight Decay

Standard Adam + L2

AdamW (Decoupled)

Why Decoupling Matters

Uniform Regularization

Hyperparameter Independence

Better Generalization

AdamW for Transformers

Adam vs AdamW

The Adam Controversy

Adam Wins: Fast Convergence

SGD+Momentum Wins: Better Generalization

The Practical Compromise

Comparison Table