Introduction
When training neural networks, the classic fear is getting stuck in a local minimum, a small valley that isn't the lowest point. However, groundbreaking research (Dauphin et al., 2014) revealed a surprising truth:
The Key Insight
In deep learning, local minima are rare. The vast majority of "stuck" points are saddle points.
Understanding this distinction explains why algorithms like momentum and Adam are necessary.
Definitions
Local Minimum
A point where the function value is lower than all neighboring points.
Imagine standing at the bottom of a bowl. Every direction you step goes UP.
Saddle Point
A point where gradient = 0, but it's a minimum in some directions and maximum in others.
Imagine a horse saddle. Front-to-back goes UP. Side-to-side goes DOWN.
Critical Points
Both local minima and saddle points are "critical points" where the gradient equals zero. The difference lies in the curvature (second derivatives).
The Hessian Test
At a critical point (where ), we examine the Hessian matrix eigenvalues to classify the point:
| Eigenvalues | Curvature | Classification |
|---|---|---|
| All | Curves UP everywhere | Local Minimum |
| All | Curves DOWN everywhere | Local Maximum |
| Mixed signs | UP in some, DOWN in others | Saddle Point |
Interactive: Saddle Point
Visualize a saddle surface. The function curves up in one direction (positive eigenvalue) and down in another (negative eigenvalue).
Interactive Saddle Point
Rotate View
Cross Sections
Along X: Convex (Minimum). Positive curvature. Stable.
Along Y: Concave (Maximum). Negative curvature. Unstable.
Hessian Analysis
Because the eigenvalues have mixed signs, the Hessian is indefinite. This mathematically defines a saddle point.
The High-Dimensional Curse
Why are saddle points more common than local minima in deep learning? Pure probability.
The Probability Argument
To be a local minimum in n-dimensional space, the curvature must be positive in all n directions.
If each eigenvalue has roughly 50% chance of being positive or negative:
For :
The Reality
Modern neural networks have millions or billions of parameters. The probability of ALL eigenvalues being positive (true minimum) is essentially zero. Almost every critical point is a saddle point with some positive and some negative eigenvalues.
Why Saddle Points Cause Problems
At a saddle point, the gradient is zero (or very small). Standard gradient descent relies on the gradient to update weights:
If gradient is zero, parameters don't move. The model thinks it has converged.
Plateaus
Near saddle points, gradients are very small (not exactly zero), creating "plateaus" where learning stagnates. The loss appears flat even though there are directions that would decrease it.
Saddle Point Stagnation
Live Telemetry
Analysis
Near the saddle, the gradient (Green) shrinks to zero.
With Momentum (Blue), velocity stays high because it "remembers" the previous speed, allowing it to coast through the flat region.
Escaping Saddle Points
Since vanilla gradient descent gets stuck, we use techniques that add "momentum" or "noise":
Stochastic Gradient Descent (SGD)
The noise from random mini-batches "kicks" the parameters. Even if the average gradient is zero, individual batch gradients are likely non-zero in the escape direction.
Momentum
Like a ball rolling downhill. If it enters a flat saddle region, its existing velocity carries it across the plateau.
Adam Optimizer
Combines momentum with adaptive learning rates. Uses both first moment (velocity) and second moment (variance) of gradients. The default choice for most deep learning.
Learning Rate Warmup
Start with a small learning rate, gradually increase. Helps navigate the initial chaotic landscape before settling into meaningful optimization.
Implications for Deep Learning
Good News: Local Minima Are Often Good
Research shows that in over-parameterized networks, most local minima have loss values very close to the global minimum. Getting "stuck" in a local minimum isn't as catastrophic as feared.
Flat Minima Generalize Better
Minima with small Hessian eigenvalues ("flat" regions) tend to generalize better than "sharp" minima. This is related to PAC-Bayesian theory and explains why SGD noise helps.
Over-Parameterization Helps
Having more parameters than needed creates many equivalent good solutions, making optimization easier. This is one reason why bigger models are often easier to train.
Skip Connections
ResNets with skip connections create a smoother loss landscape with fewer saddle points. This is why very deep networks became trainable with residual connections.