Introduction
A loss landscape is the surface formed by plotting the loss function over the parameter space. For a neural network with millions of parameters, this is a surface in million-dimensional space. We study 2D slices to build intuition about the geometry of optimization.
Training a neural network is equivalent to navigating this landscape to find a low point (minimum). The shape of the landscape determines whether gradient descent succeeds, how fast it converges, and whether the solution generalizes to new data.
Convex Landscapes
A convex loss function has a single global minimum. Any local minimum is the global minimum. This is the ideal case: gradient descent is guaranteed to find the optimal solution (given small enough learning rate).
Convex Problems ✓
- Linear/Ridge regression
- Logistic regression
- Support Vector Machines
- Lasso (L1 regularization)
Non-Convex Problems ✗
- Neural networks (all depths)
- Matrix factorization
- Most deep learning models
- Many real-world problems
The Neural Network Challenge
Even a single hidden layer with 2 neurons creates a non-convex landscape. The composition of nonlinear activation functions guarantees non-convexity. Yet, mysteriously, SGD often finds good solutions anyway!
Interactive: Loss Landscapes
Explore different loss landscape shapes and watch gradient descent navigate them. Try the different landscape types to see how topology affects optimization.
Loss Landscape Explorer
Convex Bowl
A perfect bowl. Gradient descent always finds the global minimum (0,0).
Saddle Points
In high dimensions, saddle points are far more common than local minima. A saddle point is a critical point (zero gradient) that is a minimum in some directions and a maximum in others, like the center of a horse saddle.
Why Saddles Dominate High Dimensions
At a critical point, the Hessian has n eigenvalues (one per dimension). For a local minimum, ALL must be positive. For a saddle, some are positive, some negative.
In a million-dimensional space, the probability that all million eigenvalues happen to be positive is astronomically small. Almost all critical points are saddles.
SGD Noise
Stochastic gradients provide random perturbations that help escape flat regions around saddles.
Momentum
Accumulated velocity pushes through saddle regions even when gradients are small.
Second-Order Methods
Cubic regularization and trust-region methods explicitly detect and escape saddles.
Interactive: Escaping Saddles
Watch how SGD navigates a saddle point. The algorithm starts near the saddle (gradient ≈ 0) and must escape to make progress.
Escaping Saddle Points
Random noise will eventually perturb y off zero...
Key Insight
Stochastic noise randomly perturbs the particle. Once y becomes even slightly non-zero, the negative curvature (-2) in y amplifies this, causing rapid escape.
Interactive: Minima Comparison
Compare sharp and flat minima. Notice how parameter perturbations affect loss differently.
Sharp vs Flat Minima
Error Sensitivity
High sensitivity. Small shifts cause large errors.
Low sensitivity. Robust to distribution shifts.
Generalization Gap
The test set is never exactly the same as the training set - it's like "shifting" the parameters slightly. Flat minima ensure that this shift doesn't lead to a catastrophe (huge loss spike).
Case Study: Understanding Training Dynamics
The Observation
You're training a neural network to classify images. The loss plateaus for 50 epochs, then suddenly drops by 30%. What's happening in the loss landscape?
The Landscape Explanation
The optimizer is likely stuck near a saddle point. The gradient is small, so progress is slow. The optimizer is "wandering" in a flat region. Eventually:
- SGD noise accumulates in an unstable direction
- Momentum builds up enough velocity to escape
- The optimizer finds a descent direction and the loss drops rapidly
The Solution
Several strategies can help escape saddle regions faster:
- Cyclical learning rates: Increase LR during plateaus to add energy
- Use momentum: Accumulates velocity to push through flat regions
- Add gradient noise: Helps escape local flat spots
- Decrease batch size: More stochastic gradients = more exploration
ML Applications & Insights
Mode Connectivity
Research shows different minima found by SGD are often connected by paths of low loss. The landscape has "tunnels" connecting good solutions. It's less rugged than once thought.
Lottery Ticket Hypothesis
Sparse subnetworks at initialization can match full network performance. This suggests the loss landscape contains special "winning tickets". Initialization matters more than we thought.
Skip Connections (ResNets)
ResNets make the loss landscape dramatically smoother. Skip connections create "highways" through the landscape, making optimization easier. This is why ResNets train better than plain deep networks.
Learning Rate Schedules
Warmup, cosine annealing, and cyclical LR are all strategies informed by understanding loss landscape geometry. They help navigate complex topologies.
Batch Size Effects
Large batches converge to sharp minima (poor generalization). Small batches add noise that helps find flat minima (better generalization). This explains the "generalization gap" as batch size increases.
Neural Architecture Search
Some architectures have inherently smoother loss landscapes. NAS methods can discover architectures that are easier to optimize, not just more accurate.