Modules
05/09
Optimization

Contents

Loss Landscapes

The geometry of neural network optimization. Understanding the shape of the loss surface explains why some models train easily and others don't.

Introduction

A loss landscape is the surface formed by plotting the loss function over the parameter space. For a neural network with millions of parameters, this is a surface in million-dimensional space. We study 2D slices to build intuition about the geometry of optimization.

Training a neural network is equivalent to navigating this landscape to find a low point (minimum). The shape of the landscape determines whether gradient descent succeeds, how fast it converges, and whether the solution generalizes to new data.

Why It Matters

The shape of the loss landscape determines: (1) whether GD finds a good solution, (2) how fast it converges, (3) whether the solution generalizes. Understanding landscapes informs optimizer choices (momentum, Adam) and architecture design (ResNets, skip connections).

Convex Landscapes

A convex loss function has a single global minimum. Any local minimum is the global minimum. This is the ideal case: gradient descent is guaranteed to find the optimal solution (given small enough learning rate).

f(θ) is convex if f(αθ1+(1α)θ2)αf(θ1)+(1α)f(θ2)f(\theta) \text{ is convex if } f(\alpha\theta_1 + (1-\alpha)\theta_2) \leq \alpha f(\theta_1) + (1-\alpha)f(\theta_2)
The function lies below the line connecting any two points.

Convex Problems ✓

  • Linear/Ridge regression
  • Logistic regression
  • Support Vector Machines
  • Lasso (L1 regularization)

Non-Convex Problems ✗

  • Neural networks (all depths)
  • Matrix factorization
  • Most deep learning models
  • Many real-world problems

The Neural Network Challenge

Even a single hidden layer with 2 neurons creates a non-convex landscape. The composition of nonlinear activation functions guarantees non-convexity. Yet, mysteriously, SGD often finds good solutions anyway!

Interactive: Loss Landscapes

Explore different loss landscape shapes and watch gradient descent navigate them. Try the different landscape types to see how topology affects optimization.

Loss Landscape Explorer

Click anywhere to teleport optimizer

Convex Bowl

f(x,y)=x2+0.5y2f(x,y) = x^2 + 0.5y^2

A perfect bowl. Gradient descent always finds the global minimum (0,0).

OptimizerSGD + Momentum (0.9)
Current Loss

Saddle Points

In high dimensions, saddle points are far more common than local minima. A saddle point is a critical point (zero gradient) that is a minimum in some directions and a maximum in others, like the center of a horse saddle.

Why Saddles Dominate High Dimensions

At a critical point, the Hessian has n eigenvalues (one per dimension). For a local minimum, ALL must be positive. For a saddle, some are positive, some negative.

In a million-dimensional space, the probability that all million eigenvalues happen to be positive is astronomically small. Almost all critical points are saddles.

SGD Noise

Stochastic gradients provide random perturbations that help escape flat regions around saddles.

Momentum

Accumulated velocity pushes through saddle regions even when gradients are small.

Second-Order Methods

Cubic regularization and trust-region methods explicitly detect and escape saddles.

Interactive: Escaping Saddles

Watch how SGD navigates a saddle point. The algorithm starts near the saddle (gradient ≈ 0) and must escape to make progress.

Escaping Saddle Points

x (stable)y (unstable)SaddleStep: 0
Status

Random noise will eventually perturb y off zero...

Position
x = 0.8000
y = 0.0000

Key Insight

Stochastic noise randomly perturbs the particle. Once y becomes even slightly non-zero, the negative curvature (-2) in y amplifies this, causing rapid escape.

Curvature X
+2
Stable
Curvature Y
-2
Unstable
Current Gradient
f=(1.60,0.00)\nabla f = (1.60, 0.00)

Sharp vs Flat Minima

Not all minima are equal. Flat minima (wide valleys) tend to generalize better than sharp minima (narrow ravines). This is one of the most important discoveries in understanding deep learning.

Flat Minimum

  • Small Hessian eigenvalues - gentle curvature
  • Robust to perturbations - parameter noise doesn't hurt
  • Better generalization - typically lower test error
  • Found by: SGD with large batches, high learning rates, SAM optimizer

Sharp Minimum

  • Large Hessian eigenvalues - steep curvature
  • Sensitive to perturbations - small changes cause large loss increase
  • May overfit - memorizes training data
  • Found by: Full-batch GD, very small learning rates

SAM: Sharpness-Aware Minimization

SAM (proposed by Google Research, 2020) explicitly seeks flat minima by minimizing the loss in the "worst case" neighborhood around the current parameters:

minθmaxϵρL(θ+ϵ)\min_\theta \max_{\|\epsilon\| \leq \rho} L(\theta + \epsilon)

This ensures the solution is robust to small perturbations, dramatically improving generalization.

Interactive: Minima Comparison

Compare sharp and flat minima. Notice how parameter perturbations affect loss differently.

Sharp vs Flat Minima

Parameter Noise0.20
Sharp MinimumParameters θ\thetaFlat MinimumParameters θ\theta

Error Sensitivity

Sharp Minima
0.32loss

High sensitivity. Small shifts cause large errors.

Flat Minima
0.03loss

Low sensitivity. Robust to distribution shifts.

Generalization Gap

The test set is never exactly the same as the training set - it's like "shifting" the parameters slightly. Flat minima ensure that this shift doesn't lead to a catastrophe (huge loss spike).

Case Study: Understanding Training Dynamics

The Observation

You're training a neural network to classify images. The loss plateaus for 50 epochs, then suddenly drops by 30%. What's happening in the loss landscape?

The Landscape Explanation

The optimizer is likely stuck near a saddle point. The gradient is small, so progress is slow. The optimizer is "wandering" in a flat region. Eventually:

  • SGD noise accumulates in an unstable direction
  • Momentum builds up enough velocity to escape
  • The optimizer finds a descent direction and the loss drops rapidly

The Solution

Several strategies can help escape saddle regions faster:

  • Cyclical learning rates: Increase LR during plateaus to add energy
  • Use momentum: Accumulates velocity to push through flat regions
  • Add gradient noise: Helps escape local flat spots
  • Decrease batch size: More stochastic gradients = more exploration

ML Applications & Insights

Mode Connectivity

Research shows different minima found by SGD are often connected by paths of low loss. The landscape has "tunnels" connecting good solutions. It's less rugged than once thought.

Lottery Ticket Hypothesis

Sparse subnetworks at initialization can match full network performance. This suggests the loss landscape contains special "winning tickets". Initialization matters more than we thought.

Skip Connections (ResNets)

ResNets make the loss landscape dramatically smoother. Skip connections create "highways" through the landscape, making optimization easier. This is why ResNets train better than plain deep networks.

Learning Rate Schedules

Warmup, cosine annealing, and cyclical LR are all strategies informed by understanding loss landscape geometry. They help navigate complex topologies.

Batch Size Effects

Large batches converge to sharp minima (poor generalization). Small batches add noise that helps find flat minima (better generalization). This explains the "generalization gap" as batch size increases.

Neural Architecture Search

Some architectures have inherently smoother loss landscapes. NAS methods can discover architectures that are easier to optimize, not just more accurate.