Modules
08/09
Optimization

Contents

Newton's Method

Second-order optimization for faster convergence. Understanding curvature for smarter steps.

Beyond Gradient Descent

Gradient descent uses only first derivatives. It knows the direction of steepest descent but not the curvature. Newton's Method uses second derivatives (the Hessian) to take smarter steps.

Gradient Descent

Uses ∇f only. Linear approximation. Takes many small steps.

Newton's Method

Uses ∇f and ∇²f. Quadratic approximation. Fewer but bigger steps.

The Trade-off

Newton's Method converges faster (quadratic vs linear), but each iteration is more expensive. It shines for small-to-medium sized problems where Hessian computation is feasible.

The Algorithm

For root-finding (solving f(x) = 0):

xn+1=xnf(xn)f(xn)x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}
Follow the tangent line to where it crosses zero.

For optimization (finding minimum of g(x)), we find roots of g'(x) = 0:

xn+1=xng(xn)g(xn)x_{n+1} = x_n - \frac{g'(x_n)}{g''(x_n)}
Step size is inversely proportional to curvature.

Geometric Intuition

At each point, fit a quadratic (Taylor approximation) to the function using the first and second derivatives. The minimum of this parabola is the next guess. For actual quadratics, Newton finds the minimum in one step!

Interactive: Root Finding

Watch Newton's Method find √2 by solving x² - 2 = 0. Notice how quickly the error decreases.

Newton's Method Visualization

Initial Guess:
f(x) = x² - 2√2x_0
Current Estimate
3.0000000
Error: 1.59e+0Correct Digits: 0
Iterx_nf(x_n)Step
03.00007.0000-
Quadratic Convergence

Notice how the number of correct digits roughly doubles with every step. Gradient descent would take thousands of steps to match this precision.

Interactive: Hessian Curvature

Explore how the Hessian (curvature) determines Newton's step size. Compare with gradient descent using a fixed learning rate.

Hessian Curvature & Step Size

Newton (Direct)GD (Fixed LR)
Flat (Low Curvature)Steep (High Curvature)
Hessian f(x)=2.0f''(x) = 2.0

Step Size Analysis

Newton StepΔx=f/f\Delta x = -f'/f''
Adapts to curvature. If curve is steep (high ff''), step is scaled down.
Size: 2.00Perfect
Gradient DescentΔx=ηf\Delta x = -\eta f'
Fixed learning rate. Ignores curvature.
Size: 0.80Too Slow

Key Insight

The Hessian H=2fH = \nabla^2 f acts as a "smart scaling matrix".

In steep directions (high curvature), H1H^{-1} shrinks the gradient to prevent overshooting. In flat directions, it expands the step to speed up.

Quadratic Convergence

Newton's Method has quadratic convergence. Near the solution, the error squares each iteration.

xn+1xCxnx2|x_{n+1} - x^*| \approx C \cdot |x_n - x^*|^2
Each iteration roughly doubles the number of correct digits.

Example: Computing √2

Iteration (n)xnErrorCorrect Digits
02.0000000.5857860
11.5000000.0857861
21.4166670.0024533
31.4142160.0000026

Compare this to gradient descent's linear convergence, where each iteration reduces error by a constant factor.

Multivariate Newton's Method

For functions of multiple variables, the Hessian matrix replaces the second derivative.

θn+1=θnH1f(θn)\theta_{n+1} = \theta_n - H^{-1} \nabla f(\theta_n)
H = Hessian matrix (second derivatives), ∇f = gradient

The Cost: O(n³)

Computing the Hessian is O(n²) storage and inverting it is O(n³). For neural nets with millions of parameters, this is infeasible. See quasi-Newton methods (L-BFGS) below for practical alternatives.

Case Study: Logistic Regression

The Problem

Binary classification using logistic regression. Minimize cross-entropy loss, which is convex. With n samples and d features, Hessian is d×d.

Newton vs Gradient Descent

For logistic regression, the Hessian is cheap to compute: O(nd + d³).

  • Newton: 5-10 iterations to convergence
  • GD: Thousands of iterations needed

In Practice

sklearn.linear_model.LogisticRegression uses Newton-like methods (L-BFGS) by default. This is why scikit-learn's logistic regression trains so fast!

Newton vs Gradient Descent

2.51.50.5*Iterations →Loss
Newton's Method
Iterations:0
Loss:2.5000
Status:Running...
O(nd+d3)O(nd + d^3) per iteration
Gradient Descent
Iterations:0
Loss:2.5000
Status:Running...
O(nd)O(nd) per iteration

Why Newton Wins

Newton uses curvature (Hessian) to take optimal steps. For convex problems like logistic regression, it converges in 5-10 iterations. GD needs thousands.

Newton (L-BFGS)
Gradient Descent

Limitations & Failure Modes

1. Computational Cost

O(n²) Hessian computation + O(n³) matrix inversion. Infeasible for deep learning's millions of parameters.

2. Singular Hessian

At saddle points, the Hessian may be singular or have negative eigenvalues. The method can fail or take steps in the wrong direction.

3. Divergence from Poor Initialization

Unlike gradient descent, Newton can diverge if starting too far from the solution. Quadratic convergence only holds near the optimum.

4. Non-Convex Landscapes

In deep learning's non-convex loss landscapes, Newton can get attracted to saddle points instead of minima. SGD noise actually helps escape these!

ML Applications & Variants

L-BFGS

Limited-memory BFGS approximates the Hessian inverse using past gradients. O(n) storage instead of O(n²). Used in scikit-learn and scipy.optimize.

Natural Gradient

Uses the Fisher Information Matrix (expected Hessian) instead of the true Hessian. Better for probability distributions and policy gradients in RL.

Hessian-Free Optimization

Computes Hessian-vector products without forming the full Hessian. Uses conjugate gradient for the linear solve. More feasible for deep learning.

Trust Region Methods

TRPO (Trust Region Policy Optimization) uses a quadratic approximation with curvature from the Fisher matrix. Critical for stable policy learning in RL.