Beyond Gradient Descent
Gradient descent uses only first derivatives. It knows the direction of steepest descent but not the curvature. Newton's Method uses second derivatives (the Hessian) to take smarter steps.
Gradient Descent
Uses ∇f only. Linear approximation. Takes many small steps.
Newton's Method
Uses ∇f and ∇²f. Quadratic approximation. Fewer but bigger steps.
The Trade-off
Newton's Method converges faster (quadratic vs linear), but each iteration is more expensive. It shines for small-to-medium sized problems where Hessian computation is feasible.
The Algorithm
For root-finding (solving f(x) = 0):
For optimization (finding minimum of g(x)), we find roots of g'(x) = 0:
Geometric Intuition
At each point, fit a quadratic (Taylor approximation) to the function using the first and second derivatives. The minimum of this parabola is the next guess. For actual quadratics, Newton finds the minimum in one step!
Interactive: Root Finding
Watch Newton's Method find √2 by solving x² - 2 = 0. Notice how quickly the error decreases.
Newton's Method Visualization
| Iter | x_n | f(x_n) | Step |
|---|---|---|---|
| 0 | 3.0000 | 7.0000 | - |
Notice how the number of correct digits roughly doubles with every step. Gradient descent would take thousands of steps to match this precision.
Interactive: Hessian Curvature
Explore how the Hessian (curvature) determines Newton's step size. Compare with gradient descent using a fixed learning rate.
Hessian Curvature & Step Size
Step Size Analysis
Key Insight
The Hessian acts as a "smart scaling matrix".
In steep directions (high curvature), shrinks the gradient to prevent overshooting. In flat directions, it expands the step to speed up.
Quadratic Convergence
Newton's Method has quadratic convergence. Near the solution, the error squares each iteration.
Example: Computing √2
| Iteration (n) | xn | Error | Correct Digits |
|---|---|---|---|
| 0 | 2.000000 | 0.585786 | 0 |
| 1 | 1.500000 | 0.085786 | 1 |
| 2 | 1.416667 | 0.002453 | 3 |
| 3 | 1.414216 | 0.000002 | 6 |
Compare this to gradient descent's linear convergence, where each iteration reduces error by a constant factor.
Multivariate Newton's Method
For functions of multiple variables, the Hessian matrix replaces the second derivative.
The Cost: O(n³)
Computing the Hessian is O(n²) storage and inverting it is O(n³). For neural nets with millions of parameters, this is infeasible. See quasi-Newton methods (L-BFGS) below for practical alternatives.
Case Study: Logistic Regression
The Problem
Binary classification using logistic regression. Minimize cross-entropy loss, which is convex. With n samples and d features, Hessian is d×d.
Newton vs Gradient Descent
For logistic regression, the Hessian is cheap to compute: O(nd + d³).
- Newton: 5-10 iterations to convergence
- GD: Thousands of iterations needed
In Practice
sklearn.linear_model.LogisticRegression uses Newton-like methods (L-BFGS) by default. This is why scikit-learn's logistic regression trains so fast!
Newton vs Gradient Descent
Why Newton Wins
Newton uses curvature (Hessian) to take optimal steps. For convex problems like logistic regression, it converges in 5-10 iterations. GD needs thousands.
Limitations & Failure Modes
1. Computational Cost
O(n²) Hessian computation + O(n³) matrix inversion. Infeasible for deep learning's millions of parameters.
2. Singular Hessian
At saddle points, the Hessian may be singular or have negative eigenvalues. The method can fail or take steps in the wrong direction.
3. Divergence from Poor Initialization
Unlike gradient descent, Newton can diverge if starting too far from the solution. Quadratic convergence only holds near the optimum.
4. Non-Convex Landscapes
In deep learning's non-convex loss landscapes, Newton can get attracted to saddle points instead of minima. SGD noise actually helps escape these!
ML Applications & Variants
L-BFGS
Limited-memory BFGS approximates the Hessian inverse using past gradients. O(n) storage instead of O(n²). Used in scikit-learn and scipy.optimize.
Natural Gradient
Uses the Fisher Information Matrix (expected Hessian) instead of the true Hessian. Better for probability distributions and policy gradients in RL.
Hessian-Free Optimization
Computes Hessian-vector products without forming the full Hessian. Uses conjugate gradient for the linear solve. More feasible for deep learning.
Trust Region Methods
TRPO (Trust Region Policy Optimization) uses a quadratic approximation with curvature from the Fisher matrix. Critical for stable policy learning in RL.