What is Newton's method and how does it differ from gradient descent?

Newton's method is a second-order optimization algorithm that uses both the gradient (first derivative) and the Hessian (second derivative matrix) to find the minimum of a function. Unlike gradient descent, which makes steps proportional to the gradient alone, Newton's method accounts for curvature and takes a step to the minimum of a local quadratic approximation. This gives Newton's method quadratic convergence near the solution, meaning it can converge in far fewer iterations than gradient descent's linear convergence.

Why is Newton's method rarely used in deep learning?

Newton's method is rarely used in deep learning because computing and storing the full Hessian matrix requires O(n²) memory and inverting it costs O(n³) computation, where n is the number of parameters. Modern neural networks can have millions or billions of parameters, making exact Newton steps completely infeasible. Additionally, deep learning loss landscapes are highly non-convex with many saddle points, where Newton's method can fail or converge to a saddle point rather than a minimum.

What are quasi-Newton methods like BFGS and L-BFGS?

Quasi-Newton methods approximate the Hessian inverse iteratively using gradient information, avoiding the need to compute second derivatives directly. BFGS (Broyden–Fletcher–Goldfarb–Shanno) builds up an approximation of the Hessian inverse over iterations, achieving superlinear convergence with much less cost per step than full Newton's method. L-BFGS (Limited-memory BFGS) further reduces memory usage to O(n) by storing only the last m gradient vectors rather than the full approximation matrix, making it practical for medium-scale machine learning problems like logistic regression.

What is the role of the Hessian in Newton's method?

The Hessian matrix contains all second-order partial derivatives of the objective function and encodes the curvature of the loss landscape. In Newton's method, the Hessian is used to scale and rotate the gradient step: the update is θ_{n+1} = θ_n − H⁻¹∇f(θ_n), where H⁻¹ is the inverse Hessian. This means Newton takes larger steps in directions of low curvature and smaller steps in directions of high curvature, producing better-conditioned updates than a fixed learning rate.

When is second-order optimization worth the extra cost?

Second-order optimization is worth the extra cost when the problem is small-to-medium scale (few thousand parameters), the loss is convex or nearly convex, and iteration count matters more than per-iteration cost. Logistic regression, linear SVMs, and other convex ML models are excellent candidates because the Hessian is tractable and Newton-like methods (e.g., L-BFGS) converge in 5–20 iterations versus thousands for gradient descent. For large neural networks, Hessian-free or natural gradient methods offer a middle ground.

Newton's Method: Second-Order Optimization

Beyond Gradient Descent

Gradient descent uses only first derivatives. It knows the direction of steepest descent but not the curvature. Newton's Method uses second derivatives (the Hessian) to take smarter steps.

Gradient Descent

Uses ∇f only. Linear approximation. Takes many small steps.

Newton's Method

Uses ∇f and ∇²f. Quadratic approximation. Fewer but bigger steps.

The Trade-off

Newton's Method converges faster (quadratic vs linear), but each iteration is more expensive. It shines for small-to-medium sized problems where Hessian computation is feasible.

The Algorithm

For root-finding (solving f(x) = 0):

x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}

Follow the tangent line to where it crosses zero.

For optimization (finding minimum of g(x)), we find roots of g'(x) = 0:

x_{n+1} = x_n - \frac{g'(x_n)}{g''(x_n)}

Step size is inversely proportional to curvature.

Geometric Intuition

At each point, fit a quadratic (Taylor approximation) to the function using the first and second derivatives. The minimum of this parabola is the next guess. For actual quadratics, Newton finds the minimum in one step!

Interactive: Root Finding

Watch Newton's Method find √2 by solving x² - 2 = 0. Notice how quickly the error decreases.

Newton's Method Visualization

Solving $x^2 - 2 = 0$ to find $\sqrt{2}$ . Tangent lines guide the way.

Initial Guess:

Current Estimate

3.0000000

Error: 1.59e+0Correct Digits: 0

Iter	x_n	f(x_n)	Step
0	3.0000	7.0000	-

Quadratic Convergence

Notice how the number of correct digits roughly doubles with every step. Gradient descent would take thousands of steps to match this precision.

Interactive: Hessian Curvature

Explore how the Hessian (curvature) determines Newton's step size. Compare with gradient descent using a fixed learning rate.

Hessian Curvature & Step Size

High curvature = Steep walls = Inverse Hessian prevents overshooting.

Flat (Low Curvature)Steep (High Curvature)

Hessian

f''(x) = 2.0

Step Size Analysis

Newton Step

\Delta x = -f'/f''

Adapts to curvature. If curve is steep (high

f''

), step is scaled down.

Size: 2.00Perfect

Gradient Descent

\Delta x = -\eta f'

Fixed learning rate. Ignores curvature.

Size: 0.80Too Slow

Key Insight

The Hessian $H = \nabla^2 f$ acts as a "smart scaling matrix".

In steep directions (high curvature), $H^{-1}$ shrinks the gradient to prevent overshooting. In flat directions, it expands the step to speed up.

Quadratic Convergence

Newton's Method has quadratic convergence. Near the solution, the error squares each iteration.

|x_{n+1} - x^*| \approx C \cdot |x_n - x^*|^2

Each iteration roughly doubles the number of correct digits.

Example: Computing √2

Iteration (n)	x_n	Error	Correct Digits
0	2.000000	0.585786	0
1	1.500000	0.085786	1
2	1.416667	0.002453	3
3	1.414216	0.000002	6

Compare this to gradient descent's linear convergence, where each iteration reduces error by a constant factor.

Multivariate Newton's Method

For functions of multiple variables, the Hessian matrix replaces the second derivative.

\theta_{n+1} = \theta_n - H^{-1} \nabla f(\theta_n)

H = Hessian matrix (second derivatives), ∇f = gradient

The Cost: O(n³)

Computing the Hessian is O(n²) storage and inverting it is O(n³). For neural nets with millions of parameters, this is infeasible. See quasi-Newton methods (L-BFGS) below for practical alternatives.

Case Study: Logistic Regression

The Problem

Binary classification using logistic regression. Minimize cross-entropy loss, which is convex. With n samples and d features, Hessian is d×d.

Newton vs Gradient Descent

For logistic regression, the Hessian is cheap to compute: O(nd + d³).

Newton: 5-10 iterations to convergence
GD: Thousands of iterations needed

In Practice

sklearn.linear_model.LogisticRegression uses Newton-like methods (L-BFGS) by default. This is why scikit-learn's logistic regression trains so fast!

Newton vs Gradient Descent

Logistic Regression Convergence Race

Newton's Method

Iterations:0

Loss:2.5000

Status:Running...

O(nd + d^3)

per iteration

Gradient Descent

Iterations:0

Loss:2.5000

Status:Running...

O(nd)

per iteration

Why Newton Wins

Newton uses curvature (Hessian) to take optimal steps. For convex problems like logistic regression, it converges in 5-10 iterations. GD needs thousands.

Newton (L-BFGS)

Gradient Descent

sklearn.LogisticRegression uses L-BFGS by default

Limitations & Failure Modes

1. Computational Cost

O(n²) Hessian computation + O(n³) matrix inversion. Infeasible for deep learning's millions of parameters.

2. Singular Hessian

At saddle points, the Hessian may be singular or have negative eigenvalues. The method can fail or take steps in the wrong direction.

3. Divergence from Poor Initialization

Unlike gradient descent, Newton can diverge if starting too far from the solution. Quadratic convergence only holds near the optimum.

4. Non-Convex Landscapes

In deep learning's non-convex loss landscapes, Newton can get attracted to saddle points instead of minima. SGD noise actually helps escape these!

ML Applications & Variants

L-BFGS

Limited-memory BFGS approximates the Hessian inverse using past gradients. O(n) storage instead of O(n²). Used in scikit-learn and scipy.optimize.

Natural Gradient

Uses the Fisher Information Matrix (expected Hessian) instead of the true Hessian. Better for probability distributions and policy gradients in RL.

Hessian-Free Optimization

Computes Hessian-vector products without forming the full Hessian. Uses conjugate gradient for the linear solve. More feasible for deep learning.

Trust Region Methods

TRPO (Trust Region Policy Optimization) uses a quadratic approximation with curvature from the Fisher matrix. Critical for stable policy learning in RL.

Contents

Beyond Gradient Descent

Gradient Descent

Newton's Method

The Trade-off

The Algorithm

Geometric Intuition

Interactive: Root Finding

Newton's Method Visualization

Interactive: Hessian Curvature

Hessian Curvature & Step Size

Step Size Analysis

Key Insight

Quadratic Convergence

Example: Computing √2

Multivariate Newton's Method

The Cost: O(n³)

Case Study: Logistic Regression

The Problem

Newton vs Gradient Descent

In Practice

Newton vs Gradient Descent

Limitations & Failure Modes

1. Computational Cost

2. Singular Hessian

3. Divergence from Poor Initialization

4. Non-Convex Landscapes

ML Applications & Variants

L-BFGS

Natural Gradient

Hessian-Free Optimization

Trust Region Methods