What is the Jacobian matrix and how is it used in machine learning?

The Jacobian matrix is the matrix of all first-order partial derivatives of a vector-valued function. For a function with n inputs and m outputs, it is an m×n matrix where entry (i, j) is the partial derivative of output i with respect to input j. In machine learning, the Jacobian appears in backpropagation: each layer of a neural network contributes a Jacobian, and the chain rule chains these Jacobians together to propagate gradients from the loss back to the weights.

What is the Hessian matrix and what does it tell us about a function?

The Hessian matrix is the square matrix of all second-order partial derivatives of a scalar-valued function. For a function of n variables it is an n×n symmetric matrix. The Hessian captures the curvature of the function: it tells us how the gradient itself changes in each direction, describing whether the loss surface curves upward (convex) or downward (concave) along any given direction.

What is the difference between the Jacobian and the Hessian?

The Jacobian collects first-order partial derivatives and applies to vector-valued functions (multiple outputs), while the Hessian collects second-order partial derivatives and applies to scalar-valued functions (a single output such as a loss). Concretely, the gradient of a loss is a vector; the Jacobian of that gradient with respect to the inputs is the Hessian. The Jacobian captures how outputs change with inputs; the Hessian captures how those rates of change themselves change.

How is the Hessian used in second-order optimization methods?

Newton's method uses the Hessian to take curvature-aware parameter updates: θ_new = θ − H⁻¹ ∇f. By dividing the gradient by the local curvature, steps are larger in flat directions and smaller in steep ones, enabling faster convergence than first-order gradient descent. Practical variants like L-BFGS and Hessian-free optimization approximate or implicitly use the Hessian through Hessian-vector products to avoid the prohibitive O(n²) storage and O(n³) inversion cost.

Why is computing the Hessian expensive for large neural networks?

For a model with n parameters the full Hessian contains n² entries, which for modern networks with millions to billions of parameters is completely infeasible to store or compute explicitly. Inverting or factorising the Hessian adds an additional O(n³) cost. This is why practical second-order methods use approximations such as diagonal Hessian estimates, Kronecker-factored approximations (K-FAC), or conjugate-gradient-based Hessian-free approaches that compute Hessian-vector products in O(n) time.

Jacobian & Hessian Matrices for Deep Learning

Introduction

In single-variable calculus, we have f'(x) (first derivative/slope) and f''(x) (second derivative/curvature). But neural networks operate on vectors with millions of dimensions.

When we generalize derivatives to vectors:

Jacobian (First Order)

Matrix of all first partial derivatives. Generalizes gradient to vector-valued functions.

Hessian (Second Order)

Matrix of all second partial derivatives. Captures curvature information.

The Derivative Hierarchy

The type of derivative depends on the input and output dimensions:

Input	Output	Derivative	Shape
Scalar (x)	Scalar (y)	Derivative	1 x 1
Vector (x)	Scalar (y)	Gradient	n x 1
Vector (x)	Vector (y)	Jacobian	m x n
Vector (x)	Scalar (y)	Hessian (2nd)	n x n

The Jacobian Matrix

For a function $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$ (n inputs, m outputs), the Jacobian is an m x n matrix:

J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

Row i = how output i changes with each input. Column j = how each output changes with input j.

Geometric Meaning

The Jacobian represents the best linear approximation to f near a point. It tells us how a small change in input $\delta x$ affects the output:

\mathbf{f}(\mathbf{x} + \delta\mathbf{x}) \approx \mathbf{f}(\mathbf{x}) + J \cdot \delta\mathbf{x}

Jacobian: Worked Example

Consider a function from $\mathbb{R}^2 \to \mathbb{R}^2$ :

\mathbf{f}(x, y) = \begin{bmatrix} x^2 + y \\ xy \end{bmatrix}

Step 1: Compute all partial derivatives

\frac{\partial f_1}{\partial x} = 2x

\frac{\partial f_1}{\partial y} = 1

\frac{\partial f_2}{\partial x} = y

\frac{\partial f_2}{\partial y} = x

Step 2: Assemble the Jacobian

J = \begin{bmatrix} 2x & 1 \\ y & x \end{bmatrix}

Step 3: Evaluate at a point (x=2, y=3)

J|_{(2,3)} = \begin{bmatrix} 4 & 1 \\ 3 & 2 \end{bmatrix}

Interactive: Jacobian in Action

See how the Jacobian provides a linear approximation to how outputs change with inputs. Move the base point and perturbation to explore.

Jacobian Linearization

$\mathbf{f}(x,y) = (x^2+y, xy)$ . Drag the points to explore.

Input Space (x, y)

Base: (1.50, 1.00)

Perturbation

\delta

: (0.50, 0.30)

Output Space (u, v)

Approximation Error0.2915

Jacobian J at Base

\begin{bmatrix} 3.00 & 1.00 \\ 1.00 & 1.50 \end{bmatrix}

True Change

Linear Approx

The Hessian Matrix

For a scalar-valued function $f: \mathbb{R}^n \to \mathbb{R}$ (like a loss function), the Hessian is the n x n matrix of second-order partial derivatives:

H = \nabla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix}

Symmetric

For continuous second partials:

\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}

So $H = H^T$

Curvature

The Hessian captures how the gradient itself changes. It describes the "bowl shape" of the function.

Hessian: Worked Example

Consider a loss function $f(x, y) = x^2 + 3xy + y^2$ :

Step 1: First partial derivatives (gradient)

\nabla f = \begin{bmatrix} 2x + 3y \\ 3x + 2y \end{bmatrix}

Step 2: Second partial derivatives

\frac{\partial^2 f}{\partial x^2} = 2

\frac{\partial^2 f}{\partial x \partial y} = 3

\frac{\partial^2 f}{\partial y \partial x} = 3

\frac{\partial^2 f}{\partial y^2} = 2

Step 3: Assemble Hessian

H = \begin{bmatrix} 2 & 3 \\ 3 & 2 \end{bmatrix}

Note: constant because f is quadratic.

Interactive: Curvature & Critical Points

Adjust the Hessian eigenvalues to see how they determine the shape of the loss surface and classify critical points.

Hessian Curvature & Step Size

High curvature = Steep walls = Inverse Hessian prevents overshooting.

Flat (Low Curvature)Steep (High Curvature)

Hessian

f''(x) = 2.0

Step Size Analysis

Newton Step

\Delta x = -f'/f''

Adapts to curvature. If curve is steep (high

f''

), step is scaled down.

Size: 2.00Perfect

Gradient Descent

\Delta x = -\eta f'

Fixed learning rate. Ignores curvature.

Size: 0.80Too Slow

Key Insight

The Hessian $H = \nabla^2 f$ acts as a "smart scaling matrix".

In steep directions (high curvature), $H^{-1}$ shrinks the gradient to prevent overshooting. In flat directions, it expands the step to speed up.

Critical Points: The Eigenvalue Test

At a critical point (where gradient = 0), the Hessian's eigenvalues tell us the nature of that point:

All positive

Function curves UP in all directions. Local minimum.

All negative

Function curves DOWN in all directions. Local maximum.

Mixed signs

Curves up in some directions, down in others. Saddle point.

Why This Matters

In high-dimensional neural network loss landscapes, saddle points are far more common than local minima. Understanding the Hessian helps explain why optimization can stall and why momentum-based methods help.

ML Applications

Backpropagation = Jacobian-Vector Products

When computing gradients through a neural network, each layer contributes a Jacobian. The chain rule becomes: $\nabla_x L = J_1^T J_2^T \cdots J_n^T \nabla_y L$ .

Newton's Method

Second-order optimization: $\theta_{new} = \theta - H^{-1} \nabla f$ . Uses curvature to take smarter steps. O(n³) to compute.

Hessian-Free Optimization

Clever algorithms that use Hessian information without computing the full matrix. Conjugate gradient methods can compute Hessian-vector products efficiently.

Loss Landscape Analysis

Researchers study Hessian eigenvalue distributions. Sharp minima (large eigenvalues) tend to generalize worse than flat minima (small eigenvalues).

Contents