Modules
04/09
Calculus

Contents

Jacobian & Hessian Matrices

First and second derivatives generalized to high dimensions. Essential for understanding backpropagation and optimization landscapes.

Introduction

In single-variable calculus, we have f'(x) (first derivative/slope) and f''(x) (second derivative/curvature). But neural networks operate on vectors with millions of dimensions.

When we generalize derivatives to vectors:

Jacobian (First Order)

Matrix of all first partial derivatives. Generalizes gradient to vector-valued functions.

Hessian (Second Order)

Matrix of all second partial derivatives. Captures curvature information.

The Derivative Hierarchy

The type of derivative depends on the input and output dimensions:

InputOutputDerivativeShape
Scalar (x)Scalar (y)Derivative1 x 1
Vector (x)Scalar (y)Gradientn x 1
Vector (x)Vector (y)Jacobianm x n
Vector (x)Scalar (y)Hessian (2nd)n x n

The Jacobian Matrix

For a function f:RnRm\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m (n inputs, m outputs), the Jacobian is an m x n matrix:

J=[f1x1f1x2f1xnf2x1f2x2f2xnfmx1fmx2fmxn]J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

Row i = how output i changes with each input. Column j = how each output changes with input j.

Geometric Meaning

The Jacobian represents the best linear approximation to f near a point. It tells us how a small change in input δx\delta x affects the output:

f(x+δx)f(x)+Jδx\mathbf{f}(\mathbf{x} + \delta\mathbf{x}) \approx \mathbf{f}(\mathbf{x}) + J \cdot \delta\mathbf{x}

Jacobian: Worked Example

Consider a function from R2R2\mathbb{R}^2 \to \mathbb{R}^2:

f(x,y)=[x2+yxy]\mathbf{f}(x, y) = \begin{bmatrix} x^2 + y \\ xy \end{bmatrix}

Step 1: Compute all partial derivatives

f1x=2x\frac{\partial f_1}{\partial x} = 2x
f1y=1\frac{\partial f_1}{\partial y} = 1
f2x=y\frac{\partial f_2}{\partial x} = y
f2y=x\frac{\partial f_2}{\partial y} = x

Step 2: Assemble the Jacobian

J=[2x1yx]J = \begin{bmatrix} 2x & 1 \\ y & x \end{bmatrix}

Step 3: Evaluate at a point (x=2, y=3)

J(2,3)=[4132]J|_{(2,3)} = \begin{bmatrix} 4 & 1 \\ 3 & 2 \end{bmatrix}

Interactive: Jacobian in Action

See how the Jacobian provides a linear approximation to how outputs change with inputs. Move the base point and perturbation to explore.

Jacobian Linearization

Input Space (x, y)

Base: (1.50, 1.00)
Perturbation δ\delta: (0.50, 0.30)

Output Space (u, v)

Approximation Error0.2915
Jacobian J at Base
[3.001.001.001.50]\begin{bmatrix} 3.00 & 1.00 \\ 1.00 & 1.50 \end{bmatrix}
True Change
Linear Approx

The Hessian Matrix

For a scalar-valued function f:RnRf: \mathbb{R}^n \to \mathbb{R} (like a loss function), the Hessian is the n x n matrix of second-order partial derivatives:

H=2f=[2fx122fx1x22fx2x12fx22]H = \nabla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix}

Symmetric

For continuous second partials:

2fxixj=2fxjxi\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}

So H=HTH = H^T

Curvature

The Hessian captures how the gradient itself changes. It describes the "bowl shape" of the function.

Hessian: Worked Example

Consider a loss function f(x,y)=x2+3xy+y2f(x, y) = x^2 + 3xy + y^2:

Step 1: First partial derivatives (gradient)

f=[2x+3y3x+2y]\nabla f = \begin{bmatrix} 2x + 3y \\ 3x + 2y \end{bmatrix}

Step 2: Second partial derivatives

2fx2=2\frac{\partial^2 f}{\partial x^2} = 2
2fxy=3\frac{\partial^2 f}{\partial x \partial y} = 3
2fyx=3\frac{\partial^2 f}{\partial y \partial x} = 3
2fy2=2\frac{\partial^2 f}{\partial y^2} = 2

Step 3: Assemble Hessian

H=[2332]H = \begin{bmatrix} 2 & 3 \\ 3 & 2 \end{bmatrix}

Note: constant because f is quadratic.

Interactive: Curvature & Critical Points

Adjust the Hessian eigenvalues to see how they determine the shape of the loss surface and classify critical points.

Hessian Curvature & Step Size

Newton (Direct)GD (Fixed LR)
Flat (Low Curvature)Steep (High Curvature)
Hessian f(x)=2.0f''(x) = 2.0

Step Size Analysis

Newton StepΔx=f/f\Delta x = -f'/f''
Adapts to curvature. If curve is steep (high ff''), step is scaled down.
Size: 2.00Perfect
Gradient DescentΔx=ηf\Delta x = -\eta f'
Fixed learning rate. Ignores curvature.
Size: 0.80Too Slow

Key Insight

The Hessian H=2fH = \nabla^2 f acts as a "smart scaling matrix".

In steep directions (high curvature), H1H^{-1} shrinks the gradient to prevent overshooting. In flat directions, it expands the step to speed up.

Critical Points: The Eigenvalue Test

At a critical point (where gradient = 0), the Hessian's eigenvalues tell us the nature of that point:

All positive
Function curves UP in all directions. Local minimum.
All negative
Function curves DOWN in all directions. Local maximum.
Mixed signs
Curves up in some directions, down in others. Saddle point.

Why This Matters

In high-dimensional neural network loss landscapes, saddle points are far more common than local minima. Understanding the Hessian helps explain why optimization can stall and why momentum-based methods help.

ML Applications

Backpropagation = Jacobian-Vector Products

When computing gradients through a neural network, each layer contributes a Jacobian. The chain rule becomes: xL=J1TJ2TJnTyL\nabla_x L = J_1^T J_2^T \cdots J_n^T \nabla_y L.

Newton's Method

Second-order optimization: θnew=θH1f\theta_{new} = \theta - H^{-1} \nabla f. Uses curvature to take smarter steps. O(n³) to compute.

Hessian-Free Optimization

Clever algorithms that use Hessian information without computing the full matrix. Conjugate gradient methods can compute Hessian-vector products efficiently.

Loss Landscape Analysis

Researchers study Hessian eigenvalue distributions. Sharp minima (large eigenvalues) tend to generalize worse than flat minima (small eigenvalues).