Introduction
In single-variable calculus, we have f'(x) (first derivative/slope) and f''(x) (second derivative/curvature). But neural networks operate on vectors with millions of dimensions.
When we generalize derivatives to vectors:
Jacobian (First Order)
Matrix of all first partial derivatives. Generalizes gradient to vector-valued functions.
Hessian (Second Order)
Matrix of all second partial derivatives. Captures curvature information.
The Derivative Hierarchy
The type of derivative depends on the input and output dimensions:
| Input | Output | Derivative | Shape |
|---|---|---|---|
| Scalar (x) | Scalar (y) | Derivative | 1 x 1 |
| Vector (x) | Scalar (y) | Gradient | n x 1 |
| Vector (x) | Vector (y) | Jacobian | m x n |
| Vector (x) | Scalar (y) | Hessian (2nd) | n x n |
The Jacobian Matrix
For a function (n inputs, m outputs), the Jacobian is an m x n matrix:
Row i = how output i changes with each input. Column j = how each output changes with input j.
Geometric Meaning
The Jacobian represents the best linear approximation to f near a point. It tells us how a small change in input affects the output:
Jacobian: Worked Example
Consider a function from :
Step 1: Compute all partial derivatives
Step 2: Assemble the Jacobian
Step 3: Evaluate at a point (x=2, y=3)
Interactive: Jacobian in Action
See how the Jacobian provides a linear approximation to how outputs change with inputs. Move the base point and perturbation to explore.
Jacobian Linearization
Input Space (x, y)
Output Space (u, v)
The Hessian Matrix
For a scalar-valued function (like a loss function), the Hessian is the n x n matrix of second-order partial derivatives:
Symmetric
For continuous second partials:
So
Curvature
The Hessian captures how the gradient itself changes. It describes the "bowl shape" of the function.
Hessian: Worked Example
Consider a loss function :
Step 1: First partial derivatives (gradient)
Step 2: Second partial derivatives
Step 3: Assemble Hessian
Note: constant because f is quadratic.
Interactive: Curvature & Critical Points
Adjust the Hessian eigenvalues to see how they determine the shape of the loss surface and classify critical points.
Hessian Curvature & Step Size
Step Size Analysis
Key Insight
The Hessian acts as a "smart scaling matrix".
In steep directions (high curvature), shrinks the gradient to prevent overshooting. In flat directions, it expands the step to speed up.
Critical Points: The Eigenvalue Test
At a critical point (where gradient = 0), the Hessian's eigenvalues tell us the nature of that point:
Why This Matters
In high-dimensional neural network loss landscapes, saddle points are far more common than local minima. Understanding the Hessian helps explain why optimization can stall and why momentum-based methods help.
ML Applications
Backpropagation = Jacobian-Vector Products
When computing gradients through a neural network, each layer contributes a Jacobian. The chain rule becomes: .
Newton's Method
Second-order optimization: . Uses curvature to take smarter steps. O(n³) to compute.
Hessian-Free Optimization
Clever algorithms that use Hessian information without computing the full matrix. Conjugate gradient methods can compute Hessian-vector products efficiently.
Loss Landscape Analysis
Researchers study Hessian eigenvalue distributions. Sharp minima (large eigenvalues) tend to generalize worse than flat minima (small eigenvalues).