Beyond Scalar Calculus
In standard calculus, you differentiate a scalar function . The derivative is a single number.
In Machine Learning, we deal with functions of many variables (neural networks have millions of parameters) that output vectors, matrices, or scalars. Matrix Calculus provides the notation and rules to compute these derivatives efficiently.
Why It Matters for ML
Every time you call loss.backward() in PyTorch, you are computing matrix derivatives. Understanding matrix calculus lets you derive gradients by hand, debug backpropagation, and design custom layers.
Layout Conventions
There are two conventions for arranging derivatives: Numerator Layout and Denominator Layout. ML typically uses Numerator Layout (also called Jacobian layout).
Numerator Layout
The derivative has the same shape as the numerator. If and , then is .
Denominator Layout
The derivative has the same shape as the denominator transposed. Used in some optimization textbooks.
Warning: Always check which convention a paper or library uses. Transposing the wrong matrix leads to dimension mismatch bugs in backprop.
Gradients
The gradient of a scalar function is a vector of partial derivatives. In numerator layout, it is technically a row vector, but we often treat it as a column vector for convenience in update rules.
Key Gradient Identities
- (Linear function)
- (Squared norm, recall norms)
- (Quadratic form)
- (Least squares)
Jacobian Matrix
When the output is a vector , we need an matrix to capture all partial derivatives. This is the Jacobian.
Linear Transformation View
The Jacobian tells you how the output space locally stretches and rotates. If , then . The derivative of a linear map is the map itself!
Interactive: Jacobian as Linearization
Explore how the Jacobian matrix acts as a local linear approximation for a non-linear transformation. Notice how the grid lines locally look like the Jacobian vectors.
Jacobian as Linearization
The Jacobian matrix J is the "linear approximation" of a non-linear function.
Input Space (x, y)
Output Space f(x, y)
Jacobian Matrix at Point (1.00, 1.00)
Col 1: How output changes when x moves.
Col 2: How output changes when y moves.
Hessian Matrix
The Hessian is the matrix of second derivatives. It tells you about the curvature of a scalar function.
H positive definite
Local minimum. Bowl curving up in all directions (convex).
H negative definite
Local maximum. Bowl curving down (concave).
H indefinite
Saddle point. Curves up in some directions, down in others.
H singular
Flat direction. Need higher order derivatives.
Interactive Rules
Explore the three fundamental rules of matrix calculus: scalar-by-vector, vector-by-vector (Jacobian), and the chain rule.
Matrix Calculus Rules
Interactive guide to common derivative identities.
The gradient of a dot product wᵀx with respect to x is simply w.
Quick Reference Cheat Sheet
Matrix Chain Rule
The chain rule in matrix calculus is about multiplying Jacobians in the right order.
Backpropagation is Chain Rule
A neural network is a composition of functions: .
Backprop computes gradients by starting from the loss and multiplying Jacobians backwards:
Case Study: Linear Regression
The Problem
Predict bulb lifespan (y) from features (voltage, temperature, filament thickness). We have data matrix and targets . Find optimal weights .
The Loss Function
The Gradient
Using matrix calculus rules:
Setting to zero gives the Normal Equations: . This is an exact solution!
ML Applications
Custom Autograd
When implementing a custom layer (e.g., in PyTorch or JAX), you must define the backward pass. This requires analytically deriving the Jacobian of your operation.
Natural Gradient Descent
Uses the Fisher Information Matrix (expected Hessian of log-likelihood) to precondition gradients, correcting for the geometry of the parameter space.
Variational Inference
Optimizing the ELBO in VAEs requires computing derivatives of log-determinants and trace operations, which are pure matrix calculus.
Gaussian Processes
Optimizing kernel hyperparameters requires gradients of the marginal likelihood, involving derivatives of matrix inverses and determinants: .