Modules
11/11
Linear Algebra

Contents

Matrix Calculus

The language of gradients in high dimensions. Essential for understanding backpropagation and optimization.

Beyond Scalar Calculus

In standard calculus, you differentiate a scalar function f:RRf: \mathbb{R} \to \mathbb{R}. The derivative is a single number.

In Machine Learning, we deal with functions of many variables (neural networks have millions of parameters) that output vectors, matrices, or scalars. Matrix Calculus provides the notation and rules to compute these derivatives efficiently.

Why It Matters for ML

Every time you call loss.backward() in PyTorch, you are computing matrix derivatives. Understanding matrix calculus lets you derive gradients by hand, debug backpropagation, and design custom layers.

Layout Conventions

There are two conventions for arranging derivatives: Numerator Layout and Denominator Layout. ML typically uses Numerator Layout (also called Jacobian layout).

Numerator Layout

The derivative has the same shape as the numerator. If yRmy \in \mathbb{R}^m and xRnx \in \mathbb{R}^n, then yx\frac{\partial y}{\partial x} is m×nm \times n.

Shape(yx)=Shape(y)×Shape(x)T\text{Shape}(\frac{\partial y}{\partial x}) = \text{Shape}(y) \times \text{Shape}(x)^T

Denominator Layout

The derivative has the same shape as the denominator transposed. Used in some optimization textbooks.

Shape(yx)=Shape(x)T×Shape(y)\text{Shape}(\frac{\partial y}{\partial x}) = \text{Shape}(x)^T \times \text{Shape}(y)

Warning: Always check which convention a paper or library uses. Transposing the wrong matrix leads to dimension mismatch bugs in backprop.

Gradients

The gradient of a scalar function f:RnRf: \mathbb{R}^n \to \mathbb{R} is a vector of partial derivatives. In numerator layout, it is technically a row vector, but we often treat it as a column vector for convenience in update rules.

f=[fx1fx2fxn]\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}
The gradient points in the direction of steepest ascent of the function.

Key Gradient Identities

  • x(aTx)=a\nabla_x (a^T x) = a (Linear function)
  • x(xTx)=2x\nabla_x (x^T x) = 2x (Squared norm, recall norms)
  • x(xTAx)=(A+AT)x\nabla_x (x^T A x) = (A + A^T)x (Quadratic form)
  • xAxb2=2AT(Axb)\nabla_x ||Ax - b||^2 = 2A^T(Ax - b) (Least squares)

Jacobian Matrix

When the output is a vector f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m, we need an m×nm \times n matrix to capture all partial derivatives. This is the Jacobian.

J=fx=[f1x1f1xnfmx1fmxn]J = \frac{\partial f}{\partial x} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

Linear Transformation View

The Jacobian tells you how the output space locally stretches and rotates. If f(x)=Axf(x) = Ax, then J=AJ = A. The derivative of a linear map is the map itself!

Interactive: Jacobian as Linearization

Explore how the Jacobian matrix acts as a local linear approximation for a non-linear transformation. Notice how the grid lines locally look like the Jacobian vectors.

Jacobian as Linearization

The Jacobian matrix J is the "linear approximation" of a non-linear function.

Input Space (x, y)

x: 1.00, y: 1.00

Output Space f(x, y)

u: 0.79, v: 1.45

Jacobian Matrix at Point (1.00, 1.00)

J =
1.00
-0.91
-0.42
1.00

Col 1: How output changes when x moves.

Col 2: How output changes when y moves.

Hessian Matrix

The Hessian is the matrix of second derivatives. It tells you about the curvature of a scalar function.

H=2f=[2fx122fx1x22fx2x12fx22]H = \nabla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix}

H positive definite

Local minimum. Bowl curving up in all directions (convex).

H negative definite

Local maximum. Bowl curving down (concave).

H indefinite

Saddle point. Curves up in some directions, down in others.

H singular

Flat direction. Need higher order derivatives.

Interactive Rules

Explore the three fundamental rules of matrix calculus: scalar-by-vector, vector-by-vector (Jacobian), and the chain rule.

Matrix Calculus Rules

Interactive guide to common derivative identities.

∂(wᵀx)/∂x = w

The gradient of a dot product wᵀx with respect to x is simply w.

f(x) = wᵀx
∇f = w
Practical Example
If f(x) = 3x₁ + 2x₂, then ∇f = [3, 2]ᵀ

Quick Reference Cheat Sheet

∂(xᵀx)/∂x2x
∂(xᵀAx)/∂x(A + Aᵀ)x
∂(aᵀXb)/∂Xabᵀ
∂log|X|/∂XX⁻ᵀ
∂tr(AX)/∂XAᵀ
∂||Ax-b||²/∂x2Aᵀ(Ax-b)

Matrix Chain Rule

The chain rule in matrix calculus is about multiplying Jacobians in the right order.

Lx=Lyyx\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x}
For L=f(g(x))L = f(g(x)), chain the Jacobians. Be careful with dimensions!

Backpropagation is Chain Rule

A neural network is a composition of functions: L=fn(fn1(...f1(x)))L = f_n(f_{n-1}(...f_1(x))).

Backprop computes gradients by starting from the loss and multiplying Jacobians backwards:

Lx=Lfnfnfn1f1x\frac{\partial L}{\partial x} = \frac{\partial L}{\partial f_n} \cdot \frac{\partial f_n}{\partial f_{n-1}} \cdots \frac{\partial f_1}{\partial x}

Case Study: Linear Regression

The Problem

Predict bulb lifespan (y) from features (voltage, temperature, filament thickness). We have data matrix XRn×dX \in \mathbb{R}^{n \times d} and targets yRny \in \mathbb{R}^n. Find optimal weights ww.

The Loss Function

L(w)=Xwy2=(Xwy)T(Xwy)L(w) = ||Xw - y||^2 = (Xw - y)^T(Xw - y)

The Gradient

Using matrix calculus rules:

wL=2XT(Xwy)\nabla_w L = 2X^T(Xw - y)

Setting to zero gives the Normal Equations: XTXw=XTyX^TXw = X^Ty. This is an exact solution!

ML Applications

Custom Autograd

When implementing a custom layer (e.g., in PyTorch or JAX), you must define the backward pass. This requires analytically deriving the Jacobian of your operation.

Natural Gradient Descent

Uses the Fisher Information Matrix (expected Hessian of log-likelihood) to precondition gradients, correcting for the geometry of the parameter space.

Variational Inference

Optimizing the ELBO in VAEs requires computing derivatives of log-determinants and trace operations, which are pure matrix calculus.

Gaussian Processes

Optimizing kernel hyperparameters requires gradients of the marginal likelihood, involving derivatives of matrix inverses and determinants: K1θ=K1KθK1\frac{\partial K^{-1}}{\partial \theta} = -K^{-1} \frac{\partial K}{\partial \theta} K^{-1}.