Introduction
Every layer in a neural network, from simple feed-forward networks to GPT's attention blocks, is fundamentally matrix multiplication followed by non-linearities. Understanding matrix multiplication is essential to understanding deep learning.
Matrix multiplication is not just "multiplying numbers." It is combining information and transforming space.
Prerequisites
Before diving in, make sure you understand Dot Product and Norms. Matrix multiplication is essentially many dot products computed in parallel.
The Dimension Rule
You cannot multiply any two matrices. Their dimensions must align in a specific way.
Inner dimensions must match. Result takes outer dimensions.
Valid ✓
(3 × 4) × (4 × 2) = (3 × 2)
Invalid ✗
(3 × 4) × (3 × 2) = ???
INNER DIMENSIONS DO NOT MATCH (4 ≠ 3)
Practical Tip
Dimension mismatch errors are one of the most common bugs in ML code. Always trace dimensions through your network architecture before implementing.
The Dot Product View
Each element C[i][j] in the result is the dot product of row i from A and column j from B.
Walk along row i of A and column j of B simultaneously, multiplying and summing.
Intuition
Think of it as asking: "How much does row i of A align with column j of B?" High alignment means a large value in C[i][j].
Row and Column Interpretations
There are multiple ways to think about matrix multiplication beyond the element-by-element view.
Column View
Each column of C is a linear combination of columns of A, with coefficients from the corresponding column of B.
Row View
Each row of C is a linear combination of rows of B, with coefficients from the corresponding row of A.
Outer Product View
C is the sum of outer products of columns of A with rows of B.
Interactive: Step-by-Step
Step through the multiplication of a 2×3 matrix by a 3×2 matrix. Watch how each result cell is computed as a dot product.
Matrix Multiplication
Target Cell = Row A · Col B
Transformation View
Think of a matrix as a function that transforms vectors. This geometric perspective is crucial for understanding deep learning.
If x is a point in space, Ax moves that point. The matrix stretches, rotates, shears, and reflects space.
AB is a single transformation: apply B first, then A.
Rotation
Orthogonal matrices rotate vectors without changing their length.
Scaling
Diagonal matrices scale each axis independently.
Projection
Some matrices project onto lower-dimensional subspaces.
Shearing
Off-diagonal elements create shearing effects.
Key Properties
Non-Commutative
Order matters! AB and BA often have different dimensions or values. Rotating then scaling is different from scaling then rotating.
Associative
Choose multiplication order for efficiency.
Distributive
Transpose and Products
The transpose operation (swapping rows and columns) interacts with multiplication in important ways.
The Transpose Rule
Order reverses! Critical in backpropagation derivation.
Gram Matrix
Always symmetric positive semi-definite. Used in style transfer.
Special Matrices
Diagonal
Only diagonal elements are non-zero. Scale O(n).
Orthogonal
Preserves lengths and angles (rotation/reflection).
Symmetric
Has real eigenvalues. Hessian matrices are symmetric.
Sparse
Most elements are zero. Efficient storage/compute.
Batched Operations
In deep learning, we rarely multiply single matrices. We work with batches of data processed in parallel.
Batch Matrix Multiplication
B independent matrix multiplications happen in parallel. This is how a batch of inputs flows through a neural network layer simultaneously.
Batch Matrix Multiplication
B independent multiplications in parallel
GPU Parallelism
All 4 matrix multiplications execute simultaneously on GPU. This is why batch size matters for training throughput.
Neural Network Inference
A batch of inputs flows through each layer in parallel.torch.bmm() does exactly this.
ML Applications
Neural Network Layer
The core building block of MLP, RNN, and Transformer networks.
Attention Mechanism
Self-attention is purely matrix multiplication.
Convolution (im2col)
CNNs use im2col to convert convolution into matrix multiplication to leverage fast GPU GEMM kernels.
Embedding Lookup
Mathematically equivalent to multiplying a one-hot vector with an embedding matrix.
Computational Cost
For two matrices, naive multiplication requires operations. This cubic scaling is a fundamental constraint in deep learning.
Naive
Strassen
Theoretical
Implication
Doubling layer width (512 to 1024) increases compute by 8x (). This is why model scaling is expensive. GPUs maximize throughput, but the O(n³) complexity remains.