How does matrix multiplication work?

Matrix multiplication combines two matrices by computing each output element as the dot product of a row from the first matrix and a column from the second. For matrices A (m×n) and B (n×p), the resulting matrix C has shape m×p and every entry C[i][j] equals the sum of A[i][k] × B[k][j] over all k. The inner dimensions must match for multiplication to be valid.

Why is matrix multiplication not commutative?

Matrix multiplication is not commutative because AB and BA represent different sequences of linear transformations applied to space. Even when both products are defined and have the same shape, rotating then scaling produces a different result than scaling then rotating. In practice, AB and BA often have different dimensions, making commutativity impossible in the general case.

How is matrix multiplication used in neural networks?

Every layer in a neural network computes Y = activation(XW + b), where XW is a matrix multiplication between the input batch X and the weight matrix W. Self-attention in Transformers is expressed as softmax(QKᵀ/√d)V, which is entirely matrix multiplications. Convolutions in CNNs are also converted to matrix multiplication via im2col to exploit fast GPU GEMM kernels.

What is batched matrix multiplication and why does it matter for deep learning?

Batched matrix multiplication performs B independent matrix multiplications in parallel, operating on tensors of shape (B×M×N) and (B×N×P) to produce (B×M×P). In deep learning, the batch dimension represents multiple training examples processed simultaneously, enabling efficient GPU utilization. Without batching, training modern neural networks on GPUs would be orders of magnitude slower.

What is the time complexity of matrix multiplication?

Naive matrix multiplication of two n×n matrices requires O(n³) operations — n² output elements each requiring n multiply-add operations. The Strassen algorithm reduces this to approximately O(n^2.81) through recursive decomposition. Theoretically optimal algorithms approach O(n^2.37), though practical ML workloads rely on highly optimized O(n³) GEMM routines on GPUs because the constants are small and hardware parallelism is maximized.

Matrix Multiplication: The Engine of Deep Learning

Introduction

Every layer in a neural network, from simple feed-forward networks to GPT's attention blocks, is fundamentally matrix multiplication followed by non-linearities. Understanding matrix multiplication is essential to understanding deep learning.

Matrix multiplication is not just "multiplying numbers." It is combining information and transforming space.

Prerequisites

Before diving in, make sure you understand Dot Product and Norms. Matrix multiplication is essentially many dot products computed in parallel.

The Dimension Rule

You cannot multiply any two matrices. Their dimensions must align in a specific way.

(m × n)×(n × p)=(m × p)

Inner dimensions must match. Result takes outer dimensions.

Valid ✓

(3 × 4) × (4 × 2) = (3 × 2)

Invalid ✗

(3 × 4) × (3 × 2) = ???

INNER DIMENSIONS DO NOT MATCH (4 ≠ 3)

Practical Tip

Dimension mismatch errors are one of the most common bugs in ML code. Always trace dimensions through your network architecture before implementing.

The Dot Product View

Each element C[i][j] in the result is the dot product of row i from A and column j from B.

C_{ij} = \sum_{k=1}^{n} A_{ik} \cdot B_{kj}

Walk along row i of A and column j of B simultaneously, multiplying and summing.

Intuition

Think of it as asking: "How much does row i of A align with column j of B?" High alignment means a large value in C[i][j].

Row and Column Interpretations

There are multiple ways to think about matrix multiplication beyond the element-by-element view.

Column View

Each column of C is a linear combination of columns of A, with coefficients from the corresponding column of B.

C_{:,j} = \sum_{k} B_{kj} \cdot A_{:,k}

Row View

Each row of C is a linear combination of rows of B, with coefficients from the corresponding row of A.

C_{i,:} = \sum_{k} A_{ik} \cdot B_{k,:}

Outer Product View

C is the sum of outer products of columns of A with rows of B.

C = \sum_{k} A_{:,k} \otimes B_{k,:}

Interactive: Step-by-Step

Step through the multiplication of a 2×3 matrix by a 3×2 matrix. Watch how each result cell is computed as a dot product.

Matrix Multiplication

Target Cell = Row A · Col B

Step 1 / 4

A [2×3]

B [3×2]

C [2×2]

OperationRow 1·Col 1

A×B

1·7

A×B

2·9

A×B

3·11

Sum

Transformation View

Think of a matrix as a function that transforms vectors. This geometric perspective is crucial for understanding deep learning.

If x is a point in space, Ax moves that point. The matrix stretches, rotates, shears, and reflects space.

(AB)\mathbf{x} = A(B\mathbf{x})

AB is a single transformation: apply B first, then A.

Rotation

Orthogonal matrices rotate vectors without changing their length.

Scaling

Diagonal matrices scale each axis independently.

Projection

Some matrices project onto lower-dimensional subspaces.

Shearing

Off-diagonal elements create shearing effects.

Key Properties

Non-Commutative

AB \neq BA

Order matters! AB and BA often have different dimensions or values. Rotating then scaling is different from scaling then rotating.

Associative

(AB)C = A(BC)

Choose multiplication order for efficiency.

Distributive

A(B + C) = AB + AC

Transpose and Products

The transpose operation (swapping rows and columns) interacts with multiplication in important ways.

The Transpose Rule

(AB)^T = B^T A^T

Order reverses! Critical in backpropagation derivation.

Gram Matrix

G = A^T A

Always symmetric positive semi-definite. Used in style transfer.

Special Matrices

Diagonal

Only diagonal elements are non-zero. Scale O(n).

Orthogonal

Q^T Q = I

Preserves lengths and angles (rotation/reflection).

Symmetric

A = A^T

Has real eigenvalues. Hessian matrices are symmetric.

Sparse

Most elements are zero. Efficient storage/compute.

Batched Operations

In deep learning, we rarely multiply single matrices. We work with batches of data processed in parallel.

Batch Matrix Multiplication

(B \times M \times N) \cdot (B \times N \times P) = (B \times M \times P)

B independent matrix multiplications happen in parallel. This is how a batch of inputs flows through a neural network layer simultaneously.

Batch Matrix Multiplication

B independent multiplications in parallel

(4×2×3)⋅(4×3×2)=(4×2×2)

Select Batch:

A Batch

2×3

B Batch

3×2

C Batch

2×2

Batch 0: Detailed Computation

A[0]

B[0]

C[0]

GPU Parallelism

All 4 matrix multiplications execute simultaneously on GPU. This is why batch size matters for training throughput.

Neural Network Inference

A batch of inputs flows through each layer in parallel.torch.bmm() does exactly this.

ML Applications

Neural Network Layer

Y = \text{activation}(XW + b)

The core building block of MLP, RNN, and Transformer networks.

Attention Mechanism

\text{Attn} = \text{softmax}(\frac{QK^T}{\sqrt{d}})V

Self-attention is purely matrix multiplication.

Convolution (im2col)

CNNs use im2col to convert convolution into matrix multiplication to leverage fast GPU GEMM kernels.

Embedding Lookup

Mathematically equivalent to multiplying a one-hot vector with an embedding matrix.

Computational Cost

For two $n \times n$ matrices, naive multiplication requires $O(n^3)$ operations. This cubic scaling is a fundamental constraint in deep learning.

Naive

$O(n^3)$

Strassen

$O(n^{2.81})$

Theoretical

$O(n^{2.37})$

Implication

Doubling layer width (512 to 1024) increases compute by 8x ( $2^3$ ). This is why model scaling is expensive. GPUs maximize throughput, but the O(n³) complexity remains.

Contents

Introduction

Prerequisites

The Dimension Rule

Valid ✓

Invalid ✗

Practical Tip

The Dot Product View

Intuition

Row and Column Interpretations

Column View

Row View

Outer Product View

Interactive: Step-by-Step

Matrix Multiplication

Transformation View

Rotation

Scaling

Projection

Shearing

Key Properties

Non-Commutative

Associative

Distributive

Transpose and Products

The Transpose Rule

Gram Matrix

Special Matrices

Diagonal

Orthogonal

Symmetric

Sparse

Batched Operations

Batch Matrix Multiplication

Batch Matrix Multiplication

GPU Parallelism

Neural Network Inference

ML Applications

Neural Network Layer

Attention Mechanism

Convolution (im2col)

Embedding Lookup

Computational Cost

Implication