Modules
04/11
Linear Algebra

Contents

Orthogonality

Independence, non-overlapping information, and stability in Deep Learning.

Introduction

"Orthogonal" comes from Greek: orthos (right, correct) + gonia (angle). In everyday language, it means "perpendicular" or "at a right angle." In mathematics and machine learning, orthogonality represents a deeper concept: complete independence.

Two orthogonal vectors share absolutely no common direction. In the context of data, if two features are orthogonal, knowing one tells you nothing about the other. They provide completely unique, non-redundant information. This is why orthogonality is sometimes called the "Holy Grail" of feature engineering.

Why Orthogonality Matters in ML

  • Feature Independence: Orthogonal features maximize information content per feature.
  • Numerical Stability: Orthogonal matrices are perfectly conditioned (condition number = 1).
  • Gradient Flow: Orthogonal weight matrices preserve gradient magnitudes in deep networks.
  • Efficient Computation: Inverting orthogonal matrices is trivial (just transpose).

Orthogonality is the mathematical engine behind algorithms like PCA, SVD, QR decomposition, and orthogonal weight initialization in neural networks.

Orthogonal Vectors

Two vectors u\mathbf{u} and v\mathbf{v} in Rn\mathbb{R}^n are orthogonal (written uv\mathbf{u} \perp \mathbf{v}) if their inner product (dot product) is zero:

uv=uTv=i=1nuivi=0\mathbf{u} \cdot \mathbf{v} = \mathbf{u}^T \mathbf{v} = \sum_{i=1}^{n} u_i v_i = 0

This definition extends to any finite dimension n, not just 2D or 3D.

Example: Verifying Orthogonality

Let u=[1,2,3]\mathbf{u} = [1, 2, 3] and v=[1,1,1]\mathbf{v} = [1, 1, -1]. Are they orthogonal?

u · v = (1)(1) + (2)(1) + (3)(-1)

u · v = 1 + 2 - 3 = 0

Yes, u and v are orthogonal!

The Zero Vector

The zero vector 0\mathbf{0} is orthogonal to every vector (including itself), since 0v=0\mathbf{0} \cdot \mathbf{v} = 0 for any v. However, when we talk about orthogonal sets or bases, we typically exclude the zero vector.

Orthogonal Sets

A set of vectors {v1,v2,,vk}\{\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k\} is an orthogonal set if every pair is orthogonal:

vivj=0for all ij\mathbf{v}_i \cdot \mathbf{v}_j = 0 \quad \text{for all } i \neq j

Key Theorem: An orthogonal set of non-zero vectors is always linearly independent. This makes orthogonal vectors extremely useful as basis vectors.

Geometric Intuition

The dot product has a beautiful geometric interpretation that explains why orthogonal vectors have dot product zero:

uv=uvcos(θ)\mathbf{u} \cdot \mathbf{v} = ||\mathbf{u}|| \cdot ||\mathbf{v}|| \cdot \cos(\theta)

When θ=90°\theta = 90°, cos(90°)=0\cos(90°) = 0, so the dot product is zero.

Projection Interpretation

The dot product measures "how much of u lies in the direction of v."

projv(u)=uvv2v\text{proj}_{\mathbf{v}}(\mathbf{u}) = \frac{\mathbf{u} \cdot \mathbf{v}}{||\mathbf{v}||^2} \mathbf{v}

If dot = 0, projection = 0. The vectors share no common direction.

Pythagorean Theorem

For orthogonal vectors, the Pythagorean theorem holds:

u+v2=u2+v2||\mathbf{u} + \mathbf{v}||^2 = ||\mathbf{u}||^2 + ||\mathbf{v}||^2

This is because the cross-term 2(uv)2(\mathbf{u} \cdot \mathbf{v}) vanishes.

Interactive: Projection & Orthogonality

Adjust the angle of vector u to see its projection onto v. When the projection vanishes, the vectors are orthogonal.

vu
45°
0° (Parallel)90° (Perp)180°

Decomposition

Projection (projvu)(\text{proj}_v u)70.7
Parallel
Rejection (uproj)(u - \text{proj})70.7
Perp
Dot Product: 7071
When the angle is 90° (or 270°), the dot product is zero, and the projection vanishes.

Orthonormal Bases

An orthonormal set is an orthogonal set where every vector also has unit length (norm = 1). This is the "gold standard" for coordinate systems.

1. Orthogonal

eiej=0\mathbf{e}_i \cdot \mathbf{e}_j = 0

(if i ≠ j)

2. Normalized

ei=1||\mathbf{e}_i|| = 1

(unit length)

Combined using Kronecker delta:

eiej=δij\mathbf{e}_i \cdot \mathbf{e}_j = \delta_{ij}

Why Orthonormal Bases are Powerful

In an orthonormal basis, finding coordinates is trivial. You don't need to solve a system of equations; you just take dot products!

x=i=1n(xei)ei\mathbf{x} = \sum_{i=1}^{n} (\mathbf{x} \cdot \mathbf{e}_i) \mathbf{e}_i

The coordinate for ei\mathbf{e}_i is just xei\mathbf{x} \cdot \mathbf{e}_i. This is why Fourier series and Wavelet transforms (which use orthonormal bases) are computationally feasible.

Orthogonal Matrices

An orthogonal matrix is a square matrix Q whose columns form an orthonormal set. (Confusingly named; should be "orthonormal matrix").

Defining Property

QTQ=QQT=IQ^T Q = Q Q^T = I

Which implies: Q⁻¹ = Q^T

Examples

Rotation Matrix

[cosθsinθsinθcosθ]\begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}

Reflection Matrix

[1001]\begin{bmatrix} 1 & 0 \\ 0 & -1 \end{bmatrix}

Permutation Matrix

[0110]\begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}

Identity Matrix

[1001]\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}

Properties & Proofs

1. Isometry (Length Preserving)

Multiplying by Q does not change length.

Qx=x||Q\mathbf{x}|| = ||\mathbf{x}||

2. Angle Preserving

Dot products are preserved.

(Qx)(Qy)=xy(Q\mathbf{x}) \cdot (Q\mathbf{y}) = \mathbf{x} \cdot \mathbf{y}

3. Determinant

Determinant is always ±1.

det(Q)=±1\det(Q) = \pm 1

4. Eigenvalues

All eigenvalues lie on the complex unit circle.

λ=1|\lambda| = 1

Computational Advantage

Inverting a general matrix is O(n³). Inverting an orthogonal matrix is O(n²) (just transpose!). Also, condition number = 1 means perfect numerical stability.

Gram-Schmidt Process

The Gram-Schmidt process transforms any linearly independent vectors into an orthonormal basis for the same space. It works by iteratively subtracting the projection onto previous vectors.

The Algorithm

Step 1: Normalize first vector

e1=v1v1\mathbf{e}_1 = \frac{\mathbf{v}_1}{||\mathbf{v}_1||}

Step 2: Subtract projection on e1

u2=v2(v2e1)e1\mathbf{u}_2 = \mathbf{v}_2 - (\mathbf{v}_2 \cdot \mathbf{e}_1)\mathbf{e}_1
e2=u2u2\mathbf{e}_2 = \frac{\mathbf{u}_2}{||\mathbf{u}_2||}

Step k: Subtract all previous projections

uk=vkj=1k1(vkej)ej\mathbf{u}_k = \mathbf{v}_k - \sum_{j=1}^{k-1} (\mathbf{v}_k \cdot \mathbf{e}_j)\mathbf{e}_j
ek=ukuk\mathbf{e}_k = \frac{\mathbf{u}_k}{||\mathbf{u}_k||}

Interactive: Gram-Schmidt

Watch step-by-step how orthogonalization happens.

v₁v₂

Initial Vectors

v₁ Angle15°
v₂ Angle60°

Current Operation

Configure vectors...

QR Decomposition

Matrix form of Gram-Schmidt: A = QR.

A=QRA = QR
  • Q: Orthogonal matrix (the e vectors).
  • R: Upper triangular matrix (the dot products).

Used for solving least squares (Rx=QTbRx = Q^T b) and finding eigenvalues.

ML Applications

Orthogonal Weight Initialization

Initializing RNN/LSTM weights as orthogonal matrices prevents vanishing/exploding gradients because λ=1|\lambda| = 1, preserving signal magnitude over time.

PCA & Decorrelation

PCA finds orthogonal directions of maximum variance. This "whitens" or decorrelates data, making downstream learning easier for models.

Orthogonal Regularization

Adding a loss term WTWI2||W^T W - I||^2 encourages weights to remain orthogonal during training, improving stability in GANs.