What does it mean for two vectors to be orthogonal?

Two vectors are orthogonal when their dot product equals zero, meaning they meet at a 90-degree angle. Geometrically this means they share no common direction — the projection of one onto the other is zero. In the context of data, orthogonal features provide completely non-redundant information: knowing the value of one tells you nothing about the other.

What is the difference between orthogonal and orthonormal vectors?

Orthogonal vectors simply have a dot product of zero (they are perpendicular). Orthonormal vectors are orthogonal and additionally each has a unit length (norm equal to 1). An orthonormal set satisfies e_i · e_j = δ_ij, where δ_ij is the Kronecker delta. Orthonormal bases are preferred in computation because coordinates can be found by simple dot products rather than solving linear systems.

What is the Gram-Schmidt process and what is it used for?

The Gram-Schmidt process is an algorithm that transforms any set of linearly independent vectors into an orthonormal basis spanning the same subspace. It works iteratively: each new vector has the projections onto all previously computed basis vectors subtracted out, then is normalized to unit length. It is the foundation of QR decomposition and is used in numerical linear algebra for solving least squares problems and computing eigenvalues stably.

Why is orthogonality important in machine learning?

Orthogonality underpins several key ML techniques. Orthogonal features maximize non-redundant information content, which is exploited by PCA to decorrelate data before learning. Orthogonal weight matrices have a condition number of 1, giving perfect numerical stability and preventing vanishing or exploding gradients in deep and recurrent networks. Orthogonal regularization — penalizing deviation of weight matrices from orthogonality — improves training stability in GANs and other deep architectures.

How is orthogonality used in QR decomposition?

QR decomposition factorizes a matrix A into a product A = QR, where Q is an orthogonal matrix (columns form an orthonormal basis, Q^T Q = I) and R is an upper triangular matrix. The columns of Q are exactly the orthonormal vectors produced by applying Gram-Schmidt to the columns of A, and R stores the projection coefficients. QR decomposition is numerically stable and is widely used to solve least-squares problems (via Rx = Q^T b) and to compute eigenvalues through the QR algorithm.

Orthogonality & Gram-Schmidt Process

Introduction

"Orthogonal" comes from Greek: orthos (right, correct) + gonia (angle). In everyday language, it means "perpendicular" or "at a right angle." In mathematics and machine learning, orthogonality represents a deeper concept: complete independence.

Two orthogonal vectors share absolutely no common direction. In the context of data, if two features are orthogonal, knowing one tells you nothing about the other. They provide completely unique, non-redundant information. This is why orthogonality is sometimes called the "Holy Grail" of feature engineering.

Why Orthogonality Matters in ML

Feature Independence: Orthogonal features maximize information content per feature.
Numerical Stability: Orthogonal matrices are perfectly conditioned (condition number = 1).
Gradient Flow: Orthogonal weight matrices preserve gradient magnitudes in deep networks.
Efficient Computation: Inverting orthogonal matrices is trivial (just transpose).

Orthogonality is the mathematical engine behind algorithms like PCA, SVD, QR decomposition, and orthogonal weight initialization in neural networks.

Orthogonal Vectors

Two vectors $\mathbf{u}$ and $\mathbf{v}$ in $\mathbb{R}^n$ are orthogonal (written $\mathbf{u} \perp \mathbf{v}$ ) if their inner product (dot product) is zero:

\mathbf{u} \cdot \mathbf{v} = \mathbf{u}^T \mathbf{v} = \sum_{i=1}^{n} u_i v_i = 0

This definition extends to any finite dimension n, not just 2D or 3D.

Example: Verifying Orthogonality

Let $\mathbf{u} = [1, 2, 3]$ and $\mathbf{v} = [1, 1, -1]$ . Are they orthogonal?

u · v = (1)(1) + (2)(1) + (3)(-1)

u · v = 1 + 2 - 3 = 0

Yes, u and v are orthogonal!

The Zero Vector

The zero vector $\mathbf{0}$ is orthogonal to every vector (including itself), since $\mathbf{0} \cdot \mathbf{v} = 0$ for any v. However, when we talk about orthogonal sets or bases, we typically exclude the zero vector.

Orthogonal Sets

A set of vectors $\{\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k\}$ is an orthogonal set if every pair is orthogonal:

\mathbf{v}_i \cdot \mathbf{v}_j = 0 \quad \text{for all } i \neq j

Key Theorem: An orthogonal set of non-zero vectors is always linearly independent. This makes orthogonal vectors extremely useful as basis vectors.

Geometric Intuition

The dot product has a beautiful geometric interpretation that explains why orthogonal vectors have dot product zero:

\mathbf{u} \cdot \mathbf{v} = ||\mathbf{u}|| \cdot ||\mathbf{v}|| \cdot \cos(\theta)

When $\theta = 90°$ , $\cos(90°) = 0$ , so the dot product is zero.

Projection Interpretation

The dot product measures "how much of u lies in the direction of v."

\text{proj}_{\mathbf{v}}(\mathbf{u}) = \frac{\mathbf{u} \cdot \mathbf{v}}{||\mathbf{v}||^2} \mathbf{v}

If dot = 0, projection = 0. The vectors share no common direction.

Pythagorean Theorem

For orthogonal vectors, the Pythagorean theorem holds:

||\mathbf{u} + \mathbf{v}||^2 = ||\mathbf{u}||^2 + ||\mathbf{v}||^2

This is because the cross-term $2(\mathbf{u} \cdot \mathbf{v})$ vanishes.

Interactive: Projection & Orthogonality

Adjust the angle of vector u to see its projection onto v. When the projection vanishes, the vectors are orthogonal.

Vector Projection

Decomposing u into components parallel and perpendicular to v.

Angle45°

0° (Parallel)90° (Perp)180°

Decomposition

Projection

(\text{proj}_v u)

70.7

Parallel

Rejection

(u - \text{proj})

70.7

Perp

Dot Product: 7071
When the angle is 90° (or 270°), the dot product is zero, and the projection vanishes.

Orthonormal Bases

An orthonormal set is an orthogonal set where every vector also has unit length (norm = 1). This is the "gold standard" for coordinate systems.

1. Orthogonal

\mathbf{e}_i \cdot \mathbf{e}_j = 0

(if i ≠ j)

2. Normalized

||\mathbf{e}_i|| = 1

(unit length)

Combined using Kronecker delta:

\mathbf{e}_i \cdot \mathbf{e}_j = \delta_{ij}

Why Orthonormal Bases are Powerful

In an orthonormal basis, finding coordinates is trivial. You don't need to solve a system of equations; you just take dot products!

\mathbf{x} = \sum_{i=1}^{n} (\mathbf{x} \cdot \mathbf{e}_i) \mathbf{e}_i

The coordinate for $\mathbf{e}_i$ is just $\mathbf{x} \cdot \mathbf{e}_i$ . This is why Fourier series and Wavelet transforms (which use orthonormal bases) are computationally feasible.

Orthogonal Matrices

An orthogonal matrix is a square matrix Q whose columns form an orthonormal set. (Confusingly named; should be "orthonormal matrix").

Defining Property

Q^T Q = Q Q^T = I

Which implies: Q⁻¹ = Q^T

Examples

Rotation Matrix

\begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}

Reflection Matrix

\begin{bmatrix} 1 & 0 \\ 0 & -1 \end{bmatrix}

Permutation Matrix

\begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}

Identity Matrix

\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}

Properties & Proofs

1. Isometry (Length Preserving)

Multiplying by Q does not change length.

||Q\mathbf{x}|| = ||\mathbf{x}||

2. Angle Preserving

Dot products are preserved.

(Q\mathbf{x}) \cdot (Q\mathbf{y}) = \mathbf{x} \cdot \mathbf{y}

3. Determinant

Determinant is always ±1.

\det(Q) = \pm 1

4. Eigenvalues

All eigenvalues lie on the complex unit circle.

|\lambda| = 1

Computational Advantage

Inverting a general matrix is O(n³). Inverting an orthogonal matrix is O(n²) (just transpose!). Also, condition number = 1 means perfect numerical stability.

Gram-Schmidt Process

The Gram-Schmidt process transforms any linearly independent vectors into an orthonormal basis for the same space. It works by iteratively subtracting the projection onto previous vectors.

The Algorithm

Step 1: Normalize first vector

\mathbf{e}_1 = \frac{\mathbf{v}_1}{||\mathbf{v}_1||}

Step 2: Subtract projection on e1

\mathbf{u}_2 = \mathbf{v}_2 - (\mathbf{v}_2 \cdot \mathbf{e}_1)\mathbf{e}_1

\mathbf{e}_2 = \frac{\mathbf{u}_2}{||\mathbf{u}_2||}

Step k: Subtract all previous projections

\mathbf{u}_k = \mathbf{v}_k - \sum_{j=1}^{k-1} (\mathbf{v}_k \cdot \mathbf{e}_j)\mathbf{e}_j

\mathbf{e}_k = \frac{\mathbf{u}_k}{||\mathbf{u}_k||}

Interactive: Gram-Schmidt

Watch step-by-step how orthogonalization happens.

1. Define Vectors

Start with two linearly independent vectors.

Initial Vectors

v₁ Angle15°

v₂ Angle60°

Current Operation

Configure vectors...

QR Decomposition

Matrix form of Gram-Schmidt: A = QR.

A = QR

Q: Orthogonal matrix (the e vectors).
R: Upper triangular matrix (the dot products).

Used for solving least squares ( $Rx = Q^T b$ ) and finding eigenvalues.

ML Applications

Orthogonal Weight Initialization

Initializing RNN/LSTM weights as orthogonal matrices prevents vanishing/exploding gradients because $|\lambda| = 1$ , preserving signal magnitude over time.

PCA & Decorrelation

PCA finds orthogonal directions of maximum variance. This "whitens" or decorrelates data, making downstream learning easier for models.

Orthogonal Regularization

Adding a loss term $||W^T W - I||^2$ encourages weights to remain orthogonal during training, improving stability in GANs.

Contents