What is vector projection and how is it computed?

Vector projection is the component of one vector that lies along another vector. Given vectors a and b, the projection of a onto b is computed as proj_b(a) = (a · b / b · b) × b. The scalar a · b / ||b||² gives how far along b the projection lands. The residual e = a − proj_b(a) is always perpendicular to b.

What is least squares and what problem does it solve?

Least squares is a method for finding the best approximate solution to an overdetermined system Ax = b, where b lies outside the column space of A and an exact solution does not exist. It minimizes the sum of squared residuals by projecting b onto the column space of A. The resulting solution minimizes ||Ax − b||², making it the closest achievable fit.

What is the normal equation and how does it give the least squares solution?

The normal equation is A^T A x̂ = A^T b, derived from the perpendicularity condition that the residual e = b − Ax̂ must be orthogonal to the column space of A, i.e., A^T e = 0. Solving gives x̂ = (A^T A)^{−1} A^T b, the least squares solution. The matrix (A^T A)^{−1} A^T is known as the Moore–Penrose pseudoinverse.

How is least squares related to linear regression?

Linear regression is a direct application of least squares. The closed-form OLS solution w = (X^T X)^{−1} X^T y is exactly the normal equations applied to a design matrix X and target vector y. Gradient descent and QR-based solvers converge to this same solution. Regularized variants such as ridge regression modify the normal equations by adding λI to stabilize inversion.

What is the pseudoinverse and when is it used instead of the matrix inverse?

The Moore–Penrose pseudoinverse A⁺ = (A^T A)^{−1} A^T generalizes the matrix inverse to non-square and rank-deficient matrices. It is used whenever A is not square or A^T A is singular (e.g., when columns are linearly dependent). In practice, computing the pseudoinverse via SVD is preferred over forming A^T A directly, as it avoids squaring the condition number and gives numerically stable least-squares solutions.

Projections & Least Squares Regression

The Geometry of Fitting

Most real world systems are overdetermined: we have more equations than unknowns. The system $Ax = b$ has no exact solution. The data is too noisy, the model is too simple, or both.

Least Squares asks: "If we cannot hit $b$ exactly, what is the closest we can get?" The answer is the projection of $b$ onto the column space of $A$ .

The Core Insight

Linear regression, polynomial fitting, and even deep learning (in parts) all reduce to projections. The error is minimized when it is perpendicular to the subspace of possible outputs.

Vector Projection

Before matrices, let us project a single vector $a$ onto another vector $b$ .

\text{proj}_b(a) = \frac{a \cdot b}{b \cdot b} b = \frac{a \cdot b}{||b||^2} b

The scalar

\frac{a \cdot b}{b \cdot b}

tells us how far along

b

the projection lands.

Vector Projection

Visualizing proj_b(a) and the orthogonal error vector.

Target Vector (a)

XY

Base Vector (b)

XY

Scalar Projection0.73

||a||

5.00

||proj||

3.73

||error||

3.33

The Projection

The component of $a$ that lies along $b$ . This is what we keep.

The Error

$e = a - \text{proj}_b(a)$ . The residual, always perpendicular to $b$ .

Interactive: Fitting Lines

In coordinate space, "perpendicular error" translates to minimizing the sum of vertical distance squared. Drag the points to see how the optimal line shifts to keep the residuals orthogonal to the feature space.

Data Space

e ⊥ Column Space

Optimal Line

\hat{y} = 0.56x + 1.39

Orthogonality Check

The error vector $e$ must be orthogonal to the feature space (Column Space of X).

e · 1 (Bias)

0.0000

e · x (Feature)

0.0000

Total Squared Error

4.41

Visualize this as the total area of all the squares in the plot. Least Squares finds the minimum possible area.

Column Space View

For a matrix $A$ , the column space $C(A)$ is the set of all possible outputs $Ax$ . If $b$ is not in $C(A)$ , we project it.

The target $b$ lives in $\mathbb{R}^m$ (m data points).
The column space $C(A)$ is a subspace of $\mathbb{R}^m$ (spanned by n features).
We find the point $\hat{b} \in C(A)$ closest to $b$ .
The error $e = b - \hat{b}$ is perpendicular to $C(A)$ .

Key Geometric Fact

The shortest distance from a point to a subspace is measured along the perpendicular. This is why $A^T e = 0$ (the error is orthogonal to every column of $A$ ).

The Normal Equations

From the perpendicularity condition $A^T(b - Ax) = 0$ , we simplify to find the best weights $\hat{x}$ :

A^T A \hat{x} = A^T b

Solution:

\hat{x} = (A^T A)^{-1} A^T b

The Pseudoinverse

The matrix $A^+ = (A^T A)^{-1} A^T$ is called the Moore Penrose Pseudoinverse. It gives the least squares solution even when $A$ is not square.

Case Study: Bulb Lifespan Prediction

The Problem

You have 100 bulbs with features (voltage, temperature) and lifespan measurements. You want to fit a linear model: Lifespan = w₁×Voltage + w₂×Temperature + w₀.

The Setup

Design matrix $A$ is 100×3 (100 samples, 3 features including bias). Target $b$ is 100×1. We solve for $w$ (3×1).

\hat{w} = (A^T A)^{-1} A^T b

The Geometric View

The column space of $A$ is a 3D subspace in 100D space. The vector $b$ (lifespans) is projected onto this subspace. The residuals $e = b - A\hat{w}$ are the prediction errors, perpendicular to all features.

QR Decomposition Solution

Directly computing $(A^T A)^{-1}$ is numerically unstable. In practice, we use QR decomposition.

Decompose $A = QR$ where $Q$ is orthogonal, $R$ is upper triangular.
Substitute: $(QR)^T(QR)\hat{x} = (QR)^T b$
Simplify: $R^T Q^T Q R \hat{x} = R^T Q^T b$
Since $Q^T Q = I$ : $R^T R \hat{x} = R^T Q^T b$
Result: $R \hat{x} = Q^T b$ (solve by back substitution)

Numerical Advantage: QR avoids squaring the condition number of $A$ . This is how numpy.linalg.lstsq works internally.

ML Applications

Linear Regression

The closed form solution $w = (X^TX)^{-1}X^Ty$ is exactly the normal equations. Gradient descent converges to the same point.

Ridge Regression

Add regularization: $w = (X^TX + \lambda I)^{-1}X^Ty$ . The $\lambda I$ term makes the matrix invertible and shrinks coefficients.

PCA via SVD

PCA finds the subspace that best approximates the data (least reconstruction error). This is a projection problem solved via SVD.

Kernel Methods

Kernel Ridge Regression projects data into a high dimensional feature space and applies least squares there (via the kernel trick).

Contents

The Geometry of Fitting

The Core Insight

Vector Projection

Vector Projection

Target Vector (a)

Base Vector (b)

The Projection

The Error

Interactive: Fitting Lines

Data Space

Optimal Line

Orthogonality Check

Total Squared Error

Column Space View

Key Geometric Fact

The Normal Equations

The Pseudoinverse

Case Study: Bulb Lifespan Prediction

The Problem

The Setup

The Geometric View

QR Decomposition Solution

ML Applications

Linear Regression

Ridge Regression

PCA via SVD

Kernel Methods