The Geometry of Fitting
Most real world systems are overdetermined: we have more equations than unknowns. The system has no exact solution. The data is too noisy, the model is too simple, or both.
Least Squares asks: "If we cannot hit exactly, what is the closest we can get?" The answer is the projection of onto the column space of .
The Core Insight
Linear regression, polynomial fitting, and even deep learning (in parts) all reduce to projections. The error is minimized when it is perpendicular to the subspace of possible outputs.
Vector Projection
Before matrices, let us project a single vector onto another vector .
Vector Projection
Visualizing proj_b(a) and the orthogonal error vector.
Target Vector (a)
Base Vector (b)
The Projection
The component of that lies along . This is what we keep.
The Error
. The residual, always perpendicular to .
Interactive: Fitting Lines
In coordinate space, "perpendicular error" translates to minimizing the sum of vertical distance squared. Drag the points to see how the optimal line shifts to keep the residuals orthogonal to the feature space.
Data Space
Optimal Line
Orthogonality Check
The error vector must be orthogonal to the feature space (Column Space of X).
Total Squared Error
Visualize this as the total area of all the squares in the plot. Least Squares finds the minimum possible area.
Column Space View
For a matrix , the column space is the set of all possible outputs . If is not in , we project it.
- The target lives in (m data points).
- The column space is a subspace of (spanned by n features).
- We find the point closest to .
- The error is perpendicular to .
Key Geometric Fact
The shortest distance from a point to a subspace is measured along the perpendicular. This is why (the error is orthogonal to every column of ).
The Normal Equations
From the perpendicularity condition , we simplify to find the best weights :
The Pseudoinverse
The matrix is called the Moore Penrose Pseudoinverse. It gives the least squares solution even when is not square.
Case Study: Bulb Lifespan Prediction
The Problem
You have 100 bulbs with features (voltage, temperature) and lifespan measurements. You want to fit a linear model: Lifespan = w₁×Voltage + w₂×Temperature + w₀.
The Setup
Design matrix is 100×3 (100 samples, 3 features including bias). Target is 100×1. We solve for (3×1).
The Geometric View
The column space of is a 3D subspace in 100D space. The vector (lifespans) is projected onto this subspace. The residuals are the prediction errors, perpendicular to all features.
QR Decomposition Solution
Directly computing is numerically unstable. In practice, we use QR decomposition.
- Decompose where is orthogonal, is upper triangular.
- Substitute:
- Simplify:
- Since :
- Result: (solve by back substitution)
Numerical Advantage: QR avoids squaring the condition number of . This is how numpy.linalg.lstsq works internally.
ML Applications
Linear Regression
The closed form solution is exactly the normal equations. Gradient descent converges to the same point.
Ridge Regression
Add regularization: . The term makes the matrix invertible and shrinks coefficients.
PCA via SVD
PCA finds the subspace that best approximates the data (least reconstruction error). This is a projection problem solved via SVD.
Kernel Methods
Kernel Ridge Regression projects data into a high dimensional feature space and applies least squares there (via the kernel trick).