Introduction
Eigendecomposition is powerful, but it has a fatal flaw: it only works on square matrices. Real-world data matrices are rarely square. We might have m users and n movies, or m documents and n words.
Singular Value Decomposition (SVD) is the generalization that works on any matrix, regardless of shape. It is arguably the most important matrix factorization in applied linear algebra and machine learning.
What SVD Reveals
- Rank: Number of non-zero singular values = rank of matrix.
- Range: Column space is spanned by left singular vectors (U).
- Null Space: Right singular vectors (V) with zero singular values span null space.
- Condition Number: Ratio of largest to smallest singular value measures numerical stability.
Prerequisites
SVD builds on eigendecomposition and orthogonality. Understanding PCA also helps, as PCA is often computed using SVD.
The Movie Rating Intuition
Consider a matrix A where rows are Users and columns are Movies. Entry A[i,j] is user i's rating of movie j. SVD factorizes this into three parts that reveal hidden structure:
U (User-Concept)
Each row describes a user in terms of hidden "concepts" (latent factors).
"Alice: 0.9 Sci-Fi lover, 0.1 Romance fan, 0.3 Action enthusiast"
Σ (Strength)
Diagonal values show how "strong" each concept is in the dataset.
"Sci-Fi concept explains 40% of variance, Romance 25%, Action 15%..."
(Concept-Movie)
Each column describes a movie in terms of the same concepts.
"The Matrix: 0.95 Sci-Fi, 0.05 Romance, 0.8 Action"
To predict Alice's rating for The Matrix: multiply Alice's concept profile by concept strengths by Matrix's concept profile. The "concepts" are learned automatically from the data, not pre-defined!
Mathematical Formulation
SVD Theorem: For any real matrix A of size m × n, there exist orthogonal matrices U and V and a diagonal matrix Σ such that:
()
Singular Values
- Diagonal with entries
- Sorted:
Relationship to Eigendecomposition
The singular values and singular vectors come from eigendecompositions of symmetric matrices:
Left Singular Vectors
columns are eigenvectors of
Right Singular Vectors
columns are eigenvectors of
Singular values = square roots of eigenvalues:
Interactive: SVD Steps
SVD decomposes any linear transformation into three operations: rotate, scale, then rotate. Watch how a unit circle transforms through each step.
SVD Transformer
Original Space
Current Operation
Start with a unit circle. The orthonormal basis vectors (i, j) are aligned with the axes.
Geometric Interpretation
SVD tells us that every linear transformation, no matter how complex, is equivalent to three simple operations in sequence:
(Rotation in Input Space)
Rotates the coordinate system to align with the "principal axes" of the transformation. After this rotation, the transformation will only stretch along coordinate axes.
(Scaling)
Stretches or compresses along the principal axes by the singular values. This is where the transformation does its "work." A unit circle becomes an ellipse whose semi-axes have lengths , etc.
(Rotation in Output Space)
Final rotation that orients the scaled result to its output position. If the matrix changes dimension (), this rotation happens in the output space which may have different dimensions than input.
Unit Circle to Ellipse
When you apply any matrix A to a unit circle (or sphere in higher dimensions), the result is an ellipse (or ellipsoid). The singular values are the lengths of the semi-axes of this ellipse, and the singular vectors are the directions of these axes.
Derivation
How do we prove SVD exists and find it? The key insight is using and .
Step 1: Form Symmetric Matrices
For any , both () and () are symmetric positive semi-definite. Therefore they have:
- Real, non-negative eigenvalues
- Orthogonal eigenvectors
Step 2: Key Identity
If , then
So if is an eigenvector of with eigenvalue , then is an eigenvector of with the same eigenvalue .
Step 3: Define Components
(Right Vectors)
Eigenvectors of , normalized.
(Singular Values)
(Left Vectors)
Verification
With these definitions, for each . In matrix form: , which rearranges to (since is orthogonal, ).
Truncated SVD (Low-Rank Approximation)
The real power of SVD is not in exact factorization but in approximation. By keeping only the top k singular values, we get the best rank-k approximation of A.
This is the sum of k rank-1 matrices, each scaled by its singular value.
Eckart-Young Theorem (Optimality)
Among all rank- matrices, minimizes the Frobenius norm of the error:
The error equals the sum of squared discarded singular values! This is why keeping the largest singular values is optimal.
Compressing a 1000×1000 Image
Original: 1,000,000 pixels.
SVD (k=50): 50 left vectors + 50 singular values + 50 right vectors = 50 × (1000 + 1 + 1000) = 100,050 numbers.
Result: 90% space reduction with minimal visual loss!
Interactive: Low-Rank Approximation
Drag the slider to see how much "energy" (information) is captured by keeping different numbers of singular values.
Low-Rank Approx
Reconstructing a matrix using only its most important patterns.
Singular Value Spectrum
Original Matrix
The full data matrix (Target).
Reconstruction (Ak)
Approximation using top 1 patterns.
Residuals (Error)
What's left behind (Noise).
Key Properties
Rank
Number of non-zero singular values.
Frobenius Norm
Sum of squared entries = sum of squared singular values.
Spectral Norm (2-Norm)
Maximum stretching factor = largest singular value.
Condition Number
Measures numerical stability.
Pseudo-Inverse (Moore-Penrose)
How do you invert a non-square matrix? SVD gives the answer:
Where is formed by taking reciprocals of non-zero singular values () and transposing. This solves least squares problems: .
Applications
Image Compression
An image is a matrix of pixel values. Keep top k singular values, discard the rest. The image remains recognizable even with k much smaller than rank.
Recommendation Systems
User-item rating matrix is sparse. SVD finds latent factors. Predicted rating = . This fills in missing entries. (Netflix Prize key component).
Latent Semantic Analysis (LSA)
Document-term matrix: rows = documents, columns = words. SVD reveals "topics" as latent dimensions. Documents and words are embedded in the same semantic space.
Noise Reduction
Noise typically appears in small singular values (random fluctuations don't create strong patterns). Truncating small singular values removes noise while keeping signal.
SVD vs PCA vs Eigendecomposition
| Method | Matrix Type | What It Finds | Main Use |
|---|---|---|---|
| Eigendecomposition | Square () | Eigenvalues & eigenvectors | Theoretical analysis, dynamics |
| SVD | Any () | Singular values & vectors | Data compression, inversion |
| PCA | Data matrix | Principal components | Dimensionality reduction |
PCA via SVD
PCA is often computed using SVD rather than eigendecomposition of the covariance matrix:
- Center data matrix X
- Compute SVD: X = U Σ V^T
- Principal components = columns of V
- Projected data = U Σ
This is more numerically stable than forming X^T X explicitly.