What is singular value decomposition (SVD) and how does it work?

Singular value decomposition (SVD) is a matrix factorization that decomposes any matrix A into three matrices: A = U Σ V^T. U contains the left singular vectors (orthogonal), Σ is a diagonal matrix of non-negative singular values sorted in descending order, and V contains the right singular vectors (orthogonal). Unlike eigendecomposition, SVD works on any matrix regardless of shape, making it one of the most widely used tools in linear algebra and machine learning.

What are singular values and what do they represent?

Singular values are the non-negative diagonal entries of the Σ matrix in SVD, equal to the square roots of the eigenvalues of A^T A. They measure the 'strength' of each latent dimension in the data — larger singular values correspond to directions that capture more variance or energy. Geometrically, when a matrix A transforms a unit sphere, the singular values are the lengths of the semi-axes of the resulting ellipsoid.

How is SVD related to PCA?

PCA (Principal Component Analysis) is often computed directly via SVD rather than through eigendecomposition of the covariance matrix. Given a centered data matrix X, computing X = U Σ V^T yields principal components as the columns of V and projected data as U Σ. This SVD-based approach is numerically more stable than explicitly forming X^T X, and is the method used by scikit-learn's PCA implementation.

How is truncated SVD used for dimensionality reduction?

Truncated SVD keeps only the top k singular values and their corresponding singular vectors, producing the best rank-k approximation of the original matrix (guaranteed by the Eckart-Young theorem). This dramatically reduces storage and computation: for a 1000×1000 matrix, keeping k=50 components reduces storage by about 90%. In machine learning, truncated SVD is used for feature extraction, noise reduction, and as the backbone of latent semantic analysis and collaborative filtering.

What are practical applications of SVD in machine learning?

SVD has broad applications in machine learning: image compression (representing images as low-rank matrices), recommendation systems (finding latent user-item factors as in the Netflix Prize), latent semantic analysis (discovering topics in document-term matrices), noise reduction (discarding small singular values that correspond to noise), and computing the Moore-Penrose pseudoinverse for solving least-squares problems. Libraries like NumPy (numpy.linalg.svd) and scikit-learn (TruncatedSVD) provide efficient implementations.

Singular Value Decomposition (SVD) Explained

Introduction

Eigendecomposition is powerful, but it has a fatal flaw: it only works on square matrices. Real-world data matrices are rarely square. We might have m users and n movies, or m documents and n words.

Singular Value Decomposition (SVD) is the generalization that works on any matrix, regardless of shape. It is arguably the most important matrix factorization in applied linear algebra and machine learning.

What SVD Reveals

Rank: Number of non-zero singular values = rank of matrix.
Range: Column space is spanned by left singular vectors (U).
Null Space: Right singular vectors (V) with zero singular values span null space.
Condition Number: Ratio of largest to smallest singular value measures numerical stability.

Prerequisites

SVD builds on eigendecomposition and orthogonality. Understanding PCA also helps, as PCA is often computed using SVD.

The Movie Rating Intuition

Consider a matrix A where rows are Users and columns are Movies. Entry A[i,j] is user i's rating of movie j. SVD factorizes this into three parts that reveal hidden structure:

U (User-Concept)

Each row describes a user in terms of hidden "concepts" (latent factors).

"Alice: 0.9 Sci-Fi lover, 0.1 Romance fan, 0.3 Action enthusiast"

Σ (Strength)

Diagonal values show how "strong" each concept is in the dataset.

"Sci-Fi concept explains 40% of variance, Romance 25%, Action 15%..."

$V^T$ (Concept-Movie)

Each column describes a movie in terms of the same concepts.

"The Matrix: 0.95 Sci-Fi, 0.05 Romance, 0.8 Action"

To predict Alice's rating for The Matrix: multiply Alice's concept profile by concept strengths by Matrix's concept profile. The "concepts" are learned automatically from the data, not pre-defined!

Mathematical Formulation

SVD Theorem: For any real matrix A of size m × n, there exist orthogonal matrices U and V and a diagonal matrix Σ such that:

A = U \Sigma V^T

$U$ ( $m \times m$ )

Left Singular Vectors

Orthogonal: $U^T U = I$
Columns are eigenvectors of $AA^T$

$\Sigma$ ( $m \times n$ )

Singular Values

Diagonal with entries $\sigma_i \geq 0$
Sorted: $\sigma_1 \geq \sigma_2 \geq \ldots \geq 0$

$V^T$ ( $n \times n$ )

Right Singular Vectors

Orthogonal: $V^T V = I$
Rows are eigenvectors of $A^T A$

Relationship to Eigendecomposition

The singular values and singular vectors come from eigendecompositions of symmetric matrices:

Left Singular Vectors

$AA^T = U \Lambda U^T$

$U$ columns are eigenvectors of $AA^T$

Right Singular Vectors

$A^T A = V \Lambda V^T$

$V$ columns are eigenvectors of $A^T A$

Singular values = square roots of eigenvalues: $\sigma_i = \sqrt{\lambda_i}$

Interactive: SVD Steps

SVD decomposes any linear transformation into three operations: rotate, scale, then rotate. Watch how a unit circle transforms through each step.

SVD Transformer

Original Space

x

Matrix Constructor

Vᵀ (Input Rotation)-45°

Σ (Scaling)

σ₁2

σ₂1

U (Output Rotation)30°

Current Operation

Start with a unit circle. The orthonormal basis vectors (i, j) are aligned with the axes.

Geometric Interpretation

SVD tells us that every linear transformation, no matter how complex, is equivalent to three simple operations in sequence:

$V^T$ (Rotation in Input Space)

Rotates the coordinate system to align with the "principal axes" of the transformation. After this rotation, the transformation will only stretch along coordinate axes.

$\Sigma$ (Scaling)

Stretches or compresses along the principal axes by the singular values. This is where the transformation does its "work." A unit circle becomes an ellipse whose semi-axes have lengths $\sigma_1, \sigma_2$ , etc.

$U$ (Rotation in Output Space)

Final rotation that orients the scaled result to its output position. If the matrix changes dimension ( $m \neq n$ ), this rotation happens in the output space which may have different dimensions than input.

Unit Circle to Ellipse

When you apply any matrix A to a unit circle (or sphere in higher dimensions), the result is an ellipse (or ellipsoid). The singular values are the lengths of the semi-axes of this ellipse, and the singular vectors are the directions of these axes.

Derivation

How do we prove SVD exists and find it? The key insight is using $A^T A$ and $AA^T$ .

Step 1: Form Symmetric Matrices

For any $A$ , both $A^T A$ ( $n \times n$ ) and $AA^T$ ( $m \times m$ ) are symmetric positive semi-definite. Therefore they have:

Real, non-negative eigenvalues
Orthogonal eigenvectors

Step 2: Key Identity

If $A^T A \mathbf{v} = \lambda \mathbf{v}$ , then

$AA^T (A\mathbf{v}) = A(A^T A \mathbf{v}) = A(\lambda \mathbf{v}) = \lambda (A\mathbf{v})$

So if $\mathbf{v}$ is an eigenvector of $A^T A$ with eigenvalue $\lambda$ , then $A\mathbf{v}$ is an eigenvector of $AA^T$ with the same eigenvalue $\lambda$ .

Step 3: Define Components

$V$ (Right Vectors)

Eigenvectors of $A^T A$ , normalized.

$\Sigma$ (Singular Values)

$\sigma_i = \sqrt{\lambda_i}$

$U$ (Left Vectors)

$\mathbf{u}_i = \frac{1}{\sigma_i} A \mathbf{v}_i$

Verification

With these definitions, $A\mathbf{v}_i = \sigma_i \mathbf{u}_i$ for each $i$ . In matrix form: $AV = U\Sigma$ , which rearranges to $A = U\Sigma V^T$ (since $V$ is orthogonal, $V^{-1} = V^T$ ).

Truncated SVD (Low-Rank Approximation)

The real power of SVD is not in exact factorization but in approximation. By keeping only the top k singular values, we get the best rank-k approximation of A.

A_k = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^T

This is the sum of k rank-1 matrices, each scaled by its singular value.

Eckart-Young Theorem (Optimality)

Among all rank- $k$ matrices, $A_k$ minimizes the Frobenius norm of the error:

||A - A_k||_F = \sqrt{\sigma_{k+1}^2 + \sigma_{k+2}^2 + \ldots + \sigma_r^2}

The error equals the sum of squared discarded singular values! This is why keeping the largest singular values is optimal.

Compressing a 1000×1000 Image

Original: 1,000,000 pixels.
SVD (k=50): 50 left vectors + 50 singular values + 50 right vectors = 50 × (1000 + 1 + 1000) = 100,050 numbers.
Result: 90% space reduction with minimal visual loss!

Interactive: Low-Rank Approximation

Drag the slider to see how much "energy" (information) is captured by keeping different numbers of singular values.

Low-Rank Approx

Reconstructing a matrix using only its most important patterns.

Rank (k)1

1 (Abstract)8 (Exact)

Information Kept

74.7%

σ1 = 50.0

σ2 = 25.0

σ3 = 12.0

σ4 = 8.0

σ5 = 2.7

σ6 = 2.0

σ7 = 1.6

σ8 = 1.3

Singular Value Spectrum

Notice how Rank 1 captures the main gradient, Rank 2 splits it, and Rank 3 adds texture. Rank 4 adds the central bump.

Original Matrix

The full data matrix (Target).

SVD(k=1)

Reconstruction (Ak)

Approximation using top 1 patterns.

Residuals (Error)

What's left behind (Noise).

+ Value

- Value

Zero

Key Properties

Rank

Number of non-zero singular values.

Frobenius Norm

Sum of squared entries = sum of squared singular values.
$||A||_F = \sqrt{\sum \sigma_i^2}$

Spectral Norm (2-Norm)

Maximum stretching factor = largest singular value.
$||A||_2 = \sigma_1$

Condition Number

Measures numerical stability.
$\kappa(A) = \sigma_1 / \sigma_r$

Pseudo-Inverse (Moore-Penrose)

How do you invert a non-square matrix? SVD gives the answer:

A^+ = V \Sigma^+ U^T

Where $\Sigma^+$ is formed by taking reciprocals of non-zero singular values ( $1/\sigma_i$ ) and transposing. This solves least squares problems: $\mathbf{x} = A^+ \mathbf{b}$ .

Applications

Image Compression

An image is a matrix of pixel values. Keep top k singular values, discard the rest. The image remains recognizable even with k much smaller than rank.

Recommendation Systems

User-item rating matrix is sparse. SVD finds latent factors. Predicted rating = $U_k \Sigma_k V_k^T$ . This fills in missing entries. (Netflix Prize key component).

Latent Semantic Analysis (LSA)

Document-term matrix: rows = documents, columns = words. SVD reveals "topics" as latent dimensions. Documents and words are embedded in the same semantic space.

Noise Reduction

Noise typically appears in small singular values (random fluctuations don't create strong patterns). Truncating small singular values removes noise while keeping signal.

SVD vs PCA vs Eigendecomposition

Method	Matrix Type	What It Finds	Main Use
Eigendecomposition	Square ( $n \times n$ )	Eigenvalues & eigenvectors	Theoretical analysis, dynamics
SVD	Any ( $m \times n$ )	Singular values & vectors	Data compression, inversion
PCA	Data matrix	Principal components	Dimensionality reduction

PCA via SVD

PCA is often computed using SVD rather than eigendecomposition of the covariance matrix:

Center data matrix X
Compute SVD: X = U Σ V^T
Principal components = columns of V
Projected data = U Σ

This is more numerically stable than forming X^T X explicitly.

Contents