What is a covariance matrix and what does it represent?

A covariance matrix is a square symmetric matrix that captures the variance of each variable along its diagonal and the pairwise covariances between variables in the off-diagonal entries. For n random variables it is an n×n matrix that fully describes the second-order statistical structure of a multivariate distribution. It tells you not just how spread out each variable is individually, but how every pair of variables moves together, making it the foundational object for understanding the shape of high-dimensional data.

What is the difference between covariance and correlation?

Covariance measures how two variables change together but its magnitude depends on the scales of the individual variables, making it hard to compare across different pairs. Correlation is a normalized version of covariance obtained by dividing each entry Cov(Xi, Xj) by the product of standard deviations σi·σj, which forces all values to lie between −1 and 1. The correlation matrix therefore has 1s on the diagonal and dimensionless off-diagonal entries, making it easier to interpret the strength of linear relationships regardless of units.

Why must a covariance matrix be positive semidefinite?

A covariance matrix must be positive semidefinite (PSD) because the quadratic form x^T Σ x equals the variance of the linear combination x^T X, and variance can never be negative. Formally, for any real vector x, x^T Σ x ≥ 0. This constraint implies that all eigenvalues of Σ are non-negative, which geometrically means the covariance matrix defines a valid ellipsoid (or degenerate flat ellipsoid if some eigenvalues are zero) rather than a saddle surface.

How is the covariance matrix used in PCA?

Principal Component Analysis (PCA) works by computing the covariance matrix of mean-centered data and then performing an eigendecomposition Σ = V Λ V^T. The eigenvectors in V define the principal axes of the data ellipsoid — the directions of maximum variance — while the corresponding eigenvalues in Λ give the variance explained along each axis. Projecting the data onto the top k eigenvectors reduces dimensionality while retaining the directions of greatest spread, and the proportion λi / Σλj quantifies how much variance each principal component explains.

What is the multivariate Gaussian distribution and how does the covariance matrix define its shape?

The multivariate Gaussian (normal) distribution generalizes the 1D bell curve to n dimensions and is fully specified by a mean vector μ and a covariance matrix Σ. The probability density is proportional to exp(−½ (x−μ)^T Σ^{−1} (x−μ)), where the exponent defines elliptical contours of equal probability. The covariance matrix controls both the size and orientation of these ellipses: its eigenvalues set the lengths of the principal axes and its eigenvectors set their directions, so Σ completely encodes the shape of the Gaussian cloud.

Covariance Matrices: The Shape of Data

Multidimensional Variance

In 1D, variance tells you how spread out data is along a single axis. But real data lives in many dimensions. Light bulb quality might be characterized by filament thickness, gas pressure, and glass clarity simultaneously.

In 2D (or n-D), data can be spread out in different directions. The Covariance Matrix captures this shape. It tells us not just how much each variable varies, but how they vary together.

The Covariance Matrix ( $\Sigma$ )

For $n$ variables, it's an $n \times n$ symmetric matrix.

\Sigma = \begin{bmatrix} Var(X_1) & Cov(X_1,X_2) & \cdots & Cov(X_1,X_n) \\ Cov(X_2,X_1) & Var(X_2) & \cdots & Cov(X_2,X_n) \\ \vdots & \vdots & \ddots & \vdots \\ Cov(X_n,X_1) & Cov(X_n,X_2) & \cdots & Var(X_n) \end{bmatrix}

Diagonal Elements

Variances of each variable. How spread out each variable is on its own.

Off-Diagonal Elements

Covariances between pairs of variables. How they move together.

Interactive Simulator

Manipulate the covariance matrix for 3 variables (X, Y, Z). See how changing the covariances changes the 2D projections of the data cloud.

Covariance Playground

Covariance Matrix Σ

X

Y

Z

X

1.0

0.80

0.50

Y

0.80

1.0

0.20

Z

0.50

0.20

1.0

Invalid Matrix: Impossible correlations. Reduce values.

How to use:Adjust sliders in the upper triangle. The matrix is symmetric, so changes reflect in the lower triangle automatically. Trying to set strong pairwise correlations (e.g. X-Y=0.9, Y-Z=0.9, X-Z=-0.9) will break geometry and show an error.

X vs Yr=0.80

X vs Zr=0.50

Y vs Zr=0.20

Key Properties

1. Symmetry

$Cov(X,Y) = Cov(Y,X)$ . The matrix is symmetric, so $\Sigma = \Sigma^T$ .

2. Positive Semi-Definite (PSD)

$x^T \Sigma x \geq 0$ for any vector $x$ . Why? Because variance cannot be negative. The quadratic form represents a variance.

3. Linear Transformation

If $Y = AX$ , then $\Sigma_Y = A \Sigma_X A^T$ . This is how covariance propagates through linear operations.

4. Correlation Matrix

Normalize by dividing each entry by $\sigma_i \sigma_j$ . Correlation matrix has 1s on diagonal, values between $-1$ and $1$ off-diagonal.

Geometric Intuition

Think of the Covariance Matrix as defining an ellipsoid (a stretched sphere) in n-dimensional space. The equation $(x - \mu)^T \Sigma^{-1} (x - \mu) = c$ defines an elliptical contour of constant probability.

Spherical (Identity Covariance)

Variables are uncorrelated with equal variance. Data cloud is a perfect sphere.

Ellipsoidal (General Covariance)

Variables have different variances and correlations. Data cloud is a tilted ellipsoid.

Eigen-Decomposition (Spectral Theorem)

This is the bridge between Linear Algebra and Statistics. Since $\Sigma$ is symmetric PSD, it has a special decomposition:

\Sigma = V \Lambda V^T

Eigenvectors ( $V$ ):

The columns of $V$ point in the directions of the principal axes of the ellipsoid. These are the "natural" coordinate axes of the data, where variables are uncorrelated.

Eigenvalues ( $\Lambda$ ):

The diagonal matrix where $\lambda_i$ is the variance along the i-th principal axis. Larger eigenvalue = more spread in that direction.

Why This Matters

By rotating data into the eigenvector basis, you decorrelate the variables. The covariance matrix in the new basis is diagonal. This is exactly what PCA does.

Case Study: Bulb Quality Control

The Scenario

You measure 3 properties of each bulb: Filament Resistance (Ω), Luminosity (Lumens), and Heat Output (°C). You have data from 1,000 bulbs. How do you detect anomalies?

The Covariance Matrix

Σ = [[2.5, 1.8, 0.9], [1.8, 3.2, 1.4], [0.9, 1.4, 1.8]]

The off-diagonal values (1.8, 0.9, 1.4) are positive, meaning all three properties are positively correlated. A brighter bulb is also hotter.

Mahalanobis Distance (Anomaly Detection)

D_M = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}

Unlike Euclidean distance (which ignores correlations), Mahalanobis distance accounts for the covariance structure. A bulb that's hot but not bright is an outlier, even if each value alone seems normal.

PCA (Principal Component Analysis)

PCA is literally just finding the eigenvectors of the Covariance Matrix and projecting data onto the top ones.

The Algorithm

Center the data (subtract mean from each column).
Compute Covariance Matrix $\Sigma = \frac{1}{n-1} X^T X$ .
Find Eigenvectors and Eigenvalues of $\Sigma$ .
Sort by Eigenvalues (Largest = Most Variance = PC1).
Project data: $Z = XV$ , where $V$ is the matrix of top $k$ eigenvectors.

Variance Explained

The proportion of variance explained by PC $i$ is $\lambda_i / \sum \lambda_j$ . If the first 2 PCs explain 95% of variance, you can visualize 100D data in 2D with minimal information loss.

Whitening / Sphering

After PCA, scale each PC by $1/\sqrt{\lambda_i}$ . This transforms the ellipsoid back into a sphere (uncorrelated, unit variance). Common preprocessing for neural networks.

Contents