Multidimensional Variance
In 1D, variance tells you how spread out data is along a single axis. But real data lives in many dimensions. Light bulb quality might be characterized by filament thickness, gas pressure, and glass clarity simultaneously.
In 2D (or n-D), data can be spread out in different directions. The Covariance Matrix captures this shape. It tells us not just how much each variable varies, but how they vary together.
The Covariance Matrix ()
For variables, it's an symmetric matrix.
Diagonal Elements
Variances of each variable. How spread out each variable is on its own.
Off-Diagonal Elements
Covariances between pairs of variables. How they move together.
Interactive Simulator
Manipulate the covariance matrix for 3 variables (X, Y, Z). See how changing the covariances changes the 2D projections of the data cloud.
Covariance Playground
Covariance Matrix Σ
Key Properties
1. Symmetry
. The matrix is symmetric, so .
2. Positive Semi-Definite (PSD)
for any vector . Why? Because variance cannot be negative. The quadratic form represents a variance.
3. Linear Transformation
If , then . This is how covariance propagates through linear operations.
4. Correlation Matrix
Normalize by dividing each entry by . Correlation matrix has 1s on diagonal, values between and off-diagonal.
Geometric Intuition
Think of the Covariance Matrix as defining an ellipsoid (a stretched sphere) in n-dimensional space. The equation defines an elliptical contour of constant probability.
Spherical (Identity Covariance)
Variables are uncorrelated with equal variance. Data cloud is a perfect sphere.
Ellipsoidal (General Covariance)
Variables have different variances and correlations. Data cloud is a tilted ellipsoid.
Eigen-Decomposition (Spectral Theorem)
This is the bridge between Linear Algebra and Statistics. Since is symmetric PSD, it has a special decomposition:
The columns of point in the directions of the principal axes of the ellipsoid. These are the "natural" coordinate axes of the data, where variables are uncorrelated.
The diagonal matrix where is the variance along the i-th principal axis. Larger eigenvalue = more spread in that direction.
Why This Matters
By rotating data into the eigenvector basis, you decorrelate the variables. The covariance matrix in the new basis is diagonal. This is exactly what PCA does.
Case Study: Bulb Quality Control
The Scenario
You measure 3 properties of each bulb: Filament Resistance (Ω), Luminosity (Lumens), and Heat Output (°C). You have data from 1,000 bulbs. How do you detect anomalies?
The Covariance Matrix
The off-diagonal values (1.8, 0.9, 1.4) are positive, meaning all three properties are positively correlated. A brighter bulb is also hotter.
Mahalanobis Distance (Anomaly Detection)
Unlike Euclidean distance (which ignores correlations), Mahalanobis distance accounts for the covariance structure. A bulb that's hot but not bright is an outlier, even if each value alone seems normal.
PCA (Principal Component Analysis)
PCA is literally just finding the eigenvectors of the Covariance Matrix and projecting data onto the top ones.
The Algorithm
- Center the data (subtract mean from each column).
- Compute Covariance Matrix .
- Find Eigenvectors and Eigenvalues of .
- Sort by Eigenvalues (Largest = Most Variance = PC1).
- Project data: , where is the matrix of top eigenvectors.
Variance Explained
The proportion of variance explained by PC is . If the first 2 PCs explain 95% of variance, you can visualize 100D data in 2D with minimal information loss.
Whitening / Sphering
After PCA, scale each PC by . This transforms the ellipsoid back into a sphere (uncorrelated, unit variance). Common preprocessing for neural networks.