Modules
12/15
Statistics

Contents

Correlation

Measuring how things move together.

Relationship Status

Correlation quantifies the strength and direction of the relationship between two variables. It is the bedrock of predictive modeling. If XX is correlated with YY, then knowing XX helps us predict YY.

But beware: correlation can lie. It can show relationships that don't exist, hide relationships that do, and most dangerously, suggest causation where there is none.

Covariance: The Raw Ingredient

Before Correlation, we must understand Covariance. It measures how two variables change together.

Cov(X,Y)=1n1i=1n(XiXˉ)(YiYˉ)Cov(X,Y) = \frac{1}{n-1}\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})
Cov > 0

X up, Y up. They move together.

Cov < 0

X up, Y down. They move opposite.

Cov ≈ 0

No linear relationship (could still be non-linear!).

Problem with Covariance: It's not scaled. If you measure height in meters, Cov is X. If you measure in centimeters, Cov is 10,000X. The number itself is meaningless without context.

Correlation Coefficient (rr)

Correlation is standardized covariance. We divide by the product of standard deviations to get a unitless number between -1 and 1.

r=Cov(X,Y)σXσY=(XiXˉ)(YiYˉ)(XiXˉ)2(YiYˉ)2r = \frac{Cov(X,Y)}{\sigma_X \sigma_Y} = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum(X_i-\bar{X})^2} \sqrt{\sum(Y_i-\bar{Y})^2}}

Interpreting r

r = 1.0
Perfect positive. Points on a line.
r = 0.7
Strong positive. Clear trend.
r = 0.3
Weak positive. Noisy but visible.
r = 0.0
No linear relationship.
r = -0.8
Strong negative. Inverse trend.

Pearson vs Spearman

Not all relationships are straight lines. Choosing the right correlation measure depends on your data.

Pearson Correlation (rr)

  • Measures linear relationships.
  • Requires both variables to be approximately normal.
  • Sensitive to outliers (one bad point can destroy your r).
  • If Y=X2Y = X^2, Pearson may be near 0 even though relationship is perfect.

Spearman Rank Correlation (ρ\rho)

  • Measures monotonic relationships (always increasing or decreasing).
  • Converts values to Ranks (1st, 2nd, 3rd) then runs Pearson on ranks.
  • Robust to outliers and non-normal data.
  • Perfect for ordinal data (ratings 1-5).

Case Study: Bulb Factory Quality Control

The Question

Is there a relationship between the voltage supplied during manufacturing and the lifespan of the light bulb? If so, we can optimize the voltage setting.

The Data

We sample 50 bulbs, record the manufacturing voltage (V) and the hours until failure (H).

Pearson r
0.72
Spearman ρ
0.68
p-value
< 0.001

The Interpretation

Strong positive correlation. Bulbs manufactured at higher voltage tend to last longer. But wait! Does higher voltage cause longer life? Or do better quality filaments (which last longer) also happen to tolerate higher voltage? We need to run an experiment (A/B test) to prove causation.

Interactive Simulator

Adjust the correlation coefficient. See how the cloud of points tightens into a line as r1r \to 1 or r1r \to -1. Notice how at r=0r = 0, the cloud is a perfect circle.

Correlation (r)

0.00
-1.0 (Same)0.0 (None)+1.0 (Same)
Shape
No Correlation (Random)
N = 400 | Gaussian Noise

Note: At r=0, points form a circular cloud (σ=1). As |r| → 1, the Y-noise diminishes, collapsing the cloud onto the regression line.

Correlation Causation

This is perhaps the most important lesson in statistics. Just because A and B move together, doesn't mean A causes B. There are several possibilities:

1. A causes B (Direct Causation)

Rain causes wet ground. Smoking causes cancer.

2. B causes A (Reverse Causation)

Countries with more Nobel laureates consume more chocolate. Does chocolate make you smart? No, richer countries have more research funding AND more chocolate consumption.

3. C causes both A and B (Confounding Variable)

Ice cream sales and drowning deaths are correlated. Summer (heat) causes both. This is the most common trap.

4. Pure Coincidence (Spurious)

With enough variables and time series, you'll find spurious correlations by chance.

Spurious Correlations

Sometimes, things correlate purely by accident. These are hilarious to look at but dangerous if you take them seriously.

"The number of people who drowned by falling into a pool correlates with films Nicolas Cage appeared in."

(Real stat: r = 0.66 from 1999-2009).

Bulb Example: A Warning

You notice that bulbs manufactured on Fridays have higher defect rates. Before blaming the day of the week, check if Friday is when the junior shift works, or if Friday is when the machine gets its weekly maintenance (and isn't fully calibrated afterward). The correlation is real, but the cause is something else entirely.

ML Applications

Feature Selection

We want features that correlate highly with the target (predictive power) but low with each other (no redundancy). A correlation matrix heatmap is your first step.

Multicollinearity

In Linear Regression, if two features are perfectly correlated (e.g., "Temperature in °C" and "Temperature in °F"), the matrix XTXX^T X becomes singular and cannot be inverted. You must drop one.

VIF (Variance Inflation Factor)

VIF measures how much the variance of a regression coefficient is inflated due to multicollinearity. VIF>10VIF > 10 is a red flag. Calculate as 1/(1Ri2)1/(1-R^2_i) where Ri2R^2_i is R-squared of regressing feature i on all other features.

PCA Preprocessing

If features are highly correlated, PCA can transform them into uncorrelated principal components. This helps with numerical stability and can reduce overfitting.