Introduction
We often ask: "Are these two variables related?" Standard statistics gives us Pearson Correlation, but that only measures linear relationships.
What if the relationship is ? A perfect parabola. Correlation might be near zero because it can't see non-linear patterns!
Mutual Information (MI) is the information-theoretic solution. It measures any kind of relationship, linear or non-linear, monotonic or chaotic. It asks: "By observing X, how many bits of uncertainty do I remove about Y?"
The Power of MI
Unlike correlation, MI equals zero if and only if X and Y are truly independent. It's the gold standard for detecting any form of statistical dependence.
Intuition: Shared Information
Think of entropy as a "circle" of uncertainty. X has its circle of uncertainty H(X). Y has its circle H(Y). If X and Y are related, their circles overlap.
- H(X): Total uncertainty in X
- H(Y): Total uncertainty in Y
- H(X, Y): Joint uncertainty (the union)
- I(X; Y): Mutual Information (the intersection). The information shared between X and Y.
The Key Insight
Mutual Information tells you: "If I know X, how much less uncertain am I about Y?" If I(X;Y) = 0, knowing X tells you nothing about Y. They are statistically independent.
Correlation vs Mutual Information
This is the critical distinction. Why bother with MI when we have correlation?
Pearson Correlation
Only detects linear relationships.
- :
- (symmetric) :
- :
Mutual Information
Detects any dependence.
- : High MI
- : High MI
- : High MI
The Bottom Line
means no linear relationship. Variables might still be strongly connected.
means true independence. No relationship of any kind exists.
Interactive: Correlation vs MI
See how correlation fails to detect non-linear relationships that Mutual Information captures perfectly.
Only detects linear relationships
Captures any dependence
Quadratic: Correlation blind to symmetric parabola
The Mathematical Formula
Mutual Information is the KL Divergence between the joint distribution and the product of marginals .
Why ? Because if X and Y were independent, their joint distribution would factor as . MI measures how far reality is from independence.
For Continuous Variables
Interactive: Joint Distribution Landscape
Explore how the structure of the joint distribution relates to Mutual Information. High MI means the landscape is "sharp" or structured (knowing X constrains Y). Low MI means it's "flat" or grid-like (independence).
Joint Distribution Landscape
Visualize the joint probability as a heatmap. Mutual Information measures how "structured" this landscape is compared to the product of marginals.
Observation: When there is structure (lines, circles, clusters), the joint distribution is 'sharper' than the product of marginals. MI is high.
Interactive: Information Bits
Visualize Mutual Information as shared bits in a data stream. When MI is high, the streams become synchronized (purple bits).
Information Bits Stream
Visualize Mutual Information as "Shared Bits" in a data stream. When bits are synchronized (purple), information is shared.
Information Decomposition
Key Properties
1. Non-Negativity
Mutual Information can never be negative. This follows directly from its definition as a KL Divergence, which is always non-negative (Gibbs' inequality).
Why intuitively? You cannot "lose" information by observing something. At worst, an observation tells you nothing new (MI = 0), but it can never make you more uncertain than before.
Equality condition: if and only if everywhere, meaning X and Y are statistically independent.
2. Symmetry
The information X reveals about Y is exactly the same as the information Y reveals about X. This is a major difference from KL Divergence, which is asymmetric.
Proof sketch: Using the entropy form: . Both expressions equal , which is symmetric in X and Y.
Practical implication: When measuring feature-target relationships, equals . The direction doesn't matter for feature selection.
3. Zero Iff Independence
This is the definitive test for statistical independence. Unlike correlation (which can be zero even when variables are dependent), MI = 0 guarantees true independence.
Why correlation fails: Consider where X is symmetric around zero. Correlation is zero (no linear relationship), but MI is high because knowing X completely determines Y.
The gold standard: MI captures all forms of dependence: linear, polynomial, periodic, chaotic, or any arbitrary functional relationship.
4. Self-Information Equals Entropy
The mutual information of a variable with itself equals its entropy. This makes intuitive sense: X shares all of its information with itself.
Derivation: . Since knowing X tells you X exactly, . Therefore .
Upper bound: This also gives us . You can't share more information than either variable contains.
5. Data Processing Inequality
If Z is derived from Y (which is derived from X), then Z cannot contain more information about X than Y does. Processing can only lose information, never create it.
Deep learning insight: In a neural network, if layer 1 output is Y and layer 2 output is Z, then . Each layer can only discard information, not add new information about the input.
Information Bottleneck: Good representations maximize while minimizing . This is the theoretical foundation for compression in deep learning.
Equality condition: only if Z is a sufficient statistic for X given Y, meaning no information is lost.
6. Chain Rule for Mutual Information
The information that two variables jointly share with Y equals the information from the first, plus the additional information from the second given the first.
Feature selection application: If and are redundant features, . Adding provides no new information. This is why mRMR (minimum Redundancy Maximum Relevance) uses conditional MI.
General form:
Connection to KL Divergence
Mutual Information is the KL Divergence from the "independence assumption" to reality.
This perspective shows why MI is always non-negative: KL Divergence is always non-negative, and it's zero only when the two distributions match (i.e., when X and Y are truly independent).
Alternative Views
Real-World Examples
Weather Prediction
I(Temperature; Humidity) is high. Knowing today's temperature tells you a lot about humidity.
But I(Temperature; Tomorrow's Lottery Numbers) = 0. Completely independent!
Medical Diagnosis
I(Symptoms; Disease) quantifies how informative symptoms are. High MI means the symptom is a good diagnostic indicator.
Used in medical AI to rank symptom importance.
Gene Expression
I(Gene A; Gene B) reveals regulatory relationships. High MI suggests genes are co-regulated or in the same pathway.
Foundation of gene regulatory network inference.
ML Applications
Feature Selection
Have 1000 features? Calculate for each feature. Pick the top k. Unlike correlation, this catches non-linear predictors.
Used in mRMR (minimum Redundancy Maximum Relevance) algorithm for optimal feature subsets.
Clustering Evaluation (NMI)
Normalized Mutual Information compares predicted clusters to ground truth labels. NMI = 1 means perfect agreement; NMI = 0 means no better than random.
Standard metric for unsupervised learning benchmarks (scikit-learn's default).
Decision Trees (Information Gain)
The splitting criterion for ID3 and C4.5 is Information Gain, which is literally mutual information between the feature and the label. Trees maximize .
Also used in Random Forests for feature importance ranking.
Representation Learning (InfoNCE)
Contrastive learning methods like SimCLR, MoCo, and CLIP maximize a lower bound on MI between different views/augmentations of the same data. High MI = good representations.
Powers modern self-supervised learning (BERT, GPT pre-training conceptually related).
Neural Network Compression
Information Bottleneck Theory uses MI to understand deep learning: networks maximize I(Hidden; Label) while minimizing I(Hidden; Input). This explains generalization!
Theoretical framework for understanding why deep networks generalize.
Causality Detection
Transfer Entropy (conditional MI) detects causal direction: does X predict future Y better than Y predicts future X? Used in neuroscience, economics, climate science.
Granger causality is a linear approximation of this idea.