What is mutual information and what does it measure?

Mutual information (MI) is an information-theoretic quantity that measures the amount of information shared between two random variables X and Y. Formally, I(X;Y) = H(X) + H(Y) - H(X,Y), where H denotes entropy. It quantifies how much knowing one variable reduces uncertainty about the other, capturing any statistical dependence — linear or non-linear — unlike Pearson correlation. MI equals zero if and only if X and Y are statistically independent.

How is mutual information related to entropy?

Mutual information is directly derived from entropy. I(X;Y) = H(X) + H(Y) - H(X,Y), where H(X) is the entropy of X, H(Y) is the entropy of Y, and H(X,Y) is their joint entropy. Intuitively, think of entropy circles: MI is the overlap (intersection) between the uncertainty circles of X and Y. Equivalently, I(X;Y) = H(X) - H(X|Y), the reduction in uncertainty about X after observing Y. MI is also equal to the KL divergence between the joint distribution P(X,Y) and the product of marginals P(X)P(Y).

How is mutual information used in feature selection for machine learning?

In feature selection, mutual information between each feature Xi and the target label Y is computed as I(Xi; Y). Features with higher MI scores share more information with the target and are ranked as more relevant. Unlike correlation-based selection, MI detects non-linear relationships, making it far more powerful. Scikit-learn's mutual_info_classif and mutual_info_regression implement this, and the mRMR (minimum Redundancy Maximum Relevance) algorithm extends it by also penalizing redundancy between selected features using conditional MI.

What is the difference between mutual information and correlation?

Pearson correlation measures only linear relationships — for Y = X² (a perfect parabola symmetric around zero), correlation can be near zero even though X completely determines Y. Mutual information detects any form of dependence: linear, polynomial, periodic, or arbitrary. The key distinction is that MI = 0 guarantees true statistical independence, while correlation = 0 does not. MI is non-negative and symmetric, just like correlation, but it is not bounded to [-1, 1] — instead it is bounded above by min(H(X), H(Y)).

What is pointwise mutual information (PMI) and where is it used in NLP?

Pointwise mutual information (PMI) measures the association between a specific pair of values (x, y): PMI(x,y) = log[P(x,y) / (P(x)P(y))]. While mutual information averages over all value pairs, PMI gives the association for one particular co-occurrence. PMI is widely used in NLP to measure word co-occurrence strength — how much more often two words appear together than chance would predict. It underpins word embedding methods like Word2Vec's skip-gram with negative sampling and is used in constructing semantic similarity measures and knowledge graphs.

Mutual Information: Beyond Correlation

Introduction

We often ask: "Are these two variables related?" Standard statistics gives us Pearson Correlation, but that only measures linear relationships.

What if the relationship is $y = x^2$ ? A perfect parabola. Correlation might be near zero because it can't see non-linear patterns!

Mutual Information (MI) is the information-theoretic solution. It measures any kind of relationship, linear or non-linear, monotonic or chaotic. It asks: "By observing X, how many bits of uncertainty do I remove about Y?"

The Power of MI

Unlike correlation, MI equals zero if and only if X and Y are truly independent. It's the gold standard for detecting any form of statistical dependence.

Intuition: Shared Information

Think of entropy as a "circle" of uncertainty. X has its circle of uncertainty H(X). Y has its circle H(Y). If X and Y are related, their circles overlap.

H(X): Total uncertainty in X
H(Y): Total uncertainty in Y
H(X, Y): Joint uncertainty (the union)
I(X; Y): Mutual Information (the intersection). The information shared between X and Y.

The Key Insight

Mutual Information tells you: "If I know X, how much less uncertain am I about Y?" If I(X;Y) = 0, knowing X tells you nothing about Y. They are statistically independent.

Correlation vs Mutual Information

This is the critical distinction. Why bother with MI when we have correlation?

Pearson Correlation

Only detects linear relationships.

$y = x$ : $\rho = 1.0$
$y = x^2$ (symmetric) : $\rho \approx 0$
$y = \sin(x)$ : $\rho \approx 0$

Mutual Information

Detects any dependence.

$y = x$ : High MI
$y = x^2$ : High MI
$y = \sin(x)$ : High MI

The Bottom Line

$Correlation = 0$ means no linear relationship. Variables might still be strongly connected.
$MI = 0$ means true independence. No relationship of any kind exists.

Interactive: Correlation vs MI

See how correlation fails to detect non-linear relationships that Mutual Information captures perfectly.

Relationship:

y = x²

Pearson CorrelationFails

\rho = 0.00

Only detects linear relationships

Mutual InformationDetects

I = 1.8

Captures any dependence

Quadratic: Correlation blind to symmetric parabola

The Mathematical Formula

Mutual Information is the KL Divergence between the joint distribution $P(X, Y)$ and the product of marginals $P(X)P(Y)$ .

Why $P(X)P(Y)$ ? Because if X and Y were independent, their joint distribution would factor as $P(X, Y) = P(X)P(Y)$ . MI measures how far reality is from independence.

I(X; Y) = \sum_{x} \sum_{y} P(x, y) \log \left( \frac{P(x, y)}{P(x)P(y)} \right)

Alternative form using Entropy:

I(X; Y) = H(X) + H(Y) - H(X, Y)

For Continuous Variables

I(X; Y) = \int\int p(x,y) \log\left(\frac{p(x,y)}{p(x)p(y)}\right) dx dy

Interactive: Joint Distribution Landscape

Explore how the structure of the joint distribution $P(X,Y)$ relates to Mutual Information. High MI means the landscape is "sharp" or structured (knowing X constrains Y). Low MI means it's "flat" or grid-like (independence).

Joint Distribution Landscape

Visualize the joint probability $P(X,Y)$ as a heatmap. Mutual Information measures how "structured" this landscape is compared to the product of marginals.

Noise / Spread

High

Low

Entropy H(X)

4.30

Entropy H(Y)

4.30

Joint H(X,Y)

7.84

Mutual Info I(X;Y)

0.77

Observation: When there is structure (lines, circles, clusters), the joint distribution is 'sharper' than the product of marginals. MI is high.

Interactive: Information Bits

Visualize Mutual Information as shared bits in a data stream. When MI is high, the streams become synchronized (purple bits).

Information Bits Stream

Visualize Mutual Information as "Shared Bits" in a data stream. When bits are synchronized (purple), information is shared.

Entropy X (P=1)

Entropy Y (P=1)

Coupling (MI)

Information Decomposition

H(X|Y)

I(X;Y)

H(Y|X)

Unique X

Mutual Info

Unique Y

Total Length = Joint Entropy H(X,Y)

Key Properties

1. Non-Negativity

I(X; Y) \geq 0

Mutual Information can never be negative. This follows directly from its definition as a KL Divergence, which is always non-negative (Gibbs' inequality).

Why intuitively? You cannot "lose" information by observing something. At worst, an observation tells you nothing new (MI = 0), but it can never make you more uncertain than before.

Equality condition: $I(X; Y) = 0$ if and only if $P(X, Y) = P(X)P(Y)$ everywhere, meaning X and Y are statistically independent.

2. Symmetry

I(X; Y) = I(Y; X)

The information X reveals about Y is exactly the same as the information Y reveals about X. This is a major difference from KL Divergence, which is asymmetric.

Proof sketch: Using the entropy form: $I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)$ . Both expressions equal $H(X) + H(Y) - H(X,Y)$ , which is symmetric in X and Y.

Practical implication: When measuring feature-target relationships, $I(\text{Feature}; \text{Label})$ equals $I(\text{Label}; \text{Feature})$ . The direction doesn't matter for feature selection.

3. Zero Iff Independence

I(X; Y) = 0 \iff X \perp\!\!\!\perp Y

This is the definitive test for statistical independence. Unlike correlation (which can be zero even when variables are dependent), MI = 0 guarantees true independence.

Why correlation fails: Consider $Y = X^2$ where X is symmetric around zero. Correlation is zero (no linear relationship), but MI is high because knowing X completely determines Y.

The gold standard: MI captures all forms of dependence: linear, polynomial, periodic, chaotic, or any arbitrary functional relationship.

4. Self-Information Equals Entropy

I(X; X) = H(X)

The mutual information of a variable with itself equals its entropy. This makes intuitive sense: X shares all of its information with itself.

Derivation: $I(X; X) = H(X) - H(X|X)$ . Since knowing X tells you X exactly, $H(X|X) = 0$ . Therefore $I(X; X) = H(X)$ .

Upper bound: This also gives us $I(X; Y) \leq \min(H(X), H(Y))$ . You can't share more information than either variable contains.

5. Data Processing Inequality

X \to Y \to Z \implies I(X; Z) \leq I(X; Y)

If Z is derived from Y (which is derived from X), then Z cannot contain more information about X than Y does. Processing can only lose information, never create it.

Deep learning insight: In a neural network, if layer 1 output is Y and layer 2 output is Z, then $I(\text{Input}; Z) \leq I(\text{Input}; Y)$ . Each layer can only discard information, not add new information about the input.

Information Bottleneck: Good representations maximize $I(Z; \text{Label})$ while minimizing $I(Z; \text{Input})$ . This is the theoretical foundation for compression in deep learning.

Equality condition: $I(X; Z) = I(X; Y)$ only if Z is a sufficient statistic for X given Y, meaning no information is lost.

6. Chain Rule for Mutual Information

I(X_1, X_2; Y) = I(X_1; Y) + I(X_2; Y | X_1)

The information that two variables jointly share with Y equals the information from the first, plus the additional information from the second given the first.

Feature selection application: If $X_1$ and $X_2$ are redundant features, $I(X_2; Y | X_1) \approx 0$ . Adding $X_2$ provides no new information. This is why mRMR (minimum Redundancy Maximum Relevance) uses conditional MI.

General form: $I(X_1, ..., X_n; Y) = \sum_{i=1}^{n} I(X_i; Y | X_1, ..., X_{i-1})$

Connection to KL Divergence

I(X; Y) = D_{KL}(P(X,Y) || P(X)P(Y))

Mutual Information is the KL Divergence from the "independence assumption" to reality.

This perspective shows why MI is always non-negative: KL Divergence is always non-negative, and it's zero only when the two distributions match (i.e., when X and Y are truly independent).

Alternative Views

I(X;Y) = H(X) - H(X|Y)

(Reduction in uncertainty about X given Y)

I(X;Y) = H(Y) - H(Y|X)

(Reduction in uncertainty about Y given X)

Real-World Examples

Weather Prediction

I(Temperature; Humidity) is high. Knowing today's temperature tells you a lot about humidity.

But I(Temperature; Tomorrow's Lottery Numbers) = 0. Completely independent!

Medical Diagnosis

I(Symptoms; Disease) quantifies how informative symptoms are. High MI means the symptom is a good diagnostic indicator.

Used in medical AI to rank symptom importance.

Gene Expression

I(Gene A; Gene B) reveals regulatory relationships. High MI suggests genes are co-regulated or in the same pathway.

Foundation of gene regulatory network inference.

ML Applications

Feature Selection

Have 1000 features? Calculate $I(X_i; Y_{target})$ for each feature. Pick the top k. Unlike correlation, this catches non-linear predictors.

Used in mRMR (minimum Redundancy Maximum Relevance) algorithm for optimal feature subsets.

Clustering Evaluation (NMI)

Normalized Mutual Information compares predicted clusters to ground truth labels. NMI = 1 means perfect agreement; NMI = 0 means no better than random.

Standard metric for unsupervised learning benchmarks (scikit-learn's default).

Decision Trees (Information Gain)

The splitting criterion for ID3 and C4.5 is Information Gain, which is literally mutual information between the feature and the label. Trees maximize $I(\text{Feature}; \text{Label})$ .

Also used in Random Forests for feature importance ranking.

Representation Learning (InfoNCE)

Contrastive learning methods like SimCLR, MoCo, and CLIP maximize a lower bound on MI between different views/augmentations of the same data. High MI = good representations.

Powers modern self-supervised learning (BERT, GPT pre-training conceptually related).

Neural Network Compression

Information Bottleneck Theory uses MI to understand deep learning: networks maximize I(Hidden; Label) while minimizing I(Hidden; Input). This explains generalization!

Theoretical framework for understanding why deep networks generalize.

Causality Detection

Transfer Entropy (conditional MI) detects causal direction: does X predict future Y better than Y predicts future X? Used in neuroscience, economics, climate science.

Granger causality is a linear approximation of this idea.

Contents