What is the law of large numbers and what does it state?

The law of large numbers (LLN) is a fundamental theorem in probability theory stating that as the number of independent, identically distributed trials increases, the sample average converges to the true expected value of the distribution. In formal terms, for i.i.d. random variables X₁, X₂, …, Xₙ with mean μ, the sample mean X̄ₙ approaches μ as n approaches infinity. This result is what guarantees that empirical estimates become reliable with enough data.

What is the difference between the weak and strong law of large numbers?

The weak law of large numbers states convergence in probability: for any ε > 0, P(|X̄ₙ − μ| > ε) → 0 as n → ∞, meaning the probability of the sample mean deviating from the true mean vanishes. The strong law of large numbers states almost sure convergence: P(lim X̄ₙ = μ) = 1, meaning the sample mean converges to μ with probability 1 along almost every sample path. The strong law provides a stronger pointwise guarantee, while the weak law gives a probabilistic guarantee for each n individually.

How does the law of large numbers justify Monte Carlo methods?

Monte Carlo methods approximate intractable integrals by replacing them with sample averages: ∫f(x)p(x)dx ≈ (1/N)∑f(xᵢ) where xᵢ are i.i.d. samples from p(x). The LLN guarantees that this sample average converges to the true integral as N grows, providing the mathematical foundation for the approximation. This is exploited in MCMC, Bayesian inference, reinforcement learning value estimation, and AlphaGo-style Monte Carlo Tree Search.

What is the difference between the law of large numbers and the central limit theorem?

The law of large numbers tells you where the sample mean converges to — the true population mean μ. The central limit theorem (CLT) tells you the shape of the distribution of the sample mean around that target: as n grows, the normalized deviations √n(X̄ₙ − μ) converge in distribution to a Normal distribution. In short, LLN answers "what is the destination?" and CLT answers "what does the scatter around that destination look like?"

How is the law of large numbers used in machine learning training?

The LLN underpins three core ML mechanisms. First, empirical risk minimization: the average training loss over n examples is an LLN-justified estimate of the true expected loss, making finite-dataset training meaningful. Second, stochastic gradient descent: each mini-batch gradient is an unbiased estimator of the full-batch gradient, and the noise averages out over many steps because of LLN. Third, Monte Carlo value estimation in reinforcement learning: averaging many sampled returns converges to the true value function.

Law of Large Numbers (LLN) Explained

Introduction

The Law of Large Numbers (LLN) is the anchor of statistics. It states a simple but powerful truth: as you collect more data, the sample average converges to the true expected value.

The Core Promise

\bar{X}_n \xrightarrow{n \to \infty} \mu

Sample mean approaches population mean as sample size grows

Without LLN, machine learning would be impossible. We assume that training loss approximates true generalization error. LLN is the mathematical license for that assumption.

The Casino Intuition

Why Casinos Always Win

Bet on Red (Roulette)

Win: 18/38 = 47.4%

Lose: 20/38 = 52.6%

Expected Value

E[X] = (1)(0.474) + (-1)(0.526)

= -$0.052 per bet

1 game:Player might win. Luck matters.

10 games:Still volatile. Streaks happen.

1,000,000 games:Average profit per game = exactly $0.052.

The casino is not gambling. It is running a business based on LLN.

Mathematical Statement

Let $X_1, X_2, \dots, X_n$ be i.i.d. random variables with mean $\mu$ . The sample mean is:

\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i

Sample mean = sum of observations divided by count

Law of Large Numbers

As n approaches infinity, the sample mean converges to the true mean:

\bar{X}_n \to \mu \quad \text{as} \quad n \to \infty

Interactive: Watch Convergence

Try different distributions and sample sizes. Notice how the running average stabilizes around the true mean as n grows. Small n = noisy. Large n = stable.

Samples0 / 2000

True Mean0.50

Sample Mean0.000

True Mean0.50

Sample Mean0.000

Notice how the "swings" (variance) are huge at the start (small N),
but the line inevitably tightens around the True Mean as N grows.

Why It Works: Variance Reduction

The intuition comes from looking at the variance of the sample mean.

Start with variance of sample mean:

Var(\bar{X}_n) = Var\left(\frac{1}{n} \sum_{i=1}^n X_i\right)

For independent variables:

= \frac{1}{n^2} \sum_{i=1}^n Var(X_i) = \frac{1}{n^2} \cdot n\sigma^2

Result:

Var(\bar{X}_n) = \frac{\sigma^2}{n}

The Key Insight

As $n \to \infty$ , variance $\to 0$ . A random variable with zero variance is a constant. Therefore, the sample mean becomes the constant $\mu$ .

Weak vs Strong LLN

There are two versions with different mathematical guarantees.

Weak LLN

Convergence in Probability

\lim_{n\to\infty} P(|\bar{X}_n - \mu| > \epsilon) = 0

For any margin $\epsilon$ , probability of being far from $\mu$ goes to zero.

Strong LLN

Almost Sure Convergence

P\left(\lim_{n\to\infty} \bar{X}_n = \mu\right) = 1

The sample average converges with probability 1.

For most ML applications, the distinction does not matter. Both guarantee convergence.

LLN vs CLT

These are often confused. They describe different aspects of the same process.

Theorem	What it says	Analogy
LLN	Sample mean converges to true mean	Where the arrow lands (the target)
CLT	Distribution of sample means is Normal	The shape of the arrow pattern

See Central Limit Theorem for the distribution story.

The Gambler's Fallacy

The Mistake

"I got 10 heads in a row. LLN says it balances to 50%, so tails is 'due' next."

Why It's Wrong

The coin has no memory. LLN works by dilution, not compensation.

Example

After 10 heads:10H, 0T = 100% heads

Flip 1000 more (fair):~510H, ~500T

New ratio:510/1010 = 50.5%

The streak did not disappear. It just became statistically insignificant.

Interactive: Gambler's Fallacy Demo

Start with a streak of heads, then flip more. Watch how the ratio approaches 50% through dilution, not correction.

Initial "Unlucky" Streak

Starting with 10 Heads in a row (100%)

Add Fair Flips

0 / 500

Dilution Tank

100.0%

Heads %

50% Target

Total Stats

Heads10

Tails0

Absolute Diff10

Observed: The percentage drops towards 50%, but the absolute number of Heads is still much higher than Tails (diff: 10).

The universe didn't generate extra tails to "fix" the streak. It just buried the streak under a mountain of new, normal data. That is Dilution.

ML Applications

Monte Carlo Methods

Replace intractable integrals with sample averages:

\int f(x)p(x)dx \approx \frac{1}{N} \sum_{i=1}^N f(x_i)

Used in MCMC, Reinforcement Learning (value estimation), Bayesian inference.

Empirical Risk Minimization

Training loss approximates true generalization loss:

\frac{1}{n} \sum_{i=1}^n L(f(x_i), y_i) \approx E[L(f(X), Y)]

The entire justification for training on finite datasets.

Stochastic Gradient Descent

Mini-batch gradient is an unbiased estimate of full gradient. Over many steps, the noise averages out. SGD converges because of LLN.

AlphaGo & MCTS

Cannot compute exact game tree values. Instead, play thousands of random games from a position. Average outcome converges to true value of the position.

Contents

The Law of Large Numbers