Modules
07/10
Probability

Contents

The Law of Large Numbers

The guarantee that, eventually, the data reveals the truth.

Introduction

The Law of Large Numbers (LLN) is the anchor of statistics. It states a simple but powerful truth: as you collect more data, the sample average converges to the true expected value.

The Core Promise

Xˉnnμ\bar{X}_n \xrightarrow{n \to \infty} \mu

Sample mean approaches population mean as sample size grows

Without LLN, machine learning would be impossible. We assume that training loss approximates true generalization error. LLN is the mathematical license for that assumption.

The Casino Intuition

Why Casinos Always Win

Bet on Red (Roulette)

Win: 18/38 = 47.4%

Lose: 20/38 = 52.6%

Expected Value

E[X] = (1)(0.474) + (-1)(0.526)

= -$0.052 per bet

1 game:Player might win. Luck matters.
10 games:Still volatile. Streaks happen.
1,000,000 games:Average profit per game = exactly $0.052.

The casino is not gambling. It is running a business based on LLN.

Mathematical Statement

Let X1,X2,,XnX_1, X_2, \dots, X_n be i.i.d. random variables with mean μ\mu. The sample mean is:

Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i

Sample mean = sum of observations divided by count

Law of Large Numbers

As n approaches infinity, the sample mean converges to the true mean:

Xˉnμasn\bar{X}_n \to \mu \quad \text{as} \quad n \to \infty

Interactive: Watch Convergence

Try different distributions and sample sizes. Notice how the running average stabilizes around the true mean as n grows. Small n = noisy. Large n = stable.

Samples0 / 2000
True Mean0.50
Sample Mean0.000
μ = 0.5

Notice how the "swings" (variance) are huge at the start (small N),
but the line inevitably tightens around the True Mean as N grows.

Why It Works: Variance Reduction

The intuition comes from looking at the variance of the sample mean.

Start with variance of sample mean:

Var(Xˉn)=Var(1ni=1nXi)Var(\bar{X}_n) = Var\left(\frac{1}{n} \sum_{i=1}^n X_i\right)

For independent variables:

=1n2i=1nVar(Xi)=1n2nσ2= \frac{1}{n^2} \sum_{i=1}^n Var(X_i) = \frac{1}{n^2} \cdot n\sigma^2

Result:

Var(Xˉn)=σ2nVar(\bar{X}_n) = \frac{\sigma^2}{n}

The Key Insight

As nn \to \infty, variance 0\to 0. A random variable with zero variance is a constant. Therefore, the sample mean becomes the constant μ\mu.

Weak vs Strong LLN

There are two versions with different mathematical guarantees.

Weak LLN

Convergence in Probability

limnP(Xˉnμ>ϵ)=0\lim_{n\to\infty} P(|\bar{X}_n - \mu| > \epsilon) = 0

For any margin ϵ\epsilon, probability of being far from μ\mu goes to zero.

Strong LLN

Almost Sure Convergence

P(limnXˉn=μ)=1P\left(\lim_{n\to\infty} \bar{X}_n = \mu\right) = 1

The sample average converges with probability 1.

For most ML applications, the distinction does not matter. Both guarantee convergence.

LLN vs CLT

These are often confused. They describe different aspects of the same process.

TheoremWhat it saysAnalogy
LLNSample mean converges to true meanWhere the arrow lands (the target)
CLTDistribution of sample means is NormalThe shape of the arrow pattern

See Central Limit Theorem for the distribution story.

The Gambler's Fallacy

The Mistake

"I got 10 heads in a row. LLN says it balances to 50%, so tails is 'due' next."

Why It's Wrong

The coin has no memory. LLN works by dilution, not compensation.

Example

After 10 heads:10H, 0T = 100% heads
Flip 1000 more (fair):~510H, ~500T
New ratio:510/1010 = 50.5%

The streak did not disappear. It just became statistically insignificant.

Interactive: Gambler's Fallacy Demo

Start with a streak of heads, then flip more. Watch how the ratio approaches 50% through dilution, not correction.

10

Starting with 10 Heads in a row (100%)

Add Fair Flips
0 / 500
Dilution Tank
100.0%
Heads %
50% Target

Total Stats

Heads10
Tails0
Absolute Diff10

Observed: The percentage drops towards 50%, but the absolute number of Heads is still much higher than Tails (diff: 10).

The universe didn't generate extra tails to "fix" the streak. It just buried the streak under a mountain of new, normal data. That is Dilution.

ML Applications

Monte Carlo Methods

Replace intractable integrals with sample averages:

f(x)p(x)dx1Ni=1Nf(xi)\int f(x)p(x)dx \approx \frac{1}{N} \sum_{i=1}^N f(x_i)

Used in MCMC, Reinforcement Learning (value estimation), Bayesian inference.

Empirical Risk Minimization

Training loss approximates true generalization loss:

1ni=1nL(f(xi),yi)E[L(f(X),Y)]\frac{1}{n} \sum_{i=1}^n L(f(x_i), y_i) \approx E[L(f(X), Y)]

The entire justification for training on finite datasets.

Stochastic Gradient Descent

Mini-batch gradient is an unbiased estimate of full gradient. Over many steps, the noise averages out. SGD converges because of LLN.

AlphaGo & MCTS

Cannot compute exact game tree values. Instead, play thousands of random games from a position. Average outcome converges to true value of the position.