Modules
08/10
Probability

Contents

Chebyshev's Inequality

Bounding tail probabilities with zero assumptions.

Universal Guarantees

In Statistics, we often know the Mean (μ\mu) and Standard Deviation (σ\sigma) of a distribution, but not its exact shape. Is it Normal? Uniform? Some weird multi-modal thing?

Chebyshev's Inequality answers: "How far can values reasonably stray from the mean?" The beauty is that it requires no assumptions about the distribution. It works for ANY distribution with finite mean and variance.

Why It Matters

Real-world data is often non-Normal (heavy-tailed, skewed). Chebyshev gives you worst-case guarantees when you can't assume Normality. It's a bedrock of robust statistics and theoretical ML.

The Statement

P(Xμkσ)1k2P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}
"The probability of being more than k standard deviations away from the mean is at most 1/k²."

Quick Lookup Table

kk (Std Devs)Chebyshev BoundNormal Distribution (Actual)
1\leq 100%~32%
2\leq 25%~5%
3\leq 11.1%~0.3%
4\leq 6.25%~0.006%
5\leq 4%~0.00006%

Notice: Chebyshev is conservative. For a Normal distribution, the true probability is much lower. But Chebyshev must hold for ALL distributions, including adversarial ones.

Interactive Simulator

Adjust kk and see the guaranteed probability bound. The shaded region represents the tails beyond kk standard deviations.

Range: 1.0 - 5.0
Visualizing: Normal (Gaussian)
-2.0σ+2.0σμ
Chebyshev Bound (Max)
25.0%(= 1/2.0²)
Guaranteed Upper Limit
Actual Prob (Red Area)
4.6%
Specific to this shape
Notice how the Actual Probability is always ≤ the Chebyshev Bound. (For Normal, it's MUCH lower!)

Proof Sketch (via Markov's Inequality)

The proof is elegant and builds on the simpler Markov's Inequality.

Step 1: Markov's Inequality

P(Ya)E[Y]aP(Y \geq a) \leq \frac{E[Y]}{a}

For any non-negative random variable YY and a>0a > 0. The expected value cannot be too small if a big chunk of probability is on large values.

Step 2: Apply to Variance

Let Y=(Xμ)2Y = (X - \mu)^2. This is non-negative! And E[Y]=Var(X)=σ2E[Y] = Var(X) = \sigma^2.

P((Xμ)2k2σ2)σ2k2σ2=1k2P((X-\mu)^2 \geq k^2\sigma^2) \leq \frac{\sigma^2}{k^2\sigma^2} = \frac{1}{k^2}

Since (Xμ)2k2σ2(X-\mu)^2 \geq k^2\sigma^2 is equivalent to Xμkσ|X-\mu| \geq k\sigma, we are done.

Case Study: Bulb Lifespan Quality Control

The Scenario

Your bulb factory claims an average lifespan of 1,200 hours with a standard deviation of 100 hours. A customer asks: "What's the maximum probability that a bulb will last less than 900 hours?" You don't know the exact distribution.

The Calculation

  • 900 hours is (1200 - 900) / 100 = 3 standard deviations below the mean.
  • By Chebyshev, P(|X - 1200| ≥ 300) ≤ 1/9 ≈ 11.1%.
  • This bounds both tails. For just the left tail, we need Cantelli's Inequality (next section).

The Guarantee

At most ~11% of bulbs will deviate by more than 300 hours from the mean (in either direction). This is a worst-case guarantee that holds for ANY lifespan distribution with that mean and variance.

Cantelli's Inequality (One-Sided Chebyshev)

Often we only care about one tail. "What's the probability of being below a threshold?" Cantelli's Inequality provides a tighter bound for one-sided deviations.

P(Xμkσ)11+k2P(X - \mu \geq k\sigma) \leq \frac{1}{1 + k^2}
Also: P(Xμkσ)11+k2P(X - \mu \leq -k\sigma) \leq \frac{1}{1+k^2}

Bulb Example Revisited: P(Lifespan < 900 hours) ≤ 1 / (1 + 3²) = 1/10 = 10%. Tighter than the 11.1% from two-sided Chebyshev!

Connection to Law of Large Numbers

Chebyshev's Inequality is the tool used to prove the Weak Law of Large Numbers (WLLN).

WLLN Statement

XˉnPμ as n\bar{X}_n \xrightarrow{P} \mu \text{ as } n \to \infty

The sample mean converges in probability to the true mean.

Proof Sketch: The variance of Xˉn\bar{X}_n is σ2/n\sigma^2/n. Apply Chebyshev:

P(Xˉnμϵ)σ2nϵ20P(|\bar{X}_n - \mu| \geq \epsilon) \leq \frac{\sigma^2}{n\epsilon^2} \to 0

As nn \to \infty, the bound goes to 0. The sample mean gets arbitrarily close to the true mean with high probability.

Chernoff Bounds (Tighter for Specific Distributions)

Chebyshev gives polynomial decay (1/k21/k^2). For sums of independent random variables (like Bernoulli), we can get exponential decay using Chernoff bounds.

P(X(1+δ)μ)eδ2μ2+δP(X \geq (1+\delta)\mu) \leq e^{-\frac{\delta^2 \mu}{2+\delta}}

For sum of n independent Bernoulli trials with mean μ=np\mu = np.

Why Tighter? Chebyshev uses only the first 2 moments (mean, variance). Chernoff uses the moment generating function (all moments). More info = tighter bound.

ML Applications

PAC Learning

PAC Learning theory ("Probably Approximately Correct") uses Chebyshev/Hoeffding bounds to prove sample complexity. It answers: "How many training samples do we need to guarantee a model is within ε of optimal with probability 1-δ?"

Generalization Bounds

The gap between training error and test error can be bounded using concentration inequalities (Chebyshev, Hoeffding, Rademacher). This is the foundation of Statistical Learning Theory.

Outlier Detection

If a bulb lifespan is more than 3σ from the mean, Chebyshev says at most 11% of bulbs should be this extreme. If you observe 20% at that level, something is wrong with your manufacturing process (or distribution assumptions).

Robust Statistics

When data is non-Gaussian (heavy tails, outliers), Chebyshev-based methods are preferred over Gaussian assumptions. The Median Absolute Deviation (MAD) and trimmed means are robust alternatives.