Introduction
In everyday logic, we think cause to effect: "If it rains, the grass gets wet." But in reality, we often observe the effect and must infer the cause: "The grass is wet. Did it rain, or did the sprinklers turn on?"
Bayes' Theorem tells us how to reverse conditional probabilities.
Given , compute .
This is the foundation of Bayesian Inference: treating learning as the continuous process of updating probability distributions as new data arrives. It is arguably the most important theorem in machine learning.
The Core Intuition
Prior + Evidence = Posterior
Prior
Initial belief before seeing data
Likelihood
How compatible is data with hypothesis?
Posterior
Updated belief after seeing data
Example: Is My Friend Home?
- Prior: 50% chance she is home (no idea initially).
- Evidence: Lights are on.
- Likelihood: If home, 90% chance lights are on. If not home, 10% chance lights are on.
- Posterior: After seeing lights on, P(home | lights on) = ?
Bayes tells us: 90%. The evidence dramatically shifted our belief.
The Formula
Posterior
P(A|B)
What we want
Likelihood
P(B|A)
Data fit
Prior
P(A)
Initial belief
Evidence
P(B)
Normalizer
Proportional Form
Often we skip and just normalize at the end.
Odds Form
Posterior odds = Likelihood ratio x Prior odds
Interactive: Bayes Calculator
Adjust prior and likelihood to see how the posterior changes. Notice how strong priors resist change, and how strong evidence can overcome weak priors.
Prevalence: How rare is the event?
True Positive Rate: Ability to detect true cases.
False Alarm Rate: Healthy people testing positive.
Geometric Intuition: The Posterior P(A|B) is the fraction of total colored area (Positives) that is Green (True).
Bayes Equation
Even with 99% Sensitivity, if the Prior is very low (e.g. 1%) and False Positive Rate is even moderate (5%), the Posterior drops to ~16%!
Most "positive" tests for rare diseases are actually false alarms. This is the False Positive Paradox.
Derivation
Bayes' Theorem is not a new axiom. It follows directly from the definition of conditional probability and the product rule.
1. Product rule (two ways):
2. Set them equal:
3. Divide by P(B):
See Conditional Probability for background on .
The False Positive Paradox
This classic example reveals why humans are notoriously bad at intuitive probability.
The Setup
- 1% of population has the disease (Prior = 0.01)
- Test is 99% accurate for sick people (Sensitivity = 0.99)
- Test has 1% false positive rate for healthy people
You test positive. What is P(Disease | Positive)?
Most people guess 99%. They are wrong.
Step 1: Numerator (True Positives)
0.0099
Step 2: Denominator (All Positives)
0.0198
Step 3: Result
50%
Why only 50%?
The disease is rare. In 10,000 people, only 100 are sick (99 test positive). But 9,900 are healthy, and 1% of those (99 people) also test positive! Half of all positives are false alarms.
Interactive: Medical Test Simulator
Adjust disease rate, sensitivity, and false positive rate to see how they affect the posterior probability. Watch the ratio of true vs false positives change.
Total Positive Tests
If you test positive, what is the chance you actually have the disease?
(Not 99%)
Bayesian Inference in ML
In ML, we replace events (A, B) with parameters (theta) and data (D):
Updated weights
Data fit
Regularization
Intractable!
MAP Estimation
Find that maximizes posterior:
The Denominator Problem
Computing P(D) requires integrating over all possible theta:
For neural networks with millions of parameters, this is impossible. Solutions: MCMC, Variational Inference, Laplace Approximation.
ML Applications
Naive Bayes Classifier
Fast text classification (spam detection). Assumes feature independence:
Regularization = Prior
L2 regularization is equivalent to a Gaussian prior on weights:
Belief: "Weights should be small."
Bayesian Neural Networks
Instead of point estimates, maintain distributions over weights. Provides uncertainty quantification.
Thompson Sampling
Exploration-exploitation in bandits. Sample from posterior, act on sample. Naturally balances uncertainty.
For a comparison of Bayesian vs Frequentist approaches, see Bayesian vs Frequentist.