What is Bayes' theorem and what does it state?

Bayes' theorem is a fundamental result in probability theory that describes how to update the probability of a hypothesis given new evidence. It states that the posterior probability P(A|B) equals the likelihood P(B|A) multiplied by the prior P(A), divided by the marginal probability of the evidence P(B). In essence, it provides a mathematical framework for reversing conditional probabilities and rationally updating beliefs.

What are the prior, likelihood, and posterior in Bayes' theorem?

The prior P(A) represents your initial belief about a hypothesis before observing any data. The likelihood P(B|A) measures how probable the observed evidence is assuming the hypothesis is true — it quantifies how well the hypothesis explains the data. The posterior P(A|B) is the updated belief after combining the prior with the evidence via Bayes' theorem, representing what you now believe given the observed data.

How is Bayes' theorem used in machine learning classifiers?

In machine learning, Bayes' theorem underpins MAP (maximum a posteriori) estimation, where a model's parameters are chosen to maximise the posterior probability of parameters given the training data. It also directly powers the Naive Bayes classifier, which computes class-conditional likelihoods and multiplies them by class priors to predict labels. More broadly, Bayesian inference treats model weights as probability distributions rather than point estimates, enabling uncertainty quantification.

What is the difference between Bayesian and frequentist probability?

Frequentist probability defines probability as the long-run frequency of an event in repeated experiments; parameters are fixed but unknown constants, and uncertainty is expressed through confidence intervals. Bayesian probability treats probability as a degree of belief that can be updated as new evidence arrives; parameters are treated as random variables with prior distributions that are updated to posteriors. The key practical difference is that Bayesian methods naturally incorporate prior knowledge and produce full posterior distributions over parameters.

How does the Naive Bayes classifier apply Bayes' theorem?

The Naive Bayes classifier applies Bayes' theorem by computing the posterior probability of each class label given a set of input features, then predicting the class with the highest posterior. It makes the 'naive' simplifying assumption that all features are conditionally independent given the class, allowing the joint likelihood to be factored as a product of per-feature likelihoods: P(y|x1,...,xn) ∝ P(y) × ∏ P(xi|y). Despite this strong assumption, it performs surprisingly well in high-dimensional settings like text classification and spam filtering.

Bayes' Theorem & Bayesian Inference

Introduction

In everyday logic, we think cause to effect: "If it rains, the grass gets wet." But in reality, we often observe the effect and must infer the cause: "The grass is wet. Did it rain, or did the sprinklers turn on?"

Bayes' Theorem tells us how to reverse conditional probabilities.

Given $P(\text{effect} | \text{cause})$ , compute $P(\text{cause} | \text{effect})$ .

This is the foundation of Bayesian Inference: treating learning as the continuous process of updating probability distributions as new data arrives. It is arguably the most important theorem in machine learning.

The Core Intuition

Prior + Evidence = Posterior

Prior

Initial belief before seeing data

Likelihood

How compatible is data with hypothesis?

Posterior

Updated belief after seeing data

Example: Is My Friend Home?

Prior: 50% chance she is home (no idea initially).
Evidence: Lights are on.
Likelihood: If home, 90% chance lights are on. If not home, 10% chance lights are on.
Posterior: After seeing lights on, P(home | lights on) = ?

Bayes tells us: 90%. The evidence dramatically shifted our belief.

The Formula

P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

Posterior

P(A|B)

What we want

Likelihood

P(B|A)

Data fit

Prior

P(A)

Initial belief

Evidence

P(B)

Normalizer

Proportional Form

P(A|B) \propto P(B|A) \cdot P(A)

Often we skip $P(B)$ and just normalize at the end.

Odds Form

\frac{P(A|B)}{P(\neg A|B)} = \frac{P(B|A)}{P(B|\neg A)} \cdot \frac{P(A)}{P(\neg A)}

Posterior odds = Likelihood ratio x Prior odds

Interactive: Bayes Calculator

Adjust prior and likelihood to see how the posterior changes. Notice how strong priors resist change, and how strong evidence can overcome weak priors.

Prior P(A)1.0%

Prevalence: How rare is the event?

Sensitivity P(B|A)99.0%

True Positive Rate: Ability to detect true cases.

false Positive Rate P(B|¬A)5.0%

False Alarm Rate: Healthy people testing positive.

True Pos (0.99%)

Hypothesis True

False Pos (4.95%)

Hypothesis False

Positive Tests (Colored Area)

Geometric Intuition: The Posterior P(A|B) is the fraction of total colored area (Positives) that is Green (True).

Bayes Equation

99.0%×1.0%

(99% × 1.0%)+(5.0% × 99.0%)

Posterior P(A|B)16.7%

Even with 99% Sensitivity, if the Prior is very low (e.g. 1%) and False Positive Rate is even moderate (5%), the Posterior drops to ~16%!

Most "positive" tests for rare diseases are actually false alarms. This is the False Positive Paradox.

Derivation

Bayes' Theorem is not a new axiom. It follows directly from the definition of conditional probability and the product rule.

1. Product rule (two ways):

P(A \cap B) = P(A|B) \cdot P(B) = P(B|A) \cdot P(A)

2. Set them equal:

P(A|B) \cdot P(B) = P(B|A) \cdot P(A)

3. Divide by P(B):

P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

See Conditional Probability for background on $P(A|B)$ .

The False Positive Paradox

This classic example reveals why humans are notoriously bad at intuitive probability.

The Setup

1% of population has the disease (Prior = 0.01)
Test is 99% accurate for sick people (Sensitivity = 0.99)
Test has 1% false positive rate for healthy people

You test positive. What is P(Disease | Positive)?

Most people guess 99%. They are wrong.

Step 1: Numerator (True Positives)

$P(+|D) \times P(D) = 0.99 \times 0.01 =$ 0.0099

Step 2: Denominator (All Positives)

$P(+) = P(+|D)P(D) + P(+|H)P(H)$

$= 0.99(0.01) + 0.01(0.99) =$ 0.0198

Step 3: Result

$P(D|+) = 0.0099 / 0.0198 =$ 50%

Why only 50%?

The disease is rare. In 10,000 people, only 100 are sick (99 test positive). But 9,900 are healthy, and 1% of those (99 people) also test positive! Half of all positives are false alarms.

Interactive: Medical Test Simulator

Adjust disease rate, sensitivity, and false positive rate to see how they affect the posterior probability. Watch the ratio of true vs false positives change.

Prevalence1.0%

Rare (0.1%)Common (10%)

Sensitivity (TPR)99.0%

Good (90%)Perfect (100%)

False Pos Rate (FPR)1.0%

Strict (0.1%)Loose (10%)

1000 People

True Pos

False Pos

Missed

Healthy

Total Positive Tests

9.9 Real9.9 False

The Reality

If you test positive, what is the chance you actually have the disease?

50.0%

(Not 99%)

Bayesian Inference in ML

In ML, we replace events (A, B) with parameters (theta) and data (D):

P(\theta | D) = \frac{P(D | \theta) \cdot P(\theta)}{P(D)}

Posterior

Updated weights

Likelihood

Data fit

Prior

Regularization

Evidence

Intractable!

MAP Estimation

Find $\ heta$ that maximizes posterior:

\theta_{MAP} = \arg\max_\theta P(D|\theta)P(\theta)

MLE (Special Case)

If prior is uniform (flat):

\theta_{MLE} = \arg\max_\theta P(D|\theta)

The Denominator Problem

Computing P(D) requires integrating over all possible theta:

P(D) = \int P(D|\theta)P(\theta) d\theta

For neural networks with millions of parameters, this is impossible. Solutions: MCMC, Variational Inference, Laplace Approximation.

ML Applications

Naive Bayes Classifier

Fast text classification (spam detection). Assumes feature independence:

P(y|x_1,...,x_n) \propto P(y)\prod_i P(x_i|y)

Regularization = Prior

L2 regularization is equivalent to a Gaussian prior on weights:

P(\theta) \sim \mathcal{N}(0, \sigma^2)

Belief: "Weights should be small."

Bayesian Neural Networks

Instead of point estimates, maintain distributions over weights. Provides uncertainty quantification.

Thompson Sampling

Exploration-exploitation in bandits. Sample from posterior, act on sample. Naturally balances uncertainty.

For a comparison of Bayesian vs Frequentist approaches, see Bayesian vs Frequentist.

Contents

Bayes' Theorem & Inference

Introduction

The Core Intuition

Example: Is My Friend Home?

The Formula

Posterior

Likelihood

Prior

Evidence

Proportional Form

Odds Form

Interactive: Bayes Calculator

Bayes Equation

Derivation

The False Positive Paradox

The Setup

Step 1: Numerator (True Positives)

Step 2: Denominator (All Positives)

Step 3: Result

Why only 50%?

Interactive: Medical Test Simulator

Total Positive Tests

Bayesian Inference in ML

MAP Estimation

MLE (Special Case)

The Denominator Problem

ML Applications

Naive Bayes Classifier

Regularization = Prior

Bayesian Neural Networks

Thompson Sampling