Introduction
We constantly update our beliefs based on new information. If it is cloudy, you revise your estimate of the probability of rain. If a medical test is positive, a doctor updates their assessment of whether a patient has a disease.
Conditional Probability answers: "What is the chance of E happening given that F has already happened?"
Written as , read "the probability of E given F"
Simple Example: Rolling a Die
Roll a fair six-sided die. What is P(rolled a 5)?
Answer: since there are 6 equally likely outcomes.
Now someone tells you the roll was odd. What is P(5 | odd)?
The possible outcomes are now {1, 3, 5}. Only one of these three is a 5.
The probability jumped from 1/6 to 1/3 because conditioning eliminated half the sample space.
The Formal Definition
provided that
Probability that both E and F occur (the intersection)
Probability that F occurs (the new "universe")
Why divide?
We rescale probabilities to the new sample space F
Interactive: Visualizing Conditional Probability
Adjust the probabilities and see how P(E|F) and P(F|E) change. Notice that they are generally NOT equal!
Interactive Conditional Probability
Adjust the probabilities and see how P(E|F) changes. The shaded intersection represents P(E and F).
P(E|F) asks: "Of the green area (F), what fraction is orange (intersection)?"
Analysis: Independent Events
Knowing F occurred gives us no new information about E. The probability stays the same (P(E|F) ≈ P(E)).
Worked Examples
Movie Recommendations
Netflix wants to predict if a user will watch "Life is Beautiful" (E), given they watched "Amelie" (F).
Watching Amelie makes a user 4.5x more likely to watch Life is Beautiful. This is why recommendation systems track conditional probabilities.
Family Composition
A family has 4 children. Given that they have at least one boy, what is P(exactly 2 boys)?
Let E = "exactly 2 boys", F = "at least 1 boy"
All 16 sequences (BBBB, BBBG, ..., GGGG) are equally likely.
F eliminates only GGGG, leaving 15 outcomes.
Among these 15, exactly 6 have two boys.
Independence
Two events are independent if knowing one occurred does not change the probability of the other.
Definition 1
Definition 2 (Equivalent)
Testing Independence
Roll a die. Let A = {3} and B = {1, 3, 5} (odd numbers).
P(A) = 1/6
P(B) = 3/6 = 1/2
P(A ∩ B) = P({3}) = 1/6
Check: P(A) x P(B) = (1/6)(1/2) = 1/12
Since 1/12 ≠ 1/6, the events are NOT independent.
Independence ≠ Mutual Exclusivity
Mutually exclusive: P(E ∩ F) = 0 (cannot both occur)
Independent: P(E ∩ F) = P(E)P(F) (one does not affect the other)
If events are mutually exclusive with nonzero probabilities, they are never independent. Knowing A occurred tells you B definitely did NOT occur.
The Chain Rule (Multiplication Rule)
Rearranging the conditional probability formula gives us the Chain Rule:
or equivalently
This extends to any number of events. For three events:
Example: Drawing Marbles
A jar has 7 black and 3 white marbles. Draw two without replacement. What is P(both black)?
P(B1) = 7/10
P(B2|B1) = 6/9 (after removing one black, 6 black remain out of 9)
P(B1 ∩ B2) = (7/10)(6/9) = 42/90 = 0.467
Chain Rule in Language Models
The chain rule is the foundation of autoregressive models like GPT:
GPT learns to estimate each P(next word | previous words), then multiplies along the chain.
Interactive: Chain Rule with Probability Trees
Probability trees visualize the chain rule. Each path from root to leaf represents a sequence of events. Multiply along the path to get the joint probability.
Interactive Chain Rule
Adjust the sliders to explore how probabilities propagate through the chain.
Connection to Bayes' Theorem
Since P(A ∩ B) = P(B ∩ A), we can equate the two chain rule forms:
Rearranging gives Bayes' Theorem:
Medical Diagnosis Example
A rare disease affects 0.1% of people. A test has 92% sensitivity and 89% specificity.
If someone tests positive, what is P(disease | positive)?
P(+) = P(+|D)P(D) + P(+|D')P(D') = (0.92)(0.001) + (0.11)(0.999) = 0.111
P(D|+) = (0.92 x 0.001) / 0.111 = 0.0083 (0.83%)
Despite a positive test, only 0.83% chance of having the disease. The low base rate (0.1%) dominates.
For more on Bayesian reasoning, see Bayes' Theorem.
ML Applications
Naive Bayes
Assumes features are conditionally independent given class:
Used in spam filtering, sentiment analysis
Hidden Markov Models
Model sequences with hidden states:
Transition: P(s_t | s_{t-1})
Emission: P(o_t | s_t)
Speech recognition, POS tagging
Bayesian Networks
Graphical models encoding conditional independence:
Medical diagnosis, causal reasoning
Reinforcement Learning
Policies are conditional distributions:
Game playing, robotics
Common Pitfalls
Confusing P(A|B) with P(B|A)
The "Prosecutor's Fallacy." P(test+ | disease) = 99% does NOT mean P(disease | test+) = 99%.
The posterior depends critically on the base rate (prior).
Assuming Independence
Just because events "seem" unrelated does not mean they are statistically independent. Always verify: does P(A ∩ B) = P(A)P(B)?
Ignoring Base Rates
Even with strong evidence (high likelihood), a very low prior keeps the posterior low. This is why screening tests for rare diseases produce many false positives.
Wrong Chain Rule
Correct
P(A)P(B|A)P(C|A,B)
Wrong
P(A)P(B|A)P(C|B)
Each term must condition on ALL previous events.