What is conditional probability and how is it calculated?

Conditional probability measures the likelihood of an event occurring given that another event has already occurred. It is written P(E|F), read as 'the probability of E given F', and calculated using the formula P(E|F) = P(E ∩ F) / P(F), provided P(F) > 0. The key idea is that conditioning restricts the sample space to only outcomes where F holds, effectively rescaling all probabilities within that narrowed universe.

What is the difference between joint probability and conditional probability?

Joint probability P(E ∩ F) is the probability that both events E and F occur simultaneously, without any assumption about which happened first. Conditional probability P(E|F) asks: given that F has already occurred, how likely is E? The two are related by P(E|F) = P(E ∩ F) / P(F) — conditional probability is a rescaled version of joint probability that treats F's sample space as the entire universe.

How is conditional probability used in Bayes' theorem?

Bayes' theorem is derived directly from the conditional probability formula. Because P(A ∩ B) can be written as both P(A|B)·P(B) and P(B|A)·P(A), equating and rearranging gives P(A|B) = P(B|A)·P(A) / P(B). This lets you 'invert' a conditional probability: if you know the likelihood P(B|A) and the prior P(A), you can compute the posterior P(A|B), which is the foundation of Bayesian inference in machine learning.

How does conditional probability apply to Naive Bayes classifiers?

Naive Bayes classifiers use Bayes' theorem to compute P(class | features). The 'naive' assumption is that all features are conditionally independent given the class label, so P(features | class) factorizes as a product of individual feature probabilities: P(C|X) ∝ P(C) ∏ P(Xᵢ|C). This simplification makes training and inference tractable even with many features. Naive Bayes is widely used for spam filtering, text classification, and sentiment analysis despite the often-violated independence assumption.

What is the difference between independent and dependent events in probability?

Two events are independent if knowing one occurred provides no information about the other, meaning P(E|F) = P(E), which is equivalent to P(E ∩ F) = P(E)·P(F). Dependent events violate this — the occurrence of one changes the probability of the other. For example, drawing two cards without replacement produces dependent events because the first draw alters the deck composition. Correctly identifying independence is critical in machine learning models that assume feature independence, such as Naive Bayes.

Conditional Probability Explained

Introduction

We constantly update our beliefs based on new information. If it is cloudy, you revise your estimate of the probability of rain. If a medical test is positive, a doctor updates their assessment of whether a patient has a disease.

Conditional Probability answers: "What is the chance of E happening given that F has already happened?"

Written as $P(E|F)$ , read "the probability of E given F"

Simple Example: Rolling a Die

Roll a fair six-sided die. What is P(rolled a 5)?

Answer: $P(5) = 1/6$ since there are 6 equally likely outcomes.

Now someone tells you the roll was odd. What is P(5 | odd)?

The possible outcomes are now {1, 3, 5}. Only one of these three is a 5.

$P(5 | \text{odd}) = 1/3$

The probability jumped from 1/6 to 1/3 because conditioning eliminated half the sample space.

The Formal Definition

P(E|F) = \frac{P(E \cap F)}{P(F)}

provided that $P(F) > 0$

$P(E \cap F)$

Probability that both E and F occur (the intersection)

$P(F)$

Probability that F occurs (the new "universe")

Why divide?

We rescale probabilities to the new sample space F

Interactive: Visualizing Conditional Probability

Adjust the probabilities and see how P(E|F) and P(F|E) change. Notice that they are generally NOT equal!

Interactive Conditional Probability

Adjust the probabilities and see how P(E|F) changes. The shaded intersection represents P(E and F).

P(E): 0.40

P(F): 0.50

P(E∩F): 0.20

P(E|F)

0.400

= 0.20 / 0.50

P(F|E)

0.500

= 0.20 / 0.40

Independence

YES

P(E)P(F) = 0.20

P(E|F) asks: "Of the green area (F), what fraction is orange (intersection)?"

Analysis: Independent Events

Knowing F occurred gives us no new information about E. The probability stays the same (P(E|F) ≈ P(E)).

Worked Examples

Movie Recommendations

Netflix wants to predict if a user will watch "Life is Beautiful" (E), given they watched "Amelie" (F).

Overall

P(E) = 0.02 (2% of all users)

Given Amelie

P(E|F) = 0.09 (9%)

Watching Amelie makes a user 4.5x more likely to watch Life is Beautiful. This is why recommendation systems track conditional probabilities.

Family Composition

A family has 4 children. Given that they have at least one boy, what is P(exactly 2 boys)?

Let E = "exactly 2 boys", F = "at least 1 boy"

All 16 sequences (BBBB, BBBG, ..., GGGG) are equally likely.

F eliminates only GGGG, leaving 15 outcomes.

Among these 15, exactly 6 have two boys.

$P(E|F) = \frac{6/16}{15/16} = \frac{6}{15} = 0.4$

Independence

Two events are independent if knowing one occurred does not change the probability of the other.

Definition 1

P(E|F) = P(E)

Definition 2 (Equivalent)

P(E \cap F) = P(E) \cdot P(F)

Testing Independence

Roll a die. Let A = {3} and B = {1, 3, 5} (odd numbers).

P(A) = 1/6

P(B) = 3/6 = 1/2

P(A ∩ B) = P({3}) = 1/6

Check: P(A) x P(B) = (1/6)(1/2) = 1/12

Since 1/12 ≠ 1/6, the events are NOT independent.

Independence ≠ Mutual Exclusivity

Mutually exclusive: P(E ∩ F) = 0 (cannot both occur)

Independent: P(E ∩ F) = P(E)P(F) (one does not affect the other)

If events are mutually exclusive with nonzero probabilities, they are never independent. Knowing A occurred tells you B definitely did NOT occur.

The Chain Rule (Multiplication Rule)

Rearranging the conditional probability formula gives us the Chain Rule:

P(E \cap F) = P(F) \cdot P(E|F)

or equivalently

P(E \cap F) = P(E) \cdot P(F|E)

This extends to any number of events. For three events:

P(A \cap B \cap C) = P(A) \cdot P(B|A) \cdot P(C|A,B)

Example: Drawing Marbles

A jar has 7 black and 3 white marbles. Draw two without replacement. What is P(both black)?

P(B1) = 7/10

P(B2|B1) = 6/9 (after removing one black, 6 black remain out of 9)

P(B1 ∩ B2) = (7/10)(6/9) = 42/90 = 0.467

Chain Rule in Language Models

The chain rule is the foundation of autoregressive models like GPT:

P(w_1, w_2, \ldots, w_n) = P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_1,w_2) \cdots

GPT learns to estimate each P(next word | previous words), then multiplies along the chain.

Interactive: Chain Rule with Probability Trees

Probability trees visualize the chain rule. Each path from root to leaf represents a sequence of events. Multiply along the path to get the joint probability.

Interactive Chain Rule

Adjust the sliders to explore how probabilities propagate through the chain.

P(A)0.30

P(B | A)0.80

P(B | ~A)0.20

Paths multiply along branches

Connection to Bayes' Theorem

Since P(A ∩ B) = P(B ∩ A), we can equate the two chain rule forms:

P(A|B) \cdot P(B) = P(B|A) \cdot P(A)

Rearranging gives Bayes' Theorem:

P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

Prior

P(A)

Likelihood

P(B|A)

Posterior

P(A|B)

Evidence

P(B)

Medical Diagnosis Example

A rare disease affects 0.1% of people. A test has 92% sensitivity and 89% specificity.

If someone tests positive, what is P(disease | positive)?

P(+) = P(+|D)P(D) + P(+|D')P(D') = (0.92)(0.001) + (0.11)(0.999) = 0.111

P(D|+) = (0.92 x 0.001) / 0.111 = 0.0083 (0.83%)

Despite a positive test, only 0.83% chance of having the disease. The low base rate (0.1%) dominates.

For more on Bayesian reasoning, see Bayes' Theorem.

ML Applications

Naive Bayes

Assumes features are conditionally independent given class:

P(C|X) \propto P(C) \prod_i P(X_i|C)

Used in spam filtering, sentiment analysis

Hidden Markov Models

Model sequences with hidden states:

Transition: P(s_t | s_{t-1})

Emission: P(o_t | s_t)

Speech recognition, POS tagging

Bayesian Networks

Graphical models encoding conditional independence:

P(X_1, \ldots, X_n) = \prod_i P(X_i | \ ext{parents}(X_i))

Medical diagnosis, causal reasoning

Reinforcement Learning

Policies are conditional distributions:

\pi(a|s) = P(\ ext{action}=a | \ ext{state}=s)

Game playing, robotics

Common Pitfalls

Confusing P(A|B) with P(B|A)

The "Prosecutor's Fallacy." P(test+ | disease) = 99% does NOT mean P(disease | test+) = 99%.

The posterior depends critically on the base rate (prior).

Assuming Independence

Just because events "seem" unrelated does not mean they are statistically independent. Always verify: does P(A ∩ B) = P(A)P(B)?

Ignoring Base Rates

Even with strong evidence (high likelihood), a very low prior keeps the posterior low. This is why screening tests for rare diseases produce many false positives.

Wrong Chain Rule

Correct

P(A)P(B|A)P(C|A,B)

Wrong

P(A)P(B|A)P(C|B)

Each term must condition on ALL previous events.

Contents

Introduction

Simple Example: Rolling a Die

The Formal Definition

P(E∩F)P(E \cap F)P(E∩F)

P(F)P(F)P(F)

Why divide?

Interactive: Visualizing Conditional Probability

Interactive Conditional Probability

Analysis: Independent Events

Worked Examples

Movie Recommendations

Family Composition

Independence

Testing Independence

Independence ≠ Mutual Exclusivity

The Chain Rule (Multiplication Rule)

Example: Drawing Marbles

Chain Rule in Language Models

Interactive: Chain Rule with Probability Trees

Interactive Chain Rule

Connection to Bayes' Theorem

Medical Diagnosis Example

ML Applications

Naive Bayes

Hidden Markov Models

Bayesian Networks

Reinforcement Learning

Common Pitfalls

Confusing P(A|B) with P(B|A)

Assuming Independence

Ignoring Base Rates

Wrong Chain Rule

$P(E \cap F)$

$P(F)$