Introduction
Prerequisite: This chapter builds on concepts fromHypothesis Testing. Make sure you understand Null/Alternative hypotheses and P-values before continuing.
In Hypothesis Testing, we never know the absolute truth about a population; we only make inferences based on a sample. Because we are operating with incomplete information, our decisions are subject to uncertainty.
The Fundamental Question
Whenever you make a binary decision (Reject or Fail to Reject ) based on probability, there are two ways to be right, and two ways to be wrong.
"Which mistake is worse: raising a false alarm, or missing a real discovery?"
Understanding Type I and Type II errors is arguably the most important practical skill in A/B testing, medical diagnosis, and machine learning model evaluation.
Core Definitions
To understand errors, we must first recall the two competing hypotheses in any test:
Null Hypothesis
The default assumption. "There is no effect," "The drug does nothing," or "The defendant is innocent."
Alternative Hypothesis
The claim we want to detect. "There is an effect," "The drug works," or "The defendant is guilty."
The Decision Matrix
This 2x2 matrix is the mental model you must memorize. It maps "Reality" (which we do not know) against "Our Decision" (based on data).
is TRUE (Nothing happened) | is FALSE (Real effect exists) | |
|---|---|---|
Fail to Reject (Do nothing) | Correct (True Negative) | Type II Error (False Negative) "Missed Opportunity" |
Reject (Take action) | Type I Error (False Positive) "False Alarm" | Correct (True Positive) Power = 1 - beta |
Memory Trick: Type I = Incorrectly reject true null (False Positive). Type II = IIncorrectly fail to reject false null (False Negative).
Deep Dive: Type I Error (False Positive)
Definition
Rejecting a True Null Hypothesis. You conclude there is an effect when there is none.
The Probability: Alpha ()
The probability of committing a Type I error is exactly equal to the Significance Level you choose before the test.
5% risk of false positive
The Consequence
This error usually leads to taking an action that should not have been taken:
- - Prescribing a drug that does not work
- - Launching a feature that does not increase revenue
- - Publishing a "discovery" that is not real
Analogy: The False Alarm
The smoke detector goes off (Reject ) but there is no fire (Null is True). You panic and evacuate for no reason.
Deep Dive: Type II Error (False Negative)
Definition
Failing to Reject a False Null Hypothesis. You fail to detect an effect that actually exists.
The Probability: Beta ()
Unlike Alpha, we do not set Beta directly. It depends on:
- - Sample size (bigger = lower beta)
- - Effect size (bigger = lower beta)
- - Variance /noise (lower = lower beta)
- - Alpha level (higher = lower beta)
The Consequence
This is a missed opportunity:
- - Failing to treat a sick patient
- - Killing a project that would have been profitable
- - Missing a scientific breakthrough
Analogy: The Silent Fire
There is a fire ( is False), but the alarm stays silent (Fail to Reject). You burn down.
Interactive Demo: The Alpha-Beta Trade-off
Here is the cruel reality of statistics: You generally cannot minimize both errors simultaneously without changing the sample size. Watch how adjusting alpha affects beta in this visualization.
Type I & Type II Errors Visualized
Two distributions: H0 (null is true) and H1 (alternative is true). See how alpha, beta, and power interact.
Smaller alpha = harder to reject H0
Larger effect = easier to detect (more power)
Statistical Power ()
Power is the probability that a test correctly rejects a false null hypothesis. In simple terms, it is the ability of the test to detect an effect if one actually exists.
Standard Target
Most A/B tests aim for 80% power. This means if there is a real difference, we have an 80% chance of finding it.
This implies we accept a 20% risk () of missing the effect (Type II Error).
Factors that Increase Power:
Narrows the sampling distributions, reducing overlap. This is the main lever for reducing BOTH errors.
Makes it easier to reject , increasing power. But also increases Type I error risk.
Easier to detect a massive difference than a tiny one. Not always controllable.
Interactive Demo: Power Analysis Calculator
How many samples do you need to achieve 80% power? Adjust the parameters to find out. This is a critical tool for experiment design.
Power Analysis: Sample Size Calculator
See how sample size affects your ability to detect real effects. The curve shows power for different sample sizes.
Real-World Scenarios
Criminal Trial
Medical Testing
Spam Filter (ML)
Interactive Demo: Which Error is Worse?
The optimal threshold depends entirely on the relative costs of Type I vs Type II errors in your specific domain. Explore different scenarios:
Which Error is Worse? Context Matters!
Select a scenario to see which type of error is more costly and how to optimize your test accordingly.
Defendant is innocent
Defendant is guilty
Convict innocent person
Destroys life of innocent person. Irreversible harm.
Acquit guilty person
Criminal goes free. Can potentially be retried.
Set very strict alpha (0.01). "Beyond reasonable doubt."