Introduction
Hypothesis testing is the statistical method for making decisions under uncertainty. It takes two opposing ideas about a population and uses sample data to decide which is more likely true.
The Core Question
In Data Science, we constantly see patterns. "Model A has 85% accuracy, Model B has 86%." Is Model B actually better? Or was that 1% difference just random luck?
Hypothesis testing gives us a mathematical "Yes" or "No" (with a probability attached) to that question.
For example, if a company claims their website gets 50 visitors a day on average, we use hypothesis testing to analyze past visitor data. We determine if the actual average is indeed 50, or if the deviation we see in our sample is significant enough to prove them wrong.
The Courtroom Analogy
The logic of hypothesis testing parallels a criminal trial. This analogy helps clarify why we phrase things as "Fail to reject" rather than "Accept."
Null Hypothesis
"The defendant is innocent."
We start by assuming the status quo. We assume there is no effect, no difference, or no crime committed.
Alternative Hypothesis
"The defendant is guilty."
This is what the prosecutor (or data scientist) attempts to prove. It claims there is a significant effect or difference.
The Verdict
In court, we never declare a defendant "Innocent." We declare them "Not Guilty". This means there was not enough evidence to convict. Similarly, in statistics, we never "Accept the Null Hypothesis." We only "Fail to Reject the Null Hypothesis."
Core Definitions
1. The Hypotheses
The starting assumption. Assumes no effect or difference.
"Average visits are 50"
The opposite of the null, suggesting a difference exists.
"Average visits are NOT 50"
2. Significance Level ()
How sure we want to be before saying the claim is false. Usually, we choose (5%).
This is the probability of rejecting the null hypothesis when it is actually true (making a Type I error). Think of it as setting the bar for "beyond reasonable doubt."
3. The P-Value
The probability of seeing data as extreme (or more) than what we observed, assuming is true.
Reject - data is unlikely under null.
Fail to Reject - inconclusive.
P-values are widely misunderstood. See the P-Value chapter for common fallacies (ASA Statement) and pitfalls (P-Hacking).
Interactive Demo: P-Value & Rejection
See how the test statistic, alpha level, and test type affect whether we reject . Drag the test statistic and watch the P-value change.
One-Tailed vs. Two-Tailed Tests
The direction of your test depends on your Alternative Hypothesis (). Are you checking for any difference, or a difference in a specific direction?
Two-Tailed Test
Used when we want to see if there is a difference in either direction (higher OR lower).
Example: Testing if a marketing strategy affects sales (it could increase or decrease them).
One-Tailed Test
Used when we expect a change in only one direction.
Example: Testing if a new algorithm improves accuracy (we only care if it goes up).
Choosing the Right Test Statistic
The test statistic is a number calculated from sample data that helps us decide whether to reject . Here is a quick reference:
| Test | When to Use |
|---|---|
| Z-Test | Population variance known OR large sample () |
| T-Test | Population variance unknown AND small sample () |
| Chi-Square | Categorical data (comparing observed vs expected counts) |
| ANOVA | Comparing means of 3+ groups |
Z vs T Decision: The full decision flowchart with formulas is in theConfidence Intervals chapter. For a practical T-test walkthrough, see the One-Sample T-Test chapter.
Type I & Type II Errors
Since we rely on probabilities, there is always a chance we make the wrong decision. These errors are classified into two types:
False Positive
Rejecting a TRUE Null Hypothesis.
False Negative
Failing to reject a FALSE Null Hypothesis.
Deep Dive: The Error Trade-off
Understanding when each error matters more (Spam filters vs. Medical tests) and how to balance them is critical for real-world applications.
Read the full Type I vs Type II chapterInteractive Demo: Error Trade-offs
Visualize the relationship between alpha, beta, and power. See how changing alpha affects beta, and how effect size impacts your ability to detect true differences.
Type I & Type II Errors Visualized
Two distributions: H0 (null is true) and H1 (alternative is true). See how alpha, beta, and power interact.
Smaller alpha = harder to reject H0
Larger effect = easier to detect (more power)
The 6-Step Process
A structured approach ensures valid results. Follow this checklist:
State Hypotheses
Define and clearly before collecting data.
: Drug has no effect ()Choose Significance Level
Set (usually 0.05). This sets the bar for 'beyond reasonable doubt.'
Collect & Analyze Data
Gather sample data through proper experimental design.
Measure blood pressure before/after treatmentCalculate Test Statistic
Compute Z, T, or Chi-Square based on the data.
Find P-Value or Critical Value
Compare your statistic to the distribution under .
P-value = 0.003Make Decision & Interpret
Reject or fail to reject . State practical significance.
Reject ; drug significantly lowers BPPractical Example: The Z-Test
The Z-Test is the "Hello World" of hypothesis testing. It is used when we know the population standard deviation () and want to test if a sample mean belongs to that population.
The Scenario: IQ Scores
IQ scores are normally distributed with a mean of and standard deviation . A school principal claims her students are "significantly smarter" than average. She samples students and finds an average IQ of .
Since our Z-score of is greater than the critical value (it falls in the rejection region), we Reject .
Conclusion: The students are statistically significantly smarter than the average population.
Reality Check: In real life, we rarely know the true (population std dev). When is unknown, we must use the sample std dev () and switch to the T-Test.
Applications in Machine Learning
Hypothesis testing is fundamental for making data-driven decisions in ML.
1. Feature Selection
Use hypothesis tests to decide if a feature is relevant to the target.
- : Feature X has no correlation with Target Y.
- Test: Pearson Correlation or Chi-Square.
- Result: If P-value is high, drop the feature to reduce noise.
2. A/B Testing
Comparing two versions of a model, website, or feature.
- : Conversion Rate A = Conversion Rate B.
- Test: Two-sample Z-test for proportions.
- Result: If we reject , the difference is statistically significant.
3. Model Comparison
Determining if one model is statistically better than another. See confidence intervals for reporting uncertainty.
- : Model A accuracy = Model B accuracy.
- Test: Paired T-test on cross-validation scores.
- Result: Avoid overfitting to a single train/test split.
Limitations & Pitfalls
The "File Drawer" Effect
Studies with significant results () are more likely to be published than those with non-significant results. This leads to a bias in literature where effects seem stronger than they are.
Statistical vs. Practical Significance
With a massive sample size, even a tiny difference (e.g., 0.001% improvement) can have a small P-value. While "statistically significant," this might be practically useless for the business.
P-Hacking
Running many tests until you find one with , or tweaking analysis parameters to achieve significance. This inflates false positive rates. Pre-register your hypotheses to avoid this.