What is hypothesis testing and how does it work?

Hypothesis testing is a statistical framework for making decisions under uncertainty using sample data. You state two competing claims (null and alternative hypotheses), collect data, compute a test statistic, and compare it to a threshold to decide whether to reject the null hypothesis. The process quantifies how likely your observed data would be if the null hypothesis were true, giving you a probabilistic yes-or-no answer.

What is the null hypothesis and the alternative hypothesis?

The null hypothesis (H₀) is the default assumption that there is no effect or difference — for example, that a drug has no impact on blood pressure. The alternative hypothesis (H₁) is the claim you are trying to prove, asserting that a real effect or difference exists. You never directly prove the null; you either reject it based on sufficient evidence or fail to reject it.

What are the steps of a hypothesis test?

A hypothesis test follows six steps: (1) State the null and alternative hypotheses clearly. (2) Choose a significance level α (commonly 0.05). (3) Collect and examine sample data. (4) Calculate the appropriate test statistic (Z, T, Chi-Square, etc.). (5) Find the p-value or critical value to determine if the statistic falls in the rejection region. (6) Make a decision to reject or fail to reject H₀, then interpret the result in practical terms.

What is the significance level and how do you choose it?

The significance level (α) is the probability threshold below which you reject the null hypothesis — it represents your tolerance for a false positive (Type I error). The most common choice is α = 0.05, meaning you accept a 5% chance of wrongly rejecting a true null. In high-stakes domains like medicine or safety engineering, stricter levels such as 0.01 or 0.001 are preferred to reduce the risk of false conclusions.

What is the difference between a one-tailed and two-tailed test?

A two-tailed test checks for a difference in either direction (H₁: μ ≠ 50) and places the rejection region on both ends of the distribution. A one-tailed test checks for a difference in only one direction — either greater than (right-tailed) or less than (left-tailed) — and concentrates the rejection region on a single tail. Use a one-tailed test only when you have a strong directional hypothesis established before collecting data.

How is hypothesis testing applied in machine learning experiments?

In machine learning, hypothesis testing is used in three key areas: feature selection (testing whether a feature is statistically correlated with the target using Chi-Square or correlation tests), A/B testing (comparing two model versions or website variants using a two-sample Z-test), and model comparison (using paired t-tests on cross-validation scores to determine whether one model genuinely outperforms another beyond random variation).

Hypothesis Testing: Complete Guide

Introduction

Hypothesis testing is the statistical method for making decisions under uncertainty. It takes two opposing ideas about a population and uses sample data to decide which is more likely true.

The Core Question

In Data Science, we constantly see patterns. "Model A has 85% accuracy, Model B has 86%." Is Model B actually better? Or was that 1% difference just random luck?

Hypothesis testing gives us a mathematical "Yes" or "No" (with a probability attached) to that question.

For example, if a company claims their website gets 50 visitors a day on average, we use hypothesis testing to analyze past visitor data. We determine if the actual average is indeed 50, or if the deviation we see in our sample is significant enough to prove them wrong.

The Courtroom Analogy

The logic of hypothesis testing parallels a criminal trial. This analogy helps clarify why we phrase things as "Fail to reject" rather than "Accept."

H_0

Null Hypothesis

"The defendant is innocent."

We start by assuming the status quo. We assume there is no effect, no difference, or no crime committed.

H_1

Alternative Hypothesis

"The defendant is guilty."

This is what the prosecutor (or data scientist) attempts to prove. It claims there is a significant effect or difference.

The Verdict

In court, we never declare a defendant "Innocent." We declare them "Not Guilty". This means there was not enough evidence to convict. Similarly, in statistics, we never "Accept the Null Hypothesis." We only "Fail to Reject the Null Hypothesis."

Core Definitions

1. The Hypotheses

Null Hypothesis (

H_0

)

The starting assumption. Assumes no effect or difference.

H_0: \mu = 50

"Average visits are 50"

Alternative Hypothesis (

H_1

)

The opposite of the null, suggesting a difference exists.

H_1: \mu \neq 50

"Average visits are NOT 50"

2. Significance Level ( $\alpha$ )

How sure we want to be before saying the claim is false. Usually, we choose $\alpha = 0.05$ (5%).

This is the probability of rejecting the null hypothesis when it is actually true (making a Type I error). Think of it as setting the bar for "beyond reasonable doubt."

3. The P-Value

The probability of seeing data as extreme (or more) than what we observed, assuming $H_0$ is true.

P-value

\le \alpha

Reject $H_0$ - data is unlikely under null.

P-value

> \alpha

Fail to Reject $H_0$ - inconclusive.

P-values are widely misunderstood. See the P-Value chapter for common fallacies (ASA Statement) and pitfalls (P-Hacking).

Interactive Demo: P-Value & Rejection

See how the test statistic, alpha level, and test type affect whether we reject $H_0$ . Drag the test statistic and watch the P-value change.

Hypothesis Type

Significance Level (

\alpha

)

z = 1.50

z = -1.50

z^* = \pm 1.96

Decision

Fail to Reject

H_0

p

= 0.1336

\alpha

= 0.05

Since

p \ge \alpha

, we do not have enough evidence to say this is not just random noise.

One-Tailed vs. Two-Tailed Tests

The direction of your test depends on your Alternative Hypothesis ( $H_1$ ). Are you checking for any difference, or a difference in a specific direction?

Two-Tailed Test

Used when we want to see if there is a difference in either direction (higher OR lower).

H_1: \mu \neq 50

Rejection regions on BOTH tails

Example: Testing if a marketing strategy affects sales (it could increase or decrease them).

One-Tailed Test

Used when we expect a change in only one direction.

H_1: \mu > 50

(Right-Tailed)

H_1: \mu < 50

(Left-Tailed)

Rejection region on ONE tail only

Example: Testing if a new algorithm improves accuracy (we only care if it goes up).

Choosing the Right Test Statistic

The test statistic is a number calculated from sample data that helps us decide whether to reject $H_0$ . Here is a quick reference:

Test	When to Use
Z-Test	Population variance known OR large sample ( $n > 30$ )
T-Test	Population variance unknown AND small sample ( $n < 30$ )
Chi-Square	Categorical data (comparing observed vs expected counts)
ANOVA	Comparing means of 3+ groups

Z vs T Decision: The full decision flowchart with formulas is in theConfidence Intervals chapter. For a practical T-test walkthrough, see the One-Sample T-Test chapter.

Type I & Type II Errors

Since we rely on probabilities, there is always a chance we make the wrong decision. These errors are classified into two types:

Type I Error(

\alpha

)

False Positive

Rejecting a TRUE Null Hypothesis.

Courtroom: Convicting an innocent person.

Type II Error(

\beta

)

False Negative

Failing to reject a FALSE Null Hypothesis.

Courtroom: Letting a guilty person go free.

Deep Dive: The Error Trade-off

Understanding when each error matters more (Spam filters vs. Medical tests) and how to balance them is critical for real-world applications.

Read the full Type I vs Type II chapter

Interactive Demo: Error Trade-offs

Visualize the relationship between alpha, beta, and power. See how changing alpha affects beta, and how effect size impacts your ability to detect true differences.

Type I & Type II Errors Visualized

Two distributions: H0 (null is true) and H1 (alternative is true). See how alpha, beta, and power interact.

Alpha (Type I Error): 0.05

Smaller alpha = harder to reject H0

Effect Size (True Difference): 2.0

Larger effect = easier to detect (more power)

Type I Error (alpha)

5.0%

False Positive

Type II Error (beta)

36.1%

False Negative

Power (1-beta)

63.9%

True Positive

Effect Size

2.0

Cohen's d

Key Trade-off: Decreasing alpha (stricter threshold) increases beta (miss more true effects). The only way to reduce BOTH errors is to increase sample size or study larger effects.

The 6-Step Process

A structured approach ensures valid results. Follow this checklist:

State Hypotheses

Define $H_0$ and $H_1$ clearly before collecting data.

$H_0$ : Drug has no effect ( $\mu = 0$ )

Choose Significance Level

Set $\alpha$ (usually 0.05). This sets the bar for 'beyond reasonable doubt.'

$\alpha = 0.05$

Collect & Analyze Data

Gather sample data through proper experimental design.

Measure blood pressure before/after treatment

Calculate Test Statistic

Compute Z, T, or Chi-Square based on the data.

$t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}$

Find P-Value or Critical Value

Compare your statistic to the distribution under $H_0$ .

P-value = 0.003

Make Decision & Interpret

Reject or fail to reject $H_0$ . State practical significance.

Reject $H_0$ ; drug significantly lowers BP

Practical Example: The Z-Test

The Z-Test is the "Hello World" of hypothesis testing. It is used when we know the population standard deviation ( $\sigma$ ) and want to test if a sample mean belongs to that population.

The Scenario: IQ Scores

IQ scores are normally distributed with a mean of $\mu = 100$ and standard deviation $\sigma = 15$ . A school principal claims her students are "significantly smarter" than average. She samples $n = 36$ students and finds an average IQ of $\bar{x} = 106$ .

Step 1: Hypotheses

H_0: \mu = 100

(Average intelligence)

H_1: \mu > 100

(Smarter - Right Tailed)

Step 2: Significance Level

Let's use

\alpha = 0.05

. The critical Z-score for top 5% is

1.645

Step 3: Calculate Z-Statistic

The Z-score measures the distance between the sample mean and population mean in units of Standard Error.

Numerator (Signal)

106 - 100 = 6

Difference observed

Denominator (Noise)

15 / \sqrt{36} = 2.5

Standard Error (

\sigma/\sqrt{n}

)

Z = \frac{6}{2.5} = 2.4

Step 4: Decision

Since our Z-score of $2.4$ is greater than the critical value $1.645$ (it falls in the rejection region), we Reject $H_0$ .

Conclusion: The students are statistically significantly smarter than the average population.

Reality Check: In real life, we rarely know the true $\sigma$ (population std dev). When $\sigma$ is unknown, we must use the sample std dev ( $s$ ) and switch to the T-Test.

Applications in Machine Learning

Hypothesis testing is fundamental for making data-driven decisions in ML.

1. Feature Selection

Use hypothesis tests to decide if a feature is relevant to the target.

$H_0$ : Feature X has no correlation with Target Y.
Test: Pearson Correlation or Chi-Square.
Result: If P-value is high, drop the feature to reduce noise.

2. A/B Testing

Comparing two versions of a model, website, or feature.

$H_0$ : Conversion Rate A = Conversion Rate B.
Test: Two-sample Z-test for proportions.
Result: If we reject $H_0$ , the difference is statistically significant.

3. Model Comparison

Determining if one model is statistically better than another. See confidence intervals for reporting uncertainty.

$H_0$ : Model A accuracy = Model B accuracy.
Test: Paired T-test on cross-validation scores.
Result: Avoid overfitting to a single train/test split.

Limitations & Pitfalls

The "File Drawer" Effect

Studies with significant results ( $P < 0.05$ ) are more likely to be published than those with non-significant results. This leads to a bias in literature where effects seem stronger than they are.

Statistical vs. Practical Significance

With a massive sample size, even a tiny difference (e.g., 0.001% improvement) can have a small P-value. While "statistically significant," this might be practically useless for the business.

P-Hacking

Running many tests until you find one with $P < 0.05$ , or tweaking analysis parameters to achieve significance. This inflates false positive rates. Pre-register your hypotheses to avoid this.

Contents