What is a p-value and what does it measure?

A p-value is the probability of observing test results at least as extreme as the results actually observed, assuming the null hypothesis is true. It measures the compatibility between your observed data and the null hypothesis — a lower p-value means your data is more surprising under the null, suggesting stronger evidence against it. Crucially, it is not the probability that the null hypothesis is true or false.

What does a p-value less than 0.05 actually mean?

A p-value less than 0.05 means that if the null hypothesis were true, you would observe data this extreme or more extreme only 5% of the time by random chance. It is a conventional threshold (called alpha or the significance level) for rejecting the null hypothesis, not an absolute proof of an effect. The 0.05 threshold is arbitrary — in high-stakes fields like medicine, stricter thresholds such as 0.01 or 0.001 are often required.

What are the limitations and common misinterpretations of p-values?

The most common misinterpretation is treating the p-value as the probability that the null hypothesis is true — it is not. A p-value does not measure effect size, practical significance, or the probability of replication. A result can be statistically significant with a tiny, unimportant effect given a large enough sample, or non-significant despite a real effect due to a small sample. The American Statistical Association issued a landmark 2016 statement cautioning against binary 'significant or not' thinking based on the 0.05 threshold.

What is the difference between a p-value and the significance level alpha?

The significance level alpha (commonly set at 0.05) is a pre-specified decision threshold chosen before conducting a test — it defines how much Type I error (false positive) risk you are willing to accept. The p-value is a calculated result from your data. You compare the p-value to alpha: if the p-value is less than alpha, you reject the null hypothesis. Setting alpha before the experiment is crucial to avoid bias in decision-making.

What is p-hacking and why is it a problem?

P-hacking (also called data dredging) is the practice of running many statistical tests on data and selectively reporting only those that yield significant results. Because any individual test has a false positive rate equal to alpha, running 100 tests where no real effect exists will still produce about 5 significant results by chance. This leads to false discoveries being published as real findings. The Bonferroni correction addresses this by dividing alpha by the number of tests run, raising the evidence bar for each individual test.

P-Values Explained: What They Really Mean

Context: This chapter dives deep into P-values. If you need a refresher on Null/Alternative hypotheses and the overall testing framework, see theHypothesis Testing chapter first.

The Intuition: How "Surprised" Are You?

Forget the math for a second. Think of the p-value as a Surprise Meter. It measures how weird your observed data would be if the Null Hypothesis were actually true.

The Coin Toss Analogy

Your friend gives you a coin. The Null Hypothesis ( $H_0$ ) is that the coin is fair. You flip it 10 times:

Scenario A: 5 Heads, 5 Tails

Are you surprised? No. This is exactly what you expect.

P \approx 1.0

Not surprised

Scenario B: 9 Heads, 1 Tail

Are you surprised? Yes, a little. Rare but possible.

P \approx 0.02

Getting suspicious

Scenario C: 10 Heads, 0 Tails

Are you surprised? Extremely. This coin is probably rigged.

P \approx 0.001

Reject

H_0

Key Insight

The lower the p-value, the more surprised you should be if $H_0$ were actually true. It measures the incompatibility between your data and the Null Hypothesis.

The Formal Definition

The P-value is the probability of observing test results at least as extreme as the results actually observed, under the assumption that the Null Hypothesis is correct.

What it IS

P(\text{Data} | H_0)

The probability of the evidence (data) given that the hypothesis is true.

What it is NOT

P(H_0 | \text{Data})

NOT the probability that the hypothesis is true given the evidence. This is a common misconception! See Bayes' theorem.

Interactive Demo: The Tail Area

Visually, the P-value is the area under the curve in the tail(s) beyond your observed test statistic. Drag the slider to see how more extreme data leads to smaller P-values.

P-Value: The Tail Area

The P-value is the shaded area under the curve beyond your test statistic. Drag the slider to see how extreme data leads to smaller P-values.

Test Type

Your Observed Test Statistic: 1.80

-3.5 (extreme left)0 (center)3.5 (extreme right)

Test Statistic

1.80

P-Value

0.0719

Surprise Level

Somewhat Surprised

Interpretation: The red shaded area IS your P-value. It represents the probability of seeing data this extreme (or more) if H0 were true. Smaller area = smaller P-value = more "surprised" = stronger evidence against H0.

The P-Value Fallacy (ASA Statement)

In 2016, the American Statistical Association (ASA) released a landmark statement to correct rampant misuse of p-values in science.

" $P = 0.05$ means there is a 95% chance the hypothesis is true."

Correction: False. It only means that if the null were true, you would see data this extreme 5% of the time. It says nothing about the truth of the hypothesis itself.

" $P > 0.05$ means no effect."

Correction: Absence of evidence is not evidence of absence. A high p-value might just mean your sample size was too small to detect the effect (Low Power). See Type I vs Type II Errors.

" $P = 0.04$ is significantly better than $P = 0.06$ ."

Correction: The threshold of 0.05 is arbitrary. In reality, 0.04 and 0.06 represent very similar levels of evidence. Treat the p-value as a continuous measure of compatibility, not a hard switch.

How to Interpret P-Values

P-Value Range	Evidence Against H0	Typical Interpretation
$P < 0.001$	Very Strong	Extremely unlikely under $H_0$ . Strong rejection.
$0.001 - 0.01$	Strong	Very unlikely under $H_0$ . Usually reject.
$0.01 - 0.05$	Moderate	Conventionally "significant." Reject at $\alpha=0.05$ .
$0.05 - 0.10$	Weak	"Marginally significant." Worth investigating further.
$P > 0.10$	Little to None	Data consistent with $H_0$ . Fail to reject.

Important: These are guidelines, not rules. Context matters! In medical trials, you might want $P < 0.01$ . In exploratory analysis, $P < 0.10$ might warrant further investigation.

ML Application: Feature Selection

In Machine Learning, p-values are heavily used in Filter Methods for feature selection.

Backward Elimination Algorithm

Train a linear regression model with ALL features.

Calculate the p-value for the coefficient (weight) of every feature.

$H_0: \text{Coefficient} = 0$ (Feature adds no value)

Identify the feature with the highest p-value (e.g., 0.85).

If $P > 0.05$ , drop that feature.

Retrain and repeat until all remaining features have $P < 0.05$ .

Result: A simpler, more interpretable model with less noise. Features with high p-values likely contain no predictive signal.

Pitfalls: P-Hacking & Corrections

P-Hacking (or Data Dredging) is the practice of performing many statistical tests on data and only reporting those that come back with significant results.

The Multiple Comparisons Problem

If you run 100 independent tests where $H_0$ is actually true (no effect), with $\alpha=0.05$ , you will statistically expect:

5tests to return significant results purely by chance

These are Type I errors (false positives). If you only publish these 5 "discoveries," you are publishing garbage.

The Fix: Bonferroni Correction

If you run $n$ tests, divide your significance level by n:

\alpha_{\text{corrected}} = \frac{\alpha}{n}

Example: Running 20 tests? Use $\alpha = 0.05/20 = 0.0025$ instead of 0.05. This is conservative but protects against false discoveries.

Interactive Demo: P-Hacking Simulation

See p-hacking in action. Each test is run where $H_0$ is TRUE (no real effect). Watch how many "significant" results appear by pure chance.

P-Hacking Simulation

Each test is run where H0 is actually TRUE (no real effect). Watch how many "significant" results appear by pure chance when running multiple tests.

Number of Tests: 20

Key Insight: With alpha = 0.05, running 20 tests will produce approximately 1.0 false positives by chance alone. Red boxes = P < 0.05 (would be declared "significant").

Contents

P-Values: The Surprise Factor

The Intuition: How "Surprised" Are You?

The Coin Toss Analogy

Key Insight

The Formal Definition

What it IS

What it is NOT

Interactive Demo: The Tail Area

P-Value: The Tail Area

The P-Value Fallacy (ASA Statement)

" $P = 0.05$ means there is a 95% chance the hypothesis is true."

" $P > 0.05$ means no effect."

" $P = 0.04$ is significantly better than $P = 0.06$ ."

How to Interpret P-Values

ML Application: Feature Selection

Backward Elimination Algorithm

Pitfalls: P-Hacking & Corrections

The Multiple Comparisons Problem

The Fix: Bonferroni Correction

Interactive Demo: P-Hacking Simulation

P-Hacking Simulation

Contents

The Intuition: How "Surprised" Are You?

The Coin Toss Analogy

Key Insight

The Formal Definition

What it IS

What it is NOT

Interactive Demo: The Tail Area

P-Value: The Tail Area

The P-Value Fallacy (ASA Statement)

"P=0.05P = 0.05P=0.05 means there is a 95% chance the hypothesis is true."

"P>0.05P > 0.05P>0.05 means no effect."

"P=0.04P = 0.04P=0.04 is significantly better than P=0.06P = 0.06P=0.06."

How to Interpret P-Values

ML Application: Feature Selection

Backward Elimination Algorithm

Pitfalls: P-Hacking & Corrections

The Multiple Comparisons Problem

The Fix: Bonferroni Correction

Interactive Demo: P-Hacking Simulation

P-Hacking Simulation

" $P = 0.05$ means there is a 95% chance the hypothesis is true."

" $P > 0.05$ means no effect."

" $P = 0.04$ is significantly better than $P = 0.06$ ."