What is a t-test and when should you use it?

A t-test is a parametric statistical hypothesis test used to determine whether the mean of a sample is significantly different from a known or hypothesized value (one-sample), or whether the means of two groups differ from each other. You should use a t-test when your data is continuous, observations are independent, and either the sample size is large enough for the Central Limit Theorem to apply or the population is approximately normally distributed. It is preferred over a z-test when the population standard deviation is unknown.

What is the difference between a one-sample and two-sample t-test?

A one-sample t-test compares the mean of a single sample against a known or hypothesized population mean (e.g., checking whether a factory's protein bars average 20g). A two-sample t-test compares the means of two independent groups to determine whether they differ significantly (e.g., comparing test scores between two classrooms). When the same subjects are measured twice — such as before and after a treatment — a paired t-test is the appropriate variant of the two-sample approach.

What are the assumptions of a t-test?

The four key assumptions of a t-test are: (1) Independence — each observation must be independent of the others; (2) Continuous data — the variable must be measured on an interval or ratio scale; (3) Random sampling — data should come from a random sample of the population to avoid selection bias; and (4) Normality — the population distribution should be approximately normal, though the test is robust to this assumption for large samples (n > 30) due to the Central Limit Theorem.

What is the difference between a t-test and a z-test?

Both the t-test and z-test evaluate whether a sample mean differs significantly from a hypothesized value, but they differ in what information is available. A z-test requires the true population standard deviation to be known, which is rare in practice. A t-test uses the sample standard deviation as an estimate and relies on the t-distribution, which has heavier tails to account for the additional uncertainty. In practice, the t-test is almost always preferred; when sample sizes are very large, the t-distribution converges to the standard normal distribution and the two tests yield virtually identical results.

How is the t-test used in machine learning model comparison?

In machine learning, the t-test is commonly used to determine whether the performance difference between two models is statistically significant rather than due to random variation in cross-validation splits. A paired t-test is typically applied to compare the per-fold accuracy or error scores of two models evaluated on the same k-fold partitions. If the p-value is below a chosen significance threshold (e.g., 0.05), you can conclude with confidence that one model genuinely outperforms the other on the given dataset.

One-Sample T-Test: Complete Tutorial

Introduction

The One-Sample T-Test is a parametric statistical procedure used to determine whether the mean of a single sample ( $\bar{x}$ ) is statistically different from a known or hypothesized population mean ( $\mu_0$ ).

This test bridges the gap between your specific observations and the "truth" you are comparing them against. Unlike the Z-test, which requires knowing the true population standard deviation ( $\sigma$ ), the T-test is robust because it uses the sample standard deviation ( $s$ ) as an estimate. This makes it the standard tool for real-world analysis where population parameters are almost never known.

Real-World Scenarios

Quality Control: A factory claims its protein bars contain 20g of protein. You collect a random sample of 31 bars. Is the average really 20g, or are they systematically underfilling (or overfilling)?
Healthcare: A hospital wants to know if the average cholesterol level of a specific patient group differs from the national health goal of 200 mg/dL.
Environmental Safety: An inspector needs to verify if the lead levels in an apartment building exceed the EPA safety clearance level of 10 micrograms per square foot.

When to Use This Test

You should use a One-Sample T-Test when you meet specific criteria. It is not a catch-all for every mean comparison problem.

You HAVE

A single continuous variable (e.g., weight, time, height, pH level).
A known "Test Value" or hypothesized mean ( $\mu_0$ ) to compare against. This could be a legal standard, an industry benchmark, or a historical average.

You Do NOT Have

Two separate groups of people (Use Independent Samples T-Test).
The same people measured twice (Use Paired Samples T-Test).
More than two groups (Use ANOVA).
Categorical data (Use Chi-Square).

Crucial Assumptions

For the results of a t-test to be valid, your data must adhere to four key assumptions. Violating these can lead to incorrect conclusions (false positives or false negatives).

1. Independence

The value of one observation does not influence another. For example, measuring the weight of 30 different people is independent. Measuring the weight of the same person 30 times is dependent (not allowed).

2. Continuous Data

The variable must be interval or ratio level (e.g., grams, seconds, height). Categorical variables (e.g., "Yes/No", "High/Low") cannot be used here.

3. Random Sampling

The data should be obtained via a simple random sample from the population to avoid selection bias. If you only pick the heaviest protein bars, your result will be biased high.

4. Normality

The population from which the sample is drawn should be approximately normally distributed.
Note: The t-test is "robust" to violations of normality if the sample size is large ( $n > 30$ ) thanks to the Central Limit Theorem. For small samples ( $n < 30$ ), normality is critical.

Setting up Hypotheses

Before calculating anything, you must define your null ( $H_0$ ) and alternative ( $H_1$ or $H_a$ ) hypotheses. This dictates whether you perform a one-tailed or two-tailed test.

Test Type	Question Asked	Null Hypothesis ( $H_0$ )	Alternative Hypothesis ( $H_1$ )
Two-Tailed	Is the mean different from $\mu_0$ ?	$\mu = \mu_0$	$\mu \neq \mu_0$
Right-Tailed	Is the mean greater than $\mu_0$ ?	$\mu \le \mu_0$	$\mu > \mu_0$
Left-Tailed	Is the mean less than $\mu_0$ ?	$\mu \ge \mu_0$	$\mu < \mu_0$

Example: For the protein bars (Label = 20g), we usually care if it's different in *either* direction, so we use a Two-Tailed test. $H_0: \mu = 20$ , $H_1: \mu \neq 20$ .

The T-Statistic Formula

The t-statistic essentially measures "how many standard errors away from the hypothetical mean is our sample mean?".

t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}

The Signal (Numerator)

$\bar{x} - \mu_0$

The actual difference between your sample mean and the hypothesized population mean.

The Noise (Denominator)

$SE = \frac{s}{\sqrt{n}}$

The Standard Error. This estimates how much sample means fluctuate naturally.

Prerequisite

If the concepts of Standard Error or T-distribution feel unfamiliar, review the Sampling Distributions chapter first. It includes interactive visualizations that build intuition for why we use $t$ instead of $z$ .

Manual Calculation Example

Let us walk through the calculation manually, step-by-step, using the Protein Bar example.

The Scenario

Claim (Hypothesis): Protein bars have 20g of protein.
Sample Data: We collected $n = 31$ bars.
Sample Mean ( $\bar{x}$ ): Calculated average is 21.40g.
Sample Std Dev ( $s$ ): Calculated standard deviation is 2.54g.
Significance Level ( $\alpha$ ): 0.05 (5% risk).

Step 1: Calculate the Difference (Signal)

$\text{Difference} = \bar{x} - \mu_0 = 21.40 - 20 = 1.40$

Step 2: Calculate Standard Error (Noise)

$SE = \frac{s}{\sqrt{n}} = \frac{2.54}{\sqrt{31}} \approx \frac{2.54}{5.568} \approx 0.456$

Step 3: Calculate t-Statistic

$t = \frac{\text{Signal}}{\text{Noise}} = \frac{1.40}{0.456} \approx 3.07$

Our sample mean is 3.07 standard errors away from the hypothesized mean.

Step 4: Determine Critical Value & Decision

Degrees of Freedom (df): $n - 1 = 31 - 1 = 30$

Look up $t_{0.05, 30}$ in a t-table (two-tailed). The critical value is approximately 2.042.

Comparison:

$|t_{calc}| = 3.07$ vs $t_{crit} = 2.042$

Since $3.07 > 2.042$ , our result is in the "rejection region."

Conclusion: Reject $H_0$ . The protein bars likely do NOT contain 20g of protein (they contain significantly more).

Interactive Demo: T-Test Calculator

Adjust the sample statistics and see the t-statistic calculation and decision in real-time. Watch how the t-value changes position on the distribution curve.

Interactive T-Test Calculator

Adjust your sample statistics and see how the t-statistic is calculated and whether you reject the null hypothesis.

Sample Mean (x̄): 21.40

Hypothesized Mean (μ0): 20.0

Sample SD (s): 2.54

Sample Size (n): 31

Significance Level (alpha): 0.05

Step-by-Step Calculation

1. Difference (Signal):21.40 - 20.0 = 1.40

2. Standard Error (Noise):2.54 / √31 = 0.456

3. T-Statistic:t = 3.069

4. Degrees of Freedom:df = 30

5. Critical Value (two-tailed):t* = ±2.042

T-Statistic

3.069

Critical Value

±2.042

Decision

Reject H0

Interpretation: The t-statistic (3.07) falls in the red rejection region. The sample mean is significantly different from the hypothesized mean at the 5% level.

Interpreting Results (P-Value)

In modern analysis (using software like JMP, SPSS, or Python), we rely more on the P-Value than manually comparing t-statistics to critical values.

The P-Value Logic

The P-value answers: "If the true mean was actually 20g, what is the probability of randomly picking a sample with a mean of 21.4g (or more extreme)?"

P-Value

< \alpha

(e.g.,

0.0046 < 0.05

). Highly unlikely to happen by chance. Reject $H_0$ .

P-Value

> \alpha

(e.g.,

0.23 > 0.05

). Could easily happen by chance. Fail to Reject $H_0$ .

For our energy bar example, software calculates a p-value of 0.0046. This means there is only a 0.46% chance of seeing this data if the labels were correct. That is strong evidence that the labels are wrong.

Checking Normality

Before trusting your t-test, you should visualize your data to check if the normality assumption holds.

1. Histogram

Look for a bell shape. It does not need to be perfect, but it should not be heavily skewed or have massive outliers.

"Is the distribution roughly symmetric and mound-shaped?"

2. Q-Q Plot (Quantile-Quantile)

Points should roughly follow a straight diagonal line. If they curve significantly, the data might not be normal.

"Do the sample quantiles match theoretical normal quantiles?"

What if data is not normal?

If sample size is small ( $n < 30$ ) and data is skewed, the t-test is invalid. You should use a Non-Parametric test like the Wilcoxon Signed-Rank Test instead.

Interactive Demo: Normality Diagnostics

Explore how different data distributions appear in histogram and Q-Q plots. Learn to visually diagnose normality violations.

Normality Check: Visual Diagnostics

Select different data distributions to see how histogram and Q-Q plots help diagnose normality violations.

Histogram

Bell-shaped distribution ✓

Q-Q Plot

Points follow diagonal line ✓

✓

Assessment

Data appears approximately normal. T-test is valid.

How to check: Look for a bell shape in the histogram and points following the diagonal blue line in the Q-Q plot. If n > 30, the t-test is robust to mild violations due to the Central Limit Theorem.

Contents