What is A/B testing and how does it work statistically?

A/B testing is a controlled experiment where users are randomly split into two groups: a control group (A) that sees the original version and a treatment group (B) that sees the new version. Because the split is random, any measured difference in conversion rates between the groups can be attributed to the change rather than external factors. A two-sample Z-test or t-test is then applied to determine whether the observed difference is statistically significant or likely due to chance.

How do you determine the sample size needed for an A/B test?

Sample size is calculated using the two-proportion Z-test formula: n = 2p̄(1-p̄)(Z_α/2 + Z_β)² / (p_B - p_A)², where p̄ is the pooled baseline rate, α is the significance level (typically 0.05), and β is the desired false-negative rate (typically 0.20 for 80% power). For example, detecting a 1% absolute lift from a 5% baseline with 80% power and α=0.05 requires roughly 3,000 users per group. You must determine sample size before running the test to avoid p-hacking.

How long should you run an A/B test before stopping?

You should run an A/B test until you reach the pre-calculated sample size, not until the p-value crosses 0.05. Stopping early based on interim significance inflates the true false-positive rate — peeking 10 times can raise your effective α from 5% to ~26%. As a practical minimum, run the test for at least one or two full business cycles (typically one to two weeks) to capture day-of-week effects and account for the novelty effect, where users temporarily behave differently simply because something looks new.

What statistical test should you use for A/B testing?

For proportion metrics (e.g., conversion rate, click-through rate) with large sample sizes, use a two-sample Z-test. For continuous metrics (e.g., average order value, session duration) or smaller samples, use a two-sample t-test (Welch's t-test). When running multiple simultaneous variants, apply a Bonferroni correction or use ANOVA to control the family-wise error rate. For sequential testing where you need to peek at results, use methods like the sequential probability ratio test (SPRT) or Bayesian A/B testing.

What are common mistakes in A/B testing that lead to wrong conclusions?

The most common mistakes are: (1) Peeking — stopping the test early when p < 0.05 inflates false-positive rates significantly. (2) Novelty effect — a short-term lift that disappears once users adjust to the change. (3) Simpson's Paradox — an overall positive result that is actually negative in every individual segment due to a confounded traffic mix. (4) Spillover / SUTVA violations — when treatment users influence control users, contaminating the control group. The fix for most of these is to pre-register your hypothesis, calculate required sample size upfront, and randomize at the right unit (user, session, or cluster).

A/B Testing: Statistical Guide for ML Engineers

What is A/B Testing?

Imagine you have an idea: "Changing this button from blue to green will get more clicks." How do you know if you're right?

You could just change it and see what happens. But maybe sales went up because it was a holiday, not because of the button. You'd never know for sure.

A/B Testing is the scientific way to find out. You split your users into two groups. Group A sees the old blue button. Group B sees the new green button. Since everything else is the same, if Group B clicks more, you know the green button caused it.

The "Fair Test" Rule

The key is randomness. You can't just show the new button to VIP users and the old one to new users. That's not a fair fight. By flipping a coin for every user, you ensure the two groups are practically identical before the test starts.

Setting the Rules

Before we start, we need to agree on what we're testing. In statistics, we act like a judge in a courtroom. We assume the change does nothing until the evidence is overwhelming.

The Skeptic ( $H_0$ )

Also called the Null Hypothesis. This is the assumption that your idea didn't work. "The green button is just as good (or bad) as the blue one."

\text{Group A} = \text{Group B}

The Discovery ( $H_1$ )

Also called the Alternative Hypothesis. This is what you hope to prove. "The green button is actually different."

\text{Group A} \neq \text{Group B}

Type I & Type II Errors

We can never be 100% sure. We deal in probabilities. There are two ways to be wrong, and they have real business consequences.

	Truth: $H_0$ is True (No real effect)	Truth: $H_1$ is True (Real effect exists)
We Decide: Reject $H_0$	Type I Error False Positive We thought it worked, but it was luck. Prob = $\alpha$	Correct Decision True Positive We found a real win! Prob = $1 - \beta$ (Power)
We Decide: Keep $H_0$	Correct Decision True Negative We rightly ignored the noise.	Type II Error False Negative We missed a real opportunity. Prob = $\beta$

Business Impact Example: Light Bulb Packaging

Type I Error (False Positive):

You roll out the new packaging to all stores based on a fluke result. The packaging actually doesn't help. You spent $50,000 on new printing plates for nothing.

Type II Error (False Negative):

The new packaging would have increased sales by 5%, but your test was too short/small to detect it. You abandon a great idea. You lose $100,000/year in potential revenue.

The Math (Two-Sample $Z$ -Test)

How do we know if the difference between Group A and Group B is real? We calculate a $Z$ -score, which tells us how many standard deviations away our observed difference is from zero (no difference).

Z = \frac{\hat{p}_B - \hat{p}_A}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_A} + \frac{1}{n_B})}}

\hat{p}_A, \hat{p}_B

: Conversion rates of A and B.

\hat{p}

: Pooled conversion rate (total conversions / total visitors).

n_A, n_B

: Sample sizes.

If $|Z| > 1.96$ , we are 95% confident the difference is not due to chance ( $p < 0.05$ ).

Critical Z-Values Table

90% Confidence

Z = 1.645

95% Confidence

Z = 1.96

99% Confidence

Z = 2.576

Power Analysis

Before starting, you must ask: "How many users do I need?" If you stop too early, you might miss a real effect (Type II Error). This is called being "underpowered."

Statistical Power ( $1 - \beta$ )

The probability of detecting an effect if it actually exists. Industry standard is 80%, meaning you have an 80% chance of finding a true effect. Some critical tests (e.g., medical) require 90%.

Power is affected by four levers:

Baseline Rate: Current conversion rate (e.g., 5%). Higher baselines are easier to shift.
MDE (Minimum Detectable Effect): The smallest lift you care about (e.g., +1%). Detecting smaller lifts requires much more data.
Significance Level ( $\alpha$ ): Usually 5%. Lower $\alpha$ (stricter) requires more data.
Sample Size ( $n$ ): Bigger $n$ = more power. This is the main lever you control.

Sample Size Calculation

The formula for required sample size per group for a two-proportion Z-test is:

n = \frac{2\bar{p}(1-\bar{p})(Z_{\alpha/2} + Z_\beta)^2}{(p_B - p_A)^2}

Numerical Example: Bulb Packaging

You want to test if a new packaging increases sales rate from 5% to 6% (1% lift).

Baseline ( $p_A$ ) = 5%
Expected ( $p_B$ ) = 6%
$\alpha$ = 0.05, $\beta$ = 0.20
$Z_{0.025}$ = 1.96, $Z_{0.20}$ = 0.84

Result: You need ~3,000 customers per group (6,000 total) to detect a 1% lift with 80% power.

Case Study: Bulb Packaging Redesign

The Scenario

Your light bulb company has been selling bulbs in plain cardboard boxes. The marketing team proposes a premium-looking box with a window showing the bulb. They believe it will increase the purchase rate by 10%.

The Setup

Control (A): Existing cardboard box. Shown to 50% of store visitors randomly.
Treatment (B): New premium box. Shown to the other 50%.
Metric: Purchase rate (buyers / visitors).
Duration: 2 weeks (to account for weekend/weekday effects).

The Results

Control (A)

5.2%

520 / 10,000 visitors

Treatment (B)

5.9%

590 / 10,000 visitors

Lift: +0.7% absolute (+13.5% relative). P-value: 0.018. Decision: Statistically significant at $\alpha = 0.05$ . Roll out the new packaging!

Interactive Simulator

Run a simulated A/B test. In this simulation, Control (A) represents your current baseline performance, while Treatment (B) has the added "True Lift". Watch how the p-value fluctuates early on (why you shouldn't peek!) and how it converges as sample size grows. This demonstrates why pre-determining your sample size is critical.

Baseline (A)

12%

Lift (B)

2.0%

Control (A)

0.00%

0 Conv0 Visits

Treatment (B)

0.00%

0 Conv0 Visits

P-Value

1.0000

Not Significant

Distribution Overlap

As N grows, curves narrow

Start simulation to generate data...

Convergence History

Avoid "Peeking" early!

Common Pitfalls

1. Peeking (P-Hacking)

Checking the results every day and stopping as soon as $p < 0.05$ . This inflates your False Positive rate massively. If you peek 10 times, your true $\alpha$ is closer to 26%, not 5%.

Solution: Use Sequential Testing or decide sample size before starting and stick to it.

2. Novelty Effect

Users click the new bulb packaging just because it's new. The lift disappears after a week as they get used to it.

Solution: Run tests for at least 1-2 full business cycles (weeks). Analyze day-over-day trends.

3. Simpson's Paradox

The new packaging looks better overall, but is worse for every single store. This happens if the traffic mix shifted (e.g., more visitors to high-performing stores in Treatment).

Solution: Always check segments (by store, user type, device).

4. Network Effects / Spillover

If Treatment users tell Control users about the new packaging, the Control group is contaminated. This is a violation of SUTVA (Stable Unit Treatment Value Assumption).

Solution: Use cluster randomization (randomize by geography, not individual).

ML Applications

Model Deployment

You trained a new Recommender System. Offline metrics (RMSE) look great. But does it drive revenue? You A/B test Model A (Control) vs Model B (Treatment) on live traffic. The metric that matters is business revenue, not RMSE.

Multi-Armed Bandits

A/B testing wastes traffic on the loser. Bandit algorithms (Thompson Sampling, $\epsilon$ -Greedy) dynamically shift traffic to the winner during the test. It's "Exploration vs Exploitation." Used by Google for ad optimization.

Hyperparameter Tuning

Is learning rate 0.001 significantly better than 0.0001? You can run A/B-like comparisons in cross-validation folds to ensure your "best" hyperparameter is statistically significantly better, not just lucky on one fold.

Causal ML

When A/B testing is impossible (ethics, cost), we use Causal Inference techniques (Propensity Score Matching, Instrumental Variables) to estimate treatment effects from observational data.

Contents