What is A/B Testing?
Imagine you have an idea: "Changing this button from blue to green will get more clicks." How do you know if you're right?
You could just change it and see what happens. But maybe sales went up because it was a holiday, not because of the button. You'd never know for sure.
A/B Testing is the scientific way to find out. You split your users into two groups. Group A sees the old blue button. Group B sees the new green button. Since everything else is the same, if Group B clicks more, you know the green button caused it.
The "Fair Test" Rule
The key is randomness. You can't just show the new button to VIP users and the old one to new users. That's not a fair fight. By flipping a coin for every user, you ensure the two groups are practically identical before the test starts.
Setting the Rules
Before we start, we need to agree on what we're testing. In statistics, we act like a judge in a courtroom. We assume the change does nothing until the evidence is overwhelming.
The Skeptic ()
Also called the Null Hypothesis. This is the assumption that your idea didn't work. "The green button is just as good (or bad) as the blue one."
The Discovery ()
Also called the Alternative Hypothesis. This is what you hope to prove. "The green button is actually different."
Type I & Type II Errors
We can never be 100% sure. We deal in probabilities. There are two ways to be wrong, and they have real business consequences.
| Truth: is True (No real effect) | Truth: is True (Real effect exists) | |
|---|---|---|
| We Decide: Reject | Type I Error False Positive We thought it worked, but it was luck. Prob = | Correct Decision True Positive We found a real win! Prob = (Power) |
| We Decide: Keep | Correct Decision True Negative We rightly ignored the noise. | Type II Error False Negative We missed a real opportunity. Prob = |
Business Impact Example: Light Bulb Packaging
You roll out the new packaging to all stores based on a fluke result. The packaging actually doesn't help. You spent $50,000 on new printing plates for nothing.
The new packaging would have increased sales by 5%, but your test was too short/small to detect it. You abandon a great idea. You lose $100,000/year in potential revenue.
The Math (Two-Sample -Test)
How do we know if the difference between Group A and Group B is real? We calculate a -score, which tells us how many standard deviations away our observed difference is from zero (no difference).
If , we are 95% confident the difference is not due to chance ().
Critical Z-Values Table
Power Analysis
Before starting, you must ask: "How many users do I need?" If you stop too early, you might miss a real effect (Type II Error). This is called being "underpowered."
Statistical Power ()
The probability of detecting an effect if it actually exists. Industry standard is 80%, meaning you have an 80% chance of finding a true effect. Some critical tests (e.g., medical) require 90%.
Power is affected by four levers:
- Baseline Rate: Current conversion rate (e.g., 5%). Higher baselines are easier to shift.
- MDE (Minimum Detectable Effect): The smallest lift you care about (e.g., +1%). Detecting smaller lifts requires much more data.
- Significance Level (): Usually 5%. Lower (stricter) requires more data.
- Sample Size (): Bigger = more power. This is the main lever you control.
Sample Size Calculation
The formula for required sample size per group for a two-proportion Z-test is:
Numerical Example: Bulb Packaging
You want to test if a new packaging increases sales rate from 5% to 6% (1% lift).
- Baseline () = 5%
- Expected () = 6%
- = 0.05, = 0.20
- = 1.96, = 0.84
Result: You need ~3,000 customers per group (6,000 total) to detect a 1% lift with 80% power.
Case Study: Bulb Packaging Redesign
The Scenario
Your light bulb company has been selling bulbs in plain cardboard boxes. The marketing team proposes a premium-looking box with a window showing the bulb. They believe it will increase the purchase rate by 10%.
The Setup
- Control (A): Existing cardboard box. Shown to 50% of store visitors randomly.
- Treatment (B): New premium box. Shown to the other 50%.
- Metric: Purchase rate (buyers / visitors).
- Duration: 2 weeks (to account for weekend/weekday effects).
The Results
Lift: +0.7% absolute (+13.5% relative). P-value: 0.018. Decision: Statistically significant at . Roll out the new packaging!
Interactive Simulator
Run a simulated A/B test. In this simulation, Control (A) represents your current baseline performance, while Treatment (B) has the added "True Lift". Watch how the p-value fluctuates early on (why you shouldn't peek!) and how it converges as sample size grows. This demonstrates why pre-determining your sample size is critical.
Distribution Overlap
As N grows, curves narrowConvergence History
Avoid "Peeking" early!Common Pitfalls
1. Peeking (P-Hacking)
Checking the results every day and stopping as soon as . This inflates your False Positive rate massively. If you peek 10 times, your true is closer to 26%, not 5%.
Solution: Use Sequential Testing or decide sample size before starting and stick to it.
2. Novelty Effect
Users click the new bulb packaging just because it's new. The lift disappears after a week as they get used to it.
Solution: Run tests for at least 1-2 full business cycles (weeks). Analyze day-over-day trends.
3. Simpson's Paradox
The new packaging looks better overall, but is worse for every single store. This happens if the traffic mix shifted (e.g., more visitors to high-performing stores in Treatment).
Solution: Always check segments (by store, user type, device).
4. Network Effects / Spillover
If Treatment users tell Control users about the new packaging, the Control group is contaminated. This is a violation of SUTVA (Stable Unit Treatment Value Assumption).
Solution: Use cluster randomization (randomize by geography, not individual).
ML Applications
Model Deployment
You trained a new Recommender System. Offline metrics (RMSE) look great. But does it drive revenue? You A/B test Model A (Control) vs Model B (Treatment) on live traffic. The metric that matters is business revenue, not RMSE.
Multi-Armed Bandits
A/B testing wastes traffic on the loser. Bandit algorithms (Thompson Sampling, -Greedy) dynamically shift traffic to the winner during the test. It's "Exploration vs Exploitation." Used by Google for ad optimization.
Hyperparameter Tuning
Is learning rate 0.001 significantly better than 0.0001? You can run A/B-like comparisons in cross-validation folds to ensure your "best" hyperparameter is statistically significantly better, not just lucky on one fold.
Causal ML
When A/B testing is impossible (ethics, cost), we use Causal Inference techniques (Propensity Score Matching, Instrumental Variables) to estimate treatment effects from observational data.