Modules
02/15
Statistics

Contents

Population vs. Sample

The fundamental distinction that defines Statistical Inference.

Introduction

In a perfect world, a data scientist would possess "God's View" access to every single data point, past, present, and future. If you wanted to know the average height of humans, you would instantly measure all 8 billion people. If you wanted to know if a user clicks an ad, you would simulate every possible user in existence.

In reality, we are constrained by time, cost, and physics. We cannot measure everything. We must rely on a subset of reality to understand the whole. This is the core conflict of statistics: The Truth (Population) vs. The Evidence (Sample).

The Intuition

Think of soup tasting.

  • Population: The entire pot of soup. To know the exact flavor profile perfectly, you would need to drink the entire pot (which destroys the product).
  • Sample: A single spoonful.
  • Inference: If the spoonful is salty, you assume the whole pot is salty.

But what if you didn't stir the pot? Your spoonful might be bland while the bottom is pure salt. This problem is called sampling bias.

Population

The whole group (N=100)

NAll Units

Parameter

μ, σ²

Constraint

Known Truth

Random Sampling

Sample

The observed subset (n=10)

nSubset

Statistic

x̄, s²

Constraint

Estimation

Formal Definitions

Population (NN)

The entire set of all elements (individuals, objects, events) that are of interest for a specific study. It is the "ground truth."

Examples: All pixels in an image, all transactions in a bank's history, every human on Earth.

Sample (nn)

A subset of the population selected for analysis. We collect data from the sample to estimate truths about the population.

Examples: A 224x224 crop of an image, the last 1000 transactions, a survey of 500 people.

The Notation

In interviews and literature, the notation tells you immediately if we are discussing the theoretical truth (Population) or the calculated estimate (Sample).

MetricPopulation ParameterSample StatisticRelationship
SizeNNnnUsually nNn \ll N
Meanμ\mu (Mu)xˉ\bar{x} (x-bar)xˉ\bar{x} estimates μ\mu
Varianceσ2\sigma^2 (Sigma Sq.)s2s^2s2s^2 estimates σ2\sigma^2
Std. Deviationσ\sigma (Sigma)ssss estimates σ\sigma
Proportionpp or π\pip^\hat{p} (p-hat)p^\hat{p} estimates pp

Core Formulas

Population Formulas

Mean
μ=XiN\mu = \frac{\sum X_i}{N}
Variance
σ2=(Xiμ)2N\sigma^2 = \frac{\sum (X_i - \mu)^2}{N}
Std. Deviation
σ=σ2\sigma = \sqrt{\sigma^2}

Sample Formulas

Mean
xˉ=xin\bar{x} = \frac{\sum x_i}{n}
Variance
s2=(xixˉ)2n1s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1}

*Note the n-1 correction

Std. Deviation
s=s2s = \sqrt{s^2}

Engineering Case Study Example: Light Bulb Quality Control

You're a manufacturing engineer at a light bulb factory. Your company sells bulbs with a warranty: "Each bulb lasts at least 1,000 hours on average." To honor this warranty, you need to know the true average lifespan (μ\mu) of your product.

Here's your dilemma: You can't test all millions of bulbs to check its average lifespan. You have two realistic options.

Scenario A: The Population Approach (Test Every Bulb)

The Plan

You test every single bulb before shipping. Each bulb requires thousands of hours of testing, and you destroy it in the process.

What You Get

Perfect Knowledge: You know μ\mu(mean) exactly. Zero estimation error.

Statistical Result: If the true population mean is μ=1047\mu = 1047 hours, you discover it's exactly 1047 hours.

The Fatal Problem

You've destroyed your entire inventory. You have zero bulbs left to sell. Your business is dead before you know the answer.

In Statistics Jargon: This is the Destructive Testing Problem. When testing destroys the product, the population approach is literally impossible.

Scenario B: The Smart Sampling Approach (Test 1,000 Bulbs)

The Plan

You randomly select 1,000 bulbs from production and test them to failure. You calculate their average lifespan xˉ\bar{x} to estimate μ\mu.

What You Get

Fast Estimate: Your sample gives you xˉ=1045\bar{x} = 1045 hours. You infer the true μ1045\mu \approx 1045.

Survivable Cost: You only lose 1,000 bulbs to testing, not millions. You still have inventory to sell.

The Trade-off: Uncertainty

Your sample mean xˉ=1045\bar{x} = 1045 is your best estimate, but it's not exact. There's sampling error.

We'll learn more about this in Sampling Distributions.

Warranty Decision: You safely offer "1,000 hours guaranteed." The 36-hour buffer protects against worst-case scenarios.

The Key Insight

Perfect knowledge doesn't matter if you go bankrupt getting it. A ±5 hour estimate with 1,000 bulbs tested beats destroying all millions for "exact" knowledge. Sampling trades a small error margin for a functioning business.

Sampling Strategies

How you select your sample determines if your data is valid or garbage. Let's compare strategies using our light bulb factory scenario.

Why Sampling Method Matters

Imagine your 1,000-bulb sample came from a single bad production batch. Your estimate would be biased low. You'd think the mean lifespan is 1,045 hours when it's really maybe, 1,065 hours.

Solution: Use proper sampling strategies. Different methods ensure your sample represents the entire population, not just a convenient slice of it.

The Problem Statement

Let's extend our light bulb factory example:

Goal:

Estimate the true average lifespan (μ\mu) of 1 million bulbs produced this year (N=1,000,000N=1,000,000).

Constraint:

Testing destroys the bulb. You can only afford to test 1,000 bulbs (n=1,000n=1,000).

1. Simple Random Sampling

Every bulb in the warehouse has an exactly equal probability of being selected. Requires a complete inventory list (Sampling Frame) and a random number generator. Statistically pure, but can be impractical for massive warehouses.

Scenario: Assign a unique ID to every bulb. Use a random number generator to pick 1,000 IDs. Test those bulbs.

Random Selection

2. Stratified Sampling

Divide the population into subgroups (strata) based on shared characteristics, then sample randomly from each stratum proportionally. This guarantees representation of all subgroups.

Scenario: If 70% of bulbs come from Line A and 30% from Line B, sample 700 from Line A and 300 from Line B. This ensures both production lines are represented.

Proportional Selection

3. Cluster Sampling

Divide the population into natural groups (clusters). Randomly select a few entire clusters and test every unit in them. Cheaper than random sampling but assumes each cluster is a mini-representation of the whole.

Scenario: The factory ships bulbs in boxes of 50. Randomly select 20 boxes and test every bulb in those boxes.

Entire Clusters Selected

4. Systematic Sampling

Line up the population and select every kk-th unit, starting at a random point (k=N/nk = N/n). Easier to implement than full random sampling. Just pick a starting point and count.

Scenario: As bulbs roll off the assembly line, pick every 1,000th bulb. Starts at a random position between 1 and 1,000.

Every 6th Element

5. Convenience Sampling

Sample whatever is easiest to reach. Introduces massive selection bias i.e "easy to reach" units often share specific traits. Avoid in rigorous analysis.

Scenario: Test the first 1,000 bulbs that come off the line on Monday morning. Problem: Monday batches might be worse due to machine warm-up issues.

Easiest to Reach (Top Left)

Bias & Representativeness

A sample is Representative if its distribution of attributes matches the population. If not, it is Biased.

Survivorship Bias: The WWII Planes

The Problem:

During WWII, the US military wanted to add protective armor to their bombers to reduce casualties. Armor is heavy, so they could only add it to the most critical areas. To decide where, they analyzed planes returning from missions and mapped where the bullet holes were concentrated, mostly in the wings and fuselage.

Their Initial Plan:

Add armor (reinforce = add protective metal plating) to the wings and fuselage, the areas with the most bullet holes.

The Fatal Flaw:

Statistician Abraham Wald pointed out: You're only looking at planes that survived. The planes hit in the engines and cockpit didn't return; they crashed. The areas with no bullet holes on surviving planes are precisely where you should add armor. Those hits were fatal.

Population:

All planes that flew missions (survivors + crashed)

Sample:

Only planes that returned (survivors)

The Lesson for Data Scientists:

Your data only contains records of things that made it into your dataset. Ask yourself: What's missing? When analyzing customer behavior, you're seeing active customers, not churned ones. When studying successful startups, you're missing the 90% that failed quietly.

Analysis Report: B-17F

REF: WALD-1943-SEC • SRG-45

Status
FIELD OBSERVATION DATA

DAMAGE DISTRIBUTION ANALYSIS

Visualizing bullet hole locations on returning bombers. Data shows heavy concentration on wings and fuselage. No damage observed on engines or cockpit.

Bessel's Correction

This is a favorite interview question. Why do the formulas for variance differ?

Population Variance
σ2=(Xiμ)2N\sigma^2 = \frac{\sum (X_i - \mu)^2}{N}

Divide by N

Sample Variance
s2=(xixˉ)2n1s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1}

Divide by n - 1

Why n1n-1?

When we calculate sample variance, we don't know the true population mean μ\mu, so we use the sample mean xˉ\bar{x} inside the formula.

However, the sample data is always closer to its own mean (xˉ\bar{x}) than it is to the true population mean (μ\mu). This makes the numerator slightly smaller than it should be, resulting in a Biased Estimate (specifically, an underestimation) of the variance. Dividing by n1n-1 instead of nn slightly increases the result, correcting this bias.

Numerical Proof (The "Tiny Production Run")

Imagine a tiny factory that produced just 3 bulbs ever. Their lifespans were {1,2,3}\{1, 2, 3\} years.
True Population Mean μ=2\mu = 2. True Population Variance σ2=0.66\sigma^2 = 0.66.

You pick a sample of 2 bulbs (e.g., {1,2}\{1, 2\}):
Sample Mean xˉ=1.5\bar{x} = 1.5.

Using N (Biased)

Variance = 0.250.25

(Too low! 0.25<0.660.25 < 0.66)

Using n - 1 (Corrected)

Variance = 0.500.50

(Closer to 0.660.66)

If you average this across all possible samples, the n1n-1 formula perfectly recovers the population variance.

The Goal: Statistical Inference

We don't calculate sample statistics just to describe the sample; we do it to infer the truth about the population.

Descriptive Statistics

Summarizing the data you have.

"The average lifespan of our 1,000 tested bulbs is 1,045 hours."

Inferential Statistics

Predicting the data you don't have.

"We are 95% confident the average lifespan of ALL 1M bulbs is between 1,036 and 1,054 hours."

The "Black Box" Model

?

Population

Unknown Parameters (μ,σ\mu, \sigma)

InferenceSampling
xˉ=1045\bar{x} = 1045
s=150s = 150

Sample

Known Statistics (xˉ,s\bar{x}, s)

We can never open the black box. We can only look at the sample and use Probability Theory to guess what's inside the box.

The Two Pillars of Inference

Statistical inference boils down to two main activities. We will learn both in the next chapters.

Machine Learning Applications

In ML, the distinction between Population and Sample defines our entire workflow.

1. Training Data is a Sample

Your dataset (ImageNet, Titanic, etc.) is always a sample (nn). The real world where your model is deployed is the population (NN). The Central Limit Theorem explains why sampling works.

The Challenge: We want to minimize error on the Population (Generalization Error), but we can only optimize error on the Sample (Training Error).

2. Overfitting

Overfitting happens when a model learns the "noise" of the sample rather than the "signal" of the population. It effectively memorizes the specific nn examples but fails when exposed to the NN universe. Regularization helps prevent this.

3. Train/Test Split

We split our available sample into "Train" and "Test". We pretend the "Test" set is the Population. If the model performs well on the Test set (data it hasn't seen), we infer it will perform well on the real Population.

Common Mistakes

The "Big Data" Fallacy

Thinking that because your nn is huge (e.g., 1 million bulbs), it equals NN. If those 1 million bulbs all come from just one factory line (e.g., Line A), it is still a biased sample of the population (all factory lines).

Data Leakage

Using information from the Population (or Test set) to influence the Sample (Training set). For example, imputing missing values using the mean of the entire dataset instead of just the training set.