Modules
01/15
Statistics

Contents

Basic Statistics for Machine Learning

Learn mean, median, mode, variance, and standard deviation with interactive visualizations. Essential statistics for ML engineers.

Introduction

Before diving into machine learning algorithms, we need a solid foundation in descriptive statistics. These are the tools that allow us to summarize and understand data at a glance.

Think about your exam scores from last semester. Instead of memorizing every single grade, wouldn't it be easier to just know your average? That's statistics in action. Taking a bunch of numbers and distilling them into something useful.

What is Descriptive Statistics?

Descriptive statistics are the tools we use to summarize and describe data. Instead of looking at thousands of numbers, we extract a few key metrics that tell the story.

For example: "This app has a 4.2 star rating based on 10,000 reviews." That one number summarizes thousands of individual opinions into a single, useful metric.

In this guide, we'll cover the essential building blocks: mean, median, mode (where is the "center" of your data?), and variance, standard deviation (how spread out is your data?). These concepts show up everywhere: from understanding survey results to building machine learning models and understanding probability distributions.

Why This Matters for ML Engineers

  • Make sense of data: Quickly summarize thousands of numbers into a few key insights.
  • Spot problems: Identify unusual values that might be errors or important exceptions.
  • Compare fairly: Understand if two datasets are truly different or just appear so.
  • Build intuition: These concepts are the foundation for everything in data science, from Bayesian inference to batch normalization.

The Mean (Average)

The most common way to summarize data is to find its center: a single value that represents the "typical" data point. The mean is the most widely used measure of center.

Definition

The arithmetic mean is the sum of all values divided by the count of values. It is what most people call "the average."

xˉ=1ni=1nxi=x1+x2++xnn\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i = \frac{x_1 + x_2 + \ldots + x_n}{n}

Example: Bulb Lifespans

You test 5 bulbs with lifespans: 980, 1000, 1010, 1020, 1040 hours.

Mean = (980 + 1000 + 1010 + 1020 + 1040) / 5 = 1010 hours

The Median (Middle Value)

The mean works well for symmetric data, but it has a weakness: outliers can pull it away from the true center. The median solves this by focusing on position rather than value.

Definition

The median is the middle value when all data points are sorted. If there is an even number of points, it is the average of the two middle values.

Median={xn+12if n is oddxn2+xn2+12if n is even\text{Median} = \begin{cases} x_{\frac{n+1}{2}} & \text{if } n \text{ is odd} \\[1em] \dfrac{x_{\frac{n}{2}} + x_{\frac{n}{2}+1}}{2} & \text{if } n \text{ is even} \end{cases}

Example: With an Outlier

Lifespans: 500, 980, 1000, 1010, 1020 hours. (One bulb failed early.)

Sorted: 500, 980, 1000, 1010, 1020

Median = 1000 hours (the middle value)

Mean = 902 hours (pulled down by the outlier)

Notice how the median is more representative of the "typical" bulb.

The Mode (Most Frequent)

Sometimes you do not care about the average or middle value. You want to know what shows up the most. This is especially useful for categorical data where mean and median do not apply.

Definition

The mode is the value that appears most frequently in the dataset. A dataset can have no mode, one mode (unimodal), or multiple modes (bimodal, multimodal).

Example: Categorical Data

Bulb quality ratings: Good, Good, Good, Excellent, Fair, Good, Excellent

Mode = Good (appears 4 times)

Skewness

So far we have covered three ways to find the center of your data: the mean, median, and mode. But here is a question: what if the mean and median give different answers? That happens when your data is not symmetric.

Skewness describes the shape of your data distribution. Is it balanced on both sides, or does it have a long tail stretching in one direction?

Symmetric
Skewness 0\approx 0

Mean \approx Median \approx Mode

Right-Skewed (+)
Skewness >0> 0

Tail stretches right. Mode << Median << Mean.

Left-Skewed (-)
Skewness <0< 0

Tail stretches left. Mean << Median << Mode.

Skewness Explorer

Drag the slider to see how skewness affects the shape of the distribution and the relationship between Mean, Median, and Mode.

Symmetric
LeftRight
Mean
Median
Mode
ValueFrequencyMeanMedianMode

Mean ≈ Median ≈ Mode

In symmetric data, the measures of center coincide.

ML Tip: Many machine learning models perform poorly on skewed data. If your data is heavily skewed (like house prices, where a few mansions distort the average), we often apply mathematical tricks to make it more symmetric before training.

Kurtosis

While skewness tells us about the asymmetry of our data, kurtosis tells us about the tails. Specifically: how likely are extreme values (outliers) compared to a normal distribution?

Definition

Kurtosis measures the "tailedness" of a distribution. It tells you whether your data has heavy tails (more outliers) or light tails (fewer outliers) compared to a normal distribution.

Kurtosis=1ni=1n(xixˉσ)4\text{Kurtosis} = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{x_i - \bar{x}}{\sigma} \right)^4

The fourth power amplifies extreme deviations, making kurtosis highly sensitive to outliers.

Platykurtic
Kurtosis <3< 3

Flatter peak, thinner tails. Fewer extreme values than normal.

Mesokurtic
Kurtosis =3= 3

Normal distribution. The baseline for comparison.

Leptokurtic
Kurtosis >3> 3

Sharper peak, fatter tails. More extreme values than normal.

Kurtosis Explorer

Drag the slider to see how kurtosis affects the "tailedness" of the distribution. The dashed line shows a normal distribution for reference.

Mesokurtic (Normal)
FlatPeaked
Platykurtic
Kurtosis < 3
Flat peak, thin tails
Mesokurtic
Kurtosis = 3
Normal distribution
Leptokurtic
Kurtosis > 3
Sharp peak, fat tails
ValueFrequencyNormal (reference)

Mesokurtic: Similar tailedness to a normal distribution. This is the baseline (kurtosis = 3, or excess kurtosis = 0).

Measures of Spread (Dispersion)

We have looked at the center (mean, median) and the shape (skewness). But there is one more critical piece of the puzzle: spread.

Imagine two companies with the exact same average salary of $100k. In Company A, everyone earns $100k. In Company B, the CEO earns $1M and everyone else earns $10k. The average is the same, but the spread tells the real story.

So how do we measure this? We need a way to calculate how far, on average, the data points are from the center.

Variance and Standard Deviation

These are the most commonly used measures of spread in statistics and machine learning. They quantify how much each data point deviates from the mean.

Variance

σ2=1ni=1n(xixˉ)2\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2

The average of the squared deviations from the mean. Larger variance means more spread.

Standard Deviation

σ=σ2=1ni=1n(xixˉ)2\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2}

The square root of the variance. Same units as the original data.

Example: Bulb Lifespans

Lifespans: 980, 1000, 1010, 1020, 1040 hours

Step 1: Calculate the mean

Mean = (980 + 1000 + 1010 + 1020 + 1040) / 5 = 5050 / 5 = 1010 hours

Step 2: Find each deviation from the mean

980 - 1010 = -30

1000 - 1010 = -10

1010 - 1010 = 0

1020 - 1010 = +10

1040 - 1010 = +30

Step 3: Square each deviation

(-30)² = 900

(-10)² = 100

(0)² = 0

(+10)² = 100

(+30)² = 900

Step 4: Calculate variance (average of squared deviations)

Variance = (900 + 100 + 0 + 100 + 900) / 5

Variance = 2000 / 5 = 400 hours²

Step 5: Calculate standard deviation (square root of variance)

Std Dev = √400 = 20 hours

Why Square the Deviations? Two reasons: (1) Positive and negative deviations would cancel out otherwise. (2) Squaring penalizes larger deviations more heavily, making the metric more sensitive to outliers.

Range and Interquartile Range (IQR)

Range

Range=MaxMin\text{Range} = \text{Max} - \text{Min}

The simplest measure of spread. Just the difference between the largest and smallest values.

Weakness: Extremely sensitive to outliers. One extreme value can inflate the range dramatically.

Interquartile Range (IQR)

IQR=Q3Q1\text{IQR} = Q_3 - Q_1

The range of the middle 50% of the data. Q1 is the 25th percentile, Q3 is the 75th percentile.

Strength: Robust to outliers because it ignores the extreme 25% on each end.

Outlier Detection: A common rule is to flag any data point as an outlier if it falls below Q11.5×IQRQ_1 - 1.5 \times \text{IQR} or above Q3+1.5×IQRQ_3 + 1.5 \times \text{IQR}. This is how box plots draw their "whiskers."

Measure Spread: Focus + Context

Use the slider to change variance. Compare the full range vs the central IQR.

Range
40
IQR
15
Variance
139
Std Dev
11.8
FULL RANGE (CONTEXT)MinMaxZOOMED IQRQ1995MEDIAN1000Q31010

Formulas Reference

MetricFormulaUse When
Meanxˉ=1nxi\bar{x} = \frac{1}{n}\sum x_iData is symmetric, no outliers
MedianMiddle value (sorted)Skewed data or outliers present
ModeMost frequent valueCategorical data or identifying peaks
RangeMaxMin\text{Max} - \text{Min}Quick overview, no outliers
IQRQ3Q1Q_3 - Q_1Robust spread, outlier detection
Varianceσ2=1n(xixˉ)2\sigma^2 = \frac{1}{n}\sum(x_i - \bar{x})^2Mathematical analysis, ML algorithms
Std Devσ=σ2\sigma = \sqrt{\sigma^2}Interpretable spread in original units