Introduction
In algebra, a variable x stands for a fixed number. In probability, a Random Variable represents a value determined by chance. It could be any value within a range, each with some probability.
Random Variable: A function that maps outcomes of a random process to numbers.
Expectation: The theoretical average if we repeated the process infinitely.
In ML, everything is a random variable: input data , target labels , and even model parameters (in Bayesian learning). Understanding expectation is how we define what "good" means for a model.
What is a Random Variable?
Formal Definition
A function that assigns a real number to every possible outcome in the sample space .
Example: Two Coin Flips
Sample Space:
{HH, HT, TH, TT}
X = number of Heads:
X(HH) = 2
X(HT) = X(TH) = 1
X(TT) = 0
The random variable X converts outcomes (like "HT") into numbers (like 1) that we can do math with.
Discrete vs Continuous
Discrete
Countable values: dice rolls, coin flips, number of emails.
Described by PMF:
Each value has a specific probability. Sum of all probabilities = 1.
Continuous
Any value in a range: height, time, temperature.
Described by PDF:
P(X = exact value) = 0. Only ranges have nonzero probability.
For details on specific distributions, see Probability Distributions.
Expected Value (Mean)
The Expectation E[X] is the theoretical weighted average. Think of it as the "center of mass" of the distribution.
It answers: "If we repeated this experiment infinitely, what would the average value be?"
Discrete
Sum of (value x probability)
Continuous
Integral of (value x density)
Example: Rigged Die
A die where rolling 6 is 50% likely, and 1-5 are 10% each.
E[X] = (1)(0.1) + (2)(0.1) + (3)(0.1) + (4)(0.1) + (5)(0.1) + (6)(0.5)
E[X] = 0.1 + 0.2 + 0.3 + 0.4 + 0.5 + 3.0 = 4.5
The expected value is 4.5 (you can never roll 4.5!). The high probability on 6 pulls the average upward.
Interactive: Expected Value
See how E[X] acts as the "balance point" of the distribution. Try different probability distributions and watch where the fulcrum lands.
Variance & Standard Deviation
Expectation tells us the center. Variance tells us the spread (how uncertain we are).
Variance: Expected squared deviation from mean
Computational formula:
Variance
Units are squared (e.g., dollars squared). Hard to interpret directly.
Standard Deviation
. Same units as . More interpretable.
Interactive: Variance Comparison
Compare distributions with the same mean but different variances. Higher variance = wider spread = more uncertainty.
Key Properties
Linearity of Expectation
For any random variables X and Y (even if dependent!):
This is incredibly powerful. It lets us break complex problems into simple pieces.
Variance of Sum (Independent Variables)
If X and Y are independent:
Variances add for independent variables. This is why errors accumulate.
Scaling Properties
Multiplying by a constant scales variance by a squared!
ML Applications
Expected Risk (Loss Minimization)
We train models to minimize the expected loss on unseen data:
Since we cannot compute the true expectation, we approximate with the empirical average over training data.
SGD Justification
By linearity of expectation, the expected gradient of a mini-batch equals the expected gradient of the full dataset. This is why Stochastic Gradient Descent works!
Reinforcement Learning
The Value Function is the expected future return:
The agent maximizes expected cumulative reward.
Variance in Model Performance
High variance in predictions means the model is sensitive to training data (overfitting). The bias-variance tradeoff is fundamentally about vs .