What is a random variable and how is it different from a regular variable?

A random variable is a function that maps outcomes of a random process to numerical values, unlike a regular algebraic variable which represents a fixed, unknown number. For example, if you flip two coins, the random variable X could represent the number of heads, assigning values 0, 1, or 2 to each possible outcome. The key distinction is that a random variable has a probability distribution — each possible value occurs with some probability — whereas a regular variable simply holds a single determined value.

What is the difference between discrete and continuous random variables?

A discrete random variable takes on a countable set of values (such as the result of a die roll or the number of emails received), and is described by a probability mass function (PMF) where each value has a specific nonzero probability. A continuous random variable can take any value within a range (such as height or temperature), and is described by a probability density function (PDF) — the probability of any exact value is zero, and only intervals have nonzero probability. The choice between discrete and continuous models depends on the nature of the quantity being measured.

What is the expected value and how is it calculated?

The expected value E[X] is the theoretical weighted average of a random variable — the long-run average you would observe if the experiment were repeated infinitely many times. For a discrete random variable it is calculated as the sum of each value multiplied by its probability: E[X] = Σ x · P(X=x). For a continuous random variable it is the integral E[X] = ∫ x · f(x) dx, where f(x) is the probability density function.

How is variance of a random variable defined and interpreted?

Variance Var(X) = E[(X − μ)²] measures how spread out the values of a random variable are around its mean μ. A high variance means values are widely dispersed and there is more uncertainty; a low variance means values cluster tightly around the mean. The standard deviation σ = √Var(X) is often preferred because it has the same units as X, making it easier to interpret. In machine learning, variance in model predictions is directly related to overfitting in the bias-variance tradeoff.

How are random variables used in machine learning models?

In machine learning, inputs X, labels Y, and (in Bayesian settings) model parameters θ are all treated as random variables. Training a model involves minimizing the expected loss E[L(f(x;θ), y)] over the data distribution. Stochastic Gradient Descent is justified by the linearity of expectation, since the expected gradient of a mini-batch equals the full-data gradient. In reinforcement learning, the value function is defined as the expected cumulative discounted reward.

Random Variables & Expectation: Complete Guide

Introduction

In algebra, a variable x stands for a fixed number. In probability, a Random Variable represents a value determined by chance. It could be any value within a range, each with some probability.

Random Variable: A function that maps outcomes of a random process to numbers.

Expectation: The theoretical average if we repeated the process infinitely.

In ML, everything is a random variable: input data $X$ , target labels $Y$ , and even model parameters $\ heta$ (in Bayesian learning). Understanding expectation is how we define what "good" means for a model.

What is a Random Variable?

Formal Definition

X: \Omega \to \mathbb{R}

A function that assigns a real number to every possible outcome in the sample space $\Omega$ .

Example: Two Coin Flips

Sample Space:

{HH, HT, TH, TT}

X = number of Heads:

X(HH) = 2

X(HT) = X(TH) = 1

X(TT) = 0

The random variable X converts outcomes (like "HT") into numbers (like 1) that we can do math with.

Discrete vs Continuous

Discrete

Countable values: dice rolls, coin flips, number of emails.

Described by PMF:

P(X = x)

Each value has a specific probability. Sum of all probabilities = 1.

Continuous

Any value in a range: height, time, temperature.

Described by PDF:

P(a < X < b) = \int_a^b f(x)dx

P(X = exact value) = 0. Only ranges have nonzero probability.

For details on specific distributions, see Probability Distributions.

Expected Value (Mean)

The Expectation E[X] is the theoretical weighted average. Think of it as the "center of mass" of the distribution.

It answers: "If we repeated this experiment infinitely, what would the average value be?"

Discrete

E[X] = \sum_x x \cdot P(X=x)

Sum of (value x probability)

Continuous

E[X] = \int_{-\infty}^{\infty} x \cdot f(x)dx

Integral of (value x density)

Example: Rigged Die

A die where rolling 6 is 50% likely, and 1-5 are 10% each.

E[X] = (1)(0.1) + (2)(0.1) + (3)(0.1) + (4)(0.1) + (5)(0.1) + (6)(0.5)

E[X] = 0.1 + 0.2 + 0.3 + 0.4 + 0.5 + 3.0 = 4.5

The expected value is 4.5 (you can never roll 4.5!). The high probability on 6 pulls the average upward.

Interactive: Expected Value

See how E[X] acts as the "balance point" of the distribution. Try different probability distributions and watch where the fulcrum lands.

10%

E[X] = 5.50

Click bars to increase weight (+), Right-click to decrease (-)

Fulcrum finds the center of mass

Variance & Standard Deviation

Expectation tells us the center. Variance tells us the spread (how uncertain we are).

Variance: Expected squared deviation from mean

Var(X) = E[(X - \mu)^2]

Computational formula:

Var(X) = E[X^2] - (E[X])^2

Variance

Units are squared (e.g., dollars squared). Hard to interpret directly.

Standard Deviation

$\sigma = \sqrt{Var(X)}$ . Same units as $X$ . More interpretable.

Interactive: Variance Comparison

Compare distributions with the same mean but different variances. Higher variance = wider spread = more uncertainty.

Low Variance

Tall & Narrow

High Certainty

Medium Variance

Balanced Shape

Moderate Certainty

High Variance

Short & Wide

Low Certainty

Key Properties

Linearity of Expectation

For any random variables X and Y (even if dependent!):

E[aX + bY] = aE[X] + bE[Y]

This is incredibly powerful. It lets us break complex problems into simple pieces.

Variance of Sum (Independent Variables)

If X and Y are independent:

Var(X + Y) = Var(X) + Var(Y)

Variances add for independent variables. This is why errors accumulate.

Scaling Properties

E[aX] = aE[X]

Var(aX) = a^2 Var(X)

Multiplying by a constant scales variance by a squared!

ML Applications

Expected Risk (Loss Minimization)

We train models to minimize the expected loss on unseen data:

\theta^* = \arg\min_\theta E_{(x,y)}[L(f(x;\theta), y)]

Since we cannot compute the true expectation, we approximate with the empirical average over training data.

SGD Justification

By linearity of expectation, the expected gradient of a mini-batch equals the expected gradient of the full dataset. This is why Stochastic Gradient Descent works!

Reinforcement Learning

The Value Function is the expected future return:

V(s) = E\left[\sum_{t=0}^\infty \gamma^t r_t \mid s_0 = s\right]

The agent maximizes expected cumulative reward.

Variance in Model Performance

High variance in predictions means the model is sensitive to training data (overfitting). The bias-variance tradeoff is fundamentally about $E[\ ext{error}]$ vs $Var(\ ext{error})$ .

Contents

Introduction

What is a Random Variable?

Example: Two Coin Flips

Discrete vs Continuous

Discrete

Continuous

Expected Value (Mean)

Discrete

Continuous

Example: Rigged Die

Interactive: Expected Value

Variance & Standard Deviation

Variance

Standard Deviation

Interactive: Variance Comparison

Key Properties

Linearity of Expectation

Variance of Sum (Independent Variables)

Scaling Properties

ML Applications

Expected Risk (Loss Minimization)

SGD Justification

Reinforcement Learning

Variance in Model Performance