Introduction
A Random Variable is a variable whose value depends on the outcome of a random event. A Probability Distribution describes the likelihood of each possible value.
Core Question: If I sample from this process, what values am I likely to see, and how often?
Discrete
Countable outcomes: coin flips, dice rolls, number of emails.
Function: PMF (Probability Mass Function)
Rule:
Continuous
Infinite outcomes in a range: height, time, temperature.
Function: PDF (Probability Density Function)
Rule:
PMF, PDF, and CDF
These three functions fully describe any probability distribution.
PMF: Probability Mass Function (Discrete)
Gives the probability of exactly each value:
Example: For a fair die,
PDF: Probability Density Function (Continuous)
Gives probability density, not probability itself. for any single point!
The probability of a range is the area under the curve:
CDF: Cumulative Distribution Function (Both)
Gives the probability of being at or below a value:
Always starts at 0, ends at 1, and never decreases.
Interactive: Distribution Explorer
Select different distributions and adjust their parameters to build intuition for how they behave.
Probability Density Function
Parameters
Theoretical Moments
Discrete Distributions
Bernoulli
Single trialOne trial with two outcomes: Success (1) with probability p, Failure (0) with probability 1-p.
ML: Logistic Regression output. Binary Cross-Entropy is derived from Bernoulli likelihood.
Binomial
n trialsNumber of successes k in n independent Bernoulli trials.
ML: A/B testing (how many users convert out of 1000?).
Poisson
Rare eventsNumber of events in a fixed interval, given average rate lambda.
ML: Website traffic spikes, call center volume, Poisson Regression.
Continuous Distributions
Normal (Gaussian)
The KingThe bell curve. Defined by mean and variance . Due to the Central Limit Theorem, sums of random variables converge to this.
ML: Weight initialization (Xavier/He), L2 regularization = Gaussian prior, VAEs.
Exponential
Waiting timeModels time between Poisson events. Has the memoryless property: P(wait 5 more min) is independent of how long you have waited.
ML: Survival analysis, time-to-failure predictions.
Log-Normal
Heavy tailIf ln(X) is Normal, then X is Log-Normal. Models data that is positive with a heavy right tail.
ML: House prices, incomes, stock prices. This is why we log-transform targets in regression!
Bayesian Prior Distributions
These distributions describe probabilities of probabilities. They are used as priors in Bayesian inference.
Beta Distribution
Defined on [0, 1]. Represents uncertainty about a probability (like a coin's bias).
"I think the coin lands heads 60% of the time, but I'm not sure."
Conjugate prior for Binomial likelihood.
Dirichlet Distribution
Multivariate Beta. Distribution over probability vectors that sum to 1.
Used in Latent Dirichlet Allocation (LDA) for topic modeling.
For more on Bayesian inference, see Bayesian vs. Frequentist.
The Distribution Family Tree
Distributions are not isolated. They are connected through limiting relationships.
Bernoulli Binomial: Sum n Bernoullis and you get Binomial(n, p).
Binomial Normal: As , Binomial looks Gaussian (CLT).
Binomial Poisson: As and (with constant), Binomial becomes Poisson.
Exponential Gamma: Sum k Exponentials and you get .
Normal Chi-Square: Sum of squared standard normals gives Chi-Square.
ML Applications: Loss Functions
Deep learning loss functions are negative log-likelihoods of specific distributions. Choosing a loss function implicitly assumes a distribution on your target variable.
MSE Loss Gaussian
Minimizing MSE assumes the target y follows a Gaussian distribution around the prediction.
BCE Loss Bernoulli
Binary Cross-Entropy assumes the target follows a Bernoulli distribution.
Categorical CE Multinomial
Categorical Cross-Entropy assumes the target follows a Multinomial (Categorical) distribution.
Regularization Prior
L2 regularization = Gaussian prior on weights. L1 regularization = Laplace prior on weights.
Key insight: Understanding probability distributions helps you choose the right loss function for your problem. Predicting counts? Consider Poisson loss. Predicting positive values with heavy tails? Consider log-transforming the target (Log-Normal assumption).