What are the most important probability distributions for machine learning?

The most important distributions for ML are the Normal (Gaussian), Bernoulli, Binomial, Poisson, and Exponential. The Gaussian appears in weight initialization, L2 regularization, and VAEs. The Bernoulli underpins binary classification and Binary Cross-Entropy loss. Poisson and Exponential distributions are used in count regression and survival analysis respectively.

What is the difference between a discrete and continuous probability distribution?

A discrete distribution assigns probabilities to countable outcomes (like coin flips or dice rolls) using a Probability Mass Function (PMF), where all probabilities sum to 1. A continuous distribution covers an infinite range of values (like height or temperature) and uses a Probability Density Function (PDF), where the probability of any single exact value is zero and probabilities are computed as areas under the curve.

What is the Gaussian distribution and why is it so common in ML?

The Gaussian (Normal) distribution is the symmetric bell-curve defined by its mean and variance. It is ubiquitous in ML because the Central Limit Theorem guarantees that sums of independent random variables converge to it, making it the natural assumption for noise and errors. It also has mathematically convenient properties: minimizing Mean Squared Error is equivalent to maximizing the likelihood under a Gaussian assumption, and L2 regularization corresponds to placing a Gaussian prior on model weights.

What is the Bernoulli distribution and where is it used in deep learning?

The Bernoulli distribution models a single binary trial — success with probability p, failure with probability 1-p. In deep learning it is the foundation of binary classification: a sigmoid output neuron models a Bernoulli probability, and Binary Cross-Entropy loss is the negative log-likelihood under the Bernoulli assumption. Dropout can also be viewed as sampling a Bernoulli mask over neuron activations.

What is the difference between a PDF and a CDF?

A Probability Density Function (PDF) describes the relative likelihood of a continuous random variable taking a specific value — the actual probability of any single point is zero, and probabilities are computed as areas under the PDF curve over an interval. A Cumulative Distribution Function (CDF) gives the probability that the variable is at or below a given value, i.e., F(x) = P(X ≤ x). The CDF always ranges from 0 to 1 and is the integral of the PDF.

Probability Distributions: Complete Guide

Introduction

A Random Variable is a variable whose value depends on the outcome of a random event. A Probability Distribution describes the likelihood of each possible value.

Core Question: If I sample from this process, what values am I likely to see, and how often?

Discrete

Countable outcomes: coin flips, dice rolls, number of emails.

Function: PMF (Probability Mass Function)

Rule: $\sum P(x) = 1$

Continuous

Infinite outcomes in a range: height, time, temperature.

Function: PDF (Probability Density Function)

Rule: $\int f(x) dx = 1$

PMF, PDF, and CDF

These three functions fully describe any probability distribution.

PMF: Probability Mass Function (Discrete)

Gives the probability of exactly each value: $P(X = k)$

Example: For a fair die, $\text{PMF}(3) = 1/6$

PDF: Probability Density Function (Continuous)

Gives probability density, not probability itself. $P(X = x) = 0$ for any single point!

The probability of a range is the area under the curve: $P(a < X < b) = \int_a^b f(x) dx$

CDF: Cumulative Distribution Function (Both)

Gives the probability of being at or below a value: $F(x) = P(X \leq x)$

Always starts at 0, ends at 1, and never decreases.

PDF f(x)

Curve shows density. Any single point has P=0! Only areas have probability.

Integrate / Sum

CDF F(x) = P(X ≤ x)

CDF always starts at 0, ends at 1, and never decreases. (Smooth curve for continuous)

Adjust x valuex = 0.00

F(x) = P(X \leq x) = \int_{-\infty}^x f(t) dt

•The CDF height equals the shaded area in the PDF/PMF above it.

Interactive: Distribution Explorer

Select different distributions and adjust their parameters to build intuition for how they behave.

Probability Density Function

f(x) = \frac{1}{\sqrt{2\pi1.0^2}} e^{-\frac{(x-0.0)^2}{2(1.0)^2}}

Click sample to generate data

Parameters

Mean (μ)0.0

Std Dev (σ)1.0

Theoretical Moments

Expected (Mean)0.00

Variance1.00

Discrete Distributions

Bernoulli

Single trial

One trial with two outcomes: Success (1) with probability p, Failure (0) with probability 1-p.

Mean:

p

Variance:

p(1-p)

ML: Logistic Regression output. Binary Cross-Entropy is derived from Bernoulli likelihood.

Binomial

n trials

Number of successes k in n independent Bernoulli trials.

P(k) = \binom{n}{k} p^k (1-p)^{n-k}

Mean:

np

Variance:

np(1-p)

ML: A/B testing (how many users convert out of 1000?).

Poisson

Rare events

Number of events in a fixed interval, given average rate lambda.

P(k) = \frac{\lambda^k e^{-\lambda}}{k!}

Mean:

\lambda

Variance:

\lambda

ML: Website traffic spikes, call center volume, Poisson Regression.

Continuous Distributions

Normal (Gaussian)

The King

The bell curve. Defined by mean $\mu$ and variance $\sigma^2$ . Due to the Central Limit Theorem, sums of random variables converge to this.

f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}

ML: Weight initialization (Xavier/He), L2 regularization = Gaussian prior, VAEs.

Exponential

Waiting time

Models time between Poisson events. Has the memoryless property: P(wait 5 more min) is independent of how long you have waited.

f(x) = \lambda e^{-\lambda x}, \quad x \geq 0

Mean:

1/\lambda

Variance:

1/\lambda^2

ML: Survival analysis, time-to-failure predictions.

Log-Normal

Heavy tail

If ln(X) is Normal, then X is Log-Normal. Models data that is positive with a heavy right tail.

ML: House prices, incomes, stock prices. This is why we log-transform targets in regression!

Bayesian Prior Distributions

These distributions describe probabilities of probabilities. They are used as priors in Bayesian inference.

Beta Distribution

Defined on [0, 1]. Represents uncertainty about a probability (like a coin's bias).

"I think the coin lands heads 60% of the time, but I'm not sure."

Conjugate prior for Binomial likelihood.

Dirichlet Distribution

Multivariate Beta. Distribution over probability vectors that sum to 1.

Used in Latent Dirichlet Allocation (LDA) for topic modeling.

For more on Bayesian inference, see Bayesian vs. Frequentist.

The Distribution Family Tree

Distributions are not isolated. They are connected through limiting relationships.

Bernoulli $\rightarrow$ Binomial: Sum n Bernoullis and you get Binomial(n, p).

Binomial $\rightarrow$ Normal: As $n \ o \infty$ , Binomial looks Gaussian (CLT).

Binomial $\rightarrow$ Poisson: As $n \ o \infty$ and $p \ o 0$ (with $np = \lambda$ constant), Binomial becomes Poisson.

Exponential $\rightarrow$ Gamma: Sum k Exponentials and you get $\ ext{Gamma}(k, \lambda)$ .

Normal $^2$ $\rightarrow$ Chi-Square: Sum of squared standard normals gives Chi-Square.

ML Applications: Loss Functions

Deep learning loss functions are negative log-likelihoods of specific distributions. Choosing a loss function implicitly assumes a distribution on your target variable.

MSE Loss $\leftrightarrow$ Gaussian

Minimizing MSE assumes the target y follows a Gaussian distribution around the prediction.

\text{MSE} = -\log P(y | \hat{y}, \sigma) + \text{const}

BCE Loss $\leftrightarrow$ Bernoulli

Binary Cross-Entropy assumes the target follows a Bernoulli distribution.

\text{BCE} = -[y \log(p) + (1-y)\log(1-p)]

Categorical CE $\leftrightarrow$ Multinomial

Categorical Cross-Entropy assumes the target follows a Multinomial (Categorical) distribution.

Regularization $\leftrightarrow$ Prior

L2 regularization = Gaussian prior on weights. L1 regularization = Laplace prior on weights.

Key insight: Understanding probability distributions helps you choose the right loss function for your problem. Predicting counts? Consider Poisson loss. Predicting positive values with heavy tails? Consider log-transforming the target (Log-Normal assumption).

Contents

Introduction

Discrete

Continuous

PMF, PDF, and CDF

PMF: Probability Mass Function (Discrete)

PDF: Probability Density Function (Continuous)

CDF: Cumulative Distribution Function (Both)

Interactive: Distribution Explorer

Probability Density Function

Parameters

Theoretical Moments

Discrete Distributions

Bernoulli

Binomial

Poisson

Continuous Distributions

Normal (Gaussian)

Exponential

Log-Normal

Bayesian Prior Distributions

Beta Distribution

Dirichlet Distribution

The Distribution Family Tree

ML Applications: Loss Functions

MSE Loss ↔\leftrightarrow↔ Gaussian

BCE Loss ↔\leftrightarrow↔ Bernoulli

Categorical CE ↔\leftrightarrow↔ Multinomial

Regularization ↔\leftrightarrow↔ Prior

MSE Loss $\leftrightarrow$ Gaussian

BCE Loss $\leftrightarrow$ Bernoulli

Categorical CE $\leftrightarrow$ Multinomial

Regularization $\leftrightarrow$ Prior