What is maximum likelihood estimation (MLE) and how does it work?

Maximum Likelihood Estimation (MLE) is a statistical method for estimating the parameters of a probability distribution by finding the values that maximize the probability of observing the given data. Given observed data, MLE asks: which parameter values would have made this data most probable? It formalizes this by defining a likelihood function L(θ | x) — the probability of the data as a function of the parameters — and then maximizing it, typically by taking the derivative and setting it to zero.

What is the log-likelihood and why do we maximize it instead of the likelihood?

The log-likelihood is the natural logarithm of the likelihood function: ℓ(θ) = ln L(θ). We maximize it instead of the raw likelihood for two reasons: first, the product of many small probabilities can underflow to zero on computers, while logarithms convert products into sums that are numerically stable; second, sums are far easier to differentiate than products. Since the logarithm is a monotonically increasing function, the parameter that maximizes the log-likelihood is identical to the one that maximizes the likelihood.

How does MLE relate to least squares regression?

MLE and least squares regression are equivalent under the assumption of Gaussian (normally distributed) errors. When you derive the MLE for the mean of a Gaussian distribution, the negative log-likelihood simplifies to a sum of squared residuals — exactly Mean Squared Error. This means that minimizing MSE in linear regression is the same as performing MLE under a Gaussian noise assumption.

What is the difference between MLE and MAP estimation?

MLE finds the parameter that maximizes the likelihood of the observed data alone, with no prior assumptions. MAP (Maximum A Posteriori) estimation extends MLE by incorporating a prior distribution P(θ) representing beliefs about the parameter before seeing data, maximizing the product of likelihood and prior. With small datasets, MLE can overfit (e.g., assigning zero probability to unseen events), while MAP regularizes estimates via the prior. In machine learning, L2 regularization corresponds to a Gaussian prior (MAP), and L1 regularization corresponds to a Laplace prior.

How does MLE connect to cross-entropy loss in neural networks?

Training a neural network with cross-entropy loss is equivalent to performing MLE under a Bernoulli or Categorical distribution assumption. For binary classification, the negative log-likelihood of a Bernoulli distribution gives exactly the binary cross-entropy formula: -[y ln(p) + (1-y) ln(1-p)]. For multi-class classification, minimizing categorical cross-entropy is the MLE under a Categorical distribution. This means the choice of loss function encodes an assumption about the distribution of your data.

Maximum Likelihood Estimation (MLE) Explained

Introduction

Prerequisites: This chapter assumes familiarity withSampling Distributions and basicProbability Distributions.

Throughout statistics, we often assume we know the population parameters (like mean $\mu$ or variance $\sigma^2$ ). But in the real world, we never know these values. We only have data.

The Core Question

Given some observed data, what are the most reasonable values for the unknown parameters?

Maximum Likelihood Estimation (MLE) answers this by asking: "Which parameters would have made our observed data most probable?" It is the foundation of most ML loss functions.

The Big Idea: A Simple Example

The Bag of Balls

Imagine a bag with 3 balls. Each ball is either Red or Blue, but you do not know the combination. Let $\theta$ = number of Blue balls. Possible values: 0, 1, 2, or 3.

Experiment: You draw 4 balls with replacement and observe: Blue, Red, Blue, Blue

Which hypothesis about $\theta$ makes this outcome most probable?

\theta = 0

P(Blue) = 0P(data) = 0 (impossible)

\theta = 1

P(Blue) = 1/3P(data) = 2/81 = 0.025

\theta = 2

P(Blue) = 2/3P(data) = 8/81 = 0.099 (highest!)

\theta = 3

P(Red) = 0P(data) = 0 (impossible)

MLE chooses $\theta = 2$ because it maximizes the probability of observing the data.

Likelihood vs. Probability

These terms sound interchangeable but have opposite meanings in statistics.

Probability $P(x | \theta)$

Fix the parameter, ask about data.

"If the coin is fair ( $\theta = 0.5$ ), what is the probability of getting 7 heads in 10 flips?"

Likelihood $L(\theta | x)$

Fix the data, ask about parameters.

"I observed 7 heads in 10 flips. How likely is it that $\theta = 0.5$ ? How about $\theta = 0.7$ ?"

Key Insight

Likelihood is NOT a probability distribution over $\theta$ . It does not sum to 1 across all theta values. It is simply a function that tells us how "compatible" each theta is with our observed data.

Interactive: Visualizing the Likelihood Function

For the coin flip example, the likelihood function is $L(p) = p^k (1-p)^{n-k}$ where k = heads observed and n = total flips. Adjust the sliders to see how the likelihood curve changes.

Interactive Likelihood Function

Adjust the observed coin flips and watch the likelihood function change. The peak shows the MLE estimate.

Heads observed: 7

Total flips: 10

Observed Data

7H, 3T

MLE Estimate

p = 0.700

Max Likelihood

2.22e-3

The curve shows L(p) = p^7 * (1-p)^3. The peak at p = 0.70 is the Maximum Likelihood Estimate.

The Log-Likelihood Trick

With many observations, the likelihood becomes a product of many small numbers. This causes two problems:

Problem 1: Underflow

$0.01 \times 0.01 \times 0.01 \times ... = 10^{-200}$ which computers round to 0.

Problem 2: Derivatives

Taking derivatives of products requires the complex Product Rule repeatedly.

Solution: Take the natural log. Since log is monotonically increasing, the theta that maximizes L(theta) also maximizes log L(theta).

Log-Likelihood

\ell(\theta) = \ln L(\theta) = \sum_{i=1}^n \ln f(x_i; \theta)

Products become sums. Sums are easy to differentiate.

Worked Example: Biased Coin

Observe n coin flips with k heads. What is the MLE for p (probability of heads)?

Step 1: Write the Likelihood

$L(p) = p^k (1-p)^{n-k}$

Step 2: Take the Log

$\ell(p) = k \ln(p) + (n-k) \ln(1-p)$

Step 3: Differentiate

$\frac{d\ell}{dp} = \frac{k}{p} - \frac{n-k}{1-p}$

Step 4: Set to Zero and Solve

$\frac{k}{p} = \frac{n-k}{1-p} \quad \Rightarrow \quad k(1-p) = p(n-k) \quad \Rightarrow \quad k = np$

$\hat{p}_{MLE} = \frac{k}{n}$

The MLE for p is simply the observed proportion of heads. Intuitive!

Worked Example: Normal Distribution

Data $x_1, \dots, x_n$ from $N(\mu, \sigma^2)$ . Estimate mu.

Step 1: Likelihood (PDF Product)

$L(\mu) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)$

Step 2: Log-Likelihood

$\ell(\mu) = -\frac{n}{2}\ln(2\pi\sigma^2) - \sum_{i=1}^n \frac{(x_i - \mu)^2}{2\sigma^2}$

Step 3: Differentiate w.r.t. $\mu$

First term is constant w.r.t. $\mu$ , disappears.

$\frac{d\ell}{d\mu} = \frac{1}{\sigma^2} \sum (x_i - \mu)$

Step 4: Solve

$\sum (x_i - \mu) = 0 \quad \Rightarrow \quad \sum x_i = n\mu$

$\hat{\mu}_{MLE} = \frac{1}{n} \sum x_i = \bar{x}$

The MLE for the mean is the sample mean!

Why MLE Works: Key Properties

For large datasets, MLE is theoretically optimal. Here is why:

1. Consistency

As n approaches infinity, the estimate converges to the true parameter. More data = more accurate.

2. Efficiency

No other unbiased estimator has lower variance. MLE extracts maximum information from the data.

3. Invariance

If $\hat{\theta}$ is MLE for $\theta$ , then $g(\hat{\theta})$ is MLE for $g(\theta)$ . Transformations are easy.

4. Asymptotic Normality

For large n, MLE follows a Normal distribution. This allows easy confidence interval construction.

Interactive: MLE Convergence

Watch how the MLE estimate converges to the true parameter as you increase sample size. This demonstrates the Consistency property in action.

MLE Convergence Simulator

Consistency Property: N → ∞ ⇒ θ̂ → θ

Samples0

Estimate0.0000

True Parameter (θ)0.50

0.10.9

MLE in Machine Learning

MLE is not just a statistical concept. It is the foundation of most ML loss functions!

MSE Loss = MLE for Gaussian

If we assume errors are normally distributed, minimizing negative log-likelihood gives:

$-\ell(\mu) \propto \sum (x_i - \mu)^2$

This is exactly Mean Squared Error! Linear regression minimizes MSE because it assumes Gaussian noise.

Cross-Entropy Loss = MLE for Bernoulli

For binary classification with predicted probability p:

$-\ell(p) = -[y \ln(p) + (1-y)\ln(1-p)]$

This is Binary Cross-Entropy! Logistic regression uses BCE because it models Bernoulli outcomes.

Softmax + Cross-Entropy = MLE for Categorical

For multi-class classification, minimizing categorical cross-entropy is equivalent to MLE assuming a Categorical distribution over classes.

Bottom Line

When you train a neural network by minimizing cross-entropy or MSE, you are performing MLE. The loss function encodes your assumption about the data distribution.

Limitations and the Bayesian Fix

MLE is powerful but has a critical weakness with small data.

The "Zero Count" Problem

Flip a coin 3 times, get 3 heads. MLE says: P(heads) = 1.0

This means MLE concludes tails is impossible. Obviously wrong!

MLE overfits to small samples because it only considers the data, with no prior beliefs.

The Fix: MAP (Maximum A Posteriori)

MAP adds a Prior distribution encoding our beliefs before seeing data:

$\hat{\theta}_{MAP} = \arg\max_{\theta} \underbrace{L(\theta)}_{\text{Likelihood}} \cdot \underbrace{P(\theta)}_{\text{Prior}}$

In ML, the Prior corresponds to Regularization:

L2 regularization = Gaussian prior on weights
L1 regularization = Laplace prior on weights

For a deeper dive, see Bayesian vs. Frequentist.

Contents