Introduction
Prerequisites: This chapter assumes familiarity withSampling Distributions and basicProbability Distributions.
Throughout statistics, we often assume we know the population parameters (like mean or variance ). But in the real world, we never know these values. We only have data.
The Core Question
Given some observed data, what are the most reasonable values for the unknown parameters?
Maximum Likelihood Estimation (MLE) answers this by asking: "Which parameters would have made our observed data most probable?" It is the foundation of most ML loss functions.
The Big Idea: A Simple Example
The Bag of Balls
Imagine a bag with 3 balls. Each ball is either Red or Blue, but you do not know the combination. Let = number of Blue balls. Possible values: 0, 1, 2, or 3.
Experiment: You draw 4 balls with replacement and observe: Blue, Red, Blue, Blue
Which hypothesis about makes this outcome most probable?
MLE chooses because it maximizes the probability of observing the data.
Likelihood vs. Probability
These terms sound interchangeable but have opposite meanings in statistics.
Probability
Fix the parameter, ask about data.
"If the coin is fair (), what is the probability of getting 7 heads in 10 flips?"
Likelihood
Fix the data, ask about parameters.
"I observed 7 heads in 10 flips. How likely is it that ? How about ?"
Key Insight
Likelihood is NOT a probability distribution over . It does not sum to 1 across all theta values. It is simply a function that tells us how "compatible" each theta is with our observed data.
Interactive: Visualizing the Likelihood Function
For the coin flip example, the likelihood function is where k = heads observed and n = total flips. Adjust the sliders to see how the likelihood curve changes.
Interactive Likelihood Function
Adjust the observed coin flips and watch the likelihood function change. The peak shows the MLE estimate.
The curve shows L(p) = p^7 * (1-p)^3. The peak at p = 0.70 is the Maximum Likelihood Estimate.
The Log-Likelihood Trick
With many observations, the likelihood becomes a product of many small numbers. This causes two problems:
Problem 1: Underflow
which computers round to 0.
Problem 2: Derivatives
Taking derivatives of products requires the complex Product Rule repeatedly.
Solution: Take the natural log. Since log is monotonically increasing, the theta that maximizes L(theta) also maximizes log L(theta).
Log-Likelihood
Products become sums. Sums are easy to differentiate.
Worked Example: Biased Coin
Observe n coin flips with k heads. What is the MLE for p (probability of heads)?
Step 1: Write the Likelihood
Step 2: Take the Log
Step 3: Differentiate
Step 4: Set to Zero and Solve
The MLE for p is simply the observed proportion of heads. Intuitive!
Worked Example: Normal Distribution
Data from . Estimate mu.
Step 1: Likelihood (PDF Product)
Step 2: Log-Likelihood
Step 3: Differentiate w.r.t.
First term is constant w.r.t. , disappears.
Step 4: Solve
The MLE for the mean is the sample mean!
Why MLE Works: Key Properties
For large datasets, MLE is theoretically optimal. Here is why:
1. Consistency
As n approaches infinity, the estimate converges to the true parameter. More data = more accurate.
2. Efficiency
No other unbiased estimator has lower variance. MLE extracts maximum information from the data.
3. Invariance
If is MLE for , then is MLE for . Transformations are easy.
4. Asymptotic Normality
For large n, MLE follows a Normal distribution. This allows easy confidence interval construction.
Interactive: MLE Convergence
Watch how the MLE estimate converges to the true parameter as you increase sample size. This demonstrates the Consistency property in action.
MLE Convergence Simulator
Consistency Property: N → ∞ ⇒ θ̂ → θ
MLE in Machine Learning
MLE is not just a statistical concept. It is the foundation of most ML loss functions!
MSE Loss = MLE for Gaussian
If we assume errors are normally distributed, minimizing negative log-likelihood gives:
This is exactly Mean Squared Error! Linear regression minimizes MSE because it assumes Gaussian noise.
Cross-Entropy Loss = MLE for Bernoulli
For binary classification with predicted probability p:
This is Binary Cross-Entropy! Logistic regression uses BCE because it models Bernoulli outcomes.
Softmax + Cross-Entropy = MLE for Categorical
For multi-class classification, minimizing categorical cross-entropy is equivalent to MLE assuming a Categorical distribution over classes.
Bottom Line
When you train a neural network by minimizing cross-entropy or MSE, you are performing MLE. The loss function encodes your assumption about the data distribution.
Limitations and the Bayesian Fix
MLE is powerful but has a critical weakness with small data.
The "Zero Count" Problem
Flip a coin 3 times, get 3 heads. MLE says: P(heads) = 1.0
This means MLE concludes tails is impossible. Obviously wrong!
MLE overfits to small samples because it only considers the data, with no prior beliefs.
The Fix: MAP (Maximum A Posteriori)
MAP adds a Prior distribution encoding our beliefs before seeing data:
In ML, the Prior corresponds to Regularization:
- L2 regularization = Gaussian prior on weights
- L1 regularization = Laplace prior on weights
For a deeper dive, see Bayesian vs. Frequentist.