Introduction
In ML, we rarely deal with single variables in isolation. Features interact, labels depend on inputs, and model parameters relate to each other. To model these relationships, we need multivariate probability.
Think of a dataset with multiple columns. There are three fundamental ways to view it:
Prerequisites: Conditional Probability, Random Variables.
The Trio Defined
Joint
Probability X takes value x AND Y takes value y simultaneously.
Marginal
Probability of X, regardless of Y. Sum out the other variable.
Conditional
Probability of Y, given we observed X. Slice and normalize.
Interactive: Contingency Table
This table shows 100 people categorized by Age (Young/Old) and Coffee preference (Latte/Espresso). Click the buttons to highlight different distributions.
Joint Distribution Landscape
Visualize the joint probability as a heatmap. Mutual Information measures how "structured" this landscape is compared to the product of marginals.
Observation: When there is structure (lines, circles, clusters), the joint distribution is 'sharper' than the product of marginals. MI is high.
Joint Distribution P(X, Y)
The joint distribution is the "master" distribution. If you have P(X, Y), you can derive everything else: marginals, conditionals, expectations, covariances.
Normalization: Sum over all combinations = 1
For continuous: double integral of f(x,y) = 1
What can we compute from ?
- Marginals:
- Conditionals:
- Expected values:
- Covariance:
- Independence check: Does ?
Marginal Distribution P(X)
Marginalization means "summing out" the variables you do not care about. The result is written in the "margins" of a table.
The Sum Rule
To find P(Young), add P(Young, Latte) + P(Young, Espresso).
Why is it called "marginal"?
In old-school contingency tables, these sums were written in the margins of the paper. The term stuck.
Interactive: Marginalization
Step through to see how summing rows or columns gives us marginal distributions.
The Joint Distribution P(X, Y)
This 3x3 grid contains all information. Each cell is the probability of a specific (X, Y) pair occurring.
Conditional Distribution P(Y|X)
Conditioning is "slicing" the joint distribution and renormalizing so probabilities sum to 1 in that slice.
Joint probability divided by marginal probability.
Example
0.20
Why normalize?
So that . Probabilities in a slice must sum to 1.
For details on conditional probability, see Conditional Probability.
Product Rule & Chain Rule
Rearranging the conditional definition gives us the Product Rule:
Joint = Conditional x Marginal
Generalized to n variables, this becomes the Chain Rule:
GPT Connection
Autoregressive language models like GPT compute P(next word | all previous words) using the chain rule. Each token is conditioned on everything before it.
ML Applications
Discriminative Models
Model the conditional directly.
Examples: Logistic Regression, Neural Networks, SVMs
Generative Models
Model the joint .
Examples: Naive Bayes, GANs, VAEs, Diffusion Models
Bayesian Inference
Posterior = (Likelihood x Prior) / Evidence
The denominator is a marginal (intractable integral).
Latent Variable Models
Marginalize over hidden variables :
Examples: VAEs, Mixture Models, HMMs