What is a joint probability distribution?

A joint probability distribution P(X, Y) describes the probability of two or more random variables taking specific values simultaneously. It is the 'master' distribution from which marginals, conditionals, expectations, and covariances can all be derived. For discrete variables it is a table of probabilities that sum to 1; for continuous variables it is a joint density function whose double integral equals 1.

What is marginal probability and how do you compute it?

Marginal probability is the probability of a single variable regardless of the values of other variables. It is computed by summing (or integrating) the joint distribution over all values of the variables you want to ignore: P(X=x) = Σ_y P(X=x, Y=y). This operation is called marginalization, and the name comes from writing these sums in the margins of a contingency table.

What is the difference between joint, marginal, and conditional probability?

Joint probability P(X, Y) captures the probability of two events occurring together. Marginal probability P(X) captures the probability of one event ignoring all others, obtained by summing out the other variables. Conditional probability P(Y|X) captures the probability of Y given that X has already occurred, computed as P(X, Y) / P(X). Together these three views describe the full structure of multivariate probability.

How is marginalization used in machine learning?

Marginalization is used throughout machine learning to handle hidden or nuisance variables. In latent variable models such as VAEs and mixture models, the marginal likelihood P(X) = Σ_Z P(X, Z) is computed by summing over all latent states Z. In Bayesian inference, the evidence P(D) in the denominator of Bayes' theorem is a marginal that is often intractable, motivating approximate inference techniques like variational inference and MCMC.

What does it mean for two random variables to be independent?

Two random variables X and Y are independent if their joint distribution factors into the product of their marginals: P(X, Y) = P(X) · P(Y). Equivalently, knowing the value of X gives no information about Y, so P(Y|X) = P(Y). Independence is a key assumption in many ML models, such as Naive Bayes, which assumes features are conditionally independent given the class label.

Joint, Marginal & Conditional Distributions

Introduction

In ML, we rarely deal with single variables in isolation. Features interact, labels depend on inputs, and model parameters relate to each other. To model these relationships, we need multivariate probability.

Think of a dataset with multiple columns. There are three fundamental ways to view it:

Joint

The whole picture

Marginal

One column, ignoring others

Conditional

One column, given another

Prerequisites: Conditional Probability, Random Variables.

The Trio Defined

Joint

P(X, Y)

Probability X takes value x AND Y takes value y simultaneously.

Marginal

P(X)

Probability of X, regardless of Y. Sum out the other variable.

Conditional

P(Y | X)

Probability of Y, given we observed X. Slice and normalize.

Interactive: Contingency Table

This table shows 100 people categorized by Age (Young/Old) and Coffee preference (Latte/Espresso). Click the buttons to highlight different distributions.

Joint Distribution Landscape

Visualize the joint probability $P(X,Y)$ as a heatmap. Mutual Information measures how "structured" this landscape is compared to the product of marginals.

Noise / Spread

High

Low

Entropy H(X)

4.30

Entropy H(Y)

4.30

Joint H(X,Y)

7.84

Mutual Info I(X;Y)

0.76

Observation: When there is structure (lines, circles, clusters), the joint distribution is 'sharper' than the product of marginals. MI is high.

Joint Distribution P(X, Y)

The joint distribution is the "master" distribution. If you have P(X, Y), you can derive everything else: marginals, conditionals, expectations, covariances.

Normalization: Sum over all combinations = 1

\sum_x \sum_y P(X=x, Y=y) = 1

For continuous: double integral of f(x,y) = 1

What can we compute from $P(X,Y)$ ?

Marginals: $P(X), P(Y)$
Conditionals: $P(Y|X), P(X|Y)$
Expected values: $E[X], E[Y], E[XY]$
Covariance: $Cov(X,Y) = E[XY] - E[X]E[Y]$
Independence check: Does $P(X,Y) = P(X)P(Y)$ ?

Marginal Distribution P(X)

Marginalization means "summing out" the variables you do not care about. The result is written in the "margins" of a table.

The Sum Rule

P(X=x) = \sum_y P(X=x, Y=y)

To find P(Young), add P(Young, Latte) + P(Young, Espresso).

Why is it called "marginal"?

In old-school contingency tables, these sums were written in the margins of the paper. The term stuck.

Interactive: Marginalization

Step through to see how summing rows or columns gives us marginal distributions.

The Joint Distribution P(X, Y)

This 3x3 grid contains all information. Each cell is the probability of a specific (X, Y) pair occurring.

0.15

0.10

0.05

0.10

0.20

0.10

0.05

0.10

0.15

Conditional Distribution P(Y|X)

Conditioning is "slicing" the joint distribution and renormalizing so probabilities sum to 1 in that slice.

P(Y|X) = \frac{P(X, Y)}{P(X)}

Joint probability divided by marginal probability.

Example

$P(\text{Espresso} | \text{Young}) = \frac{P(\text{Young}, \text{Espresso})}{P(\text{Young})} = \frac{0.10}{0.50} =$ 0.20

Why normalize?

So that $P(\text{Latte}|\text{Young}) + P(\text{Espresso}|\text{Young}) = 1$ . Probabilities in a slice must sum to 1.

For details on conditional probability, see Conditional Probability.

Product Rule & Chain Rule

Rearranging the conditional definition gives us the Product Rule:

P(X, Y) = P(Y | X) \cdot P(X)

Joint = Conditional x Marginal

Generalized to n variables, this becomes the Chain Rule:

P(X_1, X_2, ..., X_n) = P(X_1) \cdot P(X_2|X_1) \cdot P(X_3|X_1,X_2) \cdots P(X_n|X_1,...,X_{n-1})

GPT Connection

Autoregressive language models like GPT compute P(next word | all previous words) using the chain rule. Each token is conditioned on everything before it.

ML Applications

Discriminative Models

Model the conditional $P(Y|X)$ directly.

P(\text{label} | \text{features})

Examples: Logistic Regression, Neural Networks, SVMs

Generative Models

Model the joint $P(X, Y) = P(X|Y)P(Y)$ .

Can generate new samples!

Examples: Naive Bayes, GANs, VAEs, Diffusion Models

Bayesian Inference

Posterior = (Likelihood x Prior) / Evidence

P(\theta|D) \propto P(D|\theta)P(\theta)

The denominator $P(D)$ is a marginal (intractable integral).

Latent Variable Models

Marginalize over hidden variables $Z$ :

P(X) = \sum_Z P(X, Z)

Examples: VAEs, Mixture Models, HMMs

Contents

Introduction

The Trio Defined

Joint

Marginal

Conditional

Interactive: Contingency Table

Joint Distribution Landscape

Joint Distribution P(X, Y)

What can we compute from P(X,Y)P(X,Y)P(X,Y)?

Marginal Distribution P(X)

The Sum Rule

Why is it called "marginal"?

Interactive: Marginalization

The Joint Distribution P(X, Y)

Conditional Distribution P(Y|X)

Example

Why normalize?

Product Rule & Chain Rule

GPT Connection

ML Applications

Discriminative Models

Generative Models

Bayesian Inference

Latent Variable Models

What can we compute from $P(X,Y)$ ?