What is the Monte Carlo method and how does it work?

The Monte Carlo method is a computational technique that uses repeated random sampling to estimate numerical results that are difficult or impossible to compute analytically. Instead of solving a problem deterministically, you simulate it thousands or millions of times and average the outcomes. The technique was pioneered in the 1940s at Los Alamos by Stanislaw Ulam and John von Neumann to model neutron diffusion in nuclear reactions.

How is Monte Carlo integration used to approximate integrals?

Monte Carlo integration converts a definite integral into an expected value: the integral of f(x) over a domain V is approximated as V times the average of f evaluated at N randomly sampled points. The error decreases proportionally to 1/sqrt(N), which is slow but crucially independent of the number of dimensions. This makes Monte Carlo integration the method of choice for high-dimensional problems where grid-based quadrature becomes computationally infeasible.

What is MCMC (Markov Chain Monte Carlo) and when is it used?

Markov Chain Monte Carlo (MCMC) is a family of algorithms for sampling from probability distributions that are known only up to a normalizing constant, such as Bayesian posterior distributions. Instead of sampling uniformly, MCMC constructs a Markov chain whose stationary distribution equals the target distribution; after a burn-in period the chain's states are used as samples. The Metropolis-Hastings algorithm is the classic MCMC method, while modern variants include Hamiltonian Monte Carlo and the No-U-Turn Sampler (NUTS).

How is Monte Carlo simulation used in reinforcement learning?

In reinforcement learning, Monte Carlo methods estimate the value of states or state-action pairs by averaging the total returns observed across many complete episode simulations. Rather than bootstrapping from other value estimates (as in TD learning), Monte Carlo waits until an episode ends and uses the actual cumulative reward. This approach underpins Monte Carlo Tree Search (MCTS), which powered AlphaGo, and is also foundational to policy gradient algorithms.

What is the difference between Monte Carlo and quasi-Monte Carlo methods?

Standard Monte Carlo uses pseudo-random samples, achieving an error convergence rate of O(1/sqrt(N)). Quasi-Monte Carlo (QMC) replaces random samples with low-discrepancy sequences (such as Sobol or Halton sequences) that are more evenly distributed across the domain, achieving a faster convergence rate closer to O(1/N) in low to moderate dimensions. QMC is highly effective when the integrand is smooth and the dimension is not too high, while standard Monte Carlo remains preferable in very high-dimensional or discontinuous settings.

Monte Carlo Methods: Simulation for the Unsolvable

Introduction

Many problems are theoretically solvable but computationally impossible. Calculating portfolio risk requires integrating over thousands of correlated assets. Finding the optimal Go move requires searching a tree with more states than atoms in the universe.

The Monte Carlo Philosophy

If you cannot calculate the answer, simulate it millions of times and average.

"If you cannot calculate the area of a lake, throw a million rocks at it. Count how many splash vs. hit ground. That ratio gives you the area."

History: The Manhattan Project

The Problem

1940s Los Alamos. Stanislaw Ulam needed to calculate the probability of neutron chain reactions. The differential equations were too complex.

The Solution

Instead of solving equations, simulate individual neutrons moving randomly and observe outcomes. Named after the Monaco casino because it relies on chance (LLN).

The Dartboard Method (Estimating Pi)

How do you calculate pi without geometry? Throw darts randomly.

Setup

Square with side $2r$ . Area = $4r^2$
Circle inscribed with radius $r$ . Area = $\pi r^2$
Ratio = $\frac{\pi r^2}{4r^2} = \frac{\pi}{4}$

Simulation

Throw $N$ darts randomly at square
Count how many land inside circle
Ratio = $\frac{N_{\text{inside}}}{N} = \frac{\pi}{4}$
$\pi = 4 \times \frac{N_{\text{inside}}}{N}$

More darts = better estimate. This is the Law of Large Numbers in action.

Interactive: Estimate Pi

Throw virtual darts at a square. Watch the estimate converge to 3.14159... as you add more darts.

Estimated π0.00000Acc: 0.00%

Total Darts0Inside: 0

Paused

Convergence

0500010000 Darts

Monte Carlo Integration

For complex functions in high dimensions, standard numerical integration fails. Monte Carlo converts integrals into expectations.

I = \int_V f(x) dx \approx V \cdot \frac{1}{N} \sum_{i=1}^N f(x_i)

Replace integral with sample average. $V$ = volume of domain, $x_i$ = random samples.

Convergence Rate

\text{Error} \propto \frac{1}{\sqrt{N}}

To halve error: need $4 \times$ samples. Slow convergence, but independent of dimension!

Beating the Curse of Dimensionality

Grid-based methods suffer exponentially as dimensions increase. Monte Carlo does not.

Dimension	Grid Points (10/axis)	Monte Carlo
1D (Line)	10	~100
3D (Cube)	1,000	~100
10D	10 billion	~1,000
100D	10^100 (impossible)	~10,000

This is why Monte Carlo is the ONLY way to solve high-dimensional physics and Bayesian inference problems.

Markov Chain Monte Carlo (MCMC)

What if we cannot sample uniformly? MCMC generates samples from complex distributions using a Markov Chain that explores high-probability regions.

Metropolis-Hastings Algorithm

Start at random point $x_0$
Propose new point $x'$ nearby ( $x' = x_0 + \text{noise}$ )
Calculate acceptance ratio:
$\alpha = \frac{P(x')}{P(x_0)}$
If $\alpha \geq 1$ : always move to $x'$
If $\alpha < 1$ : move with probability $\alpha$ , else stay
Repeat thousands of times

Why it works

The chain spends more time in high-probability regions. After "burn-in," samples are from the target distribution.

Key parameter

Step size matters! Too small: slow exploration. Too large: many rejections. Target 20-50% acceptance.

Interactive: MCMC Visualization

Watch Metropolis-Hastings explore a bimodal distribution (two peaks). Adjust step size to see how it affects exploration.

Status

Paused

Proposal Step Size0.80

Cautious (High Accept)Bold (Low Accept)

Chain Statistics

Acceptance Ratio0.0%

Goldilocks Principle:
Too high (>80%): Steps are too small, exploring slowly.
Too low (<20%): Steps are too big, constantly rejected.
Target: ~20-50% for optimal mixing.

Why Bimodal?

Notice how the chain gets "stuck" in one peak for a while, then eventually makes a lucky jump across the valley to the other peak. This 'mode hopping' is the hardest challenge in MCMC.

ML Applications

Reinforcement Learning (MC Prediction)

Estimate state value $V(s)$ by playing many episodes and averaging returns:

V(s) \approx \frac{1}{N} \sum_{i=1}^N G_i^{(s)}

Used in AlphaGo (MCTS), policy gradient methods.

Bayesian Inference

Sample from posterior $P(\theta|D)$ when the integral $P(D)$ is intractable:

P(\theta|D) \propto P(D|\theta)P(\theta)

MCMC, Hamiltonian MC, Variational Inference.

Dropout as Monte Carlo

Running inference with dropout multiple times and averaging is equivalent to approximate Bayesian inference. Each forward pass samples a different sub-network.

Finance: Value at Risk (VaR)

Simulate thousands of market scenarios to estimate worst-case losses at a given confidence level.

Contents