What is the difference between batch, mini-batch, and stochastic gradient descent?

Batch gradient descent computes the gradient using the entire dataset before updating parameters, which is accurate but computationally prohibitive for large datasets. Stochastic gradient descent (SGD) uses a single randomly chosen sample per update, introducing noise but enabling very fast iterations. Mini-batch gradient descent uses a small random subset (e.g., 32–256 samples) per update, balancing gradient accuracy with computational efficiency and is the standard approach in modern deep learning.

Why is mini-batch gradient descent preferred in deep learning?

Mini-batch gradient descent is preferred because it strikes the best balance between computational efficiency and gradient quality. GPUs are optimized for batched matrix operations, making a batch of 64 nearly as fast to process as a batch of 1. Mini-batches also reduce gradient variance compared to pure SGD while still retaining beneficial noise that aids generalization and helps escape local minima.

How does batch size affect training speed and generalization?

Larger batch sizes allow more parallelism on GPUs and reduce the number of gradient updates per epoch, speeding up wall-clock training time. However, Keskar et al. (2017) showed that large batches tend to converge to sharp minima that generalize poorly, whereas smaller batches introduce noise that acts as implicit regularization, leading to flatter minima with better test performance. Scaling the learning rate proportionally when increasing batch size can partially mitigate generalization degradation.

What is the role of noise in SGD and why can it help?

Noise in SGD arises because each mini-batch gradient is a stochastic approximation of the true full-dataset gradient. This noise acts like exploration in the loss landscape: it helps the optimizer escape shallow local minima and saddle points that would trap a deterministic optimizer. The noise also serves as implicit regularization by biasing the optimizer toward wider, flatter minima that tend to generalize better to unseen data.

How do you choose the right batch size for training?

Start with the largest batch size that fits in GPU memory (commonly 64–256) to maximize hardware utilization. If training is unstable or validation performance is poor, reduce the batch size to increase gradient noise and improve generalization. When increasing batch size, scale the learning rate linearly and add a warmup period so Adam's moment estimates can stabilize. Running a learning rate range test (LR finder) after settling on a batch size helps identify the optimal learning rate.

SGD, Mini-Batch & Learning Rate Schedules

Introduction

All the optimizers we've discussed (Momentum, Adam, etc.) are variations on one core algorithm: Gradient Descent.

But there's a fundamental choice that comes before everything else: How much data do you use to compute each gradient?

The Batch Size Choice

This choice, the batch size, affects everything: convergence speed, generalization, memory usage, and whether training works at all. It's one of the most important hyperparameters in deep learning.

Batch Gradient Descent

In classical (Batch) Gradient Descent, we compute the gradient using the entire dataset:

\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{N} \sum_{i=1}^{N} \nabla L(\theta_t; x_i, y_i)

Advantages

Gradient is accurate (no sampling noise)
Stable, predictable convergence
Easy to analyze theoretically

Disadvantages

Extremely slow (one step = full dataset pass)
Requires all data in memory
Can't escape local minima

Batch GD Failure Case

Deterministic gradients get trapped in local minima

Descending...

Following steepest descent...

Gradient-5.4720

Loss2.8265

Position-1.800

Iteration0

Key Insight

Left of saddle leads to local min. No escape!

Batch GD is impractical for modern deep learning. Training GPT on the full internet for one gradient update would take years.

Stochastic Gradient Descent (SGD)

At the opposite extreme, Stochastic GD uses a single random sample:

\theta_{t+1} = \theta_t - \eta \cdot \nabla L(\theta_t; x_i, y_i)

where (x_i, y_i) is a single randomly chosen sample

Why Stochasticity Helps

The gradient from one sample is a noisy approximation of the true gradient. But this noise is not always bad:

Acts like exploration in the loss landscape
Helps escape shallow local minima and saddle points
Serves as implicit regularization

Mini-Batch SGD: The Sweet Spot

In practice, we use mini-batches: small random subsets of the data.

\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{B} \sum_{i \in \mathcal{B}} \nabla L(\theta_t; x_i, y_i)

where B is a mini-batch of size 32, 64, 128, 256, etc.

Variance Reduction

Averaging over B samples reduces gradient noise by factor of $1/\sqrt{B}$ .

Hardware Efficiency

GPUs are optimized for matrix ops. A batch of 64 is nearly as fast as a batch of 1.

Memory Constraint

Batch must fit in GPU memory. Larger batches = more memory usage.

Interactive: Batch Size Effect

Watch how batch size affects the optimization path. Smaller batches = more noise = more exploration.

Batch Size Impact

How sample size determines the "smoothness" of the optimization path.

Batch Size (B):

High NoiseLow Noise

Batch Size: 32

Current Loss

0.0000

Steps Taken0

Balanced

Medium batches offer a trade-off: enough noise to explore, but stable enough to converge efficiently.

Gradient Update Rule

\theta_{t+1} = \theta_t - \eta(\nabla L + \mathcal{N}(0, \sigma/\sqrt{B}))

Interactive: Gradient Noise

See how batch size reduces gradient variance. The Central Limit Theorem in action!

Batch Size vs. Gradient Noise

Visualize how averaging over a batch reduces estimation variance.

Batch Size:

StochasticBatch SizeStable

Theoretical Noise (

\sigma

)

0.354

\propto 1/\sqrt{B}

Current Deviation

0.000

Balanced noise.

Key Insight

Increasing batch size B reduces gradient noise by $\sqrt{B}$ .

To use large batches effectively, you often need to increase the Learning Rate linearly or by square root to compensate for the reduced noise/variance distribution.

Batch Size Tradeoffs

Aspect	Small Batch (32)	Large Batch (2048+)
Gradient Noise	High	Low
Exploration	Good (escapes local minima)	Poor (gets stuck)
Generalization	Often better	Can be worse
Training Speed	Slower (more steps)	Faster (fewer steps)
GPU Utilization	Under-utilized	Fully utilized
Memory	Low	High

The Large Batch Problem

Keskar et al. (2017) showed that large batch training finds sharp minima that generalize poorly. The noise in small batches acts as implicit regularization, pushing toward flat minima with better generalization.

Learning Rate Schedules

The optimal learning rate changes during training. Start high for fast progress, then decrease to refine the solution. Common schedules:

Step Decay

Reduce LR by a factor (e.g., 10×) at fixed epochs.

epochs [30, 60, 90]: LR = 0.1 → 0.01 → 0.001

Cosine Annealing

Smoothly decay LR following a cosine curve.

\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{t}{T}\pi))

One Cycle Policy

LR increases from small to max, then decreases. Popularized by fast.ai. Often achieves same accuracy in fewer epochs.

Interactive: Learning Rate Schedules

Compare different LR schedules. Each has different convergence characteristics.

Learning Rate Planner

Visualize how LR schedules impact convergence dynamics.

Learning Rate (

\eta

)2.47e-5

Validation Loss0.241

cosine Schedule

Smooth decay. Often reaches better optima by spending more time at high LR initially, then fine-tuning. One hyperparameter (epochs).

\eta_t = 0.5\eta_{max}(1 + \cos(t\pi/T))

Linear Warmup

Starting with a large learning rate can destabilize training. Warmup gradually increases the LR over the first few epochs.

At Initialization

Weights are random. Gradients are unreliable. Large steps cause divergence.

After Warmup

Model has learned structure. Gradients are meaningful. Can use full learning rate.

Warmup for Large Batch Training

When using large batches (to fill GPU memory), warmup becomes essential. The rule of thumb: warmup steps = 5-10% of total training steps. This lets Adam's moment estimates stabilize.

Practical Tuning Guide

Step 1: Start with Defaults

Optimizer: Adam or AdamW
Learning rate: 3e-4 (the "Karpathy constant")
Batch size: As large as GPU memory allows (64-256 typical)
Warmup: 5% of total steps

Step 2: Learning Rate Search

Run a learning rate range test (Leslie Smith, 2017):

Start with tiny LR (1e-7)
Increase exponentially each step
Plot loss vs LR
Use LR where loss decreases fastest

Step 3: Tune Batch Size

If training is unstable: Reduce batch size or add warmup
If validation loss plateaus early: Reduce batch size (more noise helps generalization)
If training is too slow: Increase batch size (if memory allows)

Common Failure Modes

Loss explodes: LR too high. Reduce by 10×.
Loss stuck: LR too low, or bad initialization. Try larger LR.
Train good, val bad: Overfitting. Add regularization, reduce model size, or use smaller batch.

Contents

SGD Variants & Practical Tuning

Introduction

The Batch Size Choice

Batch Gradient Descent

Advantages

Disadvantages

Batch GD Failure Case

Stochastic Gradient Descent (SGD)

Why Stochasticity Helps

Mini-Batch SGD: The Sweet Spot

Variance Reduction

Hardware Efficiency

Memory Constraint

Interactive: Batch Size Effect

Batch Size Impact

Balanced

Interactive: Gradient Noise

Batch Size vs. Gradient Noise

Key Insight

Batch Size Tradeoffs

The Large Batch Problem

Learning Rate Schedules

Step Decay

Cosine Annealing

One Cycle Policy

Interactive: Learning Rate Schedules

Learning Rate Planner

cosine Schedule

Linear Warmup

At Initialization

After Warmup

Warmup for Large Batch Training

Practical Tuning Guide

Step 1: Start with Defaults

Step 2: Learning Rate Search

Step 3: Tune Batch Size

Common Failure Modes