Introduction
All the optimizers we've discussed (Momentum, Adam, etc.) are variations on one core algorithm: Gradient Descent.
But there's a fundamental choice that comes before everything else: How much data do you use to compute each gradient?
The Batch Size Choice
This choice, the batch size, affects everything: convergence speed, generalization, memory usage, and whether training works at all. It's one of the most important hyperparameters in deep learning.
Batch Gradient Descent
In classical (Batch) Gradient Descent, we compute the gradient using the entire dataset:
Advantages
- Gradient is accurate (no sampling noise)
- Stable, predictable convergence
- Easy to analyze theoretically
Disadvantages
- Extremely slow (one step = full dataset pass)
- Requires all data in memory
- Can't escape local minima
Batch GD Failure Case
Deterministic gradients get trapped in local minima
Following steepest descent...
Key Insight
Left of saddle leads to local min. No escape!
Batch GD is impractical for modern deep learning. Training GPT on the full internet for one gradient update would take years.
Stochastic Gradient Descent (SGD)
At the opposite extreme, Stochastic GD uses a single random sample:
where (x_i, y_i) is a single randomly chosen sample
Why Stochasticity Helps
The gradient from one sample is a noisy approximation of the true gradient. But this noise is not always bad:
- Acts like exploration in the loss landscape
- Helps escape shallow local minima and saddle points
- Serves as implicit regularization
Mini-Batch SGD: The Sweet Spot
In practice, we use mini-batches: small random subsets of the data.
where B is a mini-batch of size 32, 64, 128, 256, etc.
Variance Reduction
Averaging over B samples reduces gradient noise by factor of .
Hardware Efficiency
GPUs are optimized for matrix ops. A batch of 64 is nearly as fast as a batch of 1.
Memory Constraint
Batch must fit in GPU memory. Larger batches = more memory usage.
Interactive: Batch Size Effect
Watch how batch size affects the optimization path. Smaller batches = more noise = more exploration.
Batch Size Impact
Balanced
Medium batches offer a trade-off: enough noise to explore, but stable enough to converge efficiently.
Gradient Update Rule
Interactive: Gradient Noise
See how batch size reduces gradient variance. The Central Limit Theorem in action!
Batch Size vs. Gradient Noise
Balanced noise.
Key Insight
Increasing batch size B reduces gradient noise by .
To use large batches effectively, you often need to increase the Learning Rate linearly or by square root to compensate for the reduced noise/variance distribution.
Batch Size Tradeoffs
| Aspect | Small Batch (32) | Large Batch (2048+) |
|---|---|---|
| Gradient Noise | High | Low |
| Exploration | Good (escapes local minima) | Poor (gets stuck) |
| Generalization | Often better | Can be worse |
| Training Speed | Slower (more steps) | Faster (fewer steps) |
| GPU Utilization | Under-utilized | Fully utilized |
| Memory | Low | High |
The Large Batch Problem
Keskar et al. (2017) showed that large batch training finds sharp minima that generalize poorly. The noise in small batches acts as implicit regularization, pushing toward flat minima with better generalization.
Learning Rate Schedules
The optimal learning rate changes during training. Start high for fast progress, then decrease to refine the solution. Common schedules:
Step Decay
Reduce LR by a factor (e.g., 10×) at fixed epochs.
epochs [30, 60, 90]: LR = 0.1 → 0.01 → 0.001
Cosine Annealing
Smoothly decay LR following a cosine curve.
One Cycle Policy
LR increases from small to max, then decreases. Popularized by fast.ai. Often achieves same accuracy in fewer epochs.
Interactive: Learning Rate Schedules
Compare different LR schedules. Each has different convergence characteristics.
Learning Rate Planner
cosine Schedule
Smooth decay. Often reaches better optima by spending more time at high LR initially, then fine-tuning. One hyperparameter (epochs).
Linear Warmup
Starting with a large learning rate can destabilize training. Warmup gradually increases the LR over the first few epochs.
At Initialization
Weights are random. Gradients are unreliable. Large steps cause divergence.
After Warmup
Model has learned structure. Gradients are meaningful. Can use full learning rate.
Warmup for Large Batch Training
When using large batches (to fill GPU memory), warmup becomes essential. The rule of thumb: warmup steps = 5-10% of total training steps. This lets Adam's moment estimates stabilize.
Practical Tuning Guide
Step 1: Start with Defaults
- Optimizer: Adam or AdamW
- Learning rate: 3e-4 (the "Karpathy constant")
- Batch size: As large as GPU memory allows (64-256 typical)
- Warmup: 5% of total steps
Step 2: Learning Rate Search
Run a learning rate range test (Leslie Smith, 2017):
- Start with tiny LR (1e-7)
- Increase exponentially each step
- Plot loss vs LR
- Use LR where loss decreases fastest
Step 3: Tune Batch Size
- If training is unstable: Reduce batch size or add warmup
- If validation loss plateaus early: Reduce batch size (more noise helps generalization)
- If training is too slow: Increase batch size (if memory allows)
Common Failure Modes
- Loss explodes: LR too high. Reduce by 10×.
- Loss stuck: LR too low, or bad initialization. Try larger LR.
- Train good, val bad: Overfitting. Add regularization, reduce model size, or use smaller batch.