Modules
06/09
Optimization

Contents

Regularization (L1, L2, Dropout)

The art of making models "dumber" to make them smarter on new data. Fighting overfitting with constraints.

Introduction

A deep neural network with millions of parameters is a "universal function approximator." It can memorize every single pixel of your training data. With a powerful enough model and cross-entropy loss, the model fits the training set perfectly.

This is a problem. When a model memorizes training data instead of learning general patterns, it fails catastrophically on new, unseen data. This is overfitting, and understanding loss landscapes helps explain why it happens.

Regularization intentionally constrains the model's capacity. We force it to be "simpler," which paradoxically makes it generalize better. It's like training with handicaps so you perform better in the real competition.

Bias-Variance Tradeoff

Every ML model's error can be decomposed into three parts:

Error=Bias2+Variance+Irreducible Noise\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}

High Bias (Underfitting)

Model is too simple. Can't capture the true pattern.

Example: Fitting a line to parabolic data.

High Variance (Overfitting)

Model is too complex. Fits noise in training data.

Example: 20-degree polynomial through 10 points.

Regularization increases bias slightly but dramatically reduces variance. The tradeoff is usually worth it.

L2 Regularization (Ridge)

L2 regularization adds the sum of squared weights to the loss function. During gradient descent, this penalizes large weights:

Ltotal=Ldata+λiwi2\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{data}} + \lambda \sum_i w_i^2

Gradient Update with L2

The gradient of the penalty is 2λw2\lambda w:

wnew=wη(L+2λw)w_{new} = w - \eta(\nabla L + 2\lambda w)
wnew=w(12ηλ)ηLw_{new} = w(1 - 2\eta\lambda) - \eta \nabla L

This is called Weight Decay. Every step, weights are multiplied by a factor slightly less than 1. See AdamW for why decoupling weight decay matters.

L2's Effect

L2 penalizes large weights quadratically. A weight of 10 is penalized 100x more than a weight of 1. Result: Weights are pushed toward zero but rarely reach exactly zero. All features contribute, just with smaller coefficients. The L2 penalty is also convex, making optimization tractable.

L1 Regularization (Lasso)

L1 regularization adds the sum of absolute weights:

Ltotal=Ldata+λiwi\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{data}} + \lambda \sum_i |w_i|

Gradient Update with L1

The gradient of w|w| is sign(w)\text{sign}(w) (either +1 or -1):

wnew=wη(L+λsign(w))w_{new} = w - \eta(\nabla L + \lambda \cdot \text{sign}(w))

The penalty subtracts a constant from the weight magnitude at each step, regardless of the weight's current value.

L1's Effect: Sparsity

Unlike L2, L1 pushes weights all the way to exactly zero. This produces sparse models where many features are completely ignored. L1 effectively performs feature selection, similar in spirit to how Information Gain selects features in decision trees.

Geometric Intuition

The magic of L1 sparsity comes from geometry. Think of regularization as constraining weights to lie within a region. This is closely related to constrained optimization.

L1: The Diamond

w1+w2C|w_1| + |w_2| \le C

Sharp corners on the axes. Loss contours likely hit a corner first, zeroing one weight.

L2: The Circle

w12+w22Cw_1^2 + w_2^2 \le C

No corners. Loss contours hit a smooth edge. Both weights stay non-zero.

Interactive: L1 vs L2 Geometry

Watch how the loss contours intersect the constraint regions. L1's corners naturally produce zeros.

L1 (Diamond) vs L2 (Circle) Geometry

The constraint region shape determines where the loss contours first touch.

Unconstrained Minw1w2

L2: Circle = Distributed Weights

The loss contours (orange/green ellipses) hit the circle at an arbitrary point on the edge. Both weights are small but non-zero. L2 shrinks weights uniformly without zeroing them out.

w1

1.157

w2

0.320

Elastic Net

Why choose? Elastic Net combines both:

L=Ldata+λ1wi+λ2wi2\mathcal{L} = \mathcal{L}_{data} + \lambda_1 \sum |w_i| + \lambda_2 \sum w_i^2

Elastic Net gets L1's feature selection while avoiding its instability when features are correlated. It's the default in many scikit-learn models.

The Grouping Effect

L1 regularization has a known weakness: if a group of features are highly correlated (e.g., "height in cm" and "height in inches"), L1 tends to select just one at random and zero out the others. This is unstable and arbitrary.

Elastic Net's L2 term encourages the grouping effect: correlated features are retained together and assigned similar weights. This makes the model more robust and interpretable when features are redundant.

Bayesian Interpretation

Regularization has a beautiful Bayesian interpretation. Adding a penalty is equivalent to placing a prior belief on the weights.

L2 = Gaussian Prior

"I believe weights are normally distributed around 0."

P(w)ew2P(w) \propto e^{-w^2}

L1 = Laplace Prior

"I believe most weights should be exactly zero."

P(w)ewP(w) \propto e^{-|w|}

MAP Estimation

Regularized loss is equivalent to Maximum A Posteriori (MAP) estimation, closely related to MLE. We're finding the most probable weights given both the data AND our prior beliefs. Bayes' theorem in action.

Dropout

Dropout (Srivastava et al., 2014) is a radical regularization technique for neural networks. During training, we randomly "kill" neurons by setting their outputs to zero. It works differently from L1/L2 and complements batch normalization (which also has regularization effects).

How Dropout Works

For each training batch, each neuron has probability pp of being dropped (output set to 0). Formally, we apply a mask vector rr of Bernoulli random variables:

rjBernoulli(p)r_j \sim \text{Bernoulli}(p)
h=f(W(xr)+b)h = f(W(x \odot r) + b)

Intuition: Imagine a team where any member might call in sick randomly. The team can't rely on one "superstar" to do everything. Everyone must learn to contribute. This forces redundant, robust representations.

Dropout Simulation

Probability a neuron is dropped (zeroed out).

x1x2x3h1_1h1_2h1_3h1_4h1_5h2_1h2_2h2_3h2_4h3_1h3_2h3_3h3_4y1y2InputHidden 1Hidden 2Hidden 3Output

Training: Each batch randomly drops neurons across all 3 hidden layers. The network becomes a "thinned" version of itself. This prevents neurons from relying too much on any specific peer and forces the network to learn robust, distributed representations.

Preventing Co-Adaptation

Without dropout, neurons develop complex co-dependencies: "I only need to detect X because neuron 47 detects Y." With dropout, neuron 47 might be absent, so each neuron must be individually useful. This prevents brittle, specialized features and forces the network to learn more generalizable patterns.

Inverted Dropout & Scaling

There's a subtle problem: during training, only (1p)(1-p) of neurons are active. At test time, all neurons are active. The expected activation magnitude is different!

The Scaling Problem

With p=0.5p = 0.5 dropout, training uses ~50 active neurons. Testing uses all 100.

Training

Signal: ~50x

Testing (No Dropout)

Signal: ~100x (2x stronger!)

Inverted Dropout (The Fix)

Instead of scaling at test time (which is annoying), we scale up activations during training:

y=11pmaskxy = \frac{1}{1-p} \cdot \text{mask} \cdot x

With p=0.5p=0.5, we multiply surviving activations by 2. This ensures the expected value of the activation remains constant between training and testing. At test time, we simply run the network normally without any scaling or masking. This is computationally efficient and standard in modern frameworks like PyTorch and TensorFlow.

Dropout as Ensemble

There's a deep theoretical justification for dropout: it implicitly trains an ensemble of 2N2^N different networks (where N is the number of neurons). This connects to the power of bootstrap aggregation.

Each dropout mask creates a different "thinned" sub-network. By training with random masks, we're training exponentially many sub-networks simultaneously, all sharing weights. At test time, using all neurons approximates averaging the predictions of all these sub-networks.

This connects to the success of Random Forests and Bagging. Ensemble methods work because they average out individual model errors. Dropout achieves this "for free" within a single network.

Comparison Table

TechniqueBest ForEffect
L1 (Lasso)Feature selection, sparse modelsDrives weights to exactly zero
L2 (Ridge)General purpose, weight decayShrinks all weights uniformly
Elastic NetCorrelated featuresL1 + L2 combined
DropoutDeep neural networksForces redundant representations