What is regularization and why is it used in machine learning?

Regularization is a set of techniques that constrain a model's complexity to prevent overfitting — the phenomenon where a model memorizes training data instead of learning general patterns. By adding a penalty term to the loss function or randomly disabling neurons during training, regularization forces the model to find simpler solutions that generalize better to unseen data. Without regularization, highly parameterized models like deep neural networks can fit training data perfectly yet perform poorly in production.

What is the difference between L1 and L2 regularization?

L1 regularization (Lasso) adds the sum of absolute weight values to the loss, which produces sparse models by driving many weights to exactly zero — effectively performing feature selection. L2 regularization (Ridge) adds the sum of squared weight values, which shrinks all weights toward zero but rarely sets them exactly to zero, keeping all features in play with smaller coefficients. L1 is preferred when you believe most features are irrelevant; L2 is preferred for general-purpose regularization when all features may contribute. Geometrically, L1 uses a diamond-shaped constraint region with corners that naturally zero out weights, while L2 uses a smooth sphere.

How does dropout work as a regularization technique?

Dropout randomly sets a fraction of neuron outputs to zero during each training batch, forcing the network to learn redundant, robust representations rather than relying on specific neurons. At test time, all neurons are active but activations are scaled appropriately — in modern inverted dropout, the surviving activations are scaled up by 1/(1-p) during training so no scaling is needed at inference. Dropout implicitly trains an exponential ensemble of sub-networks that share weights, which is why it reduces overfitting in deep neural networks so effectively.

What is elastic net regularization and when should you use it?

Elastic Net combines both L1 and L2 penalties in a single objective, controlled by two hyperparameters λ1 and λ2. It achieves L1's feature selection while the L2 term stabilizes the solution when features are correlated — a known weakness of pure L1, which tends to arbitrarily select one feature from a correlated group. Use Elastic Net when you expect many irrelevant features (motivating sparsity) but also have groups of correlated features that should be retained together. It is the default regularizer in many scikit-learn models.

How does L2 regularization relate to weight decay in deep learning?

L2 regularization and weight decay are mathematically equivalent in standard gradient descent: the L2 penalty gradient 2λw subtracts a fraction of the weight at every update step, which is the same as multiplying weights by (1 - 2ηλ) before the gradient step — hence the name weight decay. However, in adaptive optimizers like Adam, the two diverge: L2 regularization scales the penalty by the adaptive learning rate, while decoupled weight decay (used in AdamW) applies the decay directly to the weights independent of the gradient, producing better regularization in practice for large-scale deep learning.

Regularization: L1, L2, Dropout & Elastic Net

Introduction

A deep neural network with millions of parameters is a "universal function approximator." It can memorize every single pixel of your training data. With a powerful enough model and cross-entropy loss, the model fits the training set perfectly.

This is a problem. When a model memorizes training data instead of learning general patterns, it fails catastrophically on new, unseen data. This is overfitting, and understanding loss landscapes helps explain why it happens.

Regularization intentionally constrains the model's capacity. We force it to be "simpler," which paradoxically makes it generalize better. It's like training with handicaps so you perform better in the real competition.

Bias-Variance Tradeoff

Every ML model's error can be decomposed into three parts:

\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}

High Bias (Underfitting)

Model is too simple. Can't capture the true pattern.

Example: Fitting a line to parabolic data.

High Variance (Overfitting)

Model is too complex. Fits noise in training data.

Example: 20-degree polynomial through 10 points.

Regularization increases bias slightly but dramatically reduces variance. The tradeoff is usually worth it.

L2 Regularization (Ridge)

L2 regularization adds the sum of squared weights to the loss function. During gradient descent, this penalizes large weights:

\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{data}} + \lambda \sum_i w_i^2

Gradient Update with L2

The gradient of the penalty is $2\lambda w$ :

w_{new} = w - \eta(\nabla L + 2\lambda w)

w_{new} = w(1 - 2\eta\lambda) - \eta \nabla L

This is called Weight Decay. Every step, weights are multiplied by a factor slightly less than 1. See AdamW for why decoupling weight decay matters.

L2's Effect

L2 penalizes large weights quadratically. A weight of 10 is penalized 100x more than a weight of 1. Result: Weights are pushed toward zero but rarely reach exactly zero. All features contribute, just with smaller coefficients. The L2 penalty is also convex, making optimization tractable.

L1 Regularization (Lasso)

L1 regularization adds the sum of absolute weights:

\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{data}} + \lambda \sum_i |w_i|

Gradient Update with L1

The gradient of $|w|$ is $\text{sign}(w)$ (either +1 or -1):

w_{new} = w - \eta(\nabla L + \lambda \cdot \text{sign}(w))

The penalty subtracts a constant from the weight magnitude at each step, regardless of the weight's current value.

L1's Effect: Sparsity

Unlike L2, L1 pushes weights all the way to exactly zero. This produces sparse models where many features are completely ignored. L1 effectively performs feature selection, similar in spirit to how Information Gain selects features in decision trees.

Geometric Intuition

The magic of L1 sparsity comes from geometry. Think of regularization as constraining weights to lie within a region. This is closely related to constrained optimization.

L1: The Diamond

$|w_1| + |w_2| \le C$

Sharp corners on the axes. Loss contours likely hit a corner first, zeroing one weight.

L2: The Circle

$w_1^2 + w_2^2 \le C$

No corners. Loss contours hit a smooth edge. Both weights stay non-zero.

Interactive: L1 vs L2 Geometry

Watch how the loss contours intersect the constraint regions. L1's corners naturally produce zeros.

L1 (Diamond) vs L2 (Circle) Geometry

The constraint region shape determines where the loss contours first touch.

Constraint budget: 1.2

L2: Circle = Distributed Weights

The loss contours (orange/green ellipses) hit the circle at an arbitrary point on the edge. Both weights are small but non-zero. L2 shrinks weights uniformly without zeroing them out.

1.157

0.320

Elastic Net

Why choose? Elastic Net combines both:

\mathcal{L} = \mathcal{L}_{data} + \lambda_1 \sum |w_i| + \lambda_2 \sum w_i^2

Elastic Net gets L1's feature selection while avoiding its instability when features are correlated. It's the default in many scikit-learn models.

The Grouping Effect

L1 regularization has a known weakness: if a group of features are highly correlated (e.g., "height in cm" and "height in inches"), L1 tends to select just one at random and zero out the others. This is unstable and arbitrary.

Elastic Net's L2 term encourages the grouping effect: correlated features are retained together and assigned similar weights. This makes the model more robust and interpretable when features are redundant.

Bayesian Interpretation

Regularization has a beautiful Bayesian interpretation. Adding a penalty is equivalent to placing a prior belief on the weights.

L2 = Gaussian Prior

"I believe weights are normally distributed around 0."

P(w) \propto e^{-w^2}

L1 = Laplace Prior

"I believe most weights should be exactly zero."

P(w) \propto e^{-|w|}

MAP Estimation

Regularized loss is equivalent to Maximum A Posteriori (MAP) estimation, closely related to MLE. We're finding the most probable weights given both the data AND our prior beliefs. Bayes' theorem in action.

Dropout

Dropout (Srivastava et al., 2014) is a radical regularization technique for neural networks. During training, we randomly "kill" neurons by setting their outputs to zero. It works differently from L1/L2 and complements batch normalization (which also has regularization effects).

How Dropout Works

For each training batch, each neuron has probability $p$ of being dropped (output set to 0). Formally, we apply a mask vector $r$ of Bernoulli random variables:

r_j \sim \text{Bernoulli}(p)

h = f(W(x \odot r) + b)

Intuition: Imagine a team where any member might call in sick randomly. The team can't rely on one "superstar" to do everything. Everyone must learn to contribute. This forces redundant, robust representations.

Dropout Simulation

Dropout Rate (p): 0.5

Probability a neuron is dropped (zeroed out).

Training: Each batch randomly drops neurons across all 3 hidden layers. The network becomes a "thinned" version of itself. This prevents neurons from relying too much on any specific peer and forces the network to learn robust, distributed representations.

Preventing Co-Adaptation

Without dropout, neurons develop complex co-dependencies: "I only need to detect X because neuron 47 detects Y." With dropout, neuron 47 might be absent, so each neuron must be individually useful. This prevents brittle, specialized features and forces the network to learn more generalizable patterns.

Inverted Dropout & Scaling

There's a subtle problem: during training, only $(1-p)$ of neurons are active. At test time, all neurons are active. The expected activation magnitude is different!

The Scaling Problem

With $p = 0.5$ dropout, training uses ~50 active neurons. Testing uses all 100.

Training

Signal: ~50x

Testing (No Dropout)

Signal: ~100x (2x stronger!)

Inverted Dropout (The Fix)

Instead of scaling at test time (which is annoying), we scale up activations during training:

y = \frac{1}{1-p} \cdot \text{mask} \cdot x

With $p=0.5$ , we multiply surviving activations by 2. This ensures the expected value of the activation remains constant between training and testing. At test time, we simply run the network normally without any scaling or masking. This is computationally efficient and standard in modern frameworks like PyTorch and TensorFlow.

Dropout as Ensemble

There's a deep theoretical justification for dropout: it implicitly trains an ensemble of $2^N$ different networks (where N is the number of neurons). This connects to the power of bootstrap aggregation.

Each dropout mask creates a different "thinned" sub-network. By training with random masks, we're training exponentially many sub-networks simultaneously, all sharing weights. At test time, using all neurons approximates averaging the predictions of all these sub-networks.

This connects to the success of Random Forests and Bagging. Ensemble methods work because they average out individual model errors. Dropout achieves this "for free" within a single network.

Comparison Table

Technique	Best For	Effect
L1 (Lasso)	Feature selection, sparse models	Drives weights to exactly zero
L2 (Ridge)	General purpose, weight decay	Shrinks all weights uniformly
Elastic Net	Correlated features	L1 + L2 combined
Dropout	Deep neural networks	Forces redundant representations

Contents

Regularization (L1, L2, Dropout)