How does backpropagation work in neural networks?

Backpropagation computes the gradient of the loss function with respect to every weight in the network by applying the chain rule recursively from the output layer back to the input layer. In the forward pass, activations are computed and cached at each layer. In the backward pass, local gradients are multiplied together along each path in the computation graph, yielding the exact partial derivatives needed to update weights with gradient descent.

What is the relationship between backpropagation and the chain rule?

Backpropagation is a direct application of the multivariable chain rule. Because the loss depends on the weights only through intermediate variables (linear combination z, activation a, etc.), the chain rule lets us factor the total derivative into a product of local derivatives at each node of the computation graph. For example, dL/dw = (dL/da) * (da/dz) * (dz/dw), where each factor is computed cheaply at its own node.

What is the difference between the forward pass and the backward pass?

The forward pass evaluates the network left-to-right: inputs are multiplied by weights, passed through activation functions, and finally compared to targets to compute the loss. Intermediate values (z, a, etc.) are cached. The backward pass then traverses right-to-left, using the cached values and the chain rule to compute gradients for every parameter. Weights are only updated after the backward pass completes.

What is the vanishing gradient problem and how does it affect backpropagation?

The vanishing gradient problem occurs when gradients shrink exponentially as they are propagated backward through many layers, because each chain-rule multiplication involves the derivative of the activation function. Sigmoid and tanh saturate at extreme inputs, producing derivatives close to zero (e.g., 0.0009 for sigmoid at z=7). Multiplied across dozens of layers, the gradient reaching early layers becomes negligibly small, preventing those weights from learning. Solutions include ReLU activations, residual connections, batch normalisation, and careful weight initialisation.

How does mini-batch gradient descent interact with backpropagation?

In mini-batch gradient descent the forward pass and backward pass are run on a small subset of training examples (typically 32-512 samples) at once rather than one sample (SGD) or the entire dataset (batch GD). The gradients from all examples in the batch are averaged before the weight update is applied. This provides a good balance: the gradient estimate is less noisy than single-sample SGD, the computation maps efficiently to GPU matrix operations, and memory requirements stay manageable.

What is the computational complexity of backpropagation?

Backpropagation runs in O(W) time, where W is the total number of weights in the network, meaning computing all gradients costs roughly the same as two forward passes. This is why it is efficient: no matter how many parameters exist, a single backward pass yields every gradient simultaneously. Memory complexity is also O(W) plus O(L) for the cached activations at each of the L layers, which is why large models require significant GPU memory during training.

Backpropagation & Gradient Descent: Complete Tutorial

Introduction

In PyTorch or TensorFlow, you call .backward() and gradients magically appear. But to truly understand optimization, you must be able to derive gradients by hand for simple networks.

This is the ultimate test of your understanding of the chain rule and partial derivatives. We will dissect a single neuron with Sigmoid activation and MSE loss, then learn how to use those gradients for optimization.

What We'll Learn

Part 1: Backpropagation

Compute $\frac{\partial L}{\partial w}$ and $\frac{\partial L}{\partial b}$ by hand

Part 2: Gradient Descent

Use gradients to iteratively minimize loss

The Setup: A Single Neuron

We define a computational graph with input x, weight w, bias b, and target y_true.

Variables

$x = 2.0$ (Input)

$w = 3.0$ (Weight)

$b = 1.0$ (Bias)

$y_{true} = 1.0$ (Target)

Functions

Linear: $z = wx + b$

Activation: $a = \sigma(z) = \frac{1}{1+e^{-z}}$

Loss: $L = (a - y)^2$

Computation Graph

x, w, b

→

z = wx + b

→

a = σ(z)

→

L = (a-y)²

Forward Pass

Calculate values from input to output:

1. Linear Combination

$z = (3.0)(2.0) + 1.0 = 7.0$

2. Sigmoid Activation

$a = \sigma(7.0) = \frac{1}{1+e^{-7}} \approx 0.999089$

3. MSE Loss

$L = (0.999089 - 1.0)^2 \approx 8.3 \times 10^{-7}$

Backward Pass (Backpropagation)

We want $\frac{\partial L}{\partial w}$ and $\frac{\partial L}{\partial b}$ . We can't jump directly; we must use the chain rule:

\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}

Part A: Loss w.r.t Activation

Function: $L = (a - y)^2$

$\frac{\partial L}{\partial a} = 2(a - y) = 2(0.999089 - 1.0) = -0.001822$

Part B: Activation w.r.t Linear

Sigmoid derivative: $\sigma'(z) = \sigma(z)(1-\sigma(z)) = a(1-a)$

$\frac{\partial a}{\partial z} = 0.999089(1-0.999089) \approx 0.000910$

Part C: Linear w.r.t Weight

Function: $z = wx + b$

$\frac{\partial z}{\partial w} = x = 2.0$

Chain Rule in Action

Now multiply all the local gradients together:

Gradient for Weight (w)

$\frac{\partial L}{\partial w} = (-0.001822) \times (0.000910) \times (2.0)$

$\frac{\partial L}{\partial w} \approx -3.316 \times 10^{-6}$

Gradient for Bias (b)

Note: $\frac{\partial z}{\partial b} = 1$

$\frac{\partial L}{\partial b} = (-0.001822) \times (0.000910) \times (1)$

$\frac{\partial L}{\partial b} \approx -1.658 \times 10^{-6}$

Why So Small?

The gradients are tiny because the prediction (0.999) is very close to the target (1.0), so the loss is already nearly minimized. Also, Sigmoid at z=7 has a very small derivative (0.0009) due to saturation.

Gradient Descent: Using the Gradients

Now that we know how to compute gradients, we need to know what to do with them. Gradient Descent is the optimization algorithm that uses gradients to iteratively update parameters, minimizing the loss function.

The Core Idea

Start somewhere. Look at the slope. Take a small step downhill. Repeat until you reach the bottom.

The Blindfolded Hiker Analogy

Feel the Ground

Use your feet to determine which direction is "downhill." (Compute gradient)

Take a Step

Move in the downhill direction. Step size depends on confidence. (Update parameters)

Repeat

Keep going until the ground feels flat (gradient near 0). You've reached a minimum.

The Update Rule

Let $J(\theta)$ be our loss function, where $\theta$ represents all model parameters:

\theta_{new} = \theta_{old} - \alpha \nabla J(\theta_{old})

$\theta$

Parameters we're optimizing (weights, biases). Could be millions of values.

$\alpha$

Learning rate. A hyperparameter controlling step size. Critical to get right.

$\nabla J$

Gradient vector. Points toward steepest ascent. Hence the minus sign.

Why Subtract the Gradient?

The gradient $\nabla J(\theta)$ points in the direction of steepest increase. Since we want to minimize loss (go downhill), we move in the opposite direction. If we added the gradient, we'd be doing gradient ascent.

Interactive: Learning Rate Effects

Experiment with different learning rates. Watch how too small leads to slow convergence, too large leads to divergence, and just right leads to smooth optimization.

Gradient Descent Optimization

Minimize $f(x) = x^2$ . Adjust the learning rate to see convergence vs divergence.

Parameter θ (x)Loss J(θ)

OPTIMIZING

Taking steps downhill...

Learning Rate

\alpha

0.10

SlowOptimalExplosive

Current x4.000

Gradient ∇8.000

Loss16.0000

Next Step-0.800

Update Rule

x_{new} = 4.00 - 0.1 (8.00)

Learning Rate Analysis

The learning rate $\alpha$ is the most critical hyperparameter.

Too Small

Training takes forever. Model might get stuck in local minima early.

Symptoms: Loss decreases very slowly over thousands of epochs.

Just Right

Loss decreases steadily and plateaus at minimum. Efficient convergence.

Strategy: Often use learning rate decay (start high, decrease).

Too Large

Loss oscillates or explodes to infinity/NaN. You overshoot the valley.

Symptoms: Loss goes up, NaN values, training crashes.

Learning Rate Schedules

Step Decay: Reduce LR by factor every N epochs (e.g., halve every 30 epochs)
Exponential Decay: $\alpha_t = \alpha_0 \cdot e^{-kt}$
Cosine Annealing: Smoothly decrease to near-zero following cosine curve
Warmup: Start very small, increase, then decay. Helps with large batches.

Gradient Descent Variants

How much data do we use to compute each gradient update?

Batch Gradient Descent

Use ALL training examples for ONE update.

+ Stable, true gradient- Slow, memory intensive

Stochastic Gradient Descent (SGD)

Use ONE random example for ONE update.

+ Fast, can escape local minima via noise- Very noisy, unstable path

Mini-Batch Gradient Descent (Standard)

Use a small batch (32, 64, 128) of examples.

+ Best of both worlds+ GPU-friendly (matrix ops)

This is what everyone uses in practice.

Convergence Conditions

When does gradient descent actually find the minimum?

Guarantees for Convex Functions

If $J(\theta)$ is convex (bowl-shaped), gradient descent with appropriate learning rate is guaranteed to find the global minimum.

Convex

Linear regression, logistic regression. Single global minimum.

Non-Convex

Neural networks. Many local minima, saddle points. No guarantee.

The Surprising Success of Deep Learning

Despite non-convexity, deep networks train well because: (1) local minima in high dimensions are often nearly as good as global minima, (2) SGD noise helps escape bad minima, (3) over-parameterization creates many good solutions.

ML Applications

Momentum

Instead of using only the current gradient, accumulate a "velocity" term: $v_t = \beta v_{t-1} + \nabla J$ . Helps power through saddle points and noisy regions.

Adam Optimizer

Combines momentum with adaptive learning rates per parameter. Maintains running averages of both first and second moments of gradients. The default choice for most deep learning.

Gradient Clipping

When gradients explode (common in RNNs), clip them to a maximum norm. Prevents weight updates from being too large and destabilizing training.

Learning Rate Finder

Technique from fast.ai: gradually increase LR during one epoch, plot loss vs LR. The optimal LR is usually where loss decreases fastest, just before it explodes.

Contents