What is the chain rule in calculus?

The chain rule is a fundamental theorem in calculus for differentiating composite functions. If y = f(g(x)), the chain rule states that dy/dx = f'(g(x)) * g'(x) — you differentiate the outer function (keeping the inner intact) and multiply by the derivative of the inner function. In other words, to find the global rate of change you multiply the local rates of change at each step.

How is the chain rule used in backpropagation?

Backpropagation is the recursive application of the chain rule across a neural network's computational graph. Starting from the loss function and moving backward layer by layer, each gradient is computed by multiplying the gradient arriving from the next layer by the local gradient of the current operation. For a weight w, we compute ∂L/∂w = (∂L/∂a) · (∂a/∂z) · (∂z/∂w), chaining every intermediate derivative together.

What is the connection between the chain rule and computational graphs?

A computational graph represents a function as a directed acyclic graph (DAG) where each node is an operation and each edge carries a value. The chain rule maps directly onto this structure: the forward pass computes and stores values at every node, while the backward pass traverses the graph in reverse, multiplying local gradients edge by edge. Frameworks like PyTorch and TensorFlow automate this process using reverse-mode automatic differentiation.

How does the chain rule lead to the vanishing gradient problem?

In a deep network, the chain rule requires multiplying together many local gradients — one per layer. When activation functions like Sigmoid are used, their derivatives are at most 0.25. Multiplying n such terms gives values like 0.25^10 ≈ 9.5×10⁻⁷, which effectively becomes zero. Early layers receive near-zero gradients and stop learning, which is the vanishing gradient problem. Solutions include ReLU activations, residual connections, and batch normalization.

What is the difference between forward and backward pass in terms of the chain rule?

The forward pass evaluates a function by flowing inputs left-to-right through the computational graph, computing and caching intermediate values at each node. The backward pass then flows gradients right-to-left, applying the chain rule at each node: it multiplies the incoming gradient by the local gradient (derivative of the node's operation) to produce the outgoing gradient. Together these two passes efficiently compute the gradient of the loss with respect to every parameter in a single sweep.

The Chain Rule: Backbone of Backpropagation

Introduction

In calculus, we learn rules for basic derivatives: the power rule ( $x^n \to nx^{n-1}$ ), the sum rule, the product rule. But real-world functions are rarely simple. They are compositions of functions nested inside one another.

A neural network is not $y = mx + b$ . It is $y = f(g(h(k(x))))$ , where f, g, h, k are layers of weights and activation functions. If we want to train this network, we need to compute: "How does changing a weight deep inside affect the final output?"

The Chain Rule in One Sentence

To find the derivative of a composition, multiply the derivatives of each step.

The Gear System Intuition

Imagine three interconnected gears: A drives B, B drives C.

drives

If Gear A turns 1 rotation, Gear B turns 2 rotations. (Rate: dB/dA = 2)
If Gear B turns 1 rotation, Gear C turns 3 rotations. (Rate: dC/dB = 3)

Question: How much does C turn when A turns 1 rotation?

Answer: Rate(A to C) = Rate(A to B) * Rate(B to C) = 2 * 3 = 6

This is the chain rule: multiply the local rates to get the global rate.

Single Variable Chain Rule

If y is a function of u, and u is a function of x (i.e., $y = f(g(x))$ ), then:

\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}

In function notation: $\frac{d}{dx}[f(g(x))] = f'(g(x)) \cdot g'(x)$

The "Onion Peeling" Method

Let $y = (3x + 1)^2$ . Find dy/dx.

Step 1: Identify outer and inner

Outer: $u^2$ . Inner: $u = 3x + 1$

Step 2: Differentiate outer (keep inner intact)

$\frac{dy}{du} = 2u = 2(3x+1)$

Step 3: Differentiate inner

$\frac{du}{dx} = 3$

Step 4: Multiply

$\frac{dy}{dx} = 2(3x+1) \cdot 3 = 6(3x+1)$

Interactive: Chain Rule

Watch the chain rule in action. We compute the derivative by propagating gradients backward through the computation graph.

Chain Rule Calculator

Computing $\ rac{dy}{dx}$ for $y=(x+2)^2$ .

x = 1.0

Start: Forward Pass

Compute values from input to output.

x = 1.0 \to u = 1.0+2 = 3.0 \to y = 3.0^2 = 9.00

Computational Graphs

Modern deep learning frameworks (PyTorch, TensorFlow) represent computations as directed acyclic graphs (DAGs). Each node is an operation, each edge carries data. You can learn more about this in our Computational Graphs blog.

Example: f(x, y, z) = (x + y) * z

We break this into intermediate steps:

q = x + y
f = q * z

x, ythenq = x+ythenf = q*zwithz

Forward Pass

Compute values left-to-right. Store intermediate results at each node.

Backward Pass

Compute gradients right-to-left. Multiply local gradients using chain rule.

Multivariable Chain Rule

In ML, a variable often influences the output through multiple paths. For example, an input feature might feed into multiple neurons.

If f depends on u and v, and both depend on x:

\frac{df}{dx} = \frac{\partial f}{\partial u}\frac{du}{dx} + \frac{\partial f}{\partial v}\frac{dv}{dx}

Sum the gradients over ALL paths from x to f.

The River Delta Analogy

If you dump dye into a river (x), and the river splits into two streams (u and v) before joining into a lake (f), the total dye in the lake depends on flow through both streams. You must add up contributions from all paths.

Backpropagation

Backpropagation is simply the recursive application of the chain rule, starting from the loss function and moving backward to the weights. For a hands-on walkthrough, see our Backpropagation & Gradient Descent guide.

Single Neuron Example

Consider: $z = wx + b$ , $a = \sigma(z)$ , $L = (a - y)^2$

We want $\frac{\partial L}{\partial w}$ to update the weight.

$\frac{\partial L}{\partial a} = 2(a - y)$

$\frac{\partial a}{\partial z} = \sigma'(z) = a(1-a)$

$\frac{\partial z}{\partial w} = x$

The Chain

$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}$

The Vanishing Gradient Problem

The chain rule reveals a critical vulnerability in deep networks. When we compute gradients through many layers, we multiply many terms together. This multiplication can cause gradients to shrink (vanish) or grow (explode) exponentially.

The Mathematics

For a network with n layers, the gradient of the loss with respect to early layer weights involves:

\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial a_n} \cdot \frac{\partial a_n}{\partial z_n} \cdot \frac{\partial z_n}{\partial a_{n-1}} \cdots \frac{\partial a_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial w_1}

The terms $\frac{\partial a_i}{\partial z_i}$ are the activation function derivatives. For Sigmoid, the maximum is 0.25.

Why Sigmoid Causes Vanishing Gradients

The Sigmoid function $\sigma(z) = \frac{1}{1+e^{-z}}$ has derivative $\sigma'(z) = \sigma(z)(1-\sigma(z))$ .

Maximum derivative occurs at z=0: $\sigma'(0) = 0.5 \times 0.5 = 0.25$

After n layers:

$0.25^5 = 0.00098 \quad 0.25^{10} = 9.5 \times 10^{-7}$

Gradient Flow Visualization

Simulating backward propagation. See mathematical decay/explosion.

Input

Loss

∇: 1.0000

Current Chain Calculation

Click Propagate to start

Sigmoid (Vanishing)

Sigmoid derivative max is 0.25 (at z=0). Gradient shrinks by 75% at every layer.

Input LayerLoss

Vanishing Gradients

Product of small numbers goes to 0. Early layers stop learning because they receive near-zero gradients.

Common with Sigmoid/Tanh activations.

Exploding Gradients

Product of numbers > 1 grows exponentially. Weights become NaN and training fails.

Common in RNNs with long sequences.

Solutions

ReLU Activation

Derivative is 0 or 1. No multiplication decay when active.

Residual Connections

Skip connections create gradient "highways" bypassing layers.

Batch Normalization

Normalizes activations, keeping gradients in a healthy range.

Gradient Clipping

Caps gradient magnitude to prevent explosion (used in RNNs).

ML Applications

Automatic Differentiation

PyTorch and TensorFlow use reverse-mode autodiff: build the graph forward, then traverse backward multiplying gradients. This is exactly the chain rule automated. One backward pass computes gradients for all parameters.

Matrix Calculus

In vector/matrix operations, the chain rule becomes Jacobian-vector products. For a layer Y = XW, we get $\frac{\partial L}{\partial W} = X^T \frac{\partial L}{\partial Y}$ . This is the chain rule in matrix form.

LSTM/GRU Gates

Long Short-Term Memory networks use gating mechanisms specifically designed to create "gradient highways" that allow gradients to flow across many time steps without vanishing.

Gradient Checkpointing

To save memory, we can recompute forward activations during the backward pass instead of storing them all. This trades compute for memory, enabled by the modular nature of the chain rule.

Contents