Introduction
In calculus, we learn rules for basic derivatives: the power rule (), the sum rule, the product rule. But real-world functions are rarely simple. They are compositions of functions nested inside one another.
A neural network is not . It is , where f, g, h, k are layers of weights and activation functions. If we want to train this network, we need to compute: "How does changing a weight deep inside affect the final output?"
The Chain Rule in One Sentence
To find the derivative of a composition, multiply the derivatives of each step.
The Gear System Intuition
Imagine three interconnected gears: A drives B, B drives C.
- If Gear A turns 1 rotation, Gear B turns 2 rotations. (Rate: dB/dA = 2)
- If Gear B turns 1 rotation, Gear C turns 3 rotations. (Rate: dC/dB = 3)
Question: How much does C turn when A turns 1 rotation?
Answer: Rate(A to C) = Rate(A to B) * Rate(B to C) = 2 * 3 = 6
This is the chain rule: multiply the local rates to get the global rate.
Single Variable Chain Rule
If y is a function of u, and u is a function of x (i.e., ), then:
In function notation:
The "Onion Peeling" Method
Let . Find dy/dx.
Step 1: Identify outer and inner
Outer: . Inner:
Step 2: Differentiate outer (keep inner intact)
Step 3: Differentiate inner
Step 4: Multiply
Interactive: Chain Rule
Watch the chain rule in action. We compute the derivative by propagating gradients backward through the computation graph.
Chain Rule Calculator
Computing for .
Start: Forward Pass
Compute values from input to output.
Computational Graphs
Modern deep learning frameworks (PyTorch, TensorFlow) represent computations as directed acyclic graphs (DAGs). Each node is an operation, each edge carries data. You can learn more about this in our Computational Graphs blog.
Example: f(x, y, z) = (x + y) * z
We break this into intermediate steps:
- q = x + y
- f = q * z
Forward Pass
Compute values left-to-right. Store intermediate results at each node.
Backward Pass
Compute gradients right-to-left. Multiply local gradients using chain rule.
Multivariable Chain Rule
In ML, a variable often influences the output through multiple paths. For example, an input feature might feed into multiple neurons.
If f depends on u and v, and both depend on x:
Sum the gradients over ALL paths from x to f.
The River Delta Analogy
If you dump dye into a river (x), and the river splits into two streams (u and v) before joining into a lake (f), the total dye in the lake depends on flow through both streams. You must add up contributions from all paths.
Backpropagation
Backpropagation is simply the recursive application of the chain rule, starting from the loss function and moving backward to the weights. For a hands-on walkthrough, see our Backpropagation & Gradient Descent guide.
Single Neuron Example
Consider: , ,
We want to update the weight.
The Chain
The Vanishing Gradient Problem
The chain rule reveals a critical vulnerability in deep networks. When we compute gradients through many layers, we multiply many terms together. This multiplication can cause gradients to shrink (vanish) or grow (explode) exponentially.
The Mathematics
For a network with n layers, the gradient of the loss with respect to early layer weights involves:
The terms are the activation function derivatives. For Sigmoid, the maximum is 0.25.
Why Sigmoid Causes Vanishing Gradients
The Sigmoid function has derivative .
Maximum derivative occurs at z=0:
After n layers:
Gradient Flow Visualization
Current Chain Calculation
Sigmoid (Vanishing)
Sigmoid derivative max is 0.25 (at z=0). Gradient shrinks by 75% at every layer.
Vanishing Gradients
Product of small numbers goes to 0. Early layers stop learning because they receive near-zero gradients.
Common with Sigmoid/Tanh activations.
Exploding Gradients
Product of numbers > 1 grows exponentially. Weights become NaN and training fails.
Common in RNNs with long sequences.
Solutions
ReLU Activation
Derivative is 0 or 1. No multiplication decay when active.
Residual Connections
Skip connections create gradient "highways" bypassing layers.
Batch Normalization
Normalizes activations, keeping gradients in a healthy range.
Gradient Clipping
Caps gradient magnitude to prevent explosion (used in RNNs).
ML Applications
Automatic Differentiation
PyTorch and TensorFlow use reverse-mode autodiff: build the graph forward, then traverse backward multiplying gradients. This is exactly the chain rule automated. One backward pass computes gradients for all parameters.
Matrix Calculus
In vector/matrix operations, the chain rule becomes Jacobian-vector products. For a layer Y = XW, we get . This is the chain rule in matrix form.
LSTM/GRU Gates
Long Short-Term Memory networks use gating mechanisms specifically designed to create "gradient highways" that allow gradients to flow across many time steps without vanishing.
Gradient Checkpointing
To save memory, we can recompute forward activations during the backward pass instead of storing them all. This trades compute for memory, enabled by the modular nature of the chain rule.