Introduction
In single-variable calculus, the derivative tells us the rate of change of y with respect to x. But neural networks have millions of parameters. A loss function might depend on weights .
We can't simply ask "what's the slope?" because the function slopes differently in every direction. Instead, we ask: "How does the output change if I wiggle just one input while holding all others constant?"
Why Partial Derivatives Matter
- Gradient Descent: The gradient is a vector of partial derivatives.
- Backpropagation: Computing partials for every weight in a neural network.
- Optimization: Finding which direction reduces loss the most.
- Sensitivity Analysis: Which inputs affect outputs the most?
Formal Definition
Let be a function of two variables. The partial derivative with respect to x is:
Note the symbol (partial) instead of d. This explicitly indicates that y is held constant.
Similarly, the partial with respect to y is:
The Key Insight
When computing , treat y as if it were a constant number (like 5 or pi). Then differentiate with respect to x using your normal single-variable rules.
Notation & Computation
Several notations are used for partial derivatives:
Leibniz
Subscript
Operator
D-notation
Worked Example
Let . Find all partial derivatives.
Partial with respect to x:
Treat y as constant:
Partial with respect to y:
Treat x as constant:
Geometric Intuition
Consider a 3D surface representing a landscape with hills and valleys. Standing at a point (x, y), you can walk in infinitely many directions.
Slicing the Surface
When we compute , we "slice" the 3D surface with a vertical plane where y = constant.
This slice creates a 2D curve. The partial derivative is the slope of this curve at our point.
Walking East-West
tells you the slope if you walk strictly in the x-direction (East-West).
tells you the slope if you walk strictly in the y-direction (North-South).
The Gradient Vector
The gradient collects all partial derivatives into a single vector. For a function :
The symbol is called "nabla" or "del".
Direction of Steepest Ascent
The gradient points in the direction where f increases fastest. If you're standing on a hill, the gradient points uphill.
Gradient Descent
To minimize f (like a loss function), move in the opposite direction:
Magnitude = Steepness
The magnitude tells you how steep the slope is. Large gradient = steep terrain = big updates.
Interactive: The Gradient
For the function , the gradient is:
Drag the point to see how the gradient always points toward steepest ascent (away from the minimum at the origin).
Interactive Gradient
f(x, y) = x² + y²
Partials = Components
Gradient Construction
The gradient vector is literally just the list of partial derivatives.
Magnitude (Steepness): 2.83
Direction: 45°
Notice how the gradient always points perpendicular to the contour lines?
Directional Derivatives
Partial derivatives tell us the slope along coordinate axes. But what if we want to know the slope in an arbitrary direction?
Directional Derivative
The rate of change of f in direction of unit vector is:
where is the angle between the gradient and direction u.
Maximum
When u = direction of gradient, cos(0) = 1. Maximum increase.
Minimum
When u = opposite of gradient, cos(180) = -1. Maximum decrease.
Zero
When u perpendicular to gradient, cos(90) = 0. Level curve (contour).
Gradient is Perpendicular to Level Curves
The gradient is always perpendicular to level curves (contours where f is constant). This is because moving along a level curve means zero change in f, which requires cos(theta) = 0.
Higher-Order Partial Derivatives
We can take partial derivatives of partial derivatives. These second-order partials are crucial for optimization (You can read more about it in the Hessian matrix).
Second-Order Partials
Pure second partial
Mixed partial
Clairaut's Theorem (Symmetry of Mixed Partials)
If the mixed partials are continuous, then the order of differentiation doesn't matter:
This is why the Hessian matrix is symmetric for most functions we encounter.
ML Applications
Backpropagation
When training a neural network, we need for every weight w_i. The chain rule lets us compute these efficiently by propagating gradients backward through the network.
Automatic Differentiation
PyTorch and TensorFlow build computation graphs and use the chain rule to compute all partial derivatives automatically. When you call loss.backward(), it computes the gradient of loss with respect to all parameters.
Feature Importance (Saliency Maps)
For a trained model, tells us how sensitive the output is to each input feature. This creates "saliency maps" showing which pixels matter most for image classification.
Regularization via Gradients
Some regularization techniques penalize large gradients. For example, spectral normalization in GANs constrains the Lipschitz constant (maximum gradient magnitude) of the discriminator.