What is a partial derivative and how is it different from a regular derivative?

A partial derivative measures how a multivariable function changes when only one input variable is varied while all others are held constant. A regular (ordinary) derivative applies to single-variable functions and gives the slope of a curve, whereas a partial derivative applies to functions of two or more variables and gives the slope along one particular axis direction. The partial derivative of f with respect to x is written ∂f/∂x, using the ∂ symbol to signal that other variables are treated as constants.

What is the gradient vector and how is it used in machine learning?

The gradient is a vector that collects all partial derivatives of a scalar-valued function into a single object: ∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]. It points in the direction of steepest increase of the function. In machine learning, the gradient of the loss function with respect to model parameters tells the optimizer which direction to adjust each parameter to reduce loss. Gradient descent updates parameters by subtracting a learning-rate-scaled gradient: θ_new = θ_old − α∇L.

How are partial derivatives used in neural network training?

Training a neural network requires computing ∂L/∂wᵢ — the partial derivative of the loss L with respect to every weight wᵢ in the network. Backpropagation systematically applies the chain rule layer by layer to propagate these gradients from the output back to the input. Frameworks like PyTorch and TensorFlow automate this via automatic differentiation: calling loss.backward() computes and stores all partial derivatives so that an optimizer can update every weight simultaneously.

What is the difference between a gradient and a directional derivative?

The gradient ∇f is a vector containing all partial derivatives and fully describes the local slope of a function in every axis-aligned direction. A directional derivative D_u f is a scalar that gives the rate of change of f in one specific direction described by a unit vector u, computed as the dot product ∇f · u = ‖∇f‖ cos θ. The gradient can therefore be thought of as the collection of directional derivatives along each coordinate axis, and the direction of the gradient itself is the direction of maximum directional derivative.

How do you compute partial derivatives of a loss function with respect to weights?

To compute ∂L/∂w for a specific weight w, treat all other weights and inputs as constants and differentiate the loss expression with respect to w using standard single-variable differentiation rules. In practice, backpropagation applies the chain rule layer by layer: for a weight w in layer k, ∂L/∂w = (∂L/∂output_k) × (∂output_k/∂w), where the first factor is the upstream gradient passed from the next layer. This process is repeated across all layers until every weight has its gradient computed.

Partial Derivatives & Gradients for Machine Learning

Introduction

In single-variable calculus, the derivative $\frac{dy}{dx}$ tells us the rate of change of y with respect to x. But neural networks have millions of parameters. A loss function might depend on weights $w_1, w_2, \ldots, w_{1000000}$ .

We can't simply ask "what's the slope?" because the function slopes differently in every direction. Instead, we ask: "How does the output change if I wiggle just one input while holding all others constant?"

Why Partial Derivatives Matter

Gradient Descent: The gradient is a vector of partial derivatives.
Backpropagation: Computing partials for every weight in a neural network.
Optimization: Finding which direction reduces loss the most.
Sensitivity Analysis: Which inputs affect outputs the most?

Formal Definition

Let $f(x, y)$ be a function of two variables. The partial derivative with respect to x is:

\frac{\partial f}{\partial x} = \lim_{h \to 0} \frac{f(x+h, y) - f(x, y)}{h}

Note the $\partial$ symbol (partial) instead of d. This explicitly indicates that y is held constant.

Similarly, the partial with respect to y is:

\frac{\partial f}{\partial y} = \lim_{h \to 0} \frac{f(x, y+h) - f(x, y)}{h}

The Key Insight

When computing $\frac{\partial f}{\partial x}$ , treat y as if it were a constant number (like 5 or pi). Then differentiate with respect to x using your normal single-variable rules.

Notation & Computation

Several notations are used for partial derivatives:

$\frac{\partial f}{\partial x}$

Leibniz

$f_x$

Subscript

$\partial_x f$

Operator

$D_x f$

D-notation

Worked Example

Let $f(x, y) = x^2 y + 3y^3 + 5$ . Find all partial derivatives.

Partial with respect to x:

Treat y as constant:

$\frac{\partial f}{\partial x} = 2xy + 0 + 0 = 2xy$

Partial with respect to y:

Treat x as constant:

$\frac{\partial f}{\partial y} = x^2 + 9y^2 + 0 = x^2 + 9y^2$

Geometric Intuition

Consider a 3D surface $z = f(x, y)$ representing a landscape with hills and valleys. Standing at a point (x, y), you can walk in infinitely many directions.

Slicing the Surface

When we compute $\frac{\partial f}{\partial x}$ , we "slice" the 3D surface with a vertical plane where y = constant.

This slice creates a 2D curve. The partial derivative is the slope of this curve at our point.

Walking East-West

$\frac{\partial f}{\partial x}$ tells you the slope if you walk strictly in the x-direction (East-West).

$\frac{\partial f}{\partial y}$ tells you the slope if you walk strictly in the y-direction (North-South).

The Gradient Vector

The gradient collects all partial derivatives into a single vector. For a function $f: \mathbb{R}^n \to \mathbb{R}$ :

\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}

The symbol $\nabla$ is called "nabla" or "del".

Direction of Steepest Ascent

The gradient $\nabla f$ points in the direction where f increases fastest. If you're standing on a hill, the gradient points uphill.

Gradient Descent

To minimize f (like a loss function), move in the opposite direction: $\theta_{new} = \theta_{old} - \alpha \nabla f$

Magnitude = Steepness

The magnitude $||\nabla f||$ tells you how steep the slope is. Large gradient = steep terrain = big updates.

Interactive: The Gradient

For the function $f(x, y) = x^2 + y^2$ , the gradient is:

\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \end{bmatrix} = \begin{bmatrix} 2x \\ 2y \end{bmatrix}

Drag the point to see how the gradient always points toward steepest ascent (away from the minimum at the origin).

Interactive Gradient

f(x, y) = x² + y²

\nabla f = \langle 2x, 2y \rangle

Click & Drag anywhere

Partials = Components

\frac{\partial f}{\partial x} = 2x

Horizontal Slope

2.00

\frac{\partial f}{\partial y} = 2y

Vertical Slope

2.00

Gradient Construction

\nabla f = \langle 2.00, 2.00 \rangle

The gradient vector is literally just the list of partial derivatives.

Magnitude (Steepness): 2.83
Direction: 45°

Notice how the gradient always points perpendicular to the contour lines?

Directional Derivatives

Partial derivatives tell us the slope along coordinate axes. But what if we want to know the slope in an arbitrary direction?

Directional Derivative

The rate of change of f in direction of unit vector $\mathbf{u}$ is:

D_{\mathbf{u}}f = \nabla f \cdot \mathbf{u} = ||\nabla f|| \cos(\theta)

where $\theta$ is the angle between the gradient and direction u.

Maximum

When u = direction of gradient, cos(0) = 1. Maximum increase.

Minimum

When u = opposite of gradient, cos(180) = -1. Maximum decrease.

Zero

When u perpendicular to gradient, cos(90) = 0. Level curve (contour).

Gradient is Perpendicular to Level Curves

The gradient is always perpendicular to level curves (contours where f is constant). This is because moving along a level curve means zero change in f, which requires cos(theta) = 0.

Higher-Order Partial Derivatives

We can take partial derivatives of partial derivatives. These second-order partials are crucial for optimization (You can read more about it in the Hessian matrix).

Second-Order Partials

Pure second partial

\frac{\partial^2 f}{\partial x^2} = \frac{\partial}{\partial x}\left(\frac{\partial f}{\partial x}\right)

Mixed partial

\frac{\partial^2 f}{\partial x \partial y} = \frac{\partial}{\partial x}\left(\frac{\partial f}{\partial y}\right)

Clairaut's Theorem (Symmetry of Mixed Partials)

If the mixed partials are continuous, then the order of differentiation doesn't matter:

\frac{\partial^2 f}{\partial x \partial y} = \frac{\partial^2 f}{\partial y \partial x}

This is why the Hessian matrix is symmetric for most functions we encounter.

ML Applications

Backpropagation

When training a neural network, we need $\frac{\partial L}{\partial w_i}$ for every weight w_i. The chain rule lets us compute these efficiently by propagating gradients backward through the network.

Automatic Differentiation

PyTorch and TensorFlow build computation graphs and use the chain rule to compute all partial derivatives automatically. When you call loss.backward(), it computes the gradient of loss with respect to all parameters.

Feature Importance (Saliency Maps)

For a trained model, $\frac{\partial \text{output}}{\partial \text{input}_i}$ tells us how sensitive the output is to each input feature. This creates "saliency maps" showing which pixels matter most for image classification.

Regularization via Gradients

Some regularization techniques penalize large gradients. For example, spectral normalization in GANs constrains the Lipschitz constant (maximum gradient magnitude) of the discriminator.

Contents