Introduction
In PyTorch or TensorFlow, you call .backward() and gradients magically appear. But to truly understand optimization, you must be able to derive gradients by hand for simple networks.
This is the ultimate test of your understanding of the chain rule and partial derivatives. We will dissect a single neuron with Sigmoid activation and MSE loss, then learn how to use those gradients for optimization.
What We'll Learn
Part 1: Backpropagation
Compute and by hand
Part 2: Gradient Descent
Use gradients to iteratively minimize loss
The Setup: A Single Neuron
We define a computational graph with input x, weight w, bias b, and target y_true.
Variables
(Input)
(Weight)
(Bias)
(Target)
Functions
Linear:
Activation:
Loss:
Computation Graph
x, w, b
z = wx + b
a = σ(z)
L = (a-y)²
Forward Pass
Calculate values from input to output:
1. Linear Combination
2. Sigmoid Activation
3. MSE Loss
Backward Pass (Backpropagation)
We want and . We can't jump directly; we must use the chain rule:
Part A: Loss w.r.t Activation
Function:
Part B: Activation w.r.t Linear
Sigmoid derivative:
Part C: Linear w.r.t Weight
Function:
Chain Rule in Action
Now multiply all the local gradients together:
Gradient for Weight (w)
Gradient for Bias (b)
Note:
Why So Small?
The gradients are tiny because the prediction (0.999) is very close to the target (1.0), so the loss is already nearly minimized. Also, Sigmoid at z=7 has a very small derivative (0.0009) due to saturation.
Gradient Descent: Using the Gradients
Now that we know how to compute gradients, we need to know what to do with them. Gradient Descent is the optimization algorithm that uses gradients to iteratively update parameters, minimizing the loss function.
The Core Idea
Start somewhere. Look at the slope. Take a small step downhill. Repeat until you reach the bottom.
The Blindfolded Hiker Analogy
Feel the Ground
Use your feet to determine which direction is "downhill." (Compute gradient)
Take a Step
Move in the downhill direction. Step size depends on confidence. (Update parameters)
Repeat
Keep going until the ground feels flat (gradient near 0). You've reached a minimum.
The Update Rule
Let be our loss function, where represents all model parameters:
Parameters we're optimizing (weights, biases). Could be millions of values.
Learning rate. A hyperparameter controlling step size. Critical to get right.
Gradient vector. Points toward steepest ascent. Hence the minus sign.
Why Subtract the Gradient?
The gradient points in the direction of steepest increase. Since we want to minimize loss (go downhill), we move in the opposite direction. If we added the gradient, we'd be doing gradient ascent.
Interactive: Learning Rate Effects
Experiment with different learning rates. Watch how too small leads to slow convergence, too large leads to divergence, and just right leads to smooth optimization.
Gradient Descent Optimization
OPTIMIZING
Taking steps downhill...
Learning Rate Analysis
The learning rate is the most critical hyperparameter.
Too Small
Training takes forever. Model might get stuck in local minima early.
Symptoms: Loss decreases very slowly over thousands of epochs.
Just Right
Loss decreases steadily and plateaus at minimum. Efficient convergence.
Strategy: Often use learning rate decay (start high, decrease).
Too Large
Loss oscillates or explodes to infinity/NaN. You overshoot the valley.
Symptoms: Loss goes up, NaN values, training crashes.
Learning Rate Schedules
- Step Decay: Reduce LR by factor every N epochs (e.g., halve every 30 epochs)
- Exponential Decay:
- Cosine Annealing: Smoothly decrease to near-zero following cosine curve
- Warmup: Start very small, increase, then decay. Helps with large batches.
Gradient Descent Variants
How much data do we use to compute each gradient update?
Batch Gradient Descent
Use ALL training examples for ONE update.
Stochastic Gradient Descent (SGD)
Use ONE random example for ONE update.
Mini-Batch Gradient Descent (Standard)
Use a small batch (32, 64, 128) of examples.
This is what everyone uses in practice.
Convergence Conditions
When does gradient descent actually find the minimum?
Guarantees for Convex Functions
If is convex (bowl-shaped), gradient descent with appropriate learning rate is guaranteed to find the global minimum.
Convex
Linear regression, logistic regression. Single global minimum.
Non-Convex
Neural networks. Many local minima, saddle points. No guarantee.
The Surprising Success of Deep Learning
Despite non-convexity, deep networks train well because: (1) local minima in high dimensions are often nearly as good as global minima, (2) SGD noise helps escape bad minima, (3) over-parameterization creates many good solutions.
ML Applications
Momentum
Instead of using only the current gradient, accumulate a "velocity" term: . Helps power through saddle points and noisy regions.
Adam Optimizer
Combines momentum with adaptive learning rates per parameter. Maintains running averages of both first and second moments of gradients. The default choice for most deep learning.
Gradient Clipping
When gradients explode (common in RNNs), clip them to a maximum norm. Prevents weight updates from being too large and destabilizing training.
Learning Rate Finder
Technique from fast.ai: gradually increase LR during one epoch, plot loss vs LR. The optimal LR is usually where loss decreases fastest, just before it explodes.