Introduction
Computers are surprisingly limited. At the hardware level, they can only add, subtract, multiply, and divide. So how does your calculator compute or ?
It cheats. It replaces these complex transcendental functions with really long polynomials. Polynomials are just addition and multiplication, which computers understand.
Why This Matters for ML
- Gradient Descent: First-order Taylor approx (linear) justifies the update rule.
- Newton's Method: Second-order Taylor approx (quadratic) enables smarter steps.
- XGBoost: Uses Taylor expansion of loss for efficient tree construction.
- Understanding Optimization: Taylor series explains why optimization algorithms work.
The Core Idea
Imagine forging a signature. To make your forgery match the original at a specific point, you need to match several properties:
Position
Your approximation passes through the same point. f(a) = P(a)
Velocity (Slope)
The curve is going in the same direction. f'(a) = P'(a)
Acceleration (Curvature)
The curve bends at the same rate. f''(a) = P''(a)
Match enough derivatives at point a, and your polynomial becomes indistinguishable from the original function near that point.
The Taylor Formula
The Taylor series of f(x) centered at point a is:
General form:
Maclaurin Series
When a = 0, we call it a Maclaurin series. Common examples you should know:
Convergence
The series converges to f(x) within a "radius of convergence." Further from a, more terms are needed for accuracy.
Interactive: Taylor Approximation
Watch how higher-order Taylor polynomials approximate sin(x) better and better near the center point.
Taylor Approximation
Configuration
Polynomial Terms
Orders of Approximation
In optimization, we typically use only the first few terms:
First-Order (Linear)
Approximates f as a straight line (tangent). This is what gradient descent "sees."
Multivariate Taylor Series
Neural networks have vector inputs. The multivariate second-order Taylor expansion is:
Constant term
f(a): current value
Linear term
gradient: direction of change
Quadratic term
Hessian: curvature info
Connection to Optimization
Why Gradient Descent Works
Near our current point theta, the loss looks like a plane:
To minimize, we want as negative as possible. This means pointing epsilon opposite to the gradient.
Newton's Method
Uses the quadratic approximation. If f is approximately a parabola, we can jump directly to its minimum:
Much faster convergence than GD, but computing H⁻¹ is O(n³).
ML Applications
XGBoost's Second-Order Objective
XGBoost approximates the loss using second-order Taylor expansion: where g is gradient and h is Hessian.
Natural Gradient
Uses Fisher information matrix (related to Hessian) to take steps in the "natural" parameter space. More efficient for probabilistic models.
BFGS / L-BFGS
Quasi-Newton methods that approximate the Hessian from gradient history. Get second-order benefits without computing full Hessian.
Activation Function Approximations
GELU can be approximated using Taylor series for efficient computation: