What is a Taylor series and how does it work?

A Taylor series is an infinite sum of polynomial terms that approximates any smooth function near a chosen point. Each term is built from the function's derivatives at that point: the zeroth derivative gives the function value, the first gives slope, the second gives curvature, and so on. By matching more and more derivatives you get a polynomial that is indistinguishable from the original function in a neighbourhood around that point.

How is the Taylor series used in machine learning optimization?

Optimization algorithms approximate the loss surface locally using Taylor expansions. Gradient descent relies on the first-order (linear) approximation to decide which direction to step. Second-order methods such as Newton's method and quasi-Newton methods like L-BFGS use the quadratic approximation, which includes curvature information from the Hessian, allowing much larger and more accurate steps toward the minimum. XGBoost also exploits a second-order Taylor expansion of the loss to build each new tree efficiently.

What is the difference between a first-order and second-order Taylor approximation?

A first-order approximation keeps only the constant and linear terms, approximating the function as a tangent plane (or line in 1D). A second-order approximation adds the quadratic term involving the second derivative (or Hessian in multiple dimensions), approximating the function as a paraboloid. The second-order version captures curvature, making it more accurate over a wider region and enabling algorithms like Newton's method to jump directly toward a minimum in a single step when the function is truly quadratic.

How does Newton's method use the Taylor series?

Newton's method approximates the objective function as a quadratic (second-order Taylor expansion) around the current point and then jumps analytically to the minimum of that quadratic. The update rule is theta_new = theta_old - H^{-1} * gradient, where H is the Hessian matrix of second derivatives. Because the quadratic model captures curvature, Newton's method converges in far fewer steps than gradient descent, though inverting the Hessian costs O(n^3) which limits its use to small or structured problems.

Why is the Taylor series useful for understanding gradient descent?

Gradient descent is mathematically justified by the first-order Taylor approximation of the loss: near the current parameters the loss looks like a linear function of the update direction, so moving opposite to the gradient reduces the loss as fast as possible locally. The Taylor perspective also explains why a small learning rate is required — the linear approximation is only accurate close to the expansion point, so large steps risk leaving the region where the approximation is valid and potentially increasing the loss.

Taylor Series Approximation in Machine Learning

Introduction

Computers are surprisingly limited. At the hardware level, they can only add, subtract, multiply, and divide. So how does your calculator compute $\sin(1.5)$ or $e^{2.5}$ ?

It cheats. It replaces these complex transcendental functions with really long polynomials. Polynomials are just addition and multiplication, which computers understand.

Why This Matters for ML

Gradient Descent: First-order Taylor approx (linear) justifies the update rule.
Newton's Method: Second-order Taylor approx (quadratic) enables smarter steps.
XGBoost: Uses Taylor expansion of loss for efficient tree construction.
Understanding Optimization: Taylor series explains why optimization algorithms work.

The Core Idea

Imagine forging a signature. To make your forgery match the original at a specific point, you need to match several properties:

Position

Your approximation passes through the same point. f(a) = P(a)

Velocity (Slope)

The curve is going in the same direction. f'(a) = P'(a)

Acceleration (Curvature)

The curve bends at the same rate. f''(a) = P''(a)

Match enough derivatives at point a, and your polynomial becomes indistinguishable from the original function near that point.

The Taylor Formula

The Taylor series of f(x) centered at point a is:

f(x) = f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \frac{f'''(a)}{3!}(x-a)^3 + \cdots

General form: $f(x) = \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!}(x-a)^n$

Maclaurin Series

When a = 0, we call it a Maclaurin series. Common examples you should know:

$e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots$

$\sin x = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \cdots$

Convergence

The series converges to f(x) within a "radius of convergence." Further from a, more terms are needed for accuracy.

Interactive: Taylor Approximation

Watch how higher-order Taylor polynomials approximate sin(x) better and better near the center point.

Taylor Approximation

Approximating $\sin(x)$ with polynomials. Move the center a.

Configuration

Order (n)1

ConstantLinearQuadraticHigh Order

Center (a)0.00

Polynomial Terms

f(x) \approx \sum_{k=0}^{1} \frac{f^{(k)}(a)}{k!}(x-a)^k

k=00.00/ 0 • (x-a)^0

k=11.00/ 1 • (x-a)^1

Notice how the green curve "hugs" the blue curve closer as Order increases.

Orders of Approximation

In optimization, we typically use only the first few terms:

First-Order (Linear)

f(x) \approx f(a) + f'(a)(x-a)

Approximates f as a straight line (tangent). This is what gradient descent "sees."

Second-Order (Quadratic)

f(x) \approx f(a) + f'(a)(x-a) + \frac{f''(a)}{2}(x-a)^2

Approximates f as a parabola. Includes curvature via Hessian.

Multivariate Taylor Series

Neural networks have vector inputs. The multivariate second-order Taylor expansion is:

f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^T(\mathbf{x}-\mathbf{a}) + \frac{1}{2}(\mathbf{x}-\mathbf{a})^T H(\mathbf{a})(\mathbf{x}-\mathbf{a})

Constant term

f(a): current value

Linear term

gradient: direction of change

Quadratic term

Hessian: curvature info

Connection to Optimization

Why Gradient Descent Works

Near our current point theta, the loss looks like a plane: $L(\theta + \epsilon) \approx L(\theta) + \epsilon^T \nabla L$

To minimize, we want $\epsilon^T \nabla L$ as negative as possible. This means pointing epsilon opposite to the gradient.

Newton's Method

Uses the quadratic approximation. If f is approximately a parabola, we can jump directly to its minimum:

\theta_{new} = \theta_{old} - H^{-1} \nabla f

Much faster convergence than GD, but computing H⁻¹ is O(n³).

ML Applications

XGBoost's Second-Order Objective

XGBoost approximates the loss using second-order Taylor expansion: $L \approx \sum[g_i f_t + \frac{1}{2}h_i f_t^2]$ where g is gradient and h is Hessian.

Natural Gradient

Uses Fisher information matrix (related to Hessian) to take steps in the "natural" parameter space. More efficient for probabilistic models.

BFGS / L-BFGS

Quasi-Newton methods that approximate the Hessian from gradient history. Get second-order benefits without computing full Hessian.

Activation Function Approximations

GELU can be approximated using Taylor series for efficient computation: $\text{GELU}(x) \approx 0.5x(1 + \tanh[...])$

Contents