Modules
09/09
Optimization

Contents

Lagrange Multipliers & Duality

How to optimize when you have rules you cannot break. The mathematics behind SVMs and regularization.

Introduction

Standard gradient descent solves unconstrained optimization: "Find the lowest point anywhere in this landscape."

But real problems have constraints:

  • "Maximize returns, but don't invest more than $10,000."
  • "Find the nearest point to the query, but it must be on this surface."
  • "Train a classifier with maximum margin, but classify all training points correctly."

The Lagrange Method

Lagrange Multipliers transform these constrained problems into unconstrained ones that we can solve with standard calculus. This technique is the foundation of Support Vector Machines and explains why regularization works.

The Constrained Optimization Problem

The general form of a constrained optimization problem:

Minimize f(x)f(x)

subject to

gi(x)=0g_i(x) = 0 (equality constraints)

hj(x)0h_j(x) \le 0 (inequality constraints)

Example: Maximum Margin Classifier

Objective: Minimize w2||w||^2 (make weights small = wide margin)

Constraints: yi(wTxi+b)1y_i(w^T x_i + b) \ge 1 for all training points (classify correctly with margin)

Geometric Intuition

Imagine a topographic map with elevation contours. You want to find the highest point, but you must stay on a specific hiking trail (the constraint).

The Key Insight

As you walk along the trail, you cross contour lines of the objective function. Your elevation changes. You keep walking until you reach a point where the trail becomes tangent to a contour line.

At this tangent point, the trail and the contour "kiss." Moving along the trail no longer changes your elevation. This is the optimum subject to the constraint.

The Parallel Gradient Condition

f=λg\nabla f = \lambda \nabla g

At the optimum, the gradient of the objective is parallel to the gradient of the constraint.

Why Parallel?

The gradient g\nabla g points perpendicular to the constraint surface. If f\nabla f had any component along the constraint surface, you could move in that direction to improve the objective. Only when f\nabla f is purely perpendicular (parallel to g\nabla g) are you at an optimum.

Interactive: Parallel Gradients

Watch how the optimal point moves as you change the constraint. Notice the gradient vectors always point in the same direction at the optimum.

Lagrange Multipliers: Parallel Gradients

Maximize f(x,y) = x + y subject to x^2 + y^2 = r^2. The optimum is where gradients are parallel.

xygrad f (objective)grad g (constraint)Optimal point

At the optimum, the gradients point in the same direction (they are parallel).

This is the core condition: grad f = lambda * grad g

Optimal x

1.061

Optimal y

1.061

f(x,y) = x+y

2.121

The Lagrangian Function

We encode the parallel gradient condition into a single function called the Lagrangian:

L(x,λ)=f(x)λg(x)\mathcal{L}(x, \lambda) = f(x) - \lambda \cdot g(x)

We've transformed a constrained problem in n variables into an unconstrained problem in n+1 variables.

Setting the gradient of the Lagrangian to zero gives us a system of equations:

Derivative w.r.t. x:

xL=fλg=0\nabla_x \mathcal{L} = \nabla f - \lambda \nabla g = 0

This recovers the parallel gradient condition.

Derivative w.r.t. lambda:

λL=g(x)=0\nabla_\lambda \mathcal{L} = -g(x) = 0

This forces the constraint to be satisfied.

What is λ?

The Lagrange multiplier λ has a beautiful interpretation: it tells you the "shadow price" of the constraint. If you could relax the constraint slightly, how much would the optimal value improve? The answer is λ.

Worked Example

Problem

Maximize f(x,y)=xyf(x,y) = xy subject to x+y=10x + y = 10.

(What rectangle with perimeter 20 has maximum area?)

Step 1: Form the Lagrangian

L=xyλ(x+y10)\mathcal{L} = xy - \lambda(x + y - 10)

Step 2: Take partial derivatives

Lx=yλ=0\frac{\partial \mathcal{L}}{\partial x} = y - \lambda = 0

Ly=xλ=0\frac{\partial \mathcal{L}}{\partial y} = x - \lambda = 0

Lλ=(x+y10)=0\frac{\partial \mathcal{L}}{\partial \lambda} = -(x + y - 10) = 0

Step 3: Solve the system

From equations 1 and 2: y=λy = \lambda and x=λx = \lambda, so x=yx = y.

From equation 3: x+y=10x + y = 10, so 2x=102x = 10, giving x=y=5x = y = 5.

Answer: Maximum area = 25 when x = y = 5 (a square!)

KKT Conditions

Equality constraints (g(x)=0g(x) = 0) are handled beautifully by Lagrange multipliers. But what about inequalities (h(x)0h(x) \le 0)? "Maximize margin, but classify all points correctly." "Minimize loss, but keep weights bounded."

The Karush-Kuhn-Tucker (KKT) conditions, discovered independently by Karush (1939) and Kuhn-Tucker (1951), generalize Lagrange multipliers to handle both equality and inequality constraints. For convex problems, they are both necessary AND sufficient for optimality.

The Four KKT Conditions

1

Stationarity

f=λigi+μjhj\nabla f = \sum \lambda_i \nabla g_i + \sum \mu_j \nabla h_j

The gradient of the Lagrangian vanishes at optimum.

2

Primal Feasibility

gi(x)=0g_i(x) = 0, hj(x)0h_j(x) \le 0

The solution must satisfy all constraints.

3

Dual Feasibility

μj0\mu_j \ge 0

Inequality multipliers must be non-negative.

4

Complementary Slackness

μjhj(x)=0\mu_j \cdot h_j(x) = 0

Either the constraint is active OR its multiplier is zero.

Complementary Slackness: The Key Insight

For each inequality constraint, either:

  • Active (hj(x)=0h_j(x) = 0): On the boundary, constraint matters, μj\mu_j can be positive.
  • Inactive (hj(x)<0h_j(x) < 0): Strictly inside, constraint doesn't bind, μj=0\mu_j = 0.

This is crucial for SVMs: only support vectors (points on the margin) have non-zero multipliers. This sparsity is what makes SVMs efficient.

Geometric Interpretation

At an optimal point with active constraints, the negative gradient f-\nabla f must lie in the cone generated by the outward-pointing constraint normals. Intuitively: all "improving" directions are blocked by active constraints.

Interactive: Complementary Slackness

Explore the KKT complementary slackness condition. Toggle between active and inactive constraints to see μ·h(x) = 0 in action.

Complementary Slackness

h(x) ≤ 0Global Minx*∇h∇f

Conditions Check

Constraint Value
h(x)=0h(x^*) = 0
Active (Boundary)
Multiplier Value
μ>0\mu > 0
Force Required
Product
μh(x)=0\mu \cdot h(x^*) = 0

Non-zero multiplier × Zero constraint = 0

KKT Worked Example

Problem

minx,yx2+y2\min_{x,y} \quad x^2 + y^2
subject to
x+y2x + y \geq 2

Find the point in the half-plane x + y ≥ 2 closest to the origin.

Step 1: Lagrangian

L=x2+y2+λ(xy+2)\mathcal{L} = x^2 + y^2 + \lambda(-x - y + 2)

Rewrite x + y ≥ 2 as -x - y + 2 ≤ 0

Step 2: Apply KKT

Stationarity: 2xλ=02x - \lambda = 0, 2yλ=02y - \lambda = 0

Primal: x+y2x + y \geq 2

Dual: λ0\lambda \geq 0

Complementary: λ(xy+2)=0\lambda(-x - y + 2) = 0

Step 3: Solve

From stationarity: x = λ/2 = y, so x = y.

Assume constraint is active (x + y = 2). Then 2x = 2, so x = y = 1, λ = 2.

Check: λ = 2 > 0 ✓, x + y = 2 ✓, λ·(2 - x - y) = 2·0 = 0 ✓

Solution: (x*, y*) = (1, 1) with minimum value f* = 2

Constraint Qualification

KKT conditions are necessary for optimality only under constraint qualification. The most common is Slater's condition.

Slater's Condition

For a convex problem, there exists a strictly feasible point where:

  • All inequality constraints are strictly satisfied: gᵢ(x) < 0
  • All equality constraints are satisfied: hⱼ(x) = 0

Intuitively: there's a point strictly in the interior of the feasible region.

Why It Matters

When Slater's condition holds for a convex problem, strong duality holds and KKT becomes both necessary and sufficient. Find a point satisfying KKT → you've solved the problem.

Primal-Dual Gap & Strong Duality

Every constrained optimization problem has a dual problem. This isn't just mathematical curiosity; it has deep practical implications.

Primal Problem

Minimize f(x) over x, subject to constraints.

Optimal value: pp^*

Dual Problem

Maximize the Lagrangian over λ.

Optimal value: dd^*

Weak Duality (Always True)

dpd^* \le p^*

The dual provides a lower bound on the primal. This is always true, for any optimization problem.

Strong Duality (For Convex Problems)

d=pd^* = p^*

For convex problems satisfying Slater's condition, the gap closes. Solving the dual is equivalent to solving the primal.

SVM Application

Support Vector Machines are the most famous application of Lagrangian duality in ML. The dual formulation enabled the kernel trick.

SVM Problem

minw,b12w2\min_{w,b} \quad \frac{1}{2}||w||^2
subject to
yi(wTxi+b)1iy_i(w^T x_i + b) \geq 1 \quad \forall i

Maximize margin while correctly classifying all points.

Step 1: Form the Lagrangian

L=12w2iαi[yi(wTxi+b)1]\mathcal{L} = \frac{1}{2}||w||^2 - \sum_{i} \alpha_i [y_i(w^T x_i + b) - 1]

Step 2: Apply Stationarity

Lw=0w=iαiyixi\frac{\partial \mathcal{L}}{\partial w} = 0 \Rightarrow w = \sum_i \alpha_i y_i x_i

Lb=0iαiyi=0\frac{\partial \mathcal{L}}{\partial b} = 0 \Rightarrow \sum_i \alpha_i y_i = 0

Step 3: Substitute into Dual

maxαiαi12i,jαiαjyiyj(xiTxj)\max_\alpha \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j (x_i^T x_j)

The dual depends only on dot products xiTxjx_i^T x_j.

The Kernel Trick

Since the dual only needs dot products, replace xiTxjx_i^T x_j with a kernel K(xi,xj)K(x_i, x_j). This implicitly maps data to higher dimensions without computing coordinates. RBF, polynomial, and other kernels became possible only because of duality.

Complementary Slackness in SVMs

The KKT condition αi[yi(wTxi+b)1]=0\alpha_i[y_i(w^T x_i + b) - 1] = 0 has profound implications:

  • Points on margin (yi(wTxi+b)=1y_i(w^Tx_i + b) = 1): αi>0\alpha_i > 0. These are support vectors.
  • Points beyond margin (yi(wTxi+b)>1y_i(w^Tx_i + b) > 1): αi=0\alpha_i = 0. These don't influence the solution.

Only support vectors define the hyperplane. This sparsity makes SVMs memory-efficient.

Interactive: SVM Margins

Visualize how SVMs find the maximum margin hyperplane. Support vectors (highlighted) are the only points that matter.

SVM & KKT Conditions

+1 Margin-1 Margin
Total Points
12
Support Vectors
3
Non-zero α\alpha
KKT Inspector
Hover over any point to inspect its dual variable α\alpha and KKT status.

Complementary Slackness

αi[yi(wTxi+b)1]=0\alpha_i [y_i(w^Tx_i + b) - 1] = 0

This condition forces sparsity. Either a point is correctly classified with room to spare (bracket term > 0, so αi=0\alpha_i=0), OR it is exactly on the margin (bracket term = 0, so αi>0\alpha_i > 0).

ML Applications

Beyond SVMs, KKT conditions appear throughout modern ML:

Constrained RL

Maximize reward subject to safety constraints. KKT provides the Lagrangian formulation for constrained policy optimization.

Fairness Constraints

Minimize loss subject to demographic parity. The Lagrangian framework handles fairness as inequality constraints.

Portfolio Optimization

Maximize return subject to risk budget. The Markowitz model uses Lagrangian duality for mean-variance optimization.

Neural Architecture Search

Minimize loss subject to FLOPS/memory constraints. Lagrange multipliers enable differentiable architecture search.

Regularization as a Constraint

Have you ever wondered why we add λw2\lambda ||w||^2 to the loss in Ridge Regression? It's a Lagrange multiplier in disguise!

Constrained View

Minimize Loss subject to w2C||w||^2 \le C

"Keep weights in a ball of radius C."

Lagrangian View

Minimize Loss+λw2\text{Loss} + \lambda ||w||^2

"Penalize large weights directly."

The regularization strength λ\lambda is exactly the Lagrange multiplier for the weight constraint. Larger λ\lambda = tighter constraint = smaller weights. This equivalence is why regularization is sometimes called "implicit constraints."