Introduction
Standard gradient descent solves unconstrained optimization: "Find the lowest point anywhere in this landscape."
But real problems have constraints:
- "Maximize returns, but don't invest more than $10,000."
- "Find the nearest point to the query, but it must be on this surface."
- "Train a classifier with maximum margin, but classify all training points correctly."
The Lagrange Method
Lagrange Multipliers transform these constrained problems into unconstrained ones that we can solve with standard calculus. This technique is the foundation of Support Vector Machines and explains why regularization works.
The Constrained Optimization Problem
The general form of a constrained optimization problem:
Minimize
subject to
(equality constraints)
(inequality constraints)
Example: Maximum Margin Classifier
Objective: Minimize (make weights small = wide margin)
Constraints: for all training points (classify correctly with margin)
Geometric Intuition
Imagine a topographic map with elevation contours. You want to find the highest point, but you must stay on a specific hiking trail (the constraint).
The Key Insight
As you walk along the trail, you cross contour lines of the objective function. Your elevation changes. You keep walking until you reach a point where the trail becomes tangent to a contour line.
At this tangent point, the trail and the contour "kiss." Moving along the trail no longer changes your elevation. This is the optimum subject to the constraint.
The Parallel Gradient Condition
At the optimum, the gradient of the objective is parallel to the gradient of the constraint.
Why Parallel?
The gradient points perpendicular to the constraint surface. If had any component along the constraint surface, you could move in that direction to improve the objective. Only when is purely perpendicular (parallel to ) are you at an optimum.
Interactive: Parallel Gradients
Watch how the optimal point moves as you change the constraint. Notice the gradient vectors always point in the same direction at the optimum.
Lagrange Multipliers: Parallel Gradients
Maximize f(x,y) = x + y subject to x^2 + y^2 = r^2. The optimum is where gradients are parallel.
At the optimum, the gradients point in the same direction (they are parallel).
This is the core condition: grad f = lambda * grad g
Optimal x
1.061
Optimal y
1.061
f(x,y) = x+y
2.121
The Lagrangian Function
We encode the parallel gradient condition into a single function called the Lagrangian:
We've transformed a constrained problem in n variables into an unconstrained problem in n+1 variables.
Setting the gradient of the Lagrangian to zero gives us a system of equations:
Derivative w.r.t. x:
This recovers the parallel gradient condition.
Derivative w.r.t. lambda:
This forces the constraint to be satisfied.
What is λ?
The Lagrange multiplier λ has a beautiful interpretation: it tells you the "shadow price" of the constraint. If you could relax the constraint slightly, how much would the optimal value improve? The answer is λ.
Worked Example
Problem
Maximize subject to .
(What rectangle with perimeter 20 has maximum area?)
Step 1: Form the Lagrangian
Step 2: Take partial derivatives
Step 3: Solve the system
From equations 1 and 2: and , so .
From equation 3: , so , giving .
Answer: Maximum area = 25 when x = y = 5 (a square!)
KKT Conditions
Equality constraints () are handled beautifully by Lagrange multipliers. But what about inequalities ()? "Maximize margin, but classify all points correctly." "Minimize loss, but keep weights bounded."
The Karush-Kuhn-Tucker (KKT) conditions, discovered independently by Karush (1939) and Kuhn-Tucker (1951), generalize Lagrange multipliers to handle both equality and inequality constraints. For convex problems, they are both necessary AND sufficient for optimality.
The Four KKT Conditions
Stationarity
The gradient of the Lagrangian vanishes at optimum.
Primal Feasibility
,
The solution must satisfy all constraints.
Dual Feasibility
Inequality multipliers must be non-negative.
Complementary Slackness
Either the constraint is active OR its multiplier is zero.
Complementary Slackness: The Key Insight
For each inequality constraint, either:
- Active (): On the boundary, constraint matters, can be positive.
- Inactive (): Strictly inside, constraint doesn't bind, .
This is crucial for SVMs: only support vectors (points on the margin) have non-zero multipliers. This sparsity is what makes SVMs efficient.
Geometric Interpretation
At an optimal point with active constraints, the negative gradient must lie in the cone generated by the outward-pointing constraint normals. Intuitively: all "improving" directions are blocked by active constraints.
Interactive: Complementary Slackness
Explore the KKT complementary slackness condition. Toggle between active and inactive constraints to see μ·h(x) = 0 in action.
Complementary Slackness
Conditions Check
Non-zero multiplier × Zero constraint = 0
KKT Worked Example
Problem
Find the point in the half-plane x + y ≥ 2 closest to the origin.
Step 1: Lagrangian
Rewrite x + y ≥ 2 as -x - y + 2 ≤ 0
Step 2: Apply KKT
Stationarity: ,
Primal:
Dual:
Complementary:
Step 3: Solve
From stationarity: x = λ/2 = y, so x = y.
Assume constraint is active (x + y = 2). Then 2x = 2, so x = y = 1, λ = 2.
Check: λ = 2 > 0 ✓, x + y = 2 ✓, λ·(2 - x - y) = 2·0 = 0 ✓
Solution: (x*, y*) = (1, 1) with minimum value f* = 2
Constraint Qualification
KKT conditions are necessary for optimality only under constraint qualification. The most common is Slater's condition.
Slater's Condition
For a convex problem, there exists a strictly feasible point where:
- All inequality constraints are strictly satisfied: gᵢ(x) < 0
- All equality constraints are satisfied: hⱼ(x) = 0
Intuitively: there's a point strictly in the interior of the feasible region.
Why It Matters
When Slater's condition holds for a convex problem, strong duality holds and KKT becomes both necessary and sufficient. Find a point satisfying KKT → you've solved the problem.
Primal-Dual Gap & Strong Duality
Every constrained optimization problem has a dual problem. This isn't just mathematical curiosity; it has deep practical implications.
Primal Problem
Minimize f(x) over x, subject to constraints.
Optimal value:
Dual Problem
Maximize the Lagrangian over λ.
Optimal value:
Weak Duality (Always True)
The dual provides a lower bound on the primal. This is always true, for any optimization problem.
Strong Duality (For Convex Problems)
For convex problems satisfying Slater's condition, the gap closes. Solving the dual is equivalent to solving the primal.
SVM Application
Support Vector Machines are the most famous application of Lagrangian duality in ML. The dual formulation enabled the kernel trick.
SVM Problem
Maximize margin while correctly classifying all points.
Step 1: Form the Lagrangian
Step 2: Apply Stationarity
Step 3: Substitute into Dual
The dual depends only on dot products .
The Kernel Trick
Since the dual only needs dot products, replace with a kernel . This implicitly maps data to higher dimensions without computing coordinates. RBF, polynomial, and other kernels became possible only because of duality.
Complementary Slackness in SVMs
The KKT condition has profound implications:
- Points on margin (): . These are support vectors.
- Points beyond margin (): . These don't influence the solution.
Only support vectors define the hyperplane. This sparsity makes SVMs memory-efficient.
Interactive: SVM Margins
Visualize how SVMs find the maximum margin hyperplane. Support vectors (highlighted) are the only points that matter.
SVM & KKT Conditions
Complementary Slackness
This condition forces sparsity. Either a point is correctly classified with room to spare (bracket term > 0, so ), OR it is exactly on the margin (bracket term = 0, so ).
ML Applications
Beyond SVMs, KKT conditions appear throughout modern ML:
Constrained RL
Maximize reward subject to safety constraints. KKT provides the Lagrangian formulation for constrained policy optimization.
Fairness Constraints
Minimize loss subject to demographic parity. The Lagrangian framework handles fairness as inequality constraints.
Portfolio Optimization
Maximize return subject to risk budget. The Markowitz model uses Lagrangian duality for mean-variance optimization.
Neural Architecture Search
Minimize loss subject to FLOPS/memory constraints. Lagrange multipliers enable differentiable architecture search.
Regularization as a Constraint
Have you ever wondered why we add to the loss in Ridge Regression? It's a Lagrange multiplier in disguise!
Constrained View
Minimize Loss subject to
"Keep weights in a ball of radius C."
Lagrangian View
Minimize
"Penalize large weights directly."
The regularization strength is exactly the Lagrange multiplier for the weight constraint. Larger = tighter constraint = smaller weights. This equivalence is why regularization is sometimes called "implicit constraints."