What are Lagrange multipliers and what problem do they solve?

Lagrange multipliers are a mathematical technique for finding the extrema of a function subject to equality constraints. They transform a constrained optimization problem into an unconstrained one by introducing auxiliary variables (the multipliers) that enforce the constraints. The key insight is that at the optimum, the gradient of the objective function must be parallel to the gradient of the constraint, which is captured by the equation ∇f = λ∇g. This method is foundational in machine learning for deriving algorithms like Support Vector Machines and understanding regularization.

What are the KKT conditions in constrained optimization?

The Karush-Kuhn-Tucker (KKT) conditions generalize Lagrange multipliers to handle both equality and inequality constraints. There are four conditions: (1) Stationarity — the gradient of the Lagrangian vanishes at the optimum; (2) Primal Feasibility — the solution satisfies all constraints; (3) Dual Feasibility — multipliers for inequality constraints must be non-negative; and (4) Complementary Slackness — for each inequality constraint, either the constraint is active or its multiplier is zero. For convex optimization problems satisfying Slater's condition, the KKT conditions are both necessary and sufficient for optimality.

How are Lagrange multipliers used in Support Vector Machines?

In SVMs, the primal problem is to minimize (1/2)||w||² subject to margin constraints yᵢ(wᵀxᵢ + b) ≥ 1 for all training points. Applying the Lagrangian and KKT conditions yields a dual problem that depends only on dot products xᵢᵀxⱼ between training points. By complementary slackness, only points on the margin boundary (support vectors) have non-zero Lagrange multipliers αᵢ > 0 and contribute to the solution. The dual formulation enables the kernel trick, allowing SVMs to operate in high-dimensional feature spaces by replacing dot products with kernel functions K(xᵢ, xⱼ).

What is the difference between equality and inequality constrained optimization?

Equality constraints (g(x) = 0) require the solution to lie exactly on a surface, and are handled directly by Lagrange multipliers with no sign restriction on the multiplier λ. Inequality constraints (h(x) ≤ 0) allow the solution to lie anywhere in a half-space and require the KKT conditions, which add dual feasibility (μ ≥ 0) and complementary slackness (μ·h(x) = 0). An inequality constraint is either active (h(x) = 0, binding the solution to its boundary) or inactive (h(x) < 0, not limiting the optimum), whereas an equality constraint is always active by definition.

What is the Lagrangian dual problem and why is it useful?

The Lagrangian dual problem is formed by minimizing the Lagrangian over the primal variables and then maximizing over the dual variables (multipliers). Weak duality always holds: the dual optimal value d* provides a lower bound on the primal optimal value p*. For convex problems satisfying Slater's condition, strong duality holds (d* = p*), meaning solving the dual is equivalent to solving the primal. The dual is often easier to solve — in SVMs, the dual has a simpler structure that exposes the kernel trick and reveals that only support vectors matter — making duality central to efficient algorithm design.

Lagrange Multipliers & KKT Conditions

Introduction

Standard gradient descent solves unconstrained optimization: "Find the lowest point anywhere in this landscape."

But real problems have constraints:

"Maximize returns, but don't invest more than $10,000."
"Find the nearest point to the query, but it must be on this surface."
"Train a classifier with maximum margin, but classify all training points correctly."

The Lagrange Method

Lagrange Multipliers transform these constrained problems into unconstrained ones that we can solve with standard calculus. This technique is the foundation of Support Vector Machines and explains why regularization works.

The Constrained Optimization Problem

The general form of a constrained optimization problem:

Minimize $f(x)$

subject to

$g_i(x) = 0$ (equality constraints)

$h_j(x) \le 0$ (inequality constraints)

Example: Maximum Margin Classifier

Objective: Minimize $||w||^2$ (make weights small = wide margin)

Constraints: $y_i(w^T x_i + b) \ge 1$ for all training points (classify correctly with margin)

Geometric Intuition

Imagine a topographic map with elevation contours. You want to find the highest point, but you must stay on a specific hiking trail (the constraint).

The Key Insight

As you walk along the trail, you cross contour lines of the objective function. Your elevation changes. You keep walking until you reach a point where the trail becomes tangent to a contour line.

At this tangent point, the trail and the contour "kiss." Moving along the trail no longer changes your elevation. This is the optimum subject to the constraint.

The Parallel Gradient Condition

\nabla f = \lambda \nabla g

At the optimum, the gradient of the objective is parallel to the gradient of the constraint.

Why Parallel?

The gradient $\nabla g$ points perpendicular to the constraint surface. If $\nabla f$ had any component along the constraint surface, you could move in that direction to improve the objective. Only when $\nabla f$ is purely perpendicular (parallel to $\nabla g$ ) are you at an optimum.

Interactive: Parallel Gradients

Watch how the optimal point moves as you change the constraint. Notice the gradient vectors always point in the same direction at the optimum.

Lagrange Multipliers: Parallel Gradients

Maximize f(x,y) = x + y subject to x^2 + y^2 = r^2. The optimum is where gradients are parallel.

Constraint radius: 1.5

At the optimum, the gradients point in the same direction (they are parallel).

This is the core condition: grad f = lambda * grad g

Optimal x

1.061

Optimal y

1.061

f(x,y) = x+y

2.121

The Lagrangian Function

We encode the parallel gradient condition into a single function called the Lagrangian:

\mathcal{L}(x, \lambda) = f(x) - \lambda \cdot g(x)

We've transformed a constrained problem in n variables into an unconstrained problem in n+1 variables.

Setting the gradient of the Lagrangian to zero gives us a system of equations:

Derivative w.r.t. x:

$\nabla_x \mathcal{L} = \nabla f - \lambda \nabla g = 0$

This recovers the parallel gradient condition.

Derivative w.r.t. lambda:

$\nabla_\lambda \mathcal{L} = -g(x) = 0$

This forces the constraint to be satisfied.

What is λ?

The Lagrange multiplier λ has a beautiful interpretation: it tells you the "shadow price" of the constraint. If you could relax the constraint slightly, how much would the optimal value improve? The answer is λ.

Worked Example

Problem

Maximize $f(x,y) = xy$ subject to $x + y = 10$ .

(What rectangle with perimeter 20 has maximum area?)

Step 1: Form the Lagrangian

$\mathcal{L} = xy - \lambda(x + y - 10)$

Step 2: Take partial derivatives

$\frac{\partial \mathcal{L}}{\partial x} = y - \lambda = 0$

$\frac{\partial \mathcal{L}}{\partial y} = x - \lambda = 0$

$\frac{\partial \mathcal{L}}{\partial \lambda} = -(x + y - 10) = 0$

Step 3: Solve the system

From equations 1 and 2: $y = \lambda$ and $x = \lambda$ , so $x = y$ .

From equation 3: $x + y = 10$ , so $2x = 10$ , giving $x = y = 5$ .

Answer: Maximum area = 25 when x = y = 5 (a square!)

KKT Conditions

Equality constraints ( $g(x) = 0$ ) are handled beautifully by Lagrange multipliers. But what about inequalities ( $h(x) \le 0$ )? "Maximize margin, but classify all points correctly." "Minimize loss, but keep weights bounded."

The Karush-Kuhn-Tucker (KKT) conditions, discovered independently by Karush (1939) and Kuhn-Tucker (1951), generalize Lagrange multipliers to handle both equality and inequality constraints. For convex problems, they are both necessary AND sufficient for optimality.

The Four KKT Conditions

Stationarity

$\nabla f = \sum \lambda_i \nabla g_i + \sum \mu_j \nabla h_j$

The gradient of the Lagrangian vanishes at optimum.

Primal Feasibility

$g_i(x) = 0$ , $h_j(x) \le 0$

The solution must satisfy all constraints.

Dual Feasibility

$\mu_j \ge 0$

Inequality multipliers must be non-negative.

Complementary Slackness

$\mu_j \cdot h_j(x) = 0$

Either the constraint is active OR its multiplier is zero.

Complementary Slackness: The Key Insight

For each inequality constraint, either:

Active ( $h_j(x) = 0$ ): On the boundary, constraint matters, $\mu_j$ can be positive.
Inactive ( $h_j(x) < 0$ ): Strictly inside, constraint doesn't bind, $\mu_j = 0$ .

This is crucial for SVMs: only support vectors (points on the margin) have non-zero multipliers. This sparsity is what makes SVMs efficient.

Geometric Interpretation

At an optimal point with active constraints, the negative gradient $-\nabla f$ must lie in the cone generated by the outward-pointing constraint normals. Intuitively: all "improving" directions are blocked by active constraints.

Interactive: Complementary Slackness

Explore the KKT complementary slackness condition. Toggle between active and inactive constraints to see μ·h(x) = 0 in action.

Complementary Slackness

Visualizing $\mu \cdot h(x) = 0$

Conditions Check

Constraint Value

h(x^*) = 0

Active (Boundary)

Multiplier Value

\mu > 0

Force Required

Product

\mu \cdot h(x^*) = 0

Non-zero multiplier × Zero constraint = 0

KKT Worked Example

Problem

\min_{x,y} \quad x^2 + y^2

subject to

x + y \geq 2

Find the point in the half-plane x + y ≥ 2 closest to the origin.

Step 1: Lagrangian

$\mathcal{L} = x^2 + y^2 + \lambda(-x - y + 2)$

Rewrite x + y ≥ 2 as -x - y + 2 ≤ 0

Step 2: Apply KKT

Stationarity: $2x - \lambda = 0$ , $2y - \lambda = 0$

Primal: $x + y \geq 2$

Dual: $\lambda \geq 0$

Complementary: $\lambda(-x - y + 2) = 0$

Step 3: Solve

From stationarity: x = λ/2 = y, so x = y.

Assume constraint is active (x + y = 2). Then 2x = 2, so x = y = 1, λ = 2.

Check: λ = 2 > 0 ✓, x + y = 2 ✓, λ·(2 - x - y) = 2·0 = 0 ✓

Solution: (x*, y*) = (1, 1) with minimum value f* = 2

Constraint Qualification

KKT conditions are necessary for optimality only under constraint qualification. The most common is Slater's condition.

Slater's Condition

For a convex problem, there exists a strictly feasible point where:

All inequality constraints are strictly satisfied: gᵢ(x) < 0
All equality constraints are satisfied: hⱼ(x) = 0

Intuitively: there's a point strictly in the interior of the feasible region.

Why It Matters

When Slater's condition holds for a convex problem, strong duality holds and KKT becomes both necessary and sufficient. Find a point satisfying KKT → you've solved the problem.

Primal-Dual Gap & Strong Duality

Every constrained optimization problem has a dual problem. This isn't just mathematical curiosity; it has deep practical implications.

Primal Problem

Minimize f(x) over x, subject to constraints.

Optimal value: $p^*$

Dual Problem

Maximize the Lagrangian over λ.

Optimal value: $d^*$

Weak Duality (Always True)

d^* \le p^*

The dual provides a lower bound on the primal. This is always true, for any optimization problem.

Strong Duality (For Convex Problems)

d^* = p^*

For convex problems satisfying Slater's condition, the gap closes. Solving the dual is equivalent to solving the primal.

SVM Application

Support Vector Machines are the most famous application of Lagrangian duality in ML. The dual formulation enabled the kernel trick.

SVM Problem

\min_{w,b} \quad \frac{1}{2}||w||^2

subject to

y_i(w^T x_i + b) \geq 1 \quad \forall i

Maximize margin while correctly classifying all points.

Step 1: Form the Lagrangian

$\mathcal{L} = \frac{1}{2}||w||^2 - \sum_{i} \alpha_i [y_i(w^T x_i + b) - 1]$

Step 2: Apply Stationarity

$\frac{\partial \mathcal{L}}{\partial w} = 0 \Rightarrow w = \sum_i \alpha_i y_i x_i$

$\frac{\partial \mathcal{L}}{\partial b} = 0 \Rightarrow \sum_i \alpha_i y_i = 0$

Step 3: Substitute into Dual

$\max_\alpha \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j (x_i^T x_j)$

The dual depends only on dot products $x_i^T x_j$ .

The Kernel Trick

Since the dual only needs dot products, replace $x_i^T x_j$ with a kernel $K(x_i, x_j)$ . This implicitly maps data to higher dimensions without computing coordinates. RBF, polynomial, and other kernels became possible only because of duality.

Complementary Slackness in SVMs

The KKT condition $\alpha_i[y_i(w^T x_i + b) - 1] = 0$ has profound implications:

Points on margin ( $y_i(w^Tx_i + b) = 1$ ): $\alpha_i > 0$ . These are support vectors.
Points beyond margin ( $y_i(w^Tx_i + b) > 1$ ): $\alpha_i = 0$ . These don't influence the solution.

Only support vectors define the hyperplane. This sparsity makes SVMs memory-efficient.

Interactive: SVM Margins

Visualize how SVMs find the maximum margin hyperplane. Support vectors (highlighted) are the only points that matter.

SVM & KKT Conditions

Only points on the margin ("Support Vectors") have dual variables $\alpha_i > 0$ .

Total Points

Support Vectors

Non-zero

\alpha

KKT Inspector

Hover over any point to inspect its dual variable

\alpha

and KKT status.

Complementary Slackness

\alpha_i [y_i(w^Tx_i + b) - 1] = 0

This condition forces sparsity. Either a point is correctly classified with room to spare (bracket term > 0, so $\alpha_i=0$ ), OR it is exactly on the margin (bracket term = 0, so $\alpha_i > 0$ ).

ML Applications

Beyond SVMs, KKT conditions appear throughout modern ML:

Constrained RL

Maximize reward subject to safety constraints. KKT provides the Lagrangian formulation for constrained policy optimization.

Fairness Constraints

Minimize loss subject to demographic parity. The Lagrangian framework handles fairness as inequality constraints.

Portfolio Optimization

Maximize return subject to risk budget. The Markowitz model uses Lagrangian duality for mean-variance optimization.

Neural Architecture Search

Minimize loss subject to FLOPS/memory constraints. Lagrange multipliers enable differentiable architecture search.

Regularization as a Constraint

Have you ever wondered why we add $\lambda ||w||^2$ to the loss in Ridge Regression? It's a Lagrange multiplier in disguise!

Constrained View

Minimize Loss subject to $||w||^2 \le C$

"Keep weights in a ball of radius C."

Lagrangian View

Minimize $\text{Loss} + \lambda ||w||^2$

"Penalize large weights directly."

The regularization strength $\lambda$ is exactly the Lagrange multiplier for the weight constraint. Larger $\lambda$ = tighter constraint = smaller weights. This equivalence is why regularization is sometimes called "implicit constraints."

Contents

Introduction

The Lagrange Method

The Constrained Optimization Problem

Example: Maximum Margin Classifier

Geometric Intuition

The Key Insight

Why Parallel?

Interactive: Parallel Gradients

Lagrange Multipliers: Parallel Gradients

The Lagrangian Function

What is λ?

Worked Example

Problem

KKT Conditions

The Four KKT Conditions

Complementary Slackness: The Key Insight

Geometric Interpretation

Interactive: Complementary Slackness

Complementary Slackness

Conditions Check

KKT Worked Example

Problem

Constraint Qualification

Slater's Condition

Why It Matters

Primal-Dual Gap & Strong Duality

Primal Problem

Dual Problem

Weak Duality (Always True)

Strong Duality (For Convex Problems)

SVM Application

SVM Problem

Step 1: Form the Lagrangian

Step 2: Apply Stationarity

Step 3: Substitute into Dual

The Kernel Trick

Complementary Slackness in SVMs

Interactive: SVM Margins

SVM & KKT Conditions

Complementary Slackness

ML Applications

Constrained RL

Fairness Constraints

Portfolio Optimization

Neural Architecture Search

Regularization as a Constraint

Constrained View

Lagrangian View