Modules
01/09
Optimization

Contents

Convex vs Non-Convex Optimization

The single most important distinction in optimization theory. It determines whether you can trust your optimizer to find the best solution.

Introduction

Every optimization problem in machine learning boils down to this: you have a loss function (a landscape of mountains and valleys) and you want to find the lowest point. The question that determines everything is: What does your landscape look like?

If your landscape is a bowl (convex), there is exactly one lowest point, and no matter where you start rolling a ball, it will always end up at the bottom. Life is good.

If your landscape is an egg crate (non-convex), there are countless little dips and valleys. Your ball might roll into a shallow puddle and get stuck there, never finding the deep ocean trench that represents the true best solution.

The Core Distinction

Convex

Every local minimum IS the global minimum.

Non-Convex

Local minima can trap you far from the global best.

Why Should You Care?

This distinction has massive practical implications:

Linear Regression, Logistic Regression, SVMs

These are convex problems. You are guaranteed to find the globally optimal solution. If two people train the same model with different initializations, they get the exact same final model.

Neural Networks

These are non-convex. Two people training the same architecture on the same data will get different models depending on random initialization. There are no guarantees.

Understanding convexity explains why SVMs were dominant in the 2000s (mathematical guarantees!) and why deep learning required tricks like careful initialization, batch normalization, and Adam optimizer to work at all.

Convex Sets

Before we can discuss convex functions, we need to understand convex sets. This is where the geometry begins.

Definition: Convex Set

A set CC is convex if for any two points x,yCx, y \in C, the entire line segment connecting them lies within CC.

λ[0,1]:λx+(1λ)yC\forall \lambda \in [0,1]: \quad \lambda x + (1-\lambda)y \in C

Convex Sets

  • Circles, spheres
  • Rectangles, cubes
  • Triangles (filled)
  • Half-spaces
  • Intersections of convex sets

Non-Convex Sets

  • Donuts (has a hole)
  • Crescent moons
  • Stars
  • Any shape with an "indent"
  • L-shapes

The "Rubber Band" Test

Imagine stretching a rubber band between any two points in the set. If the rubber band ever "pokes out" of the set, it's not convex. For a convex set, the rubber band always stays inside.

Convex Functions

A function is convex if the region above its graph is a convex set. Equivalently:

Definition: Convex Function

A function ff is convex if the line segment between any two points on the graph lies above or on the graph.

f(λx+(1λ)y)λf(x)+(1λ)f(y)f(\lambda x + (1-\lambda)y) \le \lambda f(x) + (1-\lambda)f(y)

for all x, y and lambda in [0, 1]

Convex Functions

  • f(x)=x2f(x) = x^2 (parabola)
  • f(x)=exf(x) = e^x
  • f(x)=xf(x) = |x|
  • f(x)=xlogxf(x) = x \log x
  • Norms: x1,x2||x||_1, ||x||_2
  • Sum of convex functions

Non-Convex Functions

  • f(x)=sin(x)f(x) = \sin(x)
  • f(x)=x3f(x) = x^3 (cubic)
  • f(x)=x2f(x) = -x^2 (concave)
  • Neural network losses
  • Any function with multiple local minima

Strictly Convex

If the inequality is strict (<< instead of \le), the function is strictly convex. This means there is exactly one global minimum. x2x^2 is strictly convex; a flat line is convex but not strictly convex.

Interactive: Convex vs Non-Convex

Watch gradient descent navigate these two landscapes. On the convex bowl, it always finds the bottom. On the non-convex surface, it can get trapped in local minima.

Convex vs Non-Convex Landscapes

Watch how gradient descent behaves differently on these two types of functions.

Global Minxf(x)

Convex: Only one minimum exists. GD will always find it, guaranteed.

Position

0.800

f(x)

0.640

Gradient

1.600

The Hessian Test for Convexity

For twice-differentiable functions, there's an easy test using the Hessian matrix.

The Second Derivative Test

A twice-differentiable function f is convex if and only if its Hessian HH is Positive Semi-Definite (PSD) everywhere.

H(x)0xH(x) \succeq 0 \quad \forall x

PSD means: All eigenvalues of H are non-negative. Geometrically, the function curves "upward" (or stays flat) in every direction.

Example: MSE Loss for Linear Regression

Loss: L(w)=Xwy2L(w) = ||Xw - y||^2

Hessian: H=2XTXH = 2X^TX

Since XTXX^TX is always PSD (it's a Gram matrix), MSE loss is convex. This is why linear regression always has a unique solution.

Jensen's Inequality

This inequality is a direct consequence of convexity and appears throughout ML, especially in variational inference and the EM algorithm.

Jensen's Inequality

For a convex function f and a random variable X:

f(E[X])E[f(X)]f(E[X]) \le E[f(X)]

"The function of the mean is at most the mean of the function."

Worked Example

Let f(x)=x2f(x) = x^2 (convex). Let X take values -2 and +2 with equal probability.

Left side: f(E[X])

E[X] = 0, so f(0) = 0

Right side: E[f(X)]

E[X^2] = (4 + 4)/2 = 4

0 <= 4. Jensen's inequality holds.

Why Jensen Matters in ML

In Variational Autoencoders (VAEs), we can't compute logp(x)\log p(x) directly. Jensen's inequality lets us construct the ELBO (Evidence Lower Bound), a tractable lower bound we can maximize instead. The entire VAE framework rests on Jensen.

The Convex World: Where Life is Good

In convex optimization, you have mathematical guarantees that would make any theorist weep with joy:

Guarantee #1: Local = Global

Any local minimum is automatically the global minimum. No need to restart with different initializations.

Guarantee #2: Efficient Algorithms

Convex problems can be solved in polynomial time with proven convergence rates.

Guarantee #3: Duality

Strong duality holds. The dual problem has the same optimal value.

Convex ML Models

Linear Regression

MSE loss is quadratic (convex)

Logistic Regression

Cross-entropy loss is convex

Support Vector Machines

Hinge loss is convex

Ridge/Lasso Regression

Convex loss + convex penalty

The Non-Convex Reality: Deep Learning's Jungle

Neural networks are non-convex. The moment you compose linear transformations with nonlinear activations (ReLU, Sigmoid, etc.), convexity is destroyed.

The Hazards

  • Local Minima: Valleys that aren't the deepest. Getting stuck here means a suboptimal model.
  • Saddle Points: Points where gradient = 0, but they're not minima. Far more common than local minima in high dimensions.
  • Plateaus: Vast flat regions where gradients are tiny. Training stalls for thousands of steps.
  • Ill-Conditioning: Loss landscapes that are steep in some directions and flat in others, causing oscillation.

Why Neural Networks are Non-Convex

Consider a simple 2-layer network: f(x)=W2σ(W1x)f(x) = W_2 \sigma(W_1 x)

  • Activation Curvature: The nonlinear σ\sigma bends the function.
  • Weight Products: W2W1W_2 W_1 creates hyperbolic valleys.
  • Symmetry: Swapping neurons gives identical outputs (multiple minima).

Loss Landscape

Visualizing L(win,wout)L(w_{in}, w_{out}). Notice the symmetry: (w,w)(w, -w) works same as (w,w)(-w, w).

w_in
w_out

Function Approximation

Target Data
Model Prediction
Current Loss:0.1671
Parameters:w_in=1.00, w_out=1.00

Why Deep Learning Works Anyway

If non-convex optimization is so hard (NP-hard in the worst case), why do neural networks train at all? This was one of the big mysteries of the 2010s deep learning revolution.

Insight #1: High Dimensionality is a Blessing

Most critical points in deep learning are saddle points, not local minima. And saddle points can be escaped.

Insight #2: Local Minima Are Often Good Enough

Most local minima have loss values very close to the global minimum. The "bad" local minima are rare.

Insight #3: Over-Parameterization Creates Easy Paths

Solutions form connected valleys (mode connectivity), and gradient descent can easily slide along them.

Insight #4: SGD Noise Helps

The stochastic noise from mini-batch gradients can shake the optimizer out of shallow local minima.

Practical ML Implications

When to Choose Convex Models

If interpretability and guaranteed convergence matter, consider logistic regression, SVMs, or linear models.

Surviving Non-Convex Training

The Convex Relaxation Trick

"Relax" a non-convex problem into a convex one, solve that, then round back. Used in L1 relaxation of L0.