Introduction
Every optimization problem in machine learning boils down to this: you have a loss function (a landscape of mountains and valleys) and you want to find the lowest point. The question that determines everything is: What does your landscape look like?
If your landscape is a bowl (convex), there is exactly one lowest point, and no matter where you start rolling a ball, it will always end up at the bottom. Life is good.
If your landscape is an egg crate (non-convex), there are countless little dips and valleys. Your ball might roll into a shallow puddle and get stuck there, never finding the deep ocean trench that represents the true best solution.
The Core Distinction
Convex
Every local minimum IS the global minimum.
Non-Convex
Local minima can trap you far from the global best.
Why Should You Care?
This distinction has massive practical implications:
Linear Regression, Logistic Regression, SVMs
These are convex problems. You are guaranteed to find the globally optimal solution. If two people train the same model with different initializations, they get the exact same final model.
Neural Networks
These are non-convex. Two people training the same architecture on the same data will get different models depending on random initialization. There are no guarantees.
Understanding convexity explains why SVMs were dominant in the 2000s (mathematical guarantees!) and why deep learning required tricks like careful initialization, batch normalization, and Adam optimizer to work at all.
Convex Sets
Before we can discuss convex functions, we need to understand convex sets. This is where the geometry begins.
Definition: Convex Set
A set is convex if for any two points , the entire line segment connecting them lies within .
Convex Sets
- Circles, spheres
- Rectangles, cubes
- Triangles (filled)
- Half-spaces
- Intersections of convex sets
Non-Convex Sets
- Donuts (has a hole)
- Crescent moons
- Stars
- Any shape with an "indent"
- L-shapes
The "Rubber Band" Test
Imagine stretching a rubber band between any two points in the set. If the rubber band ever "pokes out" of the set, it's not convex. For a convex set, the rubber band always stays inside.
Convex Functions
A function is convex if the region above its graph is a convex set. Equivalently:
Definition: Convex Function
A function is convex if the line segment between any two points on the graph lies above or on the graph.
for all x, y and lambda in [0, 1]
Convex Functions
- (parabola)
- Norms:
- Sum of convex functions
Non-Convex Functions
- (cubic)
- (concave)
- Neural network losses
- Any function with multiple local minima
Strictly Convex
If the inequality is strict ( instead of ), the function is strictly convex. This means there is exactly one global minimum. is strictly convex; a flat line is convex but not strictly convex.
Interactive: Convex vs Non-Convex
Watch gradient descent navigate these two landscapes. On the convex bowl, it always finds the bottom. On the non-convex surface, it can get trapped in local minima.
Convex vs Non-Convex Landscapes
Watch how gradient descent behaves differently on these two types of functions.
Convex: Only one minimum exists. GD will always find it, guaranteed.
Position
0.800
f(x)
0.640
Gradient
1.600
The Hessian Test for Convexity
For twice-differentiable functions, there's an easy test using the Hessian matrix.
The Second Derivative Test
A twice-differentiable function f is convex if and only if its Hessian is Positive Semi-Definite (PSD) everywhere.
PSD means: All eigenvalues of H are non-negative. Geometrically, the function curves "upward" (or stays flat) in every direction.
Example: MSE Loss for Linear Regression
Loss:
Hessian:
Since is always PSD (it's a Gram matrix), MSE loss is convex. This is why linear regression always has a unique solution.
Jensen's Inequality
This inequality is a direct consequence of convexity and appears throughout ML, especially in variational inference and the EM algorithm.
Jensen's Inequality
For a convex function f and a random variable X:
"The function of the mean is at most the mean of the function."
Worked Example
Let (convex). Let X take values -2 and +2 with equal probability.
Left side: f(E[X])
E[X] = 0, so f(0) = 0
Right side: E[f(X)]
E[X^2] = (4 + 4)/2 = 4
0 <= 4. Jensen's inequality holds.
Why Jensen Matters in ML
In Variational Autoencoders (VAEs), we can't compute directly. Jensen's inequality lets us construct the ELBO (Evidence Lower Bound), a tractable lower bound we can maximize instead. The entire VAE framework rests on Jensen.
The Convex World: Where Life is Good
In convex optimization, you have mathematical guarantees that would make any theorist weep with joy:
Guarantee #1: Local = Global
Any local minimum is automatically the global minimum. No need to restart with different initializations.
Guarantee #2: Efficient Algorithms
Convex problems can be solved in polynomial time with proven convergence rates.
Guarantee #3: Duality
Strong duality holds. The dual problem has the same optimal value.
Convex ML Models
Linear Regression
MSE loss is quadratic (convex)
Logistic Regression
Cross-entropy loss is convex
Support Vector Machines
Hinge loss is convex
Ridge/Lasso Regression
Convex loss + convex penalty
The Non-Convex Reality: Deep Learning's Jungle
Neural networks are non-convex. The moment you compose linear transformations with nonlinear activations (ReLU, Sigmoid, etc.), convexity is destroyed.
The Hazards
- Local Minima: Valleys that aren't the deepest. Getting stuck here means a suboptimal model.
- Saddle Points: Points where gradient = 0, but they're not minima. Far more common than local minima in high dimensions.
- Plateaus: Vast flat regions where gradients are tiny. Training stalls for thousands of steps.
- Ill-Conditioning: Loss landscapes that are steep in some directions and flat in others, causing oscillation.
Why Neural Networks are Non-Convex
Consider a simple 2-layer network:
- Activation Curvature: The nonlinear bends the function.
- Weight Products: creates hyperbolic valleys.
- Symmetry: Swapping neurons gives identical outputs (multiple minima).
Loss Landscape
Visualizing . Notice the symmetry: works same as .
Function Approximation
Why Deep Learning Works Anyway
If non-convex optimization is so hard (NP-hard in the worst case), why do neural networks train at all? This was one of the big mysteries of the 2010s deep learning revolution.
Insight #1: High Dimensionality is a Blessing
Most critical points in deep learning are saddle points, not local minima. And saddle points can be escaped.
Insight #2: Local Minima Are Often Good Enough
Most local minima have loss values very close to the global minimum. The "bad" local minima are rare.
Insight #3: Over-Parameterization Creates Easy Paths
Solutions form connected valleys (mode connectivity), and gradient descent can easily slide along them.
Insight #4: SGD Noise Helps
The stochastic noise from mini-batch gradients can shake the optimizer out of shallow local minima.
Practical ML Implications
When to Choose Convex Models
If interpretability and guaranteed convergence matter, consider logistic regression, SVMs, or linear models.
Surviving Non-Convex Training
- Xavier/He initialization
- Batch normalization
- Skip connections (ResNets)
- Momentum
The Convex Relaxation Trick
"Relax" a non-convex problem into a convex one, solve that, then round back. Used in L1 relaxation of L0.