What is convex optimization and why does it matter?

Convex optimization is the problem of minimizing a convex function over a convex set. It matters because convex problems have one crucial property: every local minimum is guaranteed to be the global minimum. This means gradient-based algorithms such as gradient descent will always converge to the best possible solution, giving practitioners mathematical certainty that is impossible with non-convex problems like training neural networks.

What makes a function convex and how do you check?

A function f is convex if the line segment connecting any two points on its graph lies above or on the graph, expressed mathematically as f(λx + (1−λ)y) ≤ λf(x) + (1−λ)f(y) for all x, y and λ ∈ [0,1]. For twice-differentiable functions, the practical test is to compute the Hessian matrix H: if H is positive semi-definite (all eigenvalues ≥ 0) everywhere in the domain, the function is convex. Common convex functions include x², eˣ, |x|, and any norm.

What is the difference between convex and non-convex optimization in deep learning?

Convex optimization problems, such as training logistic regression or SVMs, guarantee that gradient descent finds the globally optimal solution regardless of initialization. Non-convex optimization, which describes all neural network training, has no such guarantee: the loss landscape contains local minima, saddle points, and plateaus that can trap the optimizer. In deep learning, practitioners compensate with techniques like careful weight initialization, batch normalization, momentum-based optimizers (Adam, SGD with momentum), and residual connections.

Why does convex optimization guarantee finding the global minimum?

The guarantee comes directly from the definition of convexity. In a convex function, the landscape has no 'holes' or 'bumps' — the function curves upward (or stays flat) in every direction everywhere. As a result, any point where the gradient is zero (a critical point) must be a global minimum. Gradient descent exploits this by always moving downhill, and because there is only one 'valley bottom,' it will always reach it regardless of where it starts.

Are neural network loss functions convex?

No, neural network loss functions are not convex. Composing linear transformations with nonlinear activations (ReLU, Sigmoid, Tanh) destroys convexity, and the product of weight matrices creates hyperbolic valleys in the loss landscape. Additionally, weight permutation symmetry means there are many equivalent minima scattered across the parameter space. Despite this, large overparameterized networks train reliably because most critical points are saddle points (not true local minima), nearby local minima tend to have nearly identical loss values, and stochastic gradient noise helps escape shallow traps.

Convex vs Non-Convex Optimization

Introduction

Every optimization problem in machine learning boils down to this: you have a loss function (a landscape of mountains and valleys) and you want to find the lowest point. The question that determines everything is: What does your landscape look like?

If your landscape is a bowl (convex), there is exactly one lowest point, and no matter where you start rolling a ball, it will always end up at the bottom. Life is good.

If your landscape is an egg crate (non-convex), there are countless little dips and valleys. Your ball might roll into a shallow puddle and get stuck there, never finding the deep ocean trench that represents the true best solution.

The Core Distinction

Convex

Every local minimum IS the global minimum.

Non-Convex

Local minima can trap you far from the global best.

Why Should You Care?

This distinction has massive practical implications:

Linear Regression, Logistic Regression, SVMs

These are convex problems. You are guaranteed to find the globally optimal solution. If two people train the same model with different initializations, they get the exact same final model.

Neural Networks

These are non-convex. Two people training the same architecture on the same data will get different models depending on random initialization. There are no guarantees.

Understanding convexity explains why SVMs were dominant in the 2000s (mathematical guarantees!) and why deep learning required tricks like careful initialization, batch normalization, and Adam optimizer to work at all.

Convex Sets

Before we can discuss convex functions, we need to understand convex sets. This is where the geometry begins.

Definition: Convex Set

A set $C$ is convex if for any two points $x, y \in C$ , the entire line segment connecting them lies within $C$ .

\forall \lambda \in [0,1]: \quad \lambda x + (1-\lambda)y \in C

Convex Sets

Circles, spheres
Rectangles, cubes
Triangles (filled)
Half-spaces
Intersections of convex sets

Non-Convex Sets

Donuts (has a hole)
Crescent moons
Stars
Any shape with an "indent"
L-shapes

The "Rubber Band" Test

Imagine stretching a rubber band between any two points in the set. If the rubber band ever "pokes out" of the set, it's not convex. For a convex set, the rubber band always stays inside.

Convex Functions

A function is convex if the region above its graph is a convex set. Equivalently:

Definition: Convex Function

A function $f$ is convex if the line segment between any two points on the graph lies above or on the graph.

f(\lambda x + (1-\lambda)y) \le \lambda f(x) + (1-\lambda)f(y)

for all x, y and lambda in [0, 1]

Convex Functions

$f(x) = x^2$ (parabola)
$f(x) = e^x$
$f(x) = |x|$
$f(x) = x \log x$
Norms: $||x||_1, ||x||_2$
Sum of convex functions

Non-Convex Functions

$f(x) = \sin(x)$
$f(x) = x^3$ (cubic)
$f(x) = -x^2$ (concave)
Neural network losses
Any function with multiple local minima

Strictly Convex

If the inequality is strict ( $<$ instead of $\le$ ), the function is strictly convex. This means there is exactly one global minimum. $x^2$ is strictly convex; a flat line is convex but not strictly convex.

Interactive: Convex vs Non-Convex

Watch gradient descent navigate these two landscapes. On the convex bowl, it always finds the bottom. On the non-convex surface, it can get trapped in local minima.

Convex vs Non-Convex Landscapes

Watch how gradient descent behaves differently on these two types of functions.

Convex: Only one minimum exists. GD will always find it, guaranteed.

Position

0.800

f(x)

0.640

Gradient

1.600

The Hessian Test for Convexity

For twice-differentiable functions, there's an easy test using the Hessian matrix.

The Second Derivative Test

A twice-differentiable function f is convex if and only if its Hessian $H$ is Positive Semi-Definite (PSD) everywhere.

H(x) \succeq 0 \quad \forall x

PSD means: All eigenvalues of H are non-negative. Geometrically, the function curves "upward" (or stays flat) in every direction.

Example: MSE Loss for Linear Regression

Loss: $L(w) = ||Xw - y||^2$

Hessian: $H = 2X^TX$

Since $X^TX$ is always PSD (it's a Gram matrix), MSE loss is convex. This is why linear regression always has a unique solution.

Jensen's Inequality

This inequality is a direct consequence of convexity and appears throughout ML, especially in variational inference and the EM algorithm.

Jensen's Inequality

For a convex function f and a random variable X:

f(E[X]) \le E[f(X)]

"The function of the mean is at most the mean of the function."

Worked Example

Let $f(x) = x^2$ (convex). Let X take values -2 and +2 with equal probability.

Left side: f(E[X])

E[X] = 0, so f(0) = 0

Right side: E[f(X)]

E[X^2] = (4 + 4)/2 = 4

0 <= 4. Jensen's inequality holds.

Why Jensen Matters in ML

In Variational Autoencoders (VAEs), we can't compute $\log p(x)$ directly. Jensen's inequality lets us construct the ELBO (Evidence Lower Bound), a tractable lower bound we can maximize instead. The entire VAE framework rests on Jensen.

The Convex World: Where Life is Good

In convex optimization, you have mathematical guarantees that would make any theorist weep with joy:

Guarantee #1: Local = Global

Any local minimum is automatically the global minimum. No need to restart with different initializations.

Guarantee #2: Efficient Algorithms

Convex problems can be solved in polynomial time with proven convergence rates.

Guarantee #3: Duality

Strong duality holds. The dual problem has the same optimal value.

Convex ML Models

Linear Regression

MSE loss is quadratic (convex)

Logistic Regression

Cross-entropy loss is convex

Support Vector Machines

Hinge loss is convex

Ridge/Lasso Regression

Convex loss + convex penalty

The Non-Convex Reality: Deep Learning's Jungle

Neural networks are non-convex. The moment you compose linear transformations with nonlinear activations (ReLU, Sigmoid, etc.), convexity is destroyed.

The Hazards

Local Minima: Valleys that aren't the deepest. Getting stuck here means a suboptimal model.
Saddle Points: Points where gradient = 0, but they're not minima. Far more common than local minima in high dimensions.
Plateaus: Vast flat regions where gradients are tiny. Training stalls for thousands of steps.
Ill-Conditioning: Loss landscapes that are steep in some directions and flat in others, causing oscillation.

Why Neural Networks are Non-Convex

Consider a simple 2-layer network: $f(x) = W_2 \sigma(W_1 x)$

Activation Curvature: The nonlinear $\sigma$ bends the function.
Weight Products: $W_2 W_1$ creates hyperbolic valleys.
Symmetry: Swapping neurons gives identical outputs (multiple minima).

Loss Landscape

Visualizing $L(w_{in}, w_{out})$ . Notice the symmetry: $(w, -w)$ works same as $(-w, w)$ .

w_in

w_out

Function Approximation

Target Data

Model Prediction

Current Loss:0.1671

Parameters:w_in=1.00, w_out=1.00

Why Deep Learning Works Anyway

If non-convex optimization is so hard (NP-hard in the worst case), why do neural networks train at all? This was one of the big mysteries of the 2010s deep learning revolution.

Insight #1: High Dimensionality is a Blessing

Most critical points in deep learning are saddle points, not local minima. And saddle points can be escaped.

Insight #2: Local Minima Are Often Good Enough

Most local minima have loss values very close to the global minimum. The "bad" local minima are rare.

Insight #3: Over-Parameterization Creates Easy Paths

Solutions form connected valleys (mode connectivity), and gradient descent can easily slide along them.

Insight #4: SGD Noise Helps

The stochastic noise from mini-batch gradients can shake the optimizer out of shallow local minima.

Practical ML Implications

When to Choose Convex Models

If interpretability and guaranteed convergence matter, consider logistic regression, SVMs, or linear models.

Surviving Non-Convex Training

Xavier/He initialization
Batch normalization
Skip connections (ResNets)
Momentum

The Convex Relaxation Trick

"Relax" a non-convex problem into a convex one, solve that, then round back. Used in L1 relaxation of L0.

Contents

Introduction

The Core Distinction

Why Should You Care?

Linear Regression, Logistic Regression, SVMs

Neural Networks

Convex Sets

Definition: Convex Set

Convex Sets

Non-Convex Sets

The "Rubber Band" Test

Convex Functions

Definition: Convex Function

Convex Functions

Non-Convex Functions

Strictly Convex

Interactive: Convex vs Non-Convex

Convex vs Non-Convex Landscapes

The Hessian Test for Convexity

The Second Derivative Test

Example: MSE Loss for Linear Regression

Jensen's Inequality

Jensen's Inequality

Worked Example

Why Jensen Matters in ML

The Convex World: Where Life is Good

Guarantee #1: Local = Global

Guarantee #2: Efficient Algorithms

Guarantee #3: Duality

Convex ML Models

The Non-Convex Reality: Deep Learning's Jungle

The Hazards

Why Neural Networks are Non-Convex

Loss Landscape

Function Approximation

Why Deep Learning Works Anyway

Insight #1: High Dimensionality is a Blessing

Insight #2: Local Minima Are Often Good Enough

Insight #3: Over-Parameterization Creates Easy Paths

Insight #4: SGD Noise Helps

Practical ML Implications

When to Choose Convex Models

Surviving Non-Convex Training

The Convex Relaxation Trick