Introduction
Before we can discuss derivatives (the engine of optimization) or integrals (the foundation of probability), we must understand the bedrock they stand on: limits.
Calculus is the mathematics of change. But change happens at an instant, and measuring something "at an instant" involves dividing by zero (). To bypass this mathematical impossibility, we use limits: asking not "what is the value at this point," but "what value do we approach as we get infinitely close?"
Why Limits Matter for ML
- Derivatives: The definition of derivative is a limit:
- Gradient Descent: Requires continuous, differentiable functions. Limits ensure this.
- Activation Functions: Understanding where functions break (discontinuities) explains ReLU vs Sigmoid choices.
- Convergence: Training loops converge when loss approaches a limit.
What is a Limit?
A limit describes the value a function approaches as the input approaches some value. The function doesn't need to actually reach that value, it just needs to get arbitrarily close.
Classic Example
Consider . If we plug in x = 1, we get , which is undefined.
But we can factor: for x != 1.
So as x approaches 1, f(x) approaches 2. The limit exists even though f(1) is undefined!
Left-hand Limit
Approaching c from values less than c (from the left on number line).
Right-hand Limit
Approaching c from values greater than c (from the right).
Limit Exists If and Only If
The limit exists if and only if both one-sided limits exist AND are equal:
Interactive: Approaching a Limit
Watch how both sides approach the same value as we get closer to the point. The function has a "hole" at x = 1, but the limit still exists!
Approaching the Limit
Values Table
Limit Exists!
Notice that as x gets closer to 1 from BOTH sides, f(x) gets closer to 2.
Even though f(1) is undefined (hole), we say .
The Epsilon-Delta Definition
The formal definition of a limit is often considered one of the hardest concepts in Calculus 1. But it's actually just a way of describing a guarantee of precision.
The Manufacturing Analogy
Imagine you are manufacturing a high-precision piston (input ) that must fit into a cylinder (output ).
- 1.The Goal (L): The cylinder has a perfect target width, say 10cm.
- 2.The Tolerance (): The customer says, "The width must be within 0.01cm of the target." This is your error margin.
- 3.The Input Precision (): You ask, "How precise must my piston mold be?" Maybe if the mold is within 0.005cm of the target size, the final piston will be within the customer's tolerance.
The Limit Exists If: No matter how strict the customer's tolerance () is, you can always find a high-enough precision for your machine () to satisfy it.
Formal Definition
We say if:
For every (the challenge), there exists a (the response) such that:
Translation: If the input is within of , then the output is guaranteed to be within of .
Limit Laws
These rules let us compute limits of complex expressions from simpler ones. If and :
L'Hopital's Rule
When direct substitution gives an indeterminate form like or , L'Hopital's Rule provides a way forward.
L'Hopital's Rule
If gives or , then:
provided the limit on the right exists (or is infinity).
Example
Find :
Direct substitution: (indeterminate)
Apply L'Hopital:
This limit (sin x / x = 1 as x goes to 0) is fundamental in deriving the derivative of sine.
Caution
Only apply L'Hopital when you have an indeterminate form! Applying it to (which is just undefined, not indeterminate) gives wrong answers.
Continuity
Intuitively, a function is continuous if you can draw it without lifting your pen. Formally, continuity requires three conditions at a point c:
Three Conditions for Continuity at c
is defined
No holes at the point.
exists
Left and right limits agree.
The limit equals the function value.
Important Properties
- Polynomials are continuous everywhere.
- Rational functions are continuous except where denominator = 0.
- Exponential, logarithmic, and trig functions are continuous on their domains.
- Sums, products, and compositions of continuous functions are continuous.
Types of Discontinuity
Understanding where and how functions "break" is crucial for choosing activation functions and understanding gradient flow.
Types of Discontinuity
Removable Discontinuity
The limit exists (left = right), but the function is undefined at the point (or defined differently).
Why it matters for ML
Removable discontinuities are trivial—we just 'fill the hole'. For example, defining 0/0 or 0*log(0) as 0 in entropy calculations.
Removable
Limit exists but f(c) is undefined or wrong. Can be "fixed" by redefining f(c) = L.
Example: (x^2-1)/(x-1) at x=1
Jump
Left and right limits both exist but differ. Function "jumps."
Example: Heaviside step function
Infinite
Function approaches infinity (vertical asymptote). Limit DNE.
Example: 1/x at x=0
Continuity vs Differentiability
This distinction is critical for understanding activation functions in deep learning.
The Hierarchy
Differentiable implies Continuous
If f'(c) exists, then f must be continuous at c. You can't have a derivative at a gap or jump.
Continuous does NOT imply Differentiable
A function can be continuous but have a "corner" where the derivative is undefined.
The Classic Example: |x|
The absolute value function f(x) = |x| is continuous everywhere. But at x = 0:
- Left derivative:
- Right derivative:
Since -1 != +1, the derivative at 0 does not exist. The function has a "corner."
ML Applications
The Death of the Perceptron
Early neural networks used the step function: output 0 if input is negative, 1 otherwise.
Problem: It has a jump discontinuity at 0. The derivative is 0 everywhere else, so gradients cannot flow. Backpropagation fails completely. This is why we moved to smooth activations like Sigmoid.
ReLU: The Practical Compromise
ReLU: . It is continuous everywhere but has a corner at x = 0 (not differentiable there).
Solution: We use subgradients. We arbitrarily define f'(0) = 0 or 1. Since the probability of x being exactly 0.0000... is negligible, this works perfectly in practice.
Softmax and Numerical Stability
Softmax: . When gets large, can overflow to infinity. We use the trick of subtracting max(z) from all inputs, which doesn't change the output but keeps values bounded. This is a practical application of limit behavior.
Loss Function Continuity
MSE Loss is continuous and differentiable everywhere, making gradient descent smooth. Cross-entropy loss is continuous but has a singularity as predictions approach 0 or 1, which is why we clip probabilities away from exactly 0/1.