Introduction
At the heart are two fundamental operations: Dot Product (alignment) and Norms (magnitude).
Every neural network prediction, recommender system suggestion, and regularization penalty uses these operations. Together, they form the foundation of how machines understand similarity, distance, and direction in high-dimensional space.
Why These Matter
Understanding dot products and norms is crucial for modern ML architectures. Matrix multiplication is just batched dot products. Transformer attention computes similarity via scaled dot products. Regularization techniques like Ridge and Lasso use different norms to constrain model complexity.
The Dot Product
The dot product (also called inner product or scalar product) takes two vectors of equal length and returns a single scalar. It is the most fundamental operation in linear algebra.
Multiply corresponding elements, then sum.
Algorithm
a = [1, 3]
b = [4, -2]
a · b = (1)(4) + (3)(-2) = -2
Matrix Notation
Row vector × Column vector. This generalizes to specific rows and columns in larger matrices.
Geometric Interpretation
The dot product measures alignment between vectors. It bridges algebra and geometry.
Product of magnitudes times the cosine of the angle between them.
Positive
Vectors point in roughly the same direction ().
Zero
Vectors are orthogonal (perpendicular) ().
Negative
Vectors point in opposite directions ().
The Projection View
Think of it as casting a shadow of vector a onto b. The dot product tells you: "How much of a goes in the direction of b?". This "projection" concept is why dot products measure similarity.
Properties of the Dot Product
Commutative
Order does not matter.
Distributive
Distributes over vector addition.
Self Dot Product
Dotting a vector with itself gives the squared norm (magnitude).
Interactive: Dot Product
Drag the sliders to change vectors. See how the dot product relates to angle, cosine, and projection.
Dot Product Geometric Intuition
Geometric: vs Algebraic:
Vector Norms
A norm measures the "size" or "magnitude" of a vector. Different norms measure size in different ways, each with unique geometric interpretations and ML applications.
L2 Norm (Euclidean)
Straight-line distance. Used in Ridge Regression, KNN.
L1 Norm (Manhattan)
Sum of absolute values (taxi-cab distance). Promotes sparsity (Lasso).
L-infinity Norm (Max)
Largest single element. Used in adversarial robustness.
Properties of Norms
For a function to be a valid norm, it must satisfy three axioms:
- Non-negativity: Norm is always non-negative. Zero only if the vector is zero.
- Absolute Homogeneity: Scaling the vector scales the norm by the absolute scalar value.
- Triangle Inequality: The direct path is always shorter than any detour.
The p-Norm Family
All common norms are special cases of the generalized p-norm (where ).
L1 Norm (Manhattan)
L2 Norm (Euclidean)
Higher-order norms
L-infinity (Max)
Interactive: Unit Balls
The "unit ball" (vectors with norm = 1) has a different shape for each p-norm. This shape explains why L1 regularization promotes sparsity (sharp corners on axes).
Vector Norms & Unit Balls
Visualizing for different p-norms.
Normalization (Unit Vectors)
Normalization converts a vector to a unit vector (norm = 1) pointing in the same direction, isolating direction from magnitude.
Divide by the norm.
Applications
- Cosine Similarity: Normalized vectors reduce dot product to cosine similarity.
- Batch Norm: Normalizes layer activations to stabilize training.
- Word Embeddings: Often normalized so only direction (meaning) matters, not frequency.
ML Applications
Regularization
Penalizing the norm of weights prevents overfitting.
- L1 (Lasso): - creates sparse models (feature selection).
- L2 (Ridge): - distributes weight values, prevents unstable large weights.
Attention Mechanism
Transformer attention is a scaled dot product. It calculates relevance scores between tokens.
Distance Metrics
KNN and K-Means rely on L2 distance . Using different norms here changes the algorithm's behavior drastically.
Computational Considerations
Time Complexity
Hardware
GPUs optimize these O(n) ops via massive parallelism (SIMD). Always use vectorized libraries (PyTorch/NumPy), never loops.