What is the dot product and how is it calculated?

The dot product (also called the inner product or scalar product) is an operation that takes two vectors of equal length and returns a single scalar value. It is calculated by multiplying corresponding elements of the two vectors and summing all those products: a · b = Σ aᵢbᵢ. Geometrically, it equals the product of the two vectors' magnitudes times the cosine of the angle between them: a · b = ‖a‖ ‖b‖ cos(θ).

What is the difference between L1 and L2 norms?

The L1 norm (Manhattan norm) is the sum of the absolute values of a vector's components: ‖x‖₁ = Σ |xᵢ|. The L2 norm (Euclidean norm) is the square root of the sum of squared components: ‖x‖₂ = √(Σ xᵢ²). In machine learning, L1 regularization (Lasso) tends to produce sparse models by driving some weights to exactly zero, while L2 regularization (Ridge) distributes weight magnitude more evenly across all parameters.

How is cosine similarity related to the dot product?

Cosine similarity is the dot product of two vectors divided by the product of their magnitudes: cos(θ) = (a · b) / (‖a‖ ‖b‖). When both vectors are unit vectors (normalized to length 1), the dot product directly equals the cosine similarity. This measure is widely used in NLP, recommendation systems, and information retrieval because it captures directional similarity independent of vector magnitude.

How are vector norms used in regularization?

Regularization adds a norm-based penalty on model weights to the loss function to prevent overfitting. L1 regularization adds λ‖w‖₁ (the L1 norm of the weights), which encourages sparsity and is used in Lasso regression. L2 regularization adds λ‖w‖₂² (the squared L2 norm), which penalizes large weights and is used in Ridge regression. The choice of norm directly shapes which weight configurations the model prefers during training.

How is the dot product used in neural network layers?

Every fully connected (dense) layer in a neural network computes dot products between the input vector and each row of the weight matrix, accumulating them into pre-activation values. In transformer architectures, the attention mechanism computes scaled dot products between query and key vectors to produce attention scores that determine how much each token attends to others. Efficient GPU computation of these dot products is what makes modern deep learning feasible at scale.

Dot Product & Vector Norms: Similarity and Distance

Introduction

At the heart are two fundamental operations: Dot Product (alignment) and Norms (magnitude).

Every neural network prediction, recommender system suggestion, and regularization penalty uses these operations. Together, they form the foundation of how machines understand similarity, distance, and direction in high-dimensional space.

Why These Matter

Understanding dot products and norms is crucial for modern ML architectures. Matrix multiplication is just batched dot products. Transformer attention computes similarity via scaled dot products. Regularization techniques like Ridge and Lasso use different norms to constrain model complexity.

The Dot Product

The dot product (also called inner product or scalar product) takes two vectors of equal length and returns a single scalar. It is the most fundamental operation in linear algebra.

\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i

Multiply corresponding elements, then sum.

Algorithm

a = [1, 3]
b = [4, -2]
a · b = (1)(4) + (3)(-2) = -2

Matrix Notation

\mathbf{a}^T \mathbf{b}

Row vector × Column vector. This generalizes to specific rows and columns in larger matrices.

Geometric Interpretation

The dot product measures alignment between vectors. It bridges algebra and geometry.

\mathbf{a} \cdot \mathbf{b} = ||\mathbf{a}|| \cdot ||\mathbf{b}|| \cos(\theta)

Product of magnitudes times the cosine of the angle between them.

Positive

Vectors point in roughly the same direction ( $\theta < 90°$ ).

Zero

Vectors are orthogonal (perpendicular) ( $\theta = 90°$ ).

Negative

Vectors point in opposite directions ( $\theta > 90°$ ).

The Projection View

Think of it as casting a shadow of vector a onto b. The dot product tells you: "How much of a goes in the direction of b?". This "projection" concept is why dot products measure similarity.

Properties of the Dot Product

Commutative

\mathbf{a} \cdot \mathbf{b} = \mathbf{b} \cdot \mathbf{a}

Order does not matter.

Distributive

\mathbf{a} \cdot (\mathbf{b} + \mathbf{c}) = \mathbf{a} \cdot \mathbf{b} + \mathbf{a} \cdot \mathbf{c}

Distributes over vector addition.

Self Dot Product

\mathbf{a} \cdot \mathbf{a} = ||\mathbf{a}||^2

Dotting a vector with itself gives the squared norm (magnitude).

Interactive: Dot Product

Drag the sliders to change vectors. See how the dot product relates to angle, cosine, and projection.

Dot Product

|a|3|b|3θ45°

Vector Norms

A norm measures the "size" or "magnitude" of a vector. Different norms measure size in different ways, each with unique geometric interpretations and ML applications.

L2 Norm (Euclidean)

||\mathbf{x}||_2 = \sqrt{\sum x_i^2}

Straight-line distance. Used in Ridge Regression, KNN.

L1 Norm (Manhattan)

||\mathbf{x}||_1 = \sum |x_i|

Sum of absolute values (taxi-cab distance). Promotes sparsity (Lasso).

L-infinity Norm (Max)

||\mathbf{x}||_\infty = \max |x_i|

Largest single element. Used in adversarial robustness.

Properties of Norms

For a function to be a valid norm, it must satisfy three axioms:

Non-negativity: Norm is always non-negative. Zero only if the vector is zero.
Absolute Homogeneity: Scaling the vector scales the norm by the absolute scalar value.
Triangle Inequality: The direct path is always shorter than any detour. $||\mathbf{x} + \mathbf{y}|| \leq ||\mathbf{x}|| + ||\mathbf{y}||$

The p-Norm Family

All common norms are special cases of the generalized p-norm (where $p \geq 1$ ).

||\mathbf{x}||_p = \left(\sum_{i=1}^{n} |x_i|^p\right)^{1/p}

$p = 1$

L1 Norm (Manhattan)

$p = 2$

L2 Norm (Euclidean)

$p = 3, 4, \ldots$

Higher-order norms

$p \to \infty$

L-infinity (Max)

Interactive: Unit Balls

The "unit ball" (vectors with norm = 1) has a different shape for each p-norm. This shape explains why L1 regularization promotes sparsity (sharp corners on axes).

Vector Norms & Unit Balls

Visualizing $\|x\|_p = 1$ for different p-norms.

Select Norm

Vector Components

X Coordinate 3

Y Coordinate 4

L2 Norm (Euclidean)

Points with constant sum of squares form a circle. Measures straight-line distance.

Grid: 1 unit steps

7.00

5.00

L∞

4.00

Normalization (Unit Vectors)

Normalization converts a vector to a unit vector (norm = 1) pointing in the same direction, isolating direction from magnitude.

\hat{\mathbf{x}} = \frac{\mathbf{x}}{||\mathbf{x}||}

Divide by the norm.

Applications

Cosine Similarity: Normalized vectors reduce dot product to cosine similarity.
Batch Norm: Normalizes layer activations to stabilize training.
Word Embeddings: Often normalized so only direction (meaning) matters, not frequency.

ML Applications

Regularization

Penalizing the norm of weights prevents overfitting.

L1 (Lasso): $\lambda ||\mathbf{w}||_1$ - creates sparse models (feature selection).
L2 (Ridge): $\lambda ||\mathbf{w}||_2^2$ - distributes weight values, prevents unstable large weights.

Attention Mechanism

Transformer attention is a scaled dot product. It calculates relevance scores between tokens.

Distance Metrics

KNN and K-Means rely on L2 distance $||\mathbf{x} - \mathbf{y}||_2$ . Using different norms here changes the algorithm's behavior drastically.

Computational Considerations

Time Complexity

Dot Product (n-dim)

O(n)

Norm Calculation

O(n)

Hardware

GPUs optimize these O(n) ops via massive parallelism (SIMD). Always use vectorized libraries (PyTorch/NumPy), never loops.

Contents

Introduction

Why These Matter

The Dot Product

Algorithm

Matrix Notation

Geometric Interpretation

Positive

Zero

Negative

The Projection View

Properties of the Dot Product

Commutative

Distributive

Self Dot Product

Interactive: Dot Product

Vector Norms

L2 Norm (Euclidean)

L1 Norm (Manhattan)

L-infinity Norm (Max)

Properties of Norms

The p-Norm Family

Interactive: Unit Balls

Vector Norms & Unit Balls

Normalization (Unit Vectors)

Applications

ML Applications

Regularization

Attention Mechanism

Distance Metrics

Computational Considerations

Time Complexity

Hardware