Introduction
In the GNN introduction, we saw that GNNs learn by aggregating information from neighbors. Message Passing is the formal framework that unifies virtually all GNN architectures.
The idea is elegant: each node sends "messages" to its neighbors, collects messages from its neighbors, and uses them to update its own representation. Different GNN architectures (GCN, GAT, GraphSAGE, GIN) are just different implementations of this same idea.
Why It Works
Message passing is a form of neural network inductive bias. By forcing nodes to learn from their local neighborhoods, we inject the prior knowledge that "connected things are related." This is similar to how CNNs inject the prior that "nearby pixels are related."
The MPNN Framework
The Message Passing Neural Network (MPNN) framework, introduced by Gilmer et al. (2017), provides a unified view. One layer of message passing consists of three steps:
General Message Passing Layer
Step 1: MESSAGE
Creates a "message" from neighbor u to target v using their features and edge attributes
Step 2: AGGREGATE
Combines all incoming messages (must be permutation-invariant: sum, mean, max, attention)
Step 3: UPDATE
Merges aggregated message with node's current features (usually an MLP or GRU)
The Unifying Pattern
Every GNN architecture follows this pattern. They differ only in how they implement MESSAGE, AGGREGATE, and UPDATE. This abstraction makes it easy to understand and compare different approaches.
Aggregation Functions
The aggregation function must be permutation-invariant: the result should not depend on the order in which we process neighbors. This is a fundamental requirement for graph neural networks.
SUM
Preserves neighborhood size information. Captures "how much" signal.
Different scales for different degrees. Needs normalization.
Used by: GIN, original spectral GNNs
MEAN
Normalizes for degree. Stable across graph sizes.
Loses neighborhood size information.
MAX
Captures strongest signal. Robust to outliers.
Loses information from non-max neighbors.
Used by: GraphSAGE-pool, PointNet
ATTENTION (Weighted Sum)
Learns which neighbors are important. Adaptive weighting.
More parameters, slower, can overfit on small graphs.
Used by: GAT, Graph Transformers
Update Functions
The update function combines the aggregated message with the node's own features. The choice of update function affects both expressivity and trainability.
Simple: Replace
Just use the message, ignore self. Risk: lose self-information over layers.
Concatenate + MLP
Concatenate self and message, let MLP learn combination. Most flexible.
Skip Connection
Residual connection. Essential for deep GNNs (helps gradient flow).
GRU Update
Gated recurrent unit. Used in GGNN for sequential message passing.
The Over-Smoothing Problem
Deep GNNs (many layers) suffer from over-smoothing: all node representations converge to the same value. Skip connections and careful normalization (like in BatchNorm) help, but this limits GNN depth to typically 2-4 layers.
GCN: Graph Convolutional Network
GCN (Kipf & Welling, 2017) is the foundational modern GNN. It's derived from spectral graph theory but has a simple spatial interpretation:
Where à = A + I (add self-loops) and D̃ is the degree matrix of Ã.
In message passing terms:
MESSAGE: Linear transform of neighbor feature: m = W · h_u
AGGREGATE: Normalized sum with symmetric weighting:
UPDATE: Include self-loop (via Ã), then apply nonlinearity σ
Key Insight: Symmetric Normalization
The normalization prevents high-degree nodes from dominating. It's derived from the normalized graph Laplacian and ensures stable learning across graphs with varying degree distributions.
GraphSAGE
GraphSAGE (Hamilton et al., 2017) introduced two key innovations: neighborhood sampling for scalability and explicit self-loop handling via concatenation.
Neighborhood Sampling
Instead of using all neighbors, sample a fixed number (e.g., 10-25). Makes training O(1) per node rather than O(degree), enabling massive-scale graphs.
Inductive Learning
Learns an aggregator function, not node-specific embeddings. Can generalize to completely unseen nodes and graphs at test time.
Aggregator options: MEAN (simple average), LSTM (order neighbors randomly for sequence), MAX-POOL (apply MLP element-wise then max).
GAT: Graph Attention Network
GAT (Veličković et al., 2018) uses attention mechanisms to learn which neighbors are most important for each node. This is a major departure from fixed aggregation weights.
Attention coefficient:
Softmax normalization:
Weighted aggregation:
Multi-Head Attention
Like Transformers, GAT uses K independent attention heads and concatenates (or averages) their outputs. This stabilizes training and captures different "aspects" of neighbor importance. Typically K = 8 for intermediate layers.
Interactive: Attention Weights
See how attention mechanisms dynamically weight neighbors. Unlike GCN's fixed normalization, GAT learns which connections matter most.
Attention Weights
GAT learns to weight neighbors based on feature similarity.
Feature Modulation
(Simplified here as dot product similarity)
GIN: Graph Isomorphism Network
GIN (Xu et al., 2019) asks a fundamental question: what's the most expressive possible GNN? The answer: one that's as powerful as the Weisfeiler-Lehman (WL) graph isomorphism test.
ε is a learnable parameter (or fixed small value like 0).
Key Theoretical Insights
- SUM not MEAN: Sum preserves multiset cardinality; mean loses it. Critical for distinguishing graphs.
- MLP not linear: Needs a universal approximator (MLP) to be injective. Linear transforms can't distinguish all neighborhoods.
- (1 + ε) term: Distinguishes self from sum of neighbors. Without it, some non-isomorphic graphs look identical.
Readout & Graph-Level Tasks
For graph classification or regression, we need a single vector representing the entire graph. This is done via a READOUT (or pooling) function that aggregates node embeddings.
Global Sum/Mean
Simplest approach. Sum (or mean) all final node embeddings. Works surprisingly well.
Hierarchical Pooling
Learn to coarsen graph iteratively (DiffPool, MinCutPool). More expressive but complex.
Set2Set
LSTM-based attention over all nodes. Order-invariant but sequential processing.
Virtual Node
Add a "super node" connected to all others. Its embedding represents the graph.
Architecture Comparison
| Model | Aggregation | Self-Loop | Expressivity | Best For |
|---|---|---|---|---|
| GCN | Normalized sum | In à matrix | Low (1-WL) | Semi-supervised node classification |
| GraphSAGE | Mean/Max/LSTM | Concatenate | Medium | Large-scale, inductive learning |
| GAT | Attention (learned) | In attention | Medium | Heterogeneous neighbor importance |
| GIN | Sum + MLP | (1+ε) weighted | High (WL-equivalent) | Graph classification, maximum expressivity |
Choosing the Right Architecture
GCN for node classification with smooth labels. GraphSAGE for massive graphs or inductive settings. GAT when neighbor importance varies (e.g., molecular graphs). GIN for graph-level tasks requiring maximum expressivity.
Interactive: All Architectures
Compare all four GNN architectures side-by-side. Select an architecture to see its specific message passing mechanism, aggregation function, and update rule on the same graph.
GNN Architecture Comparison
Select an architecture to see how it processes graph information. Click nodes to change target.
GCN
| Feature | GCN | SAGE | GAT | GIN | MPNN |
|---|---|---|---|---|---|
| Aggregation | Norm Sum | Mean/Max | Attention | Sum | Any |
| Inductive? | ❌ | ✅ | ✅ | ✅ | ✅ |
| Weights | Fixed | Fixed | Learned | Fixed | Variable |
| Expressivity | Low | Medium | Medium | High | High |