What is message passing in graph neural networks?

Message passing is the core computational paradigm in graph neural networks (GNNs) where each node iteratively collects information from its neighbors. In each layer, every node computes a "message" from each of its neighbors, aggregates those messages using a permutation-invariant function (such as sum, mean, or attention), and then updates its own representation by combining the aggregated message with its current features. This process allows nodes to incorporate information from increasingly distant parts of the graph with each additional layer.

What is the MPNN (Message Passing Neural Network) framework?

The MPNN framework, introduced by Gilmer et al. (2017), provides a unified abstraction for virtually all GNN architectures. It defines a general three-step layer: (1) MESSAGE — each neighbor u computes a message to node v using their features and optional edge attributes; (2) AGGREGATE — node v collects all incoming messages using a permutation-invariant function; (3) UPDATE — node v combines the aggregated message with its own current embedding. Architectures like GCN, GAT, GraphSAGE, and GIN are all special cases that differ only in how they implement these three functions.

How do GCN, GAT, and GraphSAGE differ in their message passing?

GCN (Kipf & Welling, 2017) uses a fixed symmetric normalized aggregation derived from spectral graph theory, making it simple but limited to transductive settings. GAT (Veličković et al., 2018) replaces the fixed normalization with learned attention coefficients, allowing each node to dynamically weight the importance of its neighbors. GraphSAGE (Hamilton et al., 2017) concatenates the node's own features with the aggregated neighbor features and supports neighborhood sampling, making it scalable and inductive — it can generalize to unseen nodes at test time.

What is the aggregation function in message passing and why does it matter?

The aggregation function combines messages from all neighbors into a single vector and must be permutation-invariant so the result does not depend on the arbitrary ordering of neighbors. Common choices are sum (preserves neighborhood size information, used in GIN), mean (normalizes for degree, used in GCN and GraphSAGE), max (captures the strongest signal), and attention-weighted sum (learns which neighbors matter most, used in GAT). The choice of aggregation function directly impacts the expressive power of the GNN: GIN proves that sum aggregation with an MLP is as powerful as the Weisfeiler-Lehman graph isomorphism test, while mean and max lose structural information that sum retains.

What are the limitations of message passing neural networks?

The main limitation of MPNNs is over-smoothing: as more message passing layers are stacked, all node representations converge to the same value, making deep GNNs ineffective in practice (typically limited to 2–4 layers). Additionally, MPNNs are bounded in expressive power by the 1-dimensional Weisfeiler-Lehman (1-WL) test, meaning they cannot distinguish certain non-isomorphic graph structures. Other limitations include scalability challenges for dense graphs (addressed by sampling in GraphSAGE), inability to capture long-range dependencies due to the shallow depth, and sensitivity to over-squashing when information from exponentially many nodes must be compressed through narrow graph bottlenecks.

Message Passing Neural Networks (MPNN)

Introduction

In the GNN introduction, we saw that GNNs learn by aggregating information from neighbors. Message Passing is the formal framework that unifies virtually all GNN architectures.

The idea is elegant: each node sends "messages" to its neighbors, collects messages from its neighbors, and uses them to update its own representation. Different GNN architectures (GCN, GAT, GraphSAGE, GIN) are just different implementations of this same idea.

Why It Works

Message passing is a form of neural network inductive bias. By forcing nodes to learn from their local neighborhoods, we inject the prior knowledge that "connected things are related." This is similar to how CNNs inject the prior that "nearby pixels are related."

The MPNN Framework

The Message Passing Neural Network (MPNN) framework, introduced by Gilmer et al. (2017), provides a unified view. One layer of message passing consists of three steps:

General Message Passing Layer

Step 1: MESSAGE

m_{v \leftarrow u} = \text{MESSAGE}(h_u^{(l)}, h_v^{(l)}, e_{uv})

Creates a "message" from neighbor u to target v using their features and edge attributes

Step 2: AGGREGATE

m_v = \text{AGGREGATE}(\{m_{v \leftarrow u} : u \in \mathcal{N}(v)\})

Combines all incoming messages (must be permutation-invariant: sum, mean, max, attention)

Step 3: UPDATE

h_v^{(l+1)} = \text{UPDATE}(h_v^{(l)}, m_v)

Merges aggregated message with node's current features (usually an MLP or GRU)

The Unifying Pattern

Every GNN architecture follows this pattern. They differ only in how they implement MESSAGE, AGGREGATE, and UPDATE. This abstraction makes it easy to understand and compare different approaches.

Aggregation Functions

The aggregation function must be permutation-invariant: the result should not depend on the order in which we process neighbors. This is a fundamental requirement for graph neural networks.

SUM

m_v = \sum_{u \in \mathcal{N}(v)} m_{v \leftarrow u}

Pros:

Preserves neighborhood size information. Captures "how much" signal.

Cons:

Different scales for different degrees. Needs normalization.

Used by: GIN, original spectral GNNs

MEAN

m_v = \frac{1}{|\mathcal{N}(v)|} \sum_{u \in \mathcal{N}(v)} m_{v \leftarrow u}

Pros:

Normalizes for degree. Stable across graph sizes.

Cons:

Loses neighborhood size information.

Used by: GCN, GraphSAGE-mean

MAX

m_v = \max_{u \in \mathcal{N}(v)} m_{v \leftarrow u}

Pros:

Captures strongest signal. Robust to outliers.

Cons:

Loses information from non-max neighbors.

Used by: GraphSAGE-pool, PointNet

ATTENTION (Weighted Sum)

m_v = \sum_{u \in \mathcal{N}(v)} \alpha_{vu} \cdot m_{v \leftarrow u}

Pros:

Learns which neighbors are important. Adaptive weighting.

Cons:

More parameters, slower, can overfit on small graphs.

Used by: GAT, Graph Transformers

Update Functions

The update function combines the aggregated message with the node's own features. The choice of update function affects both expressivity and trainability.

Simple: Replace

h'_v = \sigma(W \cdot m_v)

Just use the message, ignore self. Risk: lose self-information over layers.

Concatenate + MLP

h'_v = \text{MLP}([h_v \, \| \, m_v])

Concatenate self and message, let MLP learn combination. Most flexible.

Skip Connection

h'_v = h_v + \sigma(W \cdot m_v)

Residual connection. Essential for deep GNNs (helps gradient flow).

GRU Update

h'_v = \text{GRU}(h_v, m_v)

Gated recurrent unit. Used in GGNN for sequential message passing.

The Over-Smoothing Problem

Deep GNNs (many layers) suffer from over-smoothing: all node representations converge to the same value. Skip connections and careful normalization (like in BatchNorm) help, but this limits GNN depth to typically 2-4 layers.

GCN: Graph Convolutional Network

GCN (Kipf & Welling, 2017) is the foundational modern GNN. It's derived from spectral graph theory but has a simple spatial interpretation:

H^{(l+1)} = \sigma\left( \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2} H^{(l)} W^{(l)} \right)

Where Ã = A + I (add self-loops) and D̃ is the degree matrix of Ã.

In message passing terms:

MESSAGE: Linear transform of neighbor feature: m = W · h_u

AGGREGATE: Normalized sum with symmetric weighting: $\frac{1}{\sqrt{d_u \cdot d_v}}$

UPDATE: Include self-loop (via Ã), then apply nonlinearity σ

Key Insight: Symmetric Normalization

The $1/\sqrt{d_u \cdot d_v}$ normalization prevents high-degree nodes from dominating. It's derived from the normalized graph Laplacian and ensures stable learning across graphs with varying degree distributions.

GraphSAGE

GraphSAGE (Hamilton et al., 2017) introduced two key innovations: neighborhood sampling for scalability and explicit self-loop handling via concatenation.

h_v^{(l+1)} = \sigma\left( W \cdot \text{CONCAT}\left( h_v^{(l)}, \text{AGG}(\{h_u^{(l)} : u \in \mathcal{N}(v)\}) \right) \right)

Neighborhood Sampling

Instead of using all neighbors, sample a fixed number (e.g., 10-25). Makes training O(1) per node rather than O(degree), enabling massive-scale graphs.

Inductive Learning

Learns an aggregator function, not node-specific embeddings. Can generalize to completely unseen nodes and graphs at test time.

Aggregator options: MEAN (simple average), LSTM (order neighbors randomly for sequence), MAX-POOL (apply MLP element-wise then max).

GAT: Graph Attention Network

GAT (Veličković et al., 2018) uses attention mechanisms to learn which neighbors are most important for each node. This is a major departure from fixed aggregation weights.

Attention coefficient:

e_{vu} = \text{LeakyReLU}\left( \mathbf{a}^T [W h_v \| W h_u] \right)

Softmax normalization:

\alpha_{vu} = \frac{\exp(e_{vu})}{\sum_{u' \in \mathcal{N}(v)} \exp(e_{vu'})}

Weighted aggregation:

h_v^{(l+1)} = \sigma\left( \sum_{u \in \mathcal{N}(v)} \alpha_{vu} W h_u \right)

Multi-Head Attention

Like Transformers, GAT uses K independent attention heads and concatenates (or averages) their outputs. This stabilizes training and captures different "aspects" of neighbor importance. Typically K = 8 for intermediate layers.

Interactive: Attention Weights

See how attention mechanisms dynamically weight neighbors. Unlike GCN's fixed normalization, GAT learns which connections matter most.

Attention Weights

GAT learns to weight neighbors based on feature similarity.

Interactive Mode

Feature Modulation

Target Vector (Query)

Dim 0

Dim 1

Neighbor Vectors (Keys)

N1w = 53.5%

N2w = 20.5%

N3w = 26.0%

Attention Calculation

Score

e_{ij} = \text{LeakyReLU}(\mathbf{a}^T [W h_i \| W h_j])

(Simplified here as dot product similarity)

Weight

\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_k \exp(e_{ik})}

GIN: Graph Isomorphism Network

GIN (Xu et al., 2019) asks a fundamental question: what's the most expressive possible GNN? The answer: one that's as powerful as the Weisfeiler-Lehman (WL) graph isomorphism test.

h_v^{(l+1)} = \text{MLP}\left( (1 + \epsilon) \cdot h_v^{(l)} + \sum_{u \in \mathcal{N}(v)} h_u^{(l)} \right)

ε is a learnable parameter (or fixed small value like 0).

Key Theoretical Insights

SUM not MEAN: Sum preserves multiset cardinality; mean loses it. Critical for distinguishing graphs.
MLP not linear: Needs a universal approximator (MLP) to be injective. Linear transforms can't distinguish all neighborhoods.
(1 + ε) term: Distinguishes self from sum of neighbors. Without it, some non-isomorphic graphs look identical.

Readout & Graph-Level Tasks

For graph classification or regression, we need a single vector representing the entire graph. This is done via a READOUT (or pooling) function that aggregates node embeddings.

h_G = \text{READOUT}(\{h_v^{(L)} : v \in V\})

Global Sum/Mean

Simplest approach. Sum (or mean) all final node embeddings. Works surprisingly well.

Hierarchical Pooling

Learn to coarsen graph iteratively (DiffPool, MinCutPool). More expressive but complex.

Set2Set

LSTM-based attention over all nodes. Order-invariant but sequential processing.

Virtual Node

Add a "super node" connected to all others. Its embedding represents the graph.

Architecture Comparison

Model	Aggregation	Self-Loop	Expressivity	Best For
GCN	Normalized sum	In Ã matrix	Low (1-WL)	Semi-supervised node classification
GraphSAGE	Mean/Max/LSTM	Concatenate	Medium	Large-scale, inductive learning
GAT	Attention (learned)	In attention	Medium	Heterogeneous neighbor importance
GIN	Sum + MLP	(1+ε) weighted	High (WL-equivalent)	Graph classification, maximum expressivity

Choosing the Right Architecture

GCN for node classification with smooth labels. GraphSAGE for massive graphs or inductive settings. GAT when neighbor importance varies (e.g., molecular graphs). GIN for graph-level tasks requiring maximum expressivity.

Interactive: All Architectures

Compare all four GNN architectures side-by-side. Select an architecture to see its specific message passing mechanism, aggregation function, and update rule on the same graph.

GNN Architecture Comparison

Select an architecture to see how it processes graph information. Click nodes to change target.

GCN Mode

GCN

Fixed spectral normalization prevents explosion

Update Rule

h' = \sigma(\tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2} H W)

Message

Wh_u

Aggregate

Normalized sum

Update

\text{Self-loop} + \sigma

Feature	GCN	SAGE	GAT	GIN	MPNN
Aggregation	Norm Sum	Mean/Max	Attention	Sum	Any
Inductive?	❌	✅	✅	✅	✅
Weights	Fixed	Fixed	Learned	Fixed	Variable
Expressivity	Low	Medium	Medium	High	High

Contents

Introduction

Why It Works

The MPNN Framework

General Message Passing Layer

Step 1: MESSAGE

Step 2: AGGREGATE

Step 3: UPDATE

The Unifying Pattern

Aggregation Functions

SUM

MEAN

MAX

ATTENTION (Weighted Sum)

Update Functions

Simple: Replace

Concatenate + MLP

Skip Connection

GRU Update

The Over-Smoothing Problem

GCN: Graph Convolutional Network

Key Insight: Symmetric Normalization

GraphSAGE

Neighborhood Sampling

Inductive Learning

GAT: Graph Attention Network

Multi-Head Attention

Interactive: Attention Weights

Attention Weights

Feature Modulation

GIN: Graph Isomorphism Network

Key Theoretical Insights

Readout & Graph-Level Tasks

Global Sum/Mean

Hierarchical Pooling

Set2Set

Virtual Node

Architecture Comparison

Choosing the Right Architecture

Interactive: All Architectures

GNN Architecture Comparison

GCN