Modules
06/06
Graph Theory

Contents

Message Passing Neural Networks

The fundamental paradigm behind all modern GNN architectures. Understand this, and you understand GNNs.

Introduction

In the GNN introduction, we saw that GNNs learn by aggregating information from neighbors. Message Passing is the formal framework that unifies virtually all GNN architectures.

The idea is elegant: each node sends "messages" to its neighbors, collects messages from its neighbors, and uses them to update its own representation. Different GNN architectures (GCN, GAT, GraphSAGE, GIN) are just different implementations of this same idea.

Why It Works

Message passing is a form of neural network inductive bias. By forcing nodes to learn from their local neighborhoods, we inject the prior knowledge that "connected things are related." This is similar to how CNNs inject the prior that "nearby pixels are related."

The MPNN Framework

The Message Passing Neural Network (MPNN) framework, introduced by Gilmer et al. (2017), provides a unified view. One layer of message passing consists of three steps:

General Message Passing Layer

Step 1: MESSAGE

mvu=MESSAGE(hu(l),hv(l),euv)m_{v \leftarrow u} = \text{MESSAGE}(h_u^{(l)}, h_v^{(l)}, e_{uv})

Creates a "message" from neighbor u to target v using their features and edge attributes

Step 2: AGGREGATE

mv=AGGREGATE({mvu:uN(v)})m_v = \text{AGGREGATE}(\{m_{v \leftarrow u} : u \in \mathcal{N}(v)\})

Combines all incoming messages (must be permutation-invariant: sum, mean, max, attention)

Step 3: UPDATE

hv(l+1)=UPDATE(hv(l),mv)h_v^{(l+1)} = \text{UPDATE}(h_v^{(l)}, m_v)

Merges aggregated message with node's current features (usually an MLP or GRU)

The Unifying Pattern

Every GNN architecture follows this pattern. They differ only in how they implement MESSAGE, AGGREGATE, and UPDATE. This abstraction makes it easy to understand and compare different approaches.

Aggregation Functions

The aggregation function must be permutation-invariant: the result should not depend on the order in which we process neighbors. This is a fundamental requirement for graph neural networks.

SUM

mv=uN(v)mvum_v = \sum_{u \in \mathcal{N}(v)} m_{v \leftarrow u}
Pros:

Preserves neighborhood size information. Captures "how much" signal.

Cons:

Different scales for different degrees. Needs normalization.

Used by: GIN, original spectral GNNs

MEAN

mv=1N(v)uN(v)mvum_v = \frac{1}{|\mathcal{N}(v)|} \sum_{u \in \mathcal{N}(v)} m_{v \leftarrow u}
Pros:

Normalizes for degree. Stable across graph sizes.

Cons:

Loses neighborhood size information.

Used by: GCN, GraphSAGE-mean

MAX

mv=maxuN(v)mvum_v = \max_{u \in \mathcal{N}(v)} m_{v \leftarrow u}
Pros:

Captures strongest signal. Robust to outliers.

Cons:

Loses information from non-max neighbors.

Used by: GraphSAGE-pool, PointNet

ATTENTION (Weighted Sum)

mv=uN(v)αvumvum_v = \sum_{u \in \mathcal{N}(v)} \alpha_{vu} \cdot m_{v \leftarrow u}
Pros:

Learns which neighbors are important. Adaptive weighting.

Cons:

More parameters, slower, can overfit on small graphs.

Used by: GAT, Graph Transformers

Update Functions

The update function combines the aggregated message with the node's own features. The choice of update function affects both expressivity and trainability.

Simple: Replace

hv=σ(Wmv)h'_v = \sigma(W \cdot m_v)

Just use the message, ignore self. Risk: lose self-information over layers.

Concatenate + MLP

hv=MLP([hvmv])h'_v = \text{MLP}([h_v \, \| \, m_v])

Concatenate self and message, let MLP learn combination. Most flexible.

Skip Connection

hv=hv+σ(Wmv)h'_v = h_v + \sigma(W \cdot m_v)

Residual connection. Essential for deep GNNs (helps gradient flow).

GRU Update

hv=GRU(hv,mv)h'_v = \text{GRU}(h_v, m_v)

Gated recurrent unit. Used in GGNN for sequential message passing.

The Over-Smoothing Problem

Deep GNNs (many layers) suffer from over-smoothing: all node representations converge to the same value. Skip connections and careful normalization (like in BatchNorm) help, but this limits GNN depth to typically 2-4 layers.

GCN: Graph Convolutional Network

GCN (Kipf & Welling, 2017) is the foundational modern GNN. It's derived from spectral graph theory but has a simple spatial interpretation:

H(l+1)=σ(D~1/2A~D~1/2H(l)W(l))H^{(l+1)} = \sigma\left( \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2} H^{(l)} W^{(l)} \right)

Where à = A + I (add self-loops) and D̃ is the degree matrix of Ã.

In message passing terms:

MESSAGE: Linear transform of neighbor feature: m = W · h_u

AGGREGATE: Normalized sum with symmetric weighting: 1dudv\frac{1}{\sqrt{d_u \cdot d_v}}

UPDATE: Include self-loop (via Ã), then apply nonlinearity σ

Key Insight: Symmetric Normalization

The 1/dudv1/\sqrt{d_u \cdot d_v} normalization prevents high-degree nodes from dominating. It's derived from the normalized graph Laplacian and ensures stable learning across graphs with varying degree distributions.

GraphSAGE

GraphSAGE (Hamilton et al., 2017) introduced two key innovations: neighborhood sampling for scalability and explicit self-loop handling via concatenation.

hv(l+1)=σ(WCONCAT(hv(l),AGG({hu(l):uN(v)})))h_v^{(l+1)} = \sigma\left( W \cdot \text{CONCAT}\left( h_v^{(l)}, \text{AGG}(\{h_u^{(l)} : u \in \mathcal{N}(v)\}) \right) \right)

Neighborhood Sampling

Instead of using all neighbors, sample a fixed number (e.g., 10-25). Makes training O(1) per node rather than O(degree), enabling massive-scale graphs.

Inductive Learning

Learns an aggregator function, not node-specific embeddings. Can generalize to completely unseen nodes and graphs at test time.

Aggregator options: MEAN (simple average), LSTM (order neighbors randomly for sequence), MAX-POOL (apply MLP element-wise then max).

GAT: Graph Attention Network

GAT (Veličković et al., 2018) uses attention mechanisms to learn which neighbors are most important for each node. This is a major departure from fixed aggregation weights.

Attention coefficient:

evu=LeakyReLU(aT[WhvWhu])e_{vu} = \text{LeakyReLU}\left( \mathbf{a}^T [W h_v \| W h_u] \right)

Softmax normalization:

αvu=exp(evu)uN(v)exp(evu)\alpha_{vu} = \frac{\exp(e_{vu})}{\sum_{u' \in \mathcal{N}(v)} \exp(e_{vu'})}

Weighted aggregation:

hv(l+1)=σ(uN(v)αvuWhu)h_v^{(l+1)} = \sigma\left( \sum_{u \in \mathcal{N}(v)} \alpha_{vu} W h_u \right)

Multi-Head Attention

Like Transformers, GAT uses K independent attention heads and concatenates (or averages) their outputs. This stabilizes training and captures different "aspects" of neighbor importance. Typically K = 8 for intermediate layers.

Interactive: Attention Weights

See how attention mechanisms dynamically weight neighbors. Unlike GCN's fixed normalization, GAT learns which connections matter most.

Attention Weights

GAT learns to weight neighbors based on feature similarity.

Interactive Mode
53%20%26%Targeth_vN1[0.9, 0.3]N2[0.2, 0.9]N3[0.5, 0.5]

Feature Modulation

Target Vector (Query)
Neighbor Vectors (Keys)
N1w = 53.5%
N2w = 20.5%
N3w = 26.0%
Attention Calculation
Score
eij=LeakyReLU(aT[WhiWhj])e_{ij} = \text{LeakyReLU}(\mathbf{a}^T [W h_i \| W h_j])

(Simplified here as dot product similarity)
Weight
αij=exp(eij)kexp(eik)\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_k \exp(e_{ik})}

GIN: Graph Isomorphism Network

GIN (Xu et al., 2019) asks a fundamental question: what's the most expressive possible GNN? The answer: one that's as powerful as the Weisfeiler-Lehman (WL) graph isomorphism test.

hv(l+1)=MLP((1+ϵ)hv(l)+uN(v)hu(l))h_v^{(l+1)} = \text{MLP}\left( (1 + \epsilon) \cdot h_v^{(l)} + \sum_{u \in \mathcal{N}(v)} h_u^{(l)} \right)

ε is a learnable parameter (or fixed small value like 0).

Key Theoretical Insights

  • SUM not MEAN: Sum preserves multiset cardinality; mean loses it. Critical for distinguishing graphs.
  • MLP not linear: Needs a universal approximator (MLP) to be injective. Linear transforms can't distinguish all neighborhoods.
  • (1 + ε) term: Distinguishes self from sum of neighbors. Without it, some non-isomorphic graphs look identical.

Readout & Graph-Level Tasks

For graph classification or regression, we need a single vector representing the entire graph. This is done via a READOUT (or pooling) function that aggregates node embeddings.

hG=READOUT({hv(L):vV})h_G = \text{READOUT}(\{h_v^{(L)} : v \in V\})

Global Sum/Mean

Simplest approach. Sum (or mean) all final node embeddings. Works surprisingly well.

Hierarchical Pooling

Learn to coarsen graph iteratively (DiffPool, MinCutPool). More expressive but complex.

Set2Set

LSTM-based attention over all nodes. Order-invariant but sequential processing.

Virtual Node

Add a "super node" connected to all others. Its embedding represents the graph.

Architecture Comparison

ModelAggregationSelf-LoopExpressivityBest For
GCNNormalized sumIn à matrixLow (1-WL)Semi-supervised node classification
GraphSAGEMean/Max/LSTMConcatenateMediumLarge-scale, inductive learning
GATAttention (learned)In attentionMediumHeterogeneous neighbor importance
GINSum + MLP(1+ε) weightedHigh (WL-equivalent)Graph classification, maximum expressivity

Choosing the Right Architecture

GCN for node classification with smooth labels. GraphSAGE for massive graphs or inductive settings. GAT when neighbor importance varies (e.g., molecular graphs). GIN for graph-level tasks requiring maximum expressivity.

Interactive: All Architectures

Compare all four GNN architectures side-by-side. Select an architecture to see its specific message passing mechanism, aggregation function, and update rule on the same graph.

GNN Architecture Comparison

Select an architecture to see how it processes graph information. Click nodes to change target.

GCN Mode
v₀d=2v₁d=3v₂d=4v₃d=2v₄d=2

GCN

Fixed spectral normalization prevents explosion
Update Rule
h=σ(D~1/2A~D~1/2HW)h' = \sigma(\tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2} H W)
Message
WhuWh_u
Aggregate
Normalized sum
Update
Self-loop+σ\text{Self-loop} + \sigma
FeatureGCNSAGEGATGINMPNN
AggregationNorm SumMean/MaxAttentionSumAny
Inductive?
WeightsFixedFixedLearnedFixedVariable
ExpressivityLowMediumMediumHighHigh