Modules
04/06
Graph Theory

Contents

Computational Graphs & Backpropagation

The fundamental architecture that powers modern deep learning systems.

Introduction

In a perfect world, training a neural network would be instantaneous. Every computation would happen simultaneously, gradients would flow backwards effortlessly, and we would have infinite computational resources. Reality, however, demands a more elegant solution.

This is where computational graphs come in. They are the fundamental data structure that modern deep learning frameworks use to organize, optimize, and execute the complex web of calculations required to train neural networks. Understanding them is key to understanding how frameworks like TensorFlow and PyTorch actually work under the hood.

Why This Matters

Computational graphs are not just implementation details: they fundamentally shape how we can train and deploy models. They enable automatic differentiation, distributed training, and efficient memory management.

What Are Computational Graphs?

A computational graph is a directed graph where:

Nodes

Represent operations (addition, multiplication, activation functions) or variables

Edges

Represent the flow of data (tensors, scalars, gradients) between operations

Dependencies

Determine execution order: a node cannot execute until all its parent nodes complete

The Key Insight

Computations do not always need to happen sequentially. Some operations are independent and can run in parallel. A computational graph automatically identifies these opportunities, managing dependencies and parallelizing execution across threads or distributed systems.

Think of it like a recipe. Some steps must happen in order (you cannot bake before mixing), but others can happen simultaneously (chopping vegetables while water boils). The graph structure encodes these dependencies naturally.

A Simple Graph Example

Let us build a computational graph for a simple linear transformation:

z = (x * w) + b

x
input
w
weight
*
multiply
b
bias
+
add
z
output
Variables (Inputs/Parameters)
Operations
Output

In this graph:

  • Green nodes are variables: inputs (x), parameters (w, b)
  • Teal nodes are operations: multiply (*), add (+)
  • Purple node is the output (z)
  • Edges show data flow from left to right (forward pass)

During Backpropagation

Gradients flow in the reverse direction (right to left). Each operation node knows how to compute local gradients. The multiply node swaps its inputs:dL/dx = dL/dz * w and dL/dw = dL/dz * x

Interactive: Execution Demo

Watch how a computational graph executes. Nodes that do not depend on each other run in parallel, while dependent nodes wait for their parents to complete.

1.7x
Step 1 of 00%
Execution Log:
Click "Run" to see the graph execute...
Active
Completed
Waiting
Total nodes: 7
Completed: 0
Active: 0

The Anatomy of a Node

Each node in a computational graph operates like a miniature processing unit. Understanding what happens inside a node helps explain how the entire graph coordinates execution.

Node Lifecycle

1. Wait State

The node monitors its parent nodes, waiting for all required resources (data, gradients, or execution signals) to become available.

2. Execute

Once all parent resources arrive, the node performs its computation. This might be a matrix multiplication, applying an activation function, or any mathematical operation.

3. Propagate

After computation completes, the node makes its output available to its child nodes, signaling them to check if they are ready to execute.

In concurrent programming terms, you can think of each node as a critical section that acquires locks on parent resources before executing. This design elegantly handles both multi-threading(multiple cores on one machine) and distributed computing (multiple machines working together).

Why DAGs Matter

Most computational graphs are Directed Acyclic Graphs (DAGs): directed graphs with no cycles. This seemingly simple constraint provides powerful guarantees:

No Deadlocks

Without cycles, nodes never wait in circular dependencies

Topological Ordering

There is always a valid execution order

Clean Backpropagation

Gradients flow backwards through the graph unambiguously

Parallelization

All nodes at the same "level" can execute simultaneously

While it is technically possible to have cycles in computational graphs (recurrent neural networks use them), most frameworks "unroll" these cycles into DAGs for a fixed number of time steps during training.

Understanding Backpropagation

Here is where computational graphs become truly powerful. The same structure that organizes forward computation can be traversed backwards to compute gradients automatically. This is called automatic differentiation or backpropagation.

Forward Pass

  • Computes output from inputs
  • Stores information needed for gradients later

Backward Pass

  • Receives gradient signals from children
  • Computes gradients w.r.t. inputs via chain rule
  • Passes gradients back to parents

The Magic

You only need to define how to compute the forward pass. The framework automatically figures out how to compute gradients for all parameters by traversing the graph backwards and applying the chain rule at each node.

Interactive: Backprop Demo

Watch how gradients flow backwards through our simple graph

z = (x * w) + b
. Click through each step to see the forward pass compute values, then the backward pass compute gradients. For a deeper dive, see the full backpropagation page.

Step-by-Step Backpropagation

x
2
w
3
*
6
b
1
+
7
z
7
Forward
Backward
Step 1/6

Initial State

We have a simple computational graph: z = (x * w) + b. Let x=2, w=3, b=1.

z=xw+b=23+1=7z = x \cdot w + b = 2 \cdot 3 + 1 = 7

Key Observations

  • Addition distributes gradients equally to both inputs (both get the same gradient)
  • Multiplication swaps: the gradient for x uses w, and vice versa
  • Each node only needs its local gradient and the upstream gradient to compute its output
  • This is the Chain Rule in action: Lx=Lzzx\frac{\partial L}{\partial x} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial x}

Gradient Descent Intuition

Imagine you are blindfolded on a mountainside, trying to reach the lowest valley. You cannot see the entire landscape, but you can feel the slope beneath your feet. The smartest strategy is to repeatedly step downhill: following the steepest descent at each point.

This is exactly what gradient descent does. The gradient tells us the direction of steepest ascent, so we step in the opposite direction to minimize our loss function.

θnew=θoldαL(θ)\theta_{\text{new}} = \theta_{\text{old}} - \alpha \cdot \nabla L(\theta)

Where α\alpha is the learning rate and L(θ)\nabla L(\theta) is the gradient of the loss function

The Learning Rate Trade-off

Too small: slow convergence. Too large: overshooting, instability. Finding the right balance is crucial for effective training.

Interactive: Training Demo

Let us see gradient descent in action. We will try to approximate f(x) = 3x + 7 using only input-output pairs. Adjust the learning rate and watch the algorithm converge.

Parameters
w (slope):-2.000
b (intercept):15.000
Training
Epoch:0
MSE:--

The Mathematics

Let us work through the math of backpropagation with a simple example. Consider a linear function:

f(x)=wx+bf(x) = w \cdot x + b

Step 1: Compute Partial Derivatives

We need to know how the output changes with respect to each parameter:

fw=x\frac{\partial f}{\partial w} = x
(derivative w.r.t. weight)
fb=1\frac{\partial f}{\partial b} = 1
(derivative w.r.t. bias)

These partial derivatives form the gradient vector: fw,fb\langle \frac{\partial f}{\partial w}, \frac{\partial f}{\partial b} \rangle. This vector points in the direction of greatest increase. To decrease the error, we move in the opposite direction.

Step 2: The Chain Rule for Deeper Networks

Real neural networks are compositions of functions. The chain rule lets us compute gradients through these layers:

For f(g(x))f(g(x)): fx=fggx\frac{\partial f}{\partial x} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial x}

Step 3: Update Parameters

We adjust each parameter by a small step opposite to the gradient:

w=wlearning_rate×Lww = w - \text{learning\_rate} \times \frac{\partial L}{\partial w}
b=blearning_rate×Lbb = b - \text{learning\_rate} \times \frac{\partial L}{\partial b}

where LL is our loss function (like mean squared error)

Scaling to Neural Networks

Real neural networks extend these concepts in several important ways:

Non-linear Activations

Instead of linear functions, we use non-linear activations like sigmoid, ReLU, or tanh to enable the network to learn complex patterns.

Deep Architectures

We stack many layers, creating compositions of functions. Each layer's output becomes the next layer's input, allowing the network to learn hierarchical representations.

Loss Functions

Instead of manually checking errors, we use differentiable loss functions:

  • MSE: Mean Squared Error for regression tasks
  • Cross-Entropy: For classification tasks

Automatic Differentiation

Frameworks automatically compute all gradients through the entire network using the computational graph. You define the forward pass; backpropagation happens automatically.

Modern Frameworks

Modern deep learning frameworks like PyTorch and TensorFlow handle all this complexity for you. They build computational graphs automatically, manage memory efficiently, and parallelize execution across GPUs.

PyTorch

Dynamic computational graphs built on-the-fly. More Pythonic, easier to debug. Popular in research.

TensorFlow

Static computational graphs (with eager execution mode). Optimized for production deployment.

The Power of Abstraction

You define what you want to compute (the forward pass), and the framework automatically builds a computational graph and computes gradients for all parameters. This is the magic of modern deep learning frameworks. Automatic differentiation lets you focus on model design rather than gradient computation. Learn more about the backpropagation algorithm.