Series

LLM Internals

Interactive, visual deep-dives into the techniques behind modern LLMs: attention variants, KV cache, LoRA, RLHF, quantization, and more.

30 topics

KV Cache

Memoization of linear projections: why attention asks new questions over the same memory.

RoPE (Rotary Position Embedding)

Complex numbers, rotation matrices, and relative positions encoded through geometry.

Coming soon

Quantization

Discretization, rounding errors, and the math of compressing model weights.

Coming soon

Attention Mechanism

The Query-Key-Value architecture that revolutionized deep learning.

Coming soon

FlashAttention

Tiling, IO-complexity, and Online Softmax. How to make attention linear in memory.

Coming soon

Sliding Window Attention

Efficient local attention that scales linearly with sequence length.

Coming soon

Multi-Query Attention (MQA)

Share a single KV head across all queries for maximum inference speed.

Coming soon

Grouped Query Attention (GQA)

Share KV heads to reduce memory while preserving quality.

Coming soon

Multi-Latent Attention (MLA)

DeepSeek's KV cache compression via low-rank latent projections.

Coming soon

PagedAttention

Virtual memory for KV cache that powers vLLM's 24x throughput.

Coming soon

Speculative Decoding

Verification probability, acceptance sampling, and trading compute for latency.

Coming soon

Mixture of Experts (MoE)

See probability, linear algebra, gradients, and entropy come alive inside this modern architecture.

Coming soon

GPT Pretraining

Autoregressive language modeling, causal masks, and the next-token prediction objective.

Coming soon

Chain of Thought Reasoning

Unlock reasoning by letting models think step by step.

Coming soon

LoRA (Low-Rank Adaptation)

Rank, subspaces, and matrix factorization explain why fine-tuning works without touching all parameters.

Coming soon

QLoRA

Fine-tune 65B models on a single GPU with 4-bit quantization.

Coming soon

Direct Preference Optimization (DPO)

The elegant alternative to RLHF. Skip the reward model and RL loop entirely.

Coming soon

RLHF

Reinforcement Learning from Human Feedback. The alignment pipeline.

Coming soon

PPO (Proximal Policy Optimization)

KL divergence, clipped objectives, and why policy updates need trust regions.

Coming soon

GRPO (Group Relative Policy Optimization)

Relative advantages, group normalization, and why baselines matter for variance reduction.

Coming soon

Mixed Precision Training

FP16 vs BF16. Loss scaling, dynamic range, and the danger of underflow.

Coming soon

Model Compression

Magnitude pruning, knowledge distillation, and shrinking models by 90% with minimal loss.

Coming soon

Structured Pruning

Layer pruning, head pruning, and hardware-friendly module-level compression.

Coming soon

Tensor vs Pipeline Parallelism

Matrix partitioning, communication costs, and bubble time. How to scale across GPUs.

Coming soon

DeepSpeed ZeRO

Zero Redundancy Optimizer for trillion-parameter training.

Coming soon

Weight Initialization

Variance propagation, gradient flow, and depth scaling. Why initialization makes or breaks training.

Coming soon

Gradient Accumulation

Simulating massive batches on small RAM. Effective batch size and variance reduction.

Coming soon

Chinchilla Scaling Laws

Power laws, compute-optimal frontiers, and why GPT-3 was massively over-parameterized.

Coming soon

RAG

Embeddings, similarity search, and conditioning. How LLMs use external knowledge.

Coming soon

Model Merging

Combine fine-tuned models without retraining using TIES, DARE, and SLERP.