LLM Internals
Interactive, visual deep-dives into the techniques behind modern LLMs: attention variants, KV cache, LoRA, RLHF, quantization, and more.
30 topics
KV Cache
Memoization of linear projections: why attention asks new questions over the same memory.
RoPE (Rotary Position Embedding)
Complex numbers, rotation matrices, and relative positions encoded through geometry.
Quantization
Discretization, rounding errors, and the math of compressing model weights.
Attention Mechanism
The Query-Key-Value architecture that revolutionized deep learning.
FlashAttention
Tiling, IO-complexity, and Online Softmax. How to make attention linear in memory.
Sliding Window Attention
Efficient local attention that scales linearly with sequence length.
Multi-Query Attention (MQA)
Share a single KV head across all queries for maximum inference speed.
Grouped Query Attention (GQA)
Share KV heads to reduce memory while preserving quality.
Multi-Latent Attention (MLA)
DeepSeek's KV cache compression via low-rank latent projections.
PagedAttention
Virtual memory for KV cache that powers vLLM's 24x throughput.
Speculative Decoding
Verification probability, acceptance sampling, and trading compute for latency.
Mixture of Experts (MoE)
See probability, linear algebra, gradients, and entropy come alive inside this modern architecture.
GPT Pretraining
Autoregressive language modeling, causal masks, and the next-token prediction objective.
Chain of Thought Reasoning
Unlock reasoning by letting models think step by step.
LoRA (Low-Rank Adaptation)
Rank, subspaces, and matrix factorization explain why fine-tuning works without touching all parameters.
QLoRA
Fine-tune 65B models on a single GPU with 4-bit quantization.
Direct Preference Optimization (DPO)
The elegant alternative to RLHF. Skip the reward model and RL loop entirely.
RLHF
Reinforcement Learning from Human Feedback. The alignment pipeline.
PPO (Proximal Policy Optimization)
KL divergence, clipped objectives, and why policy updates need trust regions.
GRPO (Group Relative Policy Optimization)
Relative advantages, group normalization, and why baselines matter for variance reduction.
Mixed Precision Training
FP16 vs BF16. Loss scaling, dynamic range, and the danger of underflow.
Model Compression
Magnitude pruning, knowledge distillation, and shrinking models by 90% with minimal loss.
Structured Pruning
Layer pruning, head pruning, and hardware-friendly module-level compression.
Tensor vs Pipeline Parallelism
Matrix partitioning, communication costs, and bubble time. How to scale across GPUs.
DeepSpeed ZeRO
Zero Redundancy Optimizer for trillion-parameter training.
Weight Initialization
Variance propagation, gradient flow, and depth scaling. Why initialization makes or breaks training.
Gradient Accumulation
Simulating massive batches on small RAM. Effective batch size and variance reduction.
Chinchilla Scaling Laws
Power laws, compute-optimal frontiers, and why GPT-3 was massively over-parameterized.
RAG
Embeddings, similarity search, and conditioning. How LLMs use external knowledge.
Model Merging
Combine fine-tuned models without retraining using TIES, DARE, and SLERP.