0% found this document useful (0 votes)
6 views23 pages

LLM Systems

The document is a comprehensive guide on LLM systems, covering architecture, training, inference, and system design. It is aimed at candidates preparing for interviews and engineers building LLM systems, providing insights into interview expectations and technical foundations. The content includes chapters on various aspects of LLM systems, from transformer architecture to evaluation metrics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views23 pages

LLM Systems

The document is a comprehensive guide on LLM systems, covering architecture, training, inference, and system design. It is aimed at candidates preparing for interviews and engineers building LLM systems, providing insights into interview expectations and technical foundations. The content includes chapters on various aspects of LLM systems, from transformer architecture to evaluation metrics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

LLM System Interview

A Guide to Architecture, Training, Inference, and System Design

Author: Hao Hoang


Institute: AI Interview Prep
Date: April 25, 2026
Version: First edition, 2026
Focus: Applied AI/ML and LLM systems

[Link]@[Link] • [Link] • LinkedIn • Substack


Contents

Preface vi

How to Use This Book vii

For Candidates Preparing for Interviews x

For Engineers Building LLM Systems xiii

Notation and Symbols xvi

Acknowledgments xxi

About the Author xxii

I Overview and Interview Landscape 1

Chapter 1 The LLM Systems Interview 2


1.1 What Companies Actually Ask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Roles That Require LLM Systems Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Chapter 2 How to Approach an LLM System Design Question 8


2.1 A Repeatable Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Trade-Off Axes You Must Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Common Pitfalls in Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

II Foundations: Transformer Architecture and Training Signals 16

Chapter 3 Transformer Architecture Interview Essentials 17


3.1 The Baseline Decoder-Only Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Normalization Choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Activation Functions in Modern LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Position Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Chapter 4 Hyperparameters You Will Be Asked to Justify 40


4.1 Width and Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Attention Head Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Vocabulary Size and Tokenization Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Regularization in Pre-Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Chapter 5 Stability Tricks in Large-Scale Training 63


5.1 Softmax Instabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Numerical Precision and Training Health . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
CONTENTS

III Mixture of Experts and Sparse Architectures 76

Chapter 6 Why MoE Is Now Standard in Frontier Systems 77


6.1 Dense vs. Sparse Models for Fixed FLOPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Routing: The Heart of Every MoE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Chapter 7 MoE Training and Systems Considerations 89


7.1 Fine-Grained and Shared Experts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.2 Load Balancing and Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.3 Expert Parallelism Gotchas for Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

IV GPUs, Kernels, and Single-Device Performance 107

Chapter 8 GPU Architecture for LLM Engineers 108


8.1 The Compute and Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.2 Arithmetic Intensity and the Roofline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

Chapter 9 Making a Single GPU Go Fast 121


9.1 Reducing Memory Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.2 Exploiting Memory Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
9.3 Lower Precision and Quantization-Aware Training . . . . . . . . . . . . . . . . . . . . . . . . 133

Chapter 10 Writing and Benchmarking Custom Kernels 139


10.1 Benchmarking and Profiling Discipline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
10.2 Implementing Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
10.3 FlashAttention as a Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

V Distributed Training: Parallelism at Scale 159

Chapter 11 Multi-GPU and Multi-Node Fundamentals 160


11.1 Interconnect Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
11.2 Collective Communication Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
11.3 The Software Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

Chapter 12 Parallelism Strategies 178


12.1 Data Parallelism and ZeRO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
12.2 Model Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
12.3 Activation and Sequence Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
12.4 Putting It All Together: 3D and 4D Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . 196

VI Scaling Laws and Training Economics 203

Chapter 13 Predictable Scaling for Interview Answers 204


13.1 The Three Canonical Scaling Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

ii
CONTENTS

13.2 Using Scaling Laws to Make Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . 210

Chapter 14 Chinchilla and Beyond 216


14.1 Compute-Optimal Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
14.2 Inference-Aware Token-to-Parameter Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
14.3 Maximal Update Parameterization (muP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

VII Inference Systems 234

Chapter 15 The Inference Workload 235


15.1 Prefill vs. Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
15.2 Latency, Throughput, and Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

Chapter 16 Reducing the KV Cache 247


16.1 Attention Variants for Cheaper Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
16.2 Cross-Layer and Local Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

Chapter 17 Going Beyond the Transformer for Inference 258


17.1 State-Space and Linear-Attention Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
17.2 Non-Autoregressive Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

Chapter 18 Speculative Decoding and Serving Optimizations 270


18.1 Speculative Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
18.2 Serving System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276

Chapter 19 Compression Techniques for Deployment 283


19.1 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
19.2 Pruning and Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

VIII Data: The Real Differentiator 294

Chapter 20 Pre-Training Data Pipelines 295


20.1 Where Training Data Comes From . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
20.2 Evolution of Open Pre-Training Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
20.3 Legal and Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306

Chapter 21 Data Filtering and Deduplication Algorithms 312


21.1 Quality Filtering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
21.2 Targeted Filtering Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
21.3 Deduplication at Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323

Chapter 22 Mid-Training and Post-Training Data 330


22.1 Instruction and Chat Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
22.2 Long-Context and Domain Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
22.3 Data Quality Heuristics That Matter in Interviews . . . . . . . . . . . . . . . . . . . . . . . . 342

iii
CONTENTS

IX Evaluation 349

Chapter 23 Designing an Evaluation 350


23.1 Goals of Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
23.2 Metrics and Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355

Chapter 24 Benchmarks You Must Know 362


24.1 Knowledge and Reasoning Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
24.2 Instruction Following and Chat Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
24.3 Agentic and Safety Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374

Chapter 25 Validity, Contamination, and Real-World Use 380


25.1 Train-Test Overlap and Contamination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
25.2 Real-World Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385

X Alignment and Post-Training 391

Chapter 26 Supervised Fine-Tuning 392


26.1 What SFT Can and Cannot Teach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
26.2 SFT Data in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398

Chapter 27 Preference-Based Alignment (RLHF) 405


27.1 The RLHF Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
27.2 Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
27.3 Pitfalls of RLHF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417

Chapter 28 Reinforcement Learning from Verifiable Rewards 423


28.1 From RLHF to RLVR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
28.2 Policy Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
28.3 Case Studies in Reasoning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435

XI End-to-End System Design Drills 441

Chapter 29 Building a Production LLM Serving Stack 442


29.1 Reference Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
29.2 Capacity Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448

Chapter 30 Designing a Pre-Training Run from Scratch 454


30.1 From Compute Budget to Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
30.2 Operations and Failure Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460

Chapter 31 Designing a Fine-Tuning and Alignment Pipeline 466


31.1 Scoping the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
31.2 Operational Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472

iv
CONTENTS

Appendix A Napkin Math for LLM Interviews 478


A.1 Parameter Counts and Memory Footprints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
A.2 FLOPs per Forward and Backward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
A.3 KV-Cache Sizing and Latency Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481

Appendix B Common Interview Questions 483


B.1 Architecture and Training Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
B.2 Inference and Serving Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
B.3 Alignment and Evaluation Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487

Appendix C Checklists for System Design Answers 490


C.1 The 10-Minute LLM System Design Checklist . . . . . . . . . . . . . . . . . . . . . . . . . . 490
C.2 The Pre-Training Run Readiness Checklist . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
C.3 The Inference Deployment Readiness Checklist . . . . . . . . . . . . . . . . . . . . . . . . . 492

Appendix D Further Reading 495


D.1 Canonical Papers by Topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
D.2 Blogs, Talks, and Engineering Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501

v
Preface

This book exists because the interview and the job have converged in a way they did not five years ago.
Until recently, most software engineering interviews tested general systems-design intuition: databases,
caches, load balancers, message queues. The underlying assumption was that specialized domain knowledge
could be learned on the job once a strong generalist was hired. That assumption has broken down at frontier
AI companies. A candidate who cannot reason about KV cache memory pressure, arithmetic intensity on a
GPU, or the difference between compute-bound prefill and memory-bandwidth-bound decoding will not pass a
senior-level systems interview at the teams building today’s most important models. The domain has become
so operationally specific that generalist intuition is no longer sufficient, and no single paper, blog post, or course
covers it end to end.
The gap this book fills. Academic courses teach the theory of transformers; they rarely explain why GQA
with Nkv = N/8 reduces KV cache memory by 8× and how that changes the feasible batch size on a single
H100. Research papers present results; they do not explain how to answer “walk me through how you would
size the hardware for a 70B model serving 5,000 QPS” in the 45 minutes of a live interview. Engineering blogs
cover individual components in depth; they do not provide the unified arithmetic framework that lets you move
fluidly from model parameters to memory budgets to parallelism strategies to cost estimates. This book does.
What this book is. It is a technical reference and an interview preparation guide written for engineers who
already know how to code and who already understand basic ML. Every chapter develops the arithmetic behind
a topic from first principles, shows how that arithmetic connects to real hardware constraints, and then frames
the result as an interview question with a worked answer. The appendices collect the most important formulas
for fast reference during active preparation.
What this book is not. It is not an introduction to deep learning. It does not teach backpropagation or
explain what a transformer is from scratch. It does not survey the research literature comprehensively. Readers
who need that foundation first should work through a course on neural networks before returning here. It is also
not a prescriptive guide to any specific company’s interview process-the frameworks and derivations are generic
because the physics of hardware and the economics of inference are generic.
Coverage. The book follows a deliberate progression. Part I maps the interview landscape and introduces
the four-question framework that structures every design answer. Parts II through IV build the technical founda-
tion: transformer architecture, training hyperparameters, GPU hardware, and kernel-level performance. Parts V
and VI scale that foundation across distributed training and scaling laws. Part VII covers the inference stack
in depth, from the prefill-decode split through KV cache optimizations, speculative decoding, and compres-
sion. Parts VIII through X address the three domains that interviewers probe as differentiators-data pipelines,
evaluation design, and alignment methods. Part XI synthesizes everything into three full end-to-end design
drills.
Using derivations, not memorization. Every number in this book is derived, not asserted. The estimate
that a 7B model in bf16 requires approximately 13 GB of inference memory follows from 2P bytes, with P =
6.7 × 109 . The estimate that a single H100 decode step takes roughly 8 ms at batch size 1 follows from dividing
Mweights by the HBM bandwidth. These derivations are the point. An interviewer who changes one constraint-
“now make it a 13B model”-expects a candidate who can update the arithmetic in real time, not one who has
memorized a table. First principles produce that flexibility; memorization does not.
How to Use This Book

The book supports three distinct uses: interview preparation, on-the-job reference, and systematic study of
LLM systems. Read it linearly if you are building up the subject from scratch. Follow a targeted path if you are
preparing for a specific role or plugging specific gaps.

Linear Reading Order


The parts are ordered by dependency. Each part assumes the vocabulary and arithmetic developed in all
prior parts.

Part Topic Prerequisite

I Interview landscape and framework None


II Transformer architecture and training None
III Mixture of Experts Part II
IV GPU architecture and custom kernels Part II
V Distributed training and parallelism Parts II, IV
VI Scaling laws and training economics Parts II, V
VII Inference systems Parts II, IV, VI
VIII Data pipelines Part II
IX Evaluation Parts II, VIII
X Alignment and post-training Parts II, VIII, IX
XI End-to-end design drills All prior parts

Appendix A (parameter counts, FLOPs, KV cache formulas) and Appendix C (checklists) are designed to be
printed and kept at your desk during active preparation. Appendix B is a curated question bank organized by
topic. Appendix D is an annotated reading list of papers and engineering reports.

Targeted Reading Paths

Path 1 - Candidate Preparing for an Inference / Serving Interview


Focus: memory arithmetic, KV cache, latency-throughput trade-offs, speculative decoding, quantization.
1. Chapter 2 (framework) - read fully, internalize the four-axis requirement template.
2. Chapter 3, Sections 1-2 (transformer essentials, parameter count).
3. Appendix A (reference formulas) - commit to memory.
4. Chapter 8 (GPU architecture, roofline model, arithmetic intensity).
5. Chapter 15 (the inference workload: prefill vs. decode, latency vs. throughput).
6. Chapter 16 (KV cache reduction: GQA, MQA, MLA).
7. Chapter 18 (speculative decoding and continuous batching).
8. Chapter 19 (quantization and compression).
9. Chapter 29 (full end-to-end serving design drill).
How to Use This Book

Path 2 - Candidate Preparing for a Training / Infrastructure Interview


Focus: distributed training, parallelism strategies, memory budgets, scaling laws, resilience.
1. Chapter 2 (framework).
2. Chapter 3 (architecture) and Chapter 4 (hyperparameters).
3. Chapter 5 (training stability).
4. Chapter 8 (GPU architecture).
5. Chapter 11 (multi-GPU fundamentals: collectives, bandwidth).
6. Chapter 12 (parallelism strategies: TP, PP, DP, ZeRO/FSDP).
7. Chapter 13 (scaling laws and napkin math).
8. Chapter 14 (Chinchilla and compute-optimal training).
9. Appendix A (FLOPs and memory formulas).
10. Chapter 30 (end-to-end pre-training design drill).

Path 3 - Candidate Preparing for an Applied / Product Engineering Interview


Focus: serving economics, fine-tuning, alignment, evaluation, end-to-end system design.
1. Chapter 1 (interview landscape, role definitions).
2. Chapter 2 (framework).
3. Chapter 15 (inference workload fundamentals).
4. Chapter 19 (compression and deployment).
5. Chapter 20-22 (data pipelines and data quality).
6. Chapter 23-25 (evaluation design, benchmarks, contamination).
7. Chapter 26 (supervised fine-tuning).
8. Chapter 27 (RLHF and preference alignment).
9. Chapter 31 (fine-tuning and alignment pipeline drill).

Chapter Difficulty
Chapters are written at three levels. The level is not labeled explicitly because the same content serves both
preparation and reference; the depth at which you engage with it is the variable.
Conceptual (most of Parts I, II, VIII, IX, X) Develops vocabulary, mental models, and interview framing.
Suitable as first-pass reading and as review before an interview day.
Arithmetic (most of Parts IV, VI, VII; Appendix A) Derives quantitative estimates from first principles. Re-
quires pencil and paper. Work through the examples actively rather than reading passively.
Implementation (most of Parts III, V; Chapter 10) Discusses kernel-level, compiler-level, or systems-level
mechanics. Assumed background: familiarity with PyTorch, CUDA concepts, and distributed training
frameworks.

Conventions in Each Chapter


Every chapter follows the same internal structure:
The Take - the single most important insight for an interview, stated in one paragraph.
Technical content - derivations, diagrams, or system descriptions.

viii
How to Use This Book

How This Shows Up - two or three representative interview questions with annotated strong answers.
Key Takeaways - a short bullet list of the points an interviewer will probe.
Questions in How This Shows Up appear in italics. Strong-answer annotations focus on structure-what to say
first, what to derive, what follow-up to anticipate-rather than providing a script. Interviewers at frontier labs
probe depth by changing constraints; the annotations show which numbers to re-derive when that happens.

ix
For Candidates Preparing for Interviews

The LLM systems interview rewards candidates who derive answers, not candidates who recall them. Every
technique in this section is aimed at building that derivation habit before you sit down with an interviewer.

What the Interview Actually Tests


A senior-level LLM systems interview has three observable signals interviewers are collecting simultane-
ously.
First: vocabulary precision. “The KV cache grows with sequence length” is not an answer. “At 4,096
tokens, a single sequence in a 7B GQA model with 8 KV heads and head dimension 128 costs approximately
0.17 GB per sequence in bf16” is. The arithmetic formulas in Appendix A are the vocabulary; the chapters
explain where each term comes from.
Second: constraint-driven reasoning. Every design question has a binding constraint-usually KV cache
memory pressure for serving, usually communication overhead for training. Strong candidates name the bind-
ing constraint before proposing any optimization. Candidates who list optimizations without identifying the
bottleneck signal that they have read blog posts, not reasoned from hardware physics.
Third: adaptability under constraint changes. The most reliable signal in a live interview is what
happens when the interviewer says “what if context length doubles?” or “what if you now need to support
five model variants?” A candidate who derived their original answer updates the arithmetic immediately. A
candidate who memorized an architecture has to restart. This is why every technical chapter in this book teaches
derivation, not recall.

A Four-Week Preparation Plan


This plan assumes roughly two hours of active study per day. Adjust the pacing to your timeline; the
sequence of topics is fixed, but the time budget per topic is flexible.
Week 1 - Foundation Read Chapters 1-5 and Appendix A fully. After each chapter, close the book and re-
derive the key formulas from memory on paper. Target: be able to estimate model memory footprint
(inference and training), forward-pass FLOPs, and KV cache size for any model configuration given to
you verbally.
Week 2 - Hardware and Distributed Systems Read Chapters 8-12. Work through every numerical example
with a calculator. Practice explaining the roofline model out loud to an imaginary interviewer. Spend extra
time on Chapter 12 (parallelism strategies): draw the TP/PP/DP combination diagram from memory until
it takes less than two minutes.
Week 3 - Inference and Scaling Read Chapters 13-19. Focus on Chapter 15 (the inference workload) and
Chapter 18 (speculative decoding and serving). For each optimization technique in Chapters 16-19, write
one sentence explaining which constraint it addresses and what it trades away. Read Appendix C (check-
lists) and internalize the ten-minute system design checklist.
Week 4 - Integration and Drills Read Chapters 20-28 at pace (one per day). Spend the last three days on
Part XI (Chapters 29-31), working each drill as a timed mock interview: 45 minutes, whiteboard or
For Candidates Preparing for Interviews

paper, no references. After each drill, compare your answer against the chapter’s annotated response and
identify which constraints you named late or which numbers you could not derive.

How to Drill a Chapter


Passive reading does not build the derivation habit. Use this sequence for every chapter in the arithmetic
and implementation tiers.
1. Read the chapter fully once, following along with any derivations.
2. Cover the page. Write down the key formula from memory.
3. Plug in a different model configuration (change d, change L, change Nkv ) and re-derive the result.
4. Answer the “How This Shows Up” questions from the chapter out loud, targeting 90 seconds per answer
before expanding to the full 5-minute version.
5. Read the Key Takeaways bullet list and verify you can explain each point without the book.

Timing Your Answers


A 45-minute design interview has roughly the following pacing for a strong candidate:

Phase Time What you are doing

Requirements clarification 3-5 min Name TTFT, throughput, quality, cost; ask which is binding
End-to-end pipeline sketch 5-8 min Tokenize → prefill → decode → scheduler; label bottlenecks
Binding constraint analysis 5-7 min Derive KV cache footprint or memory budget; identify the limit
Targeted optimizations 10-15 min Propose GQA, speculative decoding, quantization as constraint responses
Trade-off discussion 5-8 min Answer the interviewer’s constraint-change follow-ups
Wrap-up 2-3 min Summarize and invite feedback

The requirement clarification phase is the most commonly skipped and the most damaging to skip. Interviewers
at staff and principal level watch for it explicitly.

Common Failure Modes


Cargo-culting numbers without derivation Stating “Chinchilla says 20 tokens per parameter” without being
able to derive where the 20 comes from, or what changes when inference cost is amortized over more
queries. Chapter 14 derives this ratio and explains when it does not apply.
Conflating latency and throughput Describing TTFT and tokens-per-second as if they optimize together.
They do not; Chapter 15 derives why and what each responds to.
Listing optimizations before naming the bottleneck Proposing speculative decoding, FlashAttention, and
quantization in the same breath before establishing whether the system is KV-cache-constrained, compute-
constrained, or communication-constrained. Chapter 2 provides the framework for identifying the con-
straint first.
Treating batch size as a free variable Failing to recognize that batch size is bounded by VRAM and that
both data parallelism and pipeline parallelism consume it as a resource. Appendix A and Chapter 12
derive the precise constraints.

xi
For Candidates Preparing for Interviews

Ignoring inference economics during training design Designing a training run without considering the serv-
ing cost of the resulting model. Chapter 13 establishes why inference cost is a first-class training con-
straint.

Using the Appendices


Appendix A (formulas) is the most important reference during active preparation. Print it. Put it next to
your desk. After two weeks of preparation you should not need to look at it; if you still do, spend another session
re-deriving each formula from scratch.
Appendix B (question bank) is organized by topic. Use it to simulate a 45-minute interview: pick one
design question and one deep-dive question from the same topic area, set a timer, and answer both without
references.
Appendix C (checklists) provides a ten-minute LLM system design checklist, a pre-training run readi-
ness checklist, and an inference deployment readiness checklist. Internalize the design checklist until you can
reproduce it verbally in under two minutes.
Appendix D (reading list) is for candidates preparing at the staff or principal level, where interviewers
expect familiarity with the primary literature. Papers are annotated with the specific claims you are expected to
reproduce, not just name.

xii
For Engineers Building LLM Systems

This book is organized around interview questions, but its technical content is not interview-specific.
The arithmetic behind batch-size constraints, the parallelism trade-offs in distributed training, the memory-
bandwidth math behind KV cache sizing-these are the same calculations you do on the job. This section maps
the book’s chapters to the decisions you encounter in production.

When to Reach for This Book


You are sizing a serving fleet for a new model Start with Appendix A (KV cache formula, memory foot-
print) and Chapter 15 (the inference workload, latency-throughput trade-offs). The KV-crossover batch
size B ∗ = Mweights /Ckv is the first number to compute; it determines whether you are in the superlinear
or sublinear throughput regime before you touch any configuration.
You are choosing a parallelism strategy for a training run Chapter 12 derives the memory and communi-
cation cost of every combination: tensor-parallel, pipeline-parallel, data-parallel, ZeRO stages 1-3, and
FSDP. The worked examples are parameterized so you can substitute your own model size and cluster
topology.
You are debugging a training run that is not hitting expected MFU Chapter 8 (arithmetic intensity and the
roofline model) and Chapter 11 (collective communication bandwidth and all-reduce timing) are the di-
agnostic starting points. Low MFU is almost always communication-bound, memory-bound, or pipeline-
bubble-bound; the chapters give you the arithmetic to distinguish the three cases.
You are evaluating a KV cache optimization (GQA, MLA, paged attention) Chapter 16 derives the mem-
ory reduction and the resulting change in maximum batch size for each technique. The formulas let you
estimate the throughput gain on your specific hardware and model before running any experiment.
You are designing an evaluation harness for a fine-tuned model Chapters 23-25 cover evaluation design
from first principles: metric selection, contamination detection, benchmark validity, and the gap between
benchmark performance and deployment behavior. Chapter 25 covers domain-specific evaluation design
(medical, legal, code) and deployment telemetry.
You are deciding between SFT, DPO, and RLVR for a post-training objective Chapter 26 (SFT), Chapter 27
(DPO and RLHF), and Chapter 28 (GRPO and RLVR) each cover the method’s mechanics, its data re-
quirements, its failure modes, and the scenarios where it is the right choice. Chapter 31 synthesizes the
decision into an end-to-end alignment pipeline design.
For Engineers Building LLM Systems

Chapter Map by Engineering Decision

Decision / Problem Primary chapters

Model memory footprint (inference) Appendix A, Ch. 3


Model memory footprint (training) Appendix A, Ch. 9
Forward-pass and training FLOPs Appendix A, Ch. 13
KV cache sizing and batch limits Appendix A, Ch. 15, 16
GPU roofline and arithmetic intensity Ch. 8
FlashAttention and kernel fusion Ch. 9, 10
Collective communication cost Ch. 11
Parallelism strategy selection Ch. 12
Compute-optimal token budget Ch. 13, 14
Inference latency-throughput curve Ch. 15
GQA / MQA / MLA trade-offs Ch. 16
State-space and hybrid architectures Ch. 17
Speculative decoding setup Ch. 18
Quantization (INT8, INT4, FP8) Ch. 19
Pre-training data pipeline design Ch. 20, 21
Instruction and preference data curation Ch. 22
Evaluation metric selection Ch. 23
Benchmark selection and contamination Ch. 24, 25
SFT data and training recipe Ch. 26
Reward modeling and RLHF Ch. 27
GRPO and verifiable reward RL Ch. 28
End-to-end serving stack design Ch. 29
End-to-end pre-training design Ch. 30
End-to-end fine-tuning pipeline Ch. 31

The Arithmetic Is the Point


Every number in production LLM engineering is derivable. When a colleague states that “a 70B model
needs at least 4 H100s to serve,” the correct response is to ask whether that accounts for KV cache at the target
batch size and context length, not to accept it as a fact. The derivation: Mweights = 2 × 70 × 109 ≈ 140 GB in
bf16, spread across d140/80e = 2 H100s for weights alone, but a realistic batch size and context window can
double that figure.
Chapter 15 and Appendix A derive these estimates in full, including the formulas for Bmax (VRAM-limited
batch size), B ∗ (KV-crossover batch size), and tstep (decode step latency as a function of batch size and model
size). These are not rules of thumb; they are consequences of memory capacity and bandwidth, and they update
correctly when you change the model, the hardware, or the context length.

xiv
For Engineers Building LLM Systems

Connecting the Book to the Engineering Literature


The book’s technical content is grounded in papers and engineering reports that have shaped the current
state of production LLM systems. Where a chapter’s content derives from a specific paper, the source is cited
inline. Appendix D provides an annotated reading list organized by topic, with each entry annotated to indicate
which specific claims are worth understanding in depth rather than simply citing by name.
The FlashAttention algorithm (Chapters 9 and 10), the Chinchilla scaling law (Chapter 14), continuous
batching (Chapter 18), and the GRPO training algorithm (Chapter 28) are examples where reading the original
paper alongside the corresponding chapter will deepen your understanding of the design choices behind the
method. The reading list indicates which papers repay that deeper reading and which are sufficient to know at
the result level.

Staying Current
The LLM systems field moves fast. The core arithmetic and hardware physics in this book-memory band-
width, arithmetic intensity, roofline analysis, parallelism trade-offs-are stable. Specific techniques (attention
variants, serving schedulers, compression methods) continue to evolve. When a new technique is announced,
the most reliable way to evaluate it is to ask: which constraint does it address, and what does it trade away? That
question is the same one this book trains you to ask, and it does not go stale.

xv
Notation and Symbols

This chapter standardizes the symbols, abbreviations, and conventions used throughout the book. Defini-
tions are precise; where a symbol carries multiple meanings in the literature, the book’s chosen convention is
stated explicitly. Units are always written out on first use in each chapter; the table below lists the canonical
forms.

Model and Architecture


Symbol Meaning Notes

P Total parameter count e.g. P = 7 × 109 for a 7B model


L Number of transformer layers (depth)
d Hidden / residual stream dimension also written dmodel
dmodel Hidden dimension (explicit form) synonym for d
dhead Attention head dimension typically 128
dff Feed-forward (MLP) intermediate dimension b8d/3c for SwiGLU; 4d for ReLU/GeLU
N Number of query attention heads
Nkv Number of key-value heads Nkv = N (MHA); Nkv < N (GQA/MQA)
H Head dimension; also dhead context disambiguates
HQ Query head count used in GQA ratio HQ /HKV
HKV Key-value head count HKV = 1 (MQA); HKV = HQ /G (GQA)
G GQA group size (N/Nkv ) reduction factor for KV cache
V Vocabulary size tokens in the tokenizer
S Sequence length in tokens also written T in some chapters
T Sequence length in tokens synonym for S; also used for training tokens
Ke Expert utilization rate in MoE fraction of experts activated per token
r LoRA rank dimension of low-rank adapters

Training and Optimization


Notation and Symbols

Symbol Meaning Notes

C Total compute budget (FLOPs)


Cfwd FLOPs for one forward pass ≈ 2NT P
Cstep FLOPs for one training step ≈ 6NT P (fwd + bwd)
NT Total tokens in a batch: B × T
D Training dataset size (tokens)
D∗ Chinchilla-optimal token count D∗ ≈ 20P at compute-optimal
N∗ Chinchilla-optimal parameter count
η Learning rate
α Adam β1 decay (first moment) or generic scaling coefficient
β Adam β2 decay (second moment); also KL penalty coefficient context disambiguates
ϵ Adam numerical stability term; also PPO clip ratio tolerance
mt Adam first moment (mean) at step t
vt Adam second moment (variance) at step t
µ Momentum coefficient; also mean of a distribution
γ Gradient clipping threshold; also discount factor in RL
λ L2 regularization weight; also GAE discount in RL

Inference and Serving


Symbol Meaning Notes

B Batch size (number of concurrent sequences)


Bmax Maximum batch size given VRAM b(MGPU − Mweights )/Ckv c
B∗ KV-crossover batch size Mweights /Ckv ; throughput shifts from superlinear to s
Ckv KV cache memory per sequence 2 × L × Nkv × H × S × 2 bytes (bf16)
Cseq KV cache memory per sequence (alternate notation) synonym for Ckv
Mweights Model weight memory 2P bytes in bf16
MGPU Total GPU VRAM e.g. 80 GB for H100 SXM
tstep Decode step latency (Mweights + B · Ckv )/BW
TPS Tokens per second (throughput) aggregate across all concurrent sequences
TTFT Time to first token prefill latency; compute-bound
TBT Time between tokens (decode latency) per-token; memory-bandwidth-bound
TPOT Time per output token synonym for TBT
ITL Inter-token latency synonym for TBT / TPOT
πθ Policy (language model) parameterized by θ used in RLHF/RL chapters
πref Reference policy (frozen pre-RLHF model)
rϕ Reward model parameterized by ϕ
θ Model parameters (generic)
G Number of rollouts per prompt in GRPO distinct from GQA group size G; context disambigua
Âi Group-normalized advantage for response i GRPO: (ri − µR )/σR
yw , y l Preferred / rejected response in a preference pair DPO/RLHF notation

Hardware and Performance

xvii
Notation and Symbols

Symbol Meaning Notes

BW Memory bandwidth (HBM) GB/s or TB/s


MFU Model FLOPs Utilization achieved FLOP/s ÷ peak FLOP/s; target ≥ 0.4
MBU Memory Bandwidth Utilization achieved BW ÷ peak BW
Pattn Parameters in attention layers ≈ 4d2 per layer (MHA)
Pmlp Parameters in MLP layers 3 d dff per layer (SwiGLU)
Br , B c FlashAttention tile (block) sizes for rows and columns SRAM tile dimensions
Ak , B k Left and right matrix tiles in tiled GEMM
twave Time to execute one SM wave GPU scheduling unit
Ntotal Total expert count in MoE
Nactive Active experts per token in MoE Nactive  Ntotal

Alignment and Post-Training


Symbol Meaning Notes

DKL (·k·) Kullback-Leibler divergence


DKL (πθ kπref ) Policy drift from reference regularization term in RLHF/DPO
σ Standard deviation; also sigmoid activation context disambiguates
p, q Generic probability distributions
x Input token sequence
i, j, k Generic indices

Common Abbreviations

xviii
Notation and Symbols

Term Meaning

MHA Multi-Head Attention


GQA Grouped-Query Attention
MQA Multi-Query Attention (GQA with Nkv = 1)
MLA Multi-Head Latent Attention
KV cache Key-Value cache stored between decode steps
HBM High-Bandwidth Memory (GPU DRAM)
SRAM Static RAM; on-chip shared memory / L1 cache
SM Streaming Multiprocessor (GPU compute unit)
FLOP Floating-point operation (one multiply-add = 2 FLOPs)
MFU Model FLOPs Utilization
TP Tensor Parallelism (intra-layer, intra-node)
PP Pipeline Parallelism (inter-layer, inter-node)
DP Data Parallelism
ZeRO Zero Redundancy Optimizer (stages 1-3)
FSDP Fully Sharded Data Parallelism (ZeRO stage 3 in PyTorch)
SFT Supervised Fine-Tuning
RLHF Reinforcement Learning from Human Feedback
DPO Direct Preference Optimization
PPO Proximal Policy Optimization
GRPO Group Relative Policy Optimization
RLVR Reinforcement Learning from Verifiable Rewards
LoRA Low-Rank Adaptation
QLoRA Quantized LoRA
PEFT Parameter-Efficient Fine-Tuning
MoE Mixture of Experts
CoT Chain of Thought
BF16 Brain Float 16 (1 sign, 8 exponent, 7 mantissa bits)
FP16 IEEE Float 16 (1 sign, 5 exponent, 10 mantissa bits)
INT8 8-bit integer quantization
SLO Service Level Objective
QPS Queries per second

Conventions
Tokens vs. sequences. “Tokens” refers to individual vocabulary elements; “sequence” or “context” refers
to an ordered list of tokens. Batch size B counts sequences, not tokens; NT = B × T counts tokens.
FLOPs counting. One multiply-accumulate (MAC) operation = 2 FLOPs. The formula Cfwd ≈ 2NT P
follows this convention. “FLOP/s” (floating-point operations per second) uses the same factor.
Memory units. 1 GB = 109 bytes throughout (SI prefix, not binary). Bandwidth is reported in GB/s or
TB/s.
Latency units. Milliseconds (ms) for per-token and per-request latencies; seconds (s) for end-to-end
generation time; microseconds (µs) for kernel-level timings.
Throughput units. Tokens per second (tok/s or tokens/s) for generation throughput; queries per second

xix
Notation and Symbols

(QPS) for request-level throughput.


Precision notation. Memory footprints assume bf16 (2 bytes/parameter) unless otherwise stated. The
training-memory figure of 16P bytes assumes AdamW in fp32.
Prefill vs. decode. Prefill encodes the full prompt in one parallel forward pass (compute-bound). Decode
generates tokens one at a time in an autoregressive loop (memory-bandwidth-bound). These two phases
have distinct bottlenecks and optimizations.
Latency vs. throughput. TTFT optimizes differently from aggregate TPS. Minimize TTFT with small
batches and fast prefill hardware; maximize TPS with large batches and high memory bandwidth. A single
serving configuration cannot simultaneously optimize both without workload-specific scheduling.
Overloaded symbols. G denotes both the GQA group size and the GRPO rollout count; T denotes
both sequence length and training token count; β denotes both the Adam second-moment decay and the
KL penalty coefficient. In each case, context and the surrounding equation make the intended meaning
unambiguous.

xx
Acknowledgments

A book about systems is itself a system, and this one had many contributors.

The technical content in this book was shaped by countless conversations with engineers and researchers
working on real production systems. Several people gave detailed feedback on draft chapters, caught errors in
derivations, and pointed out places where the framing did not match how the work actually gets done in practice.

A number of colleagues reviewed early drafts and offered both technical corrections and perspective on
what candidates actually encounter in interviews at frontier labs. Their input made the interview framing in
Part I and the end-to-end drills in Part XI significantly sharper.

The open-source research community deserves specific acknowledgment. The derivations in this book
stand on the published work of the teams behind FlashAttention, the Chinchilla scaling law study, the DeepSeek
architecture and training reports, the LLaMA model series, the Mixtral MoE architecture, the RLHF and DPO
preference alignment papers, and the GRPO and RLVR training algorithms. Each of these is cited in the relevant
chapter and listed in Appendix D. The existence of detailed technical reports from frontier labs-a relatively
recent norm-made it possible to ground the book’s engineering content in actual production practice rather than
academic approximations.

Finally, thank you to everyone who read early versions and offered encouragement at the moments when
the project was hardest to continue. You know who you are.
About the Author

I am Hao Hoang, an applied AI/ML engineer and technical writer. I am based in Los Angeles, California,
and I am originally from Quang Tri, Viet Nam.
I own AI Interview Prep and write for the community every day on Substack and on LinkedIn. On Substack
I publish Daily AI Interview Questions and longer notes on LLM system design, reinforcement learning, vision-
language models, RAG-style retrieval, and other high-signal topics for engineers and researchers preparing for
rigorous AI interviews. On LinkedIn I share the same thread of ideas in shorter, daily posts so people can follow
along between editions. That public writing reaches more than 55,000 LinkedIn followers and 12,000 Substack
readers, and it is the main place I teach, learn in public, and answer what the community is asking about.
About this book. I am the sole author: I researched and wrote every chapter myself, without co-authors
or ghostwriters. I wrote it because I wanted a single, opinionated reference for the patterns I kept having to
re-derive when preparing for interviews and when shipping LLM systems in production-at the intersection of
research, systems engineering, and deployment.

You can reach me at:


Email: [Link]@[Link]
Website: [Link]
Substack: [Link]
LinkedIn: [Link]
I publish daily posts on LinkedIn and the newsletter at [Link] Errata
or corrections are welcome at the email above.

You might also like