LLM Handbook
LLM Handbook
Amer Hussein
Expanded Edition with Full Mathematical Derivations
Linkedin: [Link]/in/amer-hussein/
Table of Contents
8.1 KV Cache
8.2 Quantization
11.2 LLM-as-a-Judge
Glossary
References
Chapter 1: Mathematical Foundations
Understanding Large Language Models requires a solid foundation in several areas of mathematics. This
chapter provides the essential mathematical background needed to comprehend the algorithms, training
procedures, and theoretical underpinnings of modern LLMs. We present both the key concepts and detailed
derivations that appear throughout the field.
Language models are fundamentally probabilistic models. They learn to predict the probability distribution
over possible next tokens given a sequence of previous tokens. This section reviews the essential concepts
from probability theory that underpin language modeling.
A probability distribution P over a discrete set X assigns a value P (x) ∈ [0, 1] to each x ∈ X such
that:
∑ P (x) = 1
(1.1)
x∈X
Conditional Probability
P (A ∩ B) P (A, B)
P (A∣B) = = (1.2)
P (B) P (B)
where P (B) > 0. This leads to the chain rule: P (A, B) = P (A∣B)P (B) =
P (B∣A)P (A).
In language modeling, we care about the conditional probability of the next token given previous tokens. For
a sequence of tokens x1 , x2 , … , xn, the joint probability can be factorized using the chain rule:
5
Chapter 1: Mathematical Foundations
n
P (x1 , x2 , … , xn) = ∏ P (xi ∣x1 , … , xi−1 )
(1.3)
i=1
where $x_{
Language models are trained using maximum likelihood estimation (MLE). Given a dataset D of sequences,
we find parameters θ that maximize the likelihood of the data:
(1.4)
θ θ
x∈D
6
Chapter 1: Mathematical Foundations
Key Takeaway
P (B∣A)P (A)
P (A∣B) = (1.6)
P (B)
P (D∣θ)P (θ)
P (θ∣D) = (1.7)
P (D)
1.2.1 Entropy
7
Chapter 1: Mathematical Foundations
The entropy of a discrete random variable X with distribution P measures the average uncertainty:
(1.8)
x
Entropy is measured in bits when using base-2 logarithms. It represents the minimum average number of bits
needed to encode outcomes from the distribution.
Properties of Entropy
Non-negativity: H(X) ≥0
H(X) = E[− log P (X)] ≤ − log E[P (X)] = − log ∑ P (x)2 ≤ log k
1.2.2 Cross-Entropy
The cross-entropy between a true distribution P and model distribution Q measures the average number of
bits needed to encode samples from P using a code optimized for Q:
(1.9)
x
For language modeling, P is the empirical distribution from training data (one-hot at the true token), and Q
is the model's predicted distribution. The cross-entropy loss becomes:
1.2.3 KL Divergence
8
Chapter 1: Mathematical Foundations
The Kullback-Leibler (KL) divergence measures the difference between two probability distributions:
P (x) P (X)
DKL(P ∥Q) = ∑ P (x) log
= EP [ log
]
(1.11)
x
Q(x) Q(X)
P (x)
H(P , Q) = − ∑ P (x) log Q(x) = − ∑ P (x) log P (x) + ∑ P (x) log
Q(x)
x x x
This shows that minimizing cross-entropy is equivalent to minimizing KL divergence (since H(P ) is
constant with respect to Q).
1.2.4 Perplexity
Perplexity is the standard metric for language model evaluation, defined as the exponential of cross-entropy:
$$\text{Perplexity} =
\exp\left(-\frac{1}{N}
\sum_{i=1}^{N} \log P(x_i |
x_{
Figure 1.1: Binary entropy function and cross-entropy comparison across different model
predictions.
9
Chapter 1: Mathematical Foundations
Training LLMs involves optimizing billions of parameters using gradient-based methods. This section covers
the optimization algorithms and theoretical foundations used in deep learning.
where η is the learning rate and ∇θ L(θt) is the gradient of the loss with respect to parameters.
For large datasets, computing the full gradient is expensive. SGD approximates the gradient using a mini-
batch:
1.3.3 Momentum
θt+1 = θt − ηvt+1
(1.16)
Adam (Adaptive Moment Estimation) is the most commonly used optimizer for training LLMs. It combines
momentum with adaptive learning rates:
10
Chapter 1: Mathematical Foundations
The first moment estimate is initialized to zero, causing bias in early iterations:
t
mt = (1 − β1 ) ∑ β1t−i g i
i=1
t t
E[mt] = E[(1 − β1 ) ∑ β1t−i g i ]
= E[g t](1 − β1 ) ∑ β1t−i + ζ
i=1 i=1
1.3.5 AdamW
AdamW decouples weight decay from gradient updates, which improves generalization:
^mt
θt = θt−1 − η ( + λθt−1 )
(1.17)
^vt + ϵ
11
Chapter 1: Mathematical Foundations
where λ is the weight decay coefficient. This differs from L2 regularization in Adam, which adds weight
decay to the gradient before the adaptive update.
1
ηt = ηmin + (ηmax − ηmin) (1 + cos ( π))
t
(1.18)
2
Cosine decay with linear warmup is commonly used for LLM training:
{ Twarmup 1
t
ηmax t < T warmup
ηt =
(1.19)
t−T
ηmin + 2 (ηmax − ηmin)(1 + cos( T −Twarmup t ≥ T warmup
π))
warmup
Figure 1.2: Common learning rate schedules for LLM training: cosine decay, linear decay, and constant with warmup.
Transformers rely heavily on matrix operations for efficient parallel computation. This section reviews the
essential linear algebra concepts.
12
Chapter 1: Mathematical Foundations
Transpose: (A T )ij = A ji
QK T
Attention(Q, K, V ) = softmax ( )V (1.20)
dk
where Q, K, V ∈ Rn×dk are matrices of queries, keys, and values respectively, and n is the sequence
length.
QK T O(n2 ⋅ dk )
O(n2 )
Multiply by V O(n2 ⋅ dk )
O(n ⋅ dk )
A = U ΣV T (1.21)
where U ∈ Rm×m and V ∈ Rn×n are orthogonal matrices, and Σ ∈ Rm×n is diagonal with non-
negative singular values σ1 ≥ σ2 ≥ ⋯ ≥ 0.
13
Chapter 1: Mathematical Foundations
g clipped = {
g if ∥g∥ ≤ c
g (1.25)
c⋅ otherwise
∥g∥
6 6
Wij ∼ U (− , ) (1.26)
nin + nout nin + nout
2
Wij ∼ N (0,
)
(1.27)
nin
14
Chapter 1: Mathematical Foundations
nin
Var(yi ) = ∑ Wij2 Var(xj ) = nin ⋅ Var(W ) ⋅ Var(x)
j=1
nin
2
Var(W ) =
nin + nout
exi
softmax(xi ) = (1.22)
∑j exj
This can overflow for large xi . The numerically stable version subtracts the maximum:
exi −max(x)
softmax(xi ) = (1.23)
∑j exj −max(x)
Proof of Equivalence
exi −c
= = −c =
∑j exj −c ∑j exj e−c e ∑j exj ∑j exj
15
Chapter 1: Mathematical Foundations
x−μ
LayerNorm(x) = γ ⊙ +β (1.24)
σ2 + ϵ
where ϵ (typically 10−6 to 10−5 ) prevents division by zero and improves numerical stability.
Mixed precision uses FP16/BF16 for forward/backward and FP32 for optimizer states. Key considerations:
Loss scaling: Multiply loss by a large constant to preserve small gradients in FP16
Figure 1.3: Gradient flow in deep neural networks showing vanishing, exploding, and stable gradient patterns.
16
Chapter 2: Tokenization and Embeddings
Chapter 1 Summary
Language models learn conditional probability distributions over tokens using the chain rule:
$P(x_1, \ldots, x_n) = \prod_{i=1}^{n} P(x_i | x_{
Cross-entropy loss equals negative log-likelihood: $\mathcal{L}_{CE} = -\sum \log P(x_i | x_{
Numerical stability techniques (softmax subtract-max, loss scaling) are essential for training
Before neural networks can process text, it must be converted into numerical representations. This chapter
covers the fundamental techniques for tokenizing text and creating meaningful vector representations of
words and subwords. We present both the algorithms and their mathematical foundations.
17
Chapter 2: Tokenization and Embeddings
Vocab
Method Description Pros Cons
Size
Merge frequent character 32K- Balance of vocab and Can split words
Subword (BPE)
pairs 100K length awkwardly
Language-agnostic 32K-
SentencePiece Works on raw text Requires pre-training
BPE/Unigram 250K
BPE is the most widely used tokenization algorithm in modern LLMs. It starts with a character vocabulary
and iteratively merges the most frequent adjacent pairs.
fori = 1 to num_merges:
tAB = most frequent adjacent token pair (A, B) in corpus
18
Chapter 2: Tokenization and Embeddings
BPE can be viewed as optimizing a compression objective. Each merge reduces the total number of
tokens in the corpus. The merge score for pair (A, B) is:
score(A, B) = count(A, B)
The algorithm greedily selects the pair with highest count, which locally optimizes the compression
ratio. This is a greedy approximation to the NP-hard problem of finding the optimal set of merges.
BPE Example
After 10 merges: Vocabulary includes common subwords like "low", "new", "er", "est"
Figure 2.1: Byte Pair Encoding merge process showing vocabulary growth over training steps.
19
Chapter 2: Tokenization and Embeddings
WordPiece, used in BERT, differs from BPE in how it selects merges. Instead of frequency, it uses a
language modeling objective:
P (AB)
score(A, B) = (2.1)
P (A)P (B)
This score measures how much more likely the merged pair is compared to the product of individual
probabilities. Higher scores indicate stronger co-occurrence patterns.
SentencePiece's Unigram algorithm takes a different approach, starting with a large vocabulary and pruning
it:
V = V ∖ {x∗ }
P (x) = ∑ P (s)
(2.14)
s∈S(x)
20
Chapter 2: Tokenization and Embeddings
Engineering Pitfall
Tokenization is not reversible in general. Different tokenizers may split the same text differently,
leading to subtle but important differences in model behavior. Always use the tokenizer that matches
the pre-trained model. Mixing tokenizers (e.g., using GPT-2's tokenizer with LLaMA's weights) will
produce garbage outputs.
Word embeddings map discrete tokens to continuous vector spaces where semantic relationships are
preserved. This section covers the evolution of word embedding techniques.
21
Chapter 2: Tokenization and Embeddings
The simplest representation is one-hot encoding, where each word is represented as a vector with a single 1
at its vocabulary index:
This representation is sparse and fails to capture any semantic relationships between words. The dot product
of any two distinct one-hot vectors is zero.
Word2Vec (Mikolov et al., 2013) introduced efficient methods for learning dense word embeddings. Two
architectures were proposed:
T
1
L = ∑∑ log P (wt+j ∣wt)
(2.5)
T
t=1 −c≤j≤c,j=0
exp(vw′TO vwI )
P (wO ∣wI ) =
(2.6)
W
∑w=1 exp(vw′T vwI )
′
Here vw is the input vector and vw
is the output vector for word w .
22
Chapter 2: Tokenization and Embeddings
The full softmax is computationally expensive for large vocabularies. Negative sampling approximates it by
contrasting true pairs with random negative samples:
k
L= log σ(vw′TO vwI )
+ ∑ Ewi ∼Pn (w) [log σ(−vw′Ti vwI )]
(2.7)
i=1
where k is the number of negative samples and P n(w) is the noise distribution (typically proportional to
GloVe combines global corpus statistics with local context window methods. It uses word co-occurrence
counts Xij (how often word j appears in the context of word i):
V
~
L = ∑ f (Xij )(wiT~wj + bi + bj − log Xij )2
(2.8)
i,j=1
f (x) = {
(x/xmax )α if x < xmax
(2.9)
1 otherwise
23
Chapter 2: Tokenization and Embeddings
Figure 2.2: Word embeddings showing semantic relationships in 2D projection. Similar words cluster together and analogies
form parallelograms.
Static embeddings (Word2Vec, GloVe) assign the same vector to a word regardless of context. Contextual
embeddings address this limitation by producing different representations based on surrounding words.
ELMo generates contextualized word representations by combining hidden states from a bidirectional LSTM
language model:
L
ELMok = E(R k ; Θ) = γ ∑ sj hk,j
(2.10)
j=0
scale factor.
24
Chapter 2: Tokenization and Embeddings
where Etoken ∈ R∣V ∣×d is the token embedding matrix and Epos ∈ RLmax ×d is the positional embedding
matrix.
Word embeddings exhibit interesting arithmetic properties that reflect semantic relationships.
This can be formalized as finding the word w ∗ that maximizes cosine similarity:
∗ (va − vb + vc)T vw
w = arg max (2.13)
∥va − vb + vc∥∥vw ∥
w
Mikolov et al. (2013) showed that semantic and syntactic relationships form linear subspaces in the
embedding space:
25
Chapter 2: Tokenization and Embeddings
Key Takeaway
The ability of embeddings to capture analogies suggests they learn meaningful semantic
representations. However, this property is not perfect and embeddings can encode societal biases
present in the training data. For example, vdoctor
− vnurse may have a gender component due to
Figure 2.3: Effect of temperature on softmax distribution. Lower temperature makes the distribution sharper (more confident),
while higher temperature makes it more uniform.
Chapter 2 Summary
BPE is the dominant tokenization method, iteratively merging frequent character pairs
Subword tokenization balances vocabulary size and sequence length
26
Chapter 3: Transformer Architecture
The Transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017),
revolutionized natural language processing by replacing recurrent layers with attention mechanisms. This
chapter provides a comprehensive mathematical treatment of the transformer architecture.
Figure 3.1: Transformer block architecture showing the flow from input through multi-head attention, add and norm, feed-
forward network, and output. Residual connections enable gradient flow in deep networks.
Self-attention allows each position in a sequence to attend to all other positions, computing a weighted
representation based on relevance scores. This is the core innovation of the transformer architecture.
The core attention computation involves three matrices: Queries (Q), Keys (K ), and Values (V ):
QK T
Attention(Q, K, V ) = softmax ( )V (3.1)
dk
27
Chapter 3: Transformer Architecture
The scaling factor prevents the dot products from growing too large in magnitude, which would push
the softmax into regions with extremely small gradients.
For queries and keys with components drawn i.i.d. from N (0, 1):
dk
q ⋅ k = ∑ qi ki
i=1
E[q ⋅ k] = 0, Var(q ⋅ k) = dk
The attention weights α ij represent how much position i should attend to position j :
exp(eij )
α ij =
n (3.2)
∑k=1 exp(eik )
Qi KjT
where eij = is the attention score between positions i and j .
dk
Key Takeaway
Attention can be viewed as a differentiable key-value lookup: queries match against keys, and the
resulting weights determine how much of each value to retrieve. The softmax ensures weights sum
to 1, creating a weighted average. This is analogous to database queries but with soft, differentiable
retrieval.
For language modeling, we use causal attention that prevents positions from attending to future positions:
28
Chapter 3: Transformer Architecture
QK T
Attention(Q, K, V ) = softmax ( + M) V (3.3)
dk
where M is a mask matrix with Mij = −∞ for j > i (future positions) and 0 otherwise:
Mij = {
0 if j ≤ i
(3.4)
−∞ if j > i
Figure 3.2: Attention mask patterns: causal (lower triangular) for autoregressive models, bidirectional (full) for encoder
models, and sliding window for sparse attention.
QK T
A = softmax ( ) ∈ Rn×n
(3.5)
dk
Output = AV ∈ Rn×dk
(3.6)
The attention matrix A contains all pairwise attention weights, enabling parallel computation on GPUs.
Multi-head attention allows the model to jointly attend to information from different representation
subspaces at different positions.
29
Chapter 3: Transformer Architecture
(3.9)
W O ∈ Rhdv ×dmodel
(3.10)
30
Chapter 3: Transformer Architecture
QK T computation h ⋅ n2 ⋅ dk = n2 dmodel
h ⋅ n2 = n2 h
Softmax h ⋅ n2 h ⋅ n2
Each transformer layer includes a position-wise feed-forward network applied independently to each
position.
31
Chapter 3: Transformer Architecture
The inner dimension is typically dff = 4 × dmodel . This expansion allows the model to learn more complex
transformations.
ParamsFFN = dmodel ⋅ dff + dff + dff ⋅ dmodel + dmodel = 2 ⋅ dmodel ⋅ dff + dff + dmodel
(3.14)
ParamsFFN ≈ 8 ⋅ d2model
(3.15)
Modern LLMs often use GELU (Gaussian Error Linear Unit) instead of ReLU:
1
[ 1 + erf ( )]
x
GELU(x) = x ⋅ Φ(x) = x ⋅ (3.16)
2 2
where Φ(x) is the cumulative distribution function of the standard normal distribution.
32
Chapter 3: Transformer Architecture
GELU can be interpreted as a stochastic regularizer. If we multiply the input by a Bernoulli random
variable with probability Φ(x):
x−μ
LayerNorm(x) = γ ⊙ +β (3.18)
σ2 + ϵ
where μ and σ 2 are computed across the feature dimension for each sample:
d d
1 1
μ = ∑ xi , σ = ∑(xi − μ)2
2
(3.19)
d d
i=1 i=1
xl+1 = xl + Sublayer(LayerNorm(xl ))
(3.20)
xl+1 = xl + Sublayer(LayerNorm(xl ))
(3.21)
33
Chapter 3: Transformer Architecture
xl+1 = xl + Sublayer(LayerNorm(xl ))
(3.23)
Pre-norm is more stable for deep networks and is used in GPT, LLaMA, and most modern architectures.
Key Takeaway
Pre-norm architecture places layer normalization before the attention/FFN sublayers, while post-
norm places it after the residual addition. Pre-norm is more stable for deep networks (100+ layers)
because it prevents the gradient from exploding/vanishing through the residual path.
Encoder-only models use bidirectional attention and are trained with masked language modeling:
(3.24)
i∈M
Decoder-only models use causal attention and are trained with autoregressive language modeling:
34
Chapter 3: Transformer Architecture
N
L = − ∑ log P (xi ∣x1 , … , xi−1 ; θ)
(3.25)
i=1
Encoder-decoder models use bidirectional attention on the input and causal attention on the output:
Ntarget
The first term accounts for matrix multiplications, the second for attention.
35
Chapter 3: Transformer Architecture
Figure 3.3: Comparison of transformer architecture variants: encoder-only (BERT), decoder-only (GPT), and encoder-
decoder (T5).
Figure 3.4: Comparison of training and inference pipelines for transformer models, showing the different computational
requirements.
36
Chapter 3: Transformer Architecture
Chapter 3 Summary
37
Chapter 4: Positional Encodings
The sinusoidal encoding uses sine and cosine functions of different frequencies:
pos
P E(pos,2i) = sin ( ) (4.1)
100002i/dmodel
pos
P E(pos,2i+1) = cos ( ) (4.2)
100002i/dmodel
Sinusoidal encodings allow the model to learn to attend to relative positions. For any fixed offset k,
P Epos+k can be expressed as a linear function of P Epos :
This means the model can learn to attend to relative positions by learning attention patterns that depend
on the relative phase shifts.
BERT and GPT use learned positional embeddings, treating position as another embedding lookup:
38
Chapter 4: Positional Encodings
Engineering Pitfall
Learned absolute embeddings have a fixed maximum sequence length. Extrapolation to longer
sequences often fails catastrophically. Models like GPT-2 were trained with Lmax = 1024 and
struggle with longer inputs. This is a fundamental limitation of learned position embeddings.
Figure 4.1: Heatmap visualization of sinusoidal positional encoding. Each row represents a position, and each column
represents a dimension. The wavelength increases with dimension index.
39
Chapter 4: Positional Encodings
xi W Q(xj W K )T
eij = + bj−i (4.5)
dk
⋮ ⋮ ⋮ ⋮ ⋱
where θi
= 10000−2i/dmodel are the rotation frequencies.
qm = R Θ,m Wq xm ,
kn = R Θ,nWk xn
(4.7)
The key property of RoPE is that the inner product naturally encodes relative position:
T
qm kn = xTm WqT R Θ,n−m Wk xn
(4.8)
40
Chapter 4: Positional Encodings
cos mθ − sin mθ
R θ,m = ( )
sin mθ cos mθ
T
qm kn = (R θ,m q)T (R θ,nk) = q T R θ,m
T
R θ,nk
T
Since rotation matrices are orthogonal: R θ,m = R θ,−m
= q T R θ,n−m k
This shows the inner product depends only on the relative position n − m.
Key Takeaway
RoPE has become the dominant positional encoding method in modern LLMs (LLaMA, PaLM,
Mistral) because it: (1) naturally handles relative positions, (2) decays attention for distant positions
through rotation, and (3) extrapolates better to longer sequences than absolute encodings.
where m is a head-specific slope. ALiBi enables strong extrapolation to longer sequences than seen during
training.
41
Chapter 4: Positional Encodings
′
P Epos = P Epos⋅s
(4.10)
where s = Loriginal /Ltarget < 1 is the scaling factor. This allows models to handle longer contexts with
minimal fine-tuning.
(4.11)
Loriginal
42
Chapter 4: Positional Encodings
Chapter 4 Summary
n
43
Chapter 5: Training and Scaling Laws
Training large language models at scale requires understanding how model performance scales with
compute, data, and parameters. This chapter covers the empirical laws that govern LLM training.
Figure 5.1: The LLM lifecycle from pre-training through fine-tuning, alignment, and deployment.
The standard pre-training objective for decoder-only models (GPT, LLaMA) is next-token prediction:
N
L = − ∑ log P (xi ∣x1 , … , xi−1 ; θ)
(5.1)
i=1
(5.2)
i∈M
44
Chapter 5: Training and Scaling Laws
T5 uses a text-to-text format where the input can attend bidirectionally and the output is autoregressive:
Ntarget
Kaplan et al. (2020) from OpenAI established the first scaling laws for language models, showing that loss
scales as a power law with model size, data, and compute.
The test loss L follows a power law relationship with model size N :
αN
L(N) = ( )
Nc
+ L∞ (5.4)
αD
L(D) = ( )
Dc
+ L∞
(5.5)
D
where α D ≈ 0.095.
The combined scaling law for both model size and data is:
αN /αD αD
L(N, D) = [( ) ]
Nc Dc
+ + L∞ (5.6)
N D
45
Chapter 5: Training and Scaling Laws
Figure 5.2: Scaling laws showing validation loss decreasing as a power law with training compute. Each point represents a
model checkpoint.
C ≈ 6ND (5.7)
46
Chapter 5: Training and Scaling Laws
Hoffmann et al. (2022) from DeepMind revisited scaling laws and found that previous models were
significantly undertrained.
For a model with N parameters, the optimal number of training tokens is:
Dopt ≈ 20N
(5.8)
This means a 70B parameter model should be trained on approximately 1.4 trillion tokens for compute-
optimal training.
Figure 5.3: Chinchilla scaling laws showing optimal model size and training tokens for a given compute budget. The dashed
line shows the optimal trade-off.
47
Chapter 5: Training and Scaling Laws
Data quality significantly impacts model performance. Key filtering techniques include:
48
Chapter 5: Training and Scaling Laws
∣A ∩ B∣ ∣minhash(A) ∩ minhash(B)∣
J (A, B) = ≈ (5.9)
∣A ∪ B∣
where ϕ(d) are features like perplexity, readability, and language model scores.
49
Chapter 5: Training and Scaling Laws
Figure 5.4: Training loss curves for different model sizes showing how larger models converge to lower validation loss.
Chapter 5 Summary
Next-token prediction is the standard pre-training objective: $\mathcal{L} = -\sum \log P(x_i |
x_{
Loss scales as a power law with model size and data: L(N) = (Nc/N)αN
50
Chapter 6: Distributed Training
Training billion-parameter models requires distributing computation across hundreds or thousands of GPUs.
This chapter covers the parallelism strategies and memory optimization techniques that make large-scale
training feasible.
Figure 6.1: Overview of distributed training strategies including data, tensor, and pipeline parallelism.
Data parallelism replicates the model across multiple GPUs, with each GPU processing a different batch of
data.
51
Chapter 6: Distributed Training
L i = loss(^yi, yi)
The all-reduce operation aggregates gradients across GPUs. Efficient implementations use ring-allreduce or
tree-based algorithms.
k
1
all_reduce(g i ) = ∑ g j
(6.1)
k
j=1
52
Chapter 6: Distributed Training
2(n − 1) data_size
Timering = ⋅
bandwidth
n
For large n, this approaches 2 ⋅ data_size/bandwidth, which is optimal. The algorithm works
by:
1. Each GPU sends its chunk to the next GPU in the ring
When models exceed single-GPU memory, we must partition the model itself across GPUs.
Tensor parallelism splits individual layers across GPUs. For a linear layer Y = XW :
Megatron-LM implements efficient tensor parallelism for transformer layers by splitting attention heads and
FFN weights.
The main challenge is pipeline bubbles (idle time). Techniques like GPipe and PipeDream reduce bubbles
through micro-batching.
53
Chapter 6: Distributed Training
Figure 6.2: Memory requirements per GPU for different parallelism strategies across various model sizes. ZeRO-3 provides
the best scaling.
ZeRO (Zero Redundancy Optimizer) and FSDP (Fully Sharded Data Parallel) optimize memory usage by
partitioning optimizer states, gradients, and parameters.
54
Chapter 6: Distributed Training
Total: 16 bytes per parameter. With ZeRO-3, only 16Ψ/k bytes per GPU.
model = FSDP(
model,
auto_wrap_policy=size_based_auto_wrap_policy,
mixed_precision=torch.bfloat16,
device_id=[Link].current_device(),
limit_all_gathers=True
)
55
Chapter 6: Distributed Training
Mixed precision uses FP16/BF16 for forward/backward and FP32 for optimizer states:
BF16 is preferred for LLMs due to better numerical stability (same exponent range as FP32).
Flash Attention reduces memory usage and improves speed through IO-aware algorithms:
56
Chapter 7: Alignment and Fine-tuning
Pre-trained language models require alignment to follow instructions and behave safely. This chapter covers
supervised fine-tuning, RLHF, and alternative alignment methods with full mathematical derivations.
Figure 7.1: Reward model training progress showing accuracy improvement over training steps. Human-level agreement is
typically achieved around 80% accuracy.
SFT adapts pre-trained models to follow instructions using labeled (prompt, response) pairs.
57
Chapter 7: Alignment and Fine-tuning
Engineering Pitfall
Overfitting during SFT is common. Typical SFT uses 10-100K examples for 1-3 epochs with
learning rates 10-100x smaller than pre-training (e.g., 1e-5 to 1e-6). Too much SFT can cause
catastrophic forgetting of pre-trained knowledge. Monitor validation loss on held-out tasks to detect
overfitting.
Given pairs (yw , yl ) where yw is preferred over yl , we train a reward model using the Bradley-Terry
model:
(7.3)
58
Chapter 7: Alignment and Fine-tuning
The Bradley-Terry model assumes each item has an underlying "strength" parameter. The probability
that item i beats item j is:
πi
P (i ≻ j) =
πi + πj
eri 1
P (i ≻ j) = = = σ(ri − rj )
eri + erj 1 + e−(ri −rj )
This logistic form is convenient for optimization and naturally handles ties through the difference in
rewards.
Proximal Policy Optimization (PPO) updates the policy while preventing large deviations from the reference
policy:
πθ (y∣x)
where rt(θ) = A
πref (y∣x) is the probability ratio and t is the advantage.
To prevent reward hacking (exploiting the reward model), a KL penalty is added to the reward:
πθ (y∣x)
r(x, y) = rθ (x, y) − β log = rθ (x, y) − βDKL(πθ ∥πref )
(7.5)
πref (y∣x)
59
Chapter 7: Alignment and Fine-tuning
Figure 7.2: RLHF training dynamics showing raw reward increasing but KL divergence also increasing. The penalized
reward peaks at the optimal stopping point.
DPO eliminates the need for explicit reward modeling and RL training by directly optimizing from
preferences.
DPO derives a closed-form solution for the optimal policy under the Bradley-Terry model. The optimal RL
policy is:
1 1
π∗ (y∣x) = πref (y ∣x) exp ( r(x, y))
(7.6)
Z(x) β
60
Chapter 7: Alignment and Fine-tuning
π∗ (y∣x)
r(x, y) = β log + β log Z(x)
πref (y∣x)
The partition function Z(x) cancels out! This gives the DPO objective:
(7.7)
πref (yw ∣x) πref (yl ∣x)
Key Takeaway
DPO achieves similar performance to RLHF with simpler training (no reward model, no PPO). The
key insight is that the optimal RL policy under a Bradley-Terry preference model can be expressed
in closed form, allowing direct optimization. DPO is more stable than PPO and requires less
hyperparameter tuning.
61
Chapter 7: Alignment and Fine-tuning
Figure 7.3: Training convergence comparison: DPO converges faster than RLHF (PPO) but RLHF is more stable in later
stages.
1. Supervised stage: Model critiques and revises its own responses based on constitutional principles
"Prefer responses that acknowledge uncertainty rather than making things up"
62
Chapter 8: Inference Optimization
Chapter 7 Summary
RLHF uses human preferences to train a reward model: LR = −E[log σ(rθ (x, yw ) −
rθ (x, yl ))]
Serving LLMs efficiently requires optimizing memory usage, reducing computation, and maximizing
throughput. This chapter covers the key
63
Chapter 8: Inference Optimization
Figure 8.1: Overview of inference optimization techniques and their combined impact on latency and throughput.
8.1 KV Cache
The KV cache stores key and value tensors from previous tokens to avoid recomputation during
autoregressive generation.
Without caching, each new token requires computing attention over all previous tokens:
64
Chapter 8: Inference Optimization
Layers: l = 32
Heads: h = 32
Head dimension: dh
= 128
Precision: FP16 = 2 bytes
Memory KV = 17.2 GB
Figure 8.2: KV cache memory growth with sequence length for different model sizes. Long contexts require significant
memory.
65
Chapter 8: Inference Optimization
GQA is a middle ground between MHA and MQA, grouping query heads to share keys/values:
8.2 Quantization
66
Chapter 8: Inference Optimization
w
wf ake = s ⋅ round (
)
(8.6)
s
Key Takeaway
INT8 quantization typically achieves near-lossless compression for inference. INT4/4-bit methods
(GGUF, AWQ, GPTQ) enable running 70B models on consumer GPUs with acceptable quality
degradation. GPTQ uses second-order information for better quantization.
Flash Attention optimizes the attention computation through IO-aware algorithms and tiling.
S = QK T ∈ RN ×N , P = softmax(S) ∈ RN ×N (8.7)
67
Chapter 8: Inference Optimization
Flash Attention uses tiling and online softmax to avoid materializing the full matrix:
1K 4 MB 0.5 MB 1.2x
4K 64 MB 2 MB 2.0x
16K 1 GB 8 MB 2.5x
64K 16 GB 32 MB 3.0x
m = −∞, l = 0, O i = 0
m = mnew , l = lnew
Key insight: Compute attention in blocks that fit in SRAM (fast memory), avoiding slow HBM reads/writes.
68
Chapter 8: Inference Optimization
Speculative decoding accelerates inference by drafting multiple tokens with a small model, then verifying
with the large model.
1
Speedup ≈ cdraft (8.8)
1−α+ ⋅K
ctarget
Figure 8.3: Speculative decoding speedup as a function of draft token acceptance rate for different numbers of draft tokens
(k).
69
Chapter 8: Inference Optimization
Figure 8.4: Latency versus throughput trade-offs with different batch sizes. Larger batches improve throughput but increase
latency per request.
Chapter 8 Summary
Quantization (INT8, INT4) reduces memory 2-8x with minimal quality loss
70
Chapter 9: RAG, Agents, and Tool Use
Extending LLMs with external knowledge and capabilities enables more accurate, up-to-date, and actionable
responses. This chapter covers retrieval-augmented generation, tool calling, and agent architectures.
RAG grounds LLM responses in external knowledge, reducing hallucinations and enabling access to
information beyond the training data.
71
Chapter 9: RAG, Agents, and Tool Use
(9.1)
d∈D
Figure 9.2: Comparison of vector database indexing methods showing recall vs embedding dimensions.
72
Chapter 9: RAG, Agents, and Tool Use
{"tool": "weather_api",
"parameters": {
"location": "San Francisco",
"date": "2024-01-15"
}}
Figure 9.3: LLM agent tool calling loop showing the iterative decision-execute-respond cycle.
73
Chapter 9: RAG, Agents, and Tool Use
Thought: I need to find the current weather in Paris to answer the question.
Action: weather_api(location="Paris")
Observation: {"temperature": 15, "condition": "cloudy"}
Thought: Now I have the weather information. I can provide the answer.
Final Answer: The current weather in Paris is 15°C and cloudy.
74
Chapter 9: RAG, Agents, and Tool Use
Chapter 9 Summary
Tool calling enables LLMs to interact with APIs and external systems
75
Chapter 10: Multimodal LLMs
Extending language models to process and generate multiple modalities (vision, audio) enables richer
interactions and new capabilities. This chapter covers vision-language models and multimodal architectures.
p E ] + Epos
(10.1)
where xip is the i-th image patch and E is the patch embedding matrix.
H ⋅W
N= (10.2)
P2
2
⋅C
xip ∈ RP → xip E ∈ RD
(10.3)
76
Chapter 10: Multimodal LLMs
Vision-language models combine visual encoders with language models for multimodal understanding.
CLIP learns joint representations of images and text through contrastive learning:
(− ∑ log )
1 exp(⟨Ii , T i ⟩/τ ) exp(⟨Ii , T i ⟩/τ )
L= − ∑ log
(10.4)
2 ∑ exp(⟨I , ⟩/τ ) ∑ exp(⟨I , ⟩/τ )
i j
i T j
i j
j T i
where τ is a temperature parameter and ⟨I, T ⟩ is the cosine similarity between image and text embeddings.
LLaVA connects a vision encoder (CLIP) to an LLM with a simple projection layer:
Image → CLIP Vision Encoder → Projection → LLM Token Space → Language Model
Speech models convert audio to text and vice versa, enabling voice interfaces.
77
Chapter 10: Multimodal LLMs
Chapter 10 Summary
78
Chapter 11: Evaluation and Safety
Evaluating LLMs requires diverse benchmarks covering knowledge, reasoning, coding, and safety. This
chapter covers evaluation methodologies and safety considerations.
Figure 11.1: LLM performance across key evaluation benchmarks comparing GPT-4, GPT-3.5, and Llama-2-70B.
79
Chapter 11: Evaluation and Safety
11.1.3 HumanEval
Pass@k = EProblems [ 1 − ]
(n−c
k
)
(11.1)
(nk)
80
Chapter 11: Evaluation and Safety
11.2 LLM-as-a-Judge
Using LLMs to evaluate other LLMs addresses scalability challenges of human evaluation.
Engineering Pitfall
LLM-as-a-Judge can be unreliable for nuanced reasoning tasks. Always validate against human
judgments and be aware of known biases. GPT-4 as judge correlates ~80% with human judgments,
but this varies significantly by task domain.
81
Chapter 11: Evaluation and Safety
Red teaming involves actively trying to make models produce harmful outputs to identify vulnerabilities.
82
Chapter 11: Evaluation and Safety
Chapter 11 Summary
83
Chapter 12: Reasoning and Advanced Topics
Advancing beyond pattern matching to genuine reasoning is a key frontier in LLM research. This chapter
covers reasoning techniques, test-time compute, and open research problems.
Q: A juggler has 16 balls. Half are golf balls and half are tennis balls.
If 3 golf balls are removed, how many golf balls remain?
A: Let's think step by step.
12.1.2 Self-Consistency
Generate multiple CoT reasoning paths and take the majority vote:
k
^y = arg max ∑ 1[yi = y]
(12.1)
y
i=1
84
Chapter 12: Reasoning and Advanced Topics
Increasing computation at inference time can improve performance without training larger models.
Mixture of Experts (MoE) scales model capacity without proportional compute increase.
85
Chapter 12: Reasoning and Advanced Topics
N
y = ∑ G(x)i ⋅ Ei (x) (12.2)
i=1
where G is the gating network and Ei are expert networks. Only top-k experts are activated.
N
Laux = α ⋅ N ⋅ ∑ fi ⋅ P i
(12.3)
i=1
where fi is the fraction of tokens routed to expert i and P i is the average routing probability.
86
Chapter 12: Reasoning and Advanced Topics
Figure 12.2: Growth of large language model sizes over time, showing the rapid increase in parameters from 2018 to 2024.
Research Insight
The field is rapidly evolving. Techniques that were state-of-the-art 6 months ago may be obsolete
today. Key trends include: (1) increased focus on reasoning over scaling, (2) test-time compute as a
new scaling dimension, (3) multimodal as default, and (4) efficiency enabling broader access.
Chapter 12 Summary
87
Appendix: Worked Examples
1 0 0
Compute: Q = K = V = X = 0 1 0
0 0 1
1 0 0
Attention scores: S = QK = 0 T
1
0
0 0 1
88
Appendix: Worked Examples
Forward: y = Wx + b
Loss: L = 21 ∥y − t∥2
Gradients:
∂L
∂y =y−t
∂L
∂W = (y − t)xT
∂L
∂b =y−t
∂L
∂x = W T (y − t)
Question: What perplexity would a 70B model trained on 1.4T tokens achieve?
L = L∞ + (Nc/70B)0.076 + (Dc/1400B)0.095
89
Appendix: Worked Examples
Memory breakdown:
Given: Reference policy πref (yw ∣x) = 0.3, πref (yl ∣x) = 0.2
With β = 0.1:
0.5
rw = β log
0.2
= 0.1 × (−0.29) = −0.029
= − log(0.52) = 0.65
90
Appendix: Worked Examples
Layers: 80
Heads: 64
Head dimension: 128
= 2 × 8192 × 80 × 8 × 128 × 2
= 2.15 GB
91
Appendix: Worked Examples
T = 1.0:
T = 0.5:
P = [e4 , e2 , e1 ]/(e4 + e2 + e1 )
T = 2.0:
Top-k with k=2: Sample from [0.4, 0.3] (renormalized to [0.57, 0.43])
92
Appendix: Worked Examples
Step 2: For "The": logits [1.5, 1.0, 0.5] → probs [0.55, 0.30, 0.15]
For "A": logits [1.0, 0.8, 0.3] → probs [0.50, 0.33, 0.17]
93
Appendix: Worked Examples
Document embeddings:
Cosine similarity:
0.5(0.6)+0.3(0.2)+0.8(0.7) 0.88
sim(q, d1 ) =
∥q∥∥d1 ∥
= 0.99×0.92
= 0.97
0.5(0.1)+0.3(0.9)+0.8(0.3) 0.50
sim(q, d2 ) =
∥q∥∥d2 ∥
= 0.99×0.95
= 0.53
Model probabilities:
− 13 (log 0.1 + log 0.05 + log 0.2) = − 31 (−2.30 − 3.00 − 1.61) = 2.30
94
Glossary
Glossary
Attention: A mechanism that computes weighted representations by comparing queries against keys to
determine relevance. The core operation is softmax(QK T / dk )V .
Autoregressive: A model that generates output one token at a time, conditioning each new token on
previously generated tokens. Used in GPT-style decoder-only models.
Backpropagation: Algorithm for computing gradients of the loss with respect to model parameters by
applying the chain rule through the computation graph.
BERT: Bidirectional Encoder Representations from Transformers; encoder-only model pre-trained with
masked language modeling.
BPE: Byte Pair Encoding; subword tokenization algorithm that iteratively merges frequent character pairs to
build a vocabulary.
Chain-of-Thought: Prompting technique that elicits step-by-step reasoning from language models by
providing examples of intermediate reasoning steps.
Context Window: The maximum sequence length a model can process at once. Limited by memory and
positional encoding method.
Cross-Entropy: Loss function measuring the difference between predicted and true probability distributions:
H(P , Q) = − ∑x P (x) log Q(x).
Decoder: The part of a transformer that generates output autoregressively; used in GPT-style models with
causal attention.
DPO: Direct Preference Optimization; alignment method that directly optimizes from preferences without
reward modeling.
Embedding: A dense vector representation of discrete tokens in a continuous space, typically learned during
training.
Encoder: The part of a transformer that processes input bidirectionally; used in BERT-style models.
95
Glossary
FSDP: Fully Sharded Data Parallel; PyTorch's implementation of ZeRO for distributed training.
Gradient Checkpointing: Memory optimization technique that recomputes activations during backward
pass instead of storing them.
Hallucination: Generation of plausible but factually incorrect or unsupported content by a language model.
KV Cache: Storage of key and value tensors from previous tokens to avoid recomputation during
autoregressive inference.
LLM: Large Language Model; neural network with billions of parameters trained on vast text corpora.
LoRA: Low-Rank Adaptation; parameter-efficient fine-tuning method using low-rank matrix decomposition.
Masking: Preventing attention to certain positions, typically future tokens in autoregressive models using a
causal mask.
MoE: Mixture of Experts; architecture using sparse activation of specialized sub-networks to scale model
capacity.
Multi-Head Attention: Parallel attention computations with different learned projections, allowing attention
to different representation subspaces.
Perplexity: Metric for language model quality; exponential of average negative log-likelihood: PPL =
exp(− 1N ∑ log P (xi )).
Positional Encoding: Method to inject sequence position information into transformer models, which are
otherwise permutation-invariant.
PPO: Proximal Policy Optimization; reinforcement learning algorithm used in RLHF to optimize policies
with a trust region constraint.
Quantization: Reducing numerical precision of model weights to decrease memory and increase speed, e.g.,
FP16 → INT8.
RAG: Retrieval-Augmented Generation; augmenting LLMs with external knowledge retrieval to reduce
hallucination.
96
Glossary
RLHF: Reinforcement Learning from Human Feedback; alignment method using human preferences to train
a reward model and optimize policy.
RoPE: Rotary Position Embedding; relative positional encoding using rotation matrices that encode relative
position naturally.
Self-Attention: Attention mechanism where queries, keys, and values come from the same sequence.
SFT: Supervised Fine-Tuning; adapting pre-trained models with labeled instruction-response pairs.
Token: Discrete unit of text (word, subword, or character) processed by language models.
Tokenization: Process of converting text into discrete tokens that can be processed by neural networks.
Transformer: Neural network architecture based on attention mechanisms, replacing recurrence with
parallelizable self-attention.
ZeRO: Zero Redundancy Optimizer; memory optimization technique for distributed training that partitions
optimizer states, gradients, and parameters.
97
References
References
1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017).
Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998-6008.
2. Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative
pre-training. OpenAI Technical Report.
3. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised
multitask learners. OpenAI Blog.
4. Brown, T., Mann, B., Ryder, N., et al. (2020). Language models are few-shot learners. Advances in Neural Information
Processing Systems, 33, 1877-1901.
5. Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). Scaling laws for neural language models. arXiv preprint
arXiv:2001.08361.
6. Raffel, C., Shazeer, N., Roberts, A., et al. (2020). Exploring the limits of transfer learning with a unified text-to-text
transformer. Journal of Machine Learning Research, 21(140), 1-67.
7. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for
language understanding. Proceedings of NAACL-HLT, 4171-4186.
8. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space.
ICLR Workshop.
9. Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. Proceedings of
EMNLP, 1532-1543.
10. Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). Training compute-optimal large language models. arXiv preprint
arXiv:2203.15556.
11. Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback.
Advances in Neural Information Processing Systems, 35, 27730-27744.
12. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct preference optimization:
Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
13. Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv preprint
arXiv:2212.08073.
14. Touvron, H., Lavril, T., Izacard, G., et al. (2023). LLaMA: Open and efficient foundation language models. arXiv
preprint arXiv:2302.13971.
98
References
15. Touvron, H., Martin, L., Stone, K., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint
arXiv:2307.09288.
16. Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2024). RoFormer: Enhanced transformer with rotary position
embedding. Neurocomputing, 568, 127063.
17. Press, O., Smith, N. A., & Lewis, M. (2022). Train short, test long: Attention with linear biases enables input length
extrapolation. ICLR.
18. Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Re, C. (2022). FlashAttention: Fast and memory-efficient exact attention
with IO-awareness. Advances in Neural Information Processing Systems, 35, 16344-16359.
19. Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks.
Advances in Neural Information Processing Systems, 33, 9459-9474.
20. Yao, S., Zhao, J., Yu, D., et al. (2023). ReAct: Synergizing reasoning and acting in language models. ICLR.
21. Wei, J., Wang, X., Schuurmans, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language
models. Advances in Neural Information Processing Systems, 35, 24824-24837.
23. Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2020). ZeRO: Memory optimizations toward training trillion
parameter models. SC20: International Conference for High Performance Computing, 1-16.
24. Shoeybi, M., Patwary, M., Puri, R., et al. (2019). Megatron-LM: Training multi-billion parameter language models using
model parallelism. arXiv preprint arXiv:1909.08053.
25. Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). LLM.int8(): 8-bit matrix multiplication for transformers
at scale. Advances in Neural Information Processing Systems, 35, 30318-30332.
26. Lin, J., Tang, J., Tang, H., et al. (2024). AWQ: Activation-aware weight quantization for LLM compression and
acceleration. MLSys.
27. Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast inference from transformers via speculative decoding. ICML,
19274-19286.
28. Ainslie, J., Lee-Thorp, J., de Jong, M., et al. (2023). GQA: Training generalized multi-query transformer models from
multi-head checkpoints. EMNLP.
29. Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch transformers: Scaling to trillion parameter models with simple and
efficient sparsity. Journal of Machine Learning Research, 23(120), 1-39.
99
References
30. Hendrycks, D., Burns, C., Basart, S., et al. (2021). Measuring massive multitask language understanding. ICLR.
31. Cobbe, K., Kosaraju, V., Bavarian, M., et al. (2021). Training verifiers to solve math word problems. arXiv preprint
arXiv:2110.14168.
32. Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating large language models trained on code. arXiv preprint
arXiv:2107.03374.
33. Zellers, R., Holtzman, A., Bisk, Y., et al. (2019). HellaSwag: Can a machine really finish your sentence? ACL, 4791-
4800.
34. Lin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring how models mimic human falsehoods. ACL, 3214-3252.
35. Suzgun, M., Scales, N., Sch{"a}rli, N., et al. (2023). Challenging BIG-Bench tasks and whether chain-of-thought can
solve them. ACL Findings, 13003-13051.
36. Zheng, L., Chiang, W. L., Sheng, Y., et al. (2023). Judging LLM-as-a-judge with MT-Bench and chatbot arena.
Advances in Neural Information Processing Systems, 36.
37. Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning transferable visual models from natural language
supervision. ICML, 8748-8763.
38. Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2024). Visual instruction tuning. Advances in Neural Information Processing
Systems, 36.
39. Radford, A., Kim, J. W., Xu, T., et al. (2023). Robust speech recognition via large-scale weak supervision. ICML,
28492-28518.
40. Yao, S., Yu, D., Zhao, J., et al. (2024). Tree of thoughts: Deliberate problem solving with large language models.
Advances in Neural Information Processing Systems, 36.
100
References
Amer Hussein
(7.1)
(1.12)
(1.5)
Connect:
LinkedIn: [Link]/in/amer-hussein/
101