0% found this document useful (0 votes)
7 views101 pages

LLM Handbook

The LLM Handbook provides a comprehensive overview of foundational theories and advanced research related to large language models. It covers essential topics such as mathematical foundations, tokenization, transformer architecture, and positional encodings. The expanded edition includes full mathematical derivations to support the concepts discussed.

Uploaded by

raushankr2k24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views101 pages

LLM Handbook

The LLM Handbook provides a comprehensive overview of foundational theories and advanced research related to large language models. It covers essential topics such as mathematical foundations, tokenization, transformer architecture, and positional encodings. The expanded edition includes full mathematical derivations to support the concepts discussed.

Uploaded by

raushankr2k24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

The LLM Handbook

From Foundational Theory to Advanced Research and Production Deployment

Mathematical Foundations • Architecture • Training • Alignment • Inference • Applications

Amer Hussein
Expanded Edition with Full Mathematical Derivations

Linkedin: [Link]/in/amer-hussein/
Table of Contents

Chapter 1: Mathematical Foundations

1.1 Probability Theory

1.2 Information Theory

1.3 Optimization Theory

1.4 Linear Algebra for Deep Learning

1.5 Numerical Stability

Chapter 2: Tokenization and Embeddings

2.1 Text Tokenization Algorithms

2.2 Word Embeddings

2.3 Contextual Embeddings

2.4 Embedding Arithmetic

Chapter 3: Transformer Architecture

3.1 Self-Attention Mechanism

3.2 Multi-Head Attention

3.3 Feed-Forward Networks

3.4 Layer Normalization

3.5 Architecture Variants

Chapter 4: Positional Encodings

4.1 Absolute Positional Encodings

4.2 Relative Positional Encodings

4.3 Rotary Position Embeddings (RoPE)

4.4 ALiBi and Long Context Methods


Chapter 5: Training and Scaling Laws

5.1 Pre-training Objectives

5.2 Scaling Laws Derivation

5.3 Chinchilla Scaling

5.4 Training at Scale

Chapter 6: Distributed Training

6.1 Data Parallelism

6.2 Tensor and Pipeline Parallelism

6.3 ZeRO and FSDP

6.4 Memory Optimization

Chapter 7: Alignment and Fine-tuning

7.1 Supervised Fine-tuning (SFT)

7.2 Reinforcement Learning from Human Feedback (RLHF)

7.3 Direct Preference Optimization (DPO)

7.4 Constitutional AI and RLAIF

Chapter 8: Inference Optimization

8.1 KV Cache

8.2 Quantization

8.3 Flash Attention

8.4 Speculative Decoding

Chapter 9: RAG, Agents, and Tool Use

9.1 Retrieval-Augmented Generation

9.2 Tool Calling

9.3 LLM Agents

9.4 ReAct and Chain-of-Thought


Chapter 10: Multimodal LLMs

10.1 Vision Transformers (ViT)

10.2 Vision-Language Models

10.3 Audio and Speech

Chapter 11: Evaluation and Safety

11.1 Evaluation Benchmarks

11.2 LLM-as-a-Judge

11.3 Safety and Alignment

11.4 Red Teaming

Chapter 12: Reasoning and Advanced Topics

12.1 Chain of Thought

12.2 Test-Time Compute

12.3 Mixture of Experts

12.4 Open Research Problems

Glossary

References
Chapter 1: Mathematical Foundations

Chapter 1: Mathematical Foundations

Understanding Large Language Models requires a solid foundation in several areas of mathematics. This
chapter provides the essential mathematical background needed to comprehend the algorithms, training
procedures, and theoretical underpinnings of modern LLMs. We present both the key concepts and detailed
derivations that appear throughout the field.

1.1 Probability Theory

Language models are fundamentally probabilistic models. They learn to predict the probability distribution
over possible next tokens given a sequence of previous tokens. This section reviews the essential concepts
from probability theory that underpin language modeling.

1.1.1 Basic Probability

A probability distribution P over a discrete set X assigns a value P (x) ∈ [0, 1] to each x ∈ X such
that:

∑ P (x) = 1

(1.1)
x∈X

Conditional Probability

The conditional probability of event A given event B is defined as:

P (A ∩ B) P (A, B)
P (A∣B) = = (1.2)
P (B) P (B)
​ ​

where P (B) > 0. This leads to the chain rule: P (A, B) = P (A∣B)P (B) =
P (B∣A)P (A).

1.1.2 Chain Rule for Sequences

In language modeling, we care about the conditional probability of the next token given previous tokens. For
a sequence of tokens x1 , x2 , … , xn, the joint probability can be factorized using the chain rule:
​ ​ ​

5
Chapter 1: Mathematical Foundations

n
P (x1 , x2 , … , xn) = ∏ P (xi ∣x1 , … , xi−1 )
​ ​ ​ ​ ​ ​ ​
(1.3)
i=1

Derivation of the Chain Rule

Starting from the definition of conditional probability:

P (x1 , x2 ) = P (x2 ∣x1 )P (x1 )


​ ​ ​ ​ ​

Extending to three variables:

P (x1 , x2 , x3 ) = P (x3 ∣x1 , x2 )P (x1 , x2 ) = P (x3 ∣x1 , x2 )P (x2 ∣x1 )P (x1 )


​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

By induction, for n variables:

$$P(x_1, \ldots, x_n) = P(x_n | x_{

where $x_{

1.1.3 Maximum Likelihood Estimation

Language models are trained using maximum likelihood estimation (MLE). Given a dataset D of sequences,
we find parameters θ that maximize the likelihood of the data:

θM LE = arg max P (D; θ) = arg max ∏ P (x; θ)


​ ​ ​ ​

(1.4)
θ θ
x∈D

For computational convenience, we maximize the log-likelihood:

6
Chapter 1: Mathematical Foundations

Key Takeaway

Next-token prediction is equivalent to maximizing the log-


likelihood of the training data. This is why the cross-entropy
loss is the standard training objective for language models.
The equivalence follows from:

$$\mathcal{L}_{CE} = -\frac{1}{N} \sum_{i=1}^{N} \log


P(x_i^{true} | x_{

1.1.4 Bayes' Theorem

Bayes' theorem provides a way to update beliefs given evidence:

P (B∣A)P (A)
P (A∣B) = (1.6)
P (B)

$$\mathcal{L}(\theta) = \sum_{x \in


\mathcal{D}} \log P(x; \theta) =
\sum_{x \in \mathcal{D}}In the context of language modeling, this can be used for posterior
\sum_{i=1}^{|x|} \log P(x_i | x_{ inference:

P (D∣θ)P (θ)
P (θ∣D) = (1.7)
P (D)

where P (θ) is the prior over parameters, P (D∣θ) is the likelihood,


and P (θ∣D) is the posterior.

1.2 Information Theory

Information theory provides the theoretical framework for


understanding compression, prediction, and the fundamental limits of
language modeling. Developed by Claude Shannon, it quantifies the
amount of information in random variables.

1.2.1 Entropy

7
Chapter 1: Mathematical Foundations

The entropy of a discrete random variable X with distribution P measures the average uncertainty:

H(X) = − ∑ P (x) log P (x) = EP [− log P (X)]


​ ​

(1.8)
x

Entropy is measured in bits when using base-2 logarithms. It represents the minimum average number of bits
needed to encode outcomes from the distribution.

Properties of Entropy

Non-negativity: H(X) ≥0

Proof: Since 0 ≤ P (x) ≤ 1, we have − log P (x) ≥ 0, thus H(X) ≥ 0.

Maximum entropy: For a distribution with k outcomes, H(X) ≤ log k

Proof using Jensen's inequality: Since log is concave,

H(X) = E[− log P (X)] ≤ − log E[P (X)] = − log ∑ P (x)2 ≤ log k ​

Equality holds for the uniform distribution P (x) = 1/k.

1.2.2 Cross-Entropy

The cross-entropy between a true distribution P and model distribution Q measures the average number of
bits needed to encode samples from P using a code optimized for Q:

H(P , Q) = − ∑ P (x) log Q(x) = EP [− log Q(X)]


​ ​

(1.9)
x

For language modeling, P is the empirical distribution from training data (one-hot at the true token), and Q
is the model's predicted distribution. The cross-entropy loss becomes:

LCE = − log Q(xtrue)


​ ​ (1.10)

1.2.3 KL Divergence

8
Chapter 1: Mathematical Foundations

The Kullback-Leibler (KL) divergence measures the difference between two probability distributions:

P (x) P (X)
DKL(P ∥Q) = ∑ P (x) log
​ ​ ​= EP [ log
​ ] ​

(1.11)
x
Q(x) Q(X)

Relationship Between Cross-Entropy, Entropy, and KL Divergence

We can decompose cross-entropy as:

P (x)
H(P , Q) = − ∑ P (x) log Q(x) = − ∑ P (x) log P (x) + ∑ P (x) log
​ ​ ​ ​

Q(x)
x x x

H(P , Q) = H(P ) + DKL(P ∥Q) ​

This shows that minimizing cross-entropy is equivalent to minimizing KL divergence (since H(P ) is
constant with respect to Q).

Gibbs' inequality: DKL(P ∥Q) ​ ≥ 0 with equality iff P = Q.

1.2.4 Perplexity

Perplexity is the standard metric for language model evaluation, defined as the exponential of cross-entropy:

Perplexity can be interpreted as the effective vocabulary size: a model with


perplexity 100 is as confused as if it had to choose uniformly among 100
tokens at each step.

$$\text{Perplexity} =
\exp\left(-\frac{1}{N}
\sum_{i=1}^{N} \log P(x_i |
x_{

Figure 1.1: Binary entropy function and cross-entropy comparison across different model
predictions.

1.3 Optimization Theory

9
Chapter 1: Mathematical Foundations

Training LLMs involves optimizing billions of parameters using gradient-based methods. This section covers
the optimization algorithms and theoretical foundations used in deep learning.

1.3.1 Gradient Descent

Gradient descent iteratively updates parameters in the direction of steepest descent:

θt+1 = θt − η∇θ L(θt)


​ ​ ​ ​ (1.13)

where η is the learning rate and ∇θ L(θt) is the gradient of the loss with respect to parameters.
​ ​

1.3.2 Stochastic Gradient Descent (SGD)

For large datasets, computing the full gradient is expensive. SGD approximates the gradient using a mini-
batch:

θt+1 = θt − η∇θ L(θt; Bt)


​ ​ ​ ​ ​
(1.14)

where Bt is a random subset (mini-batch) of the training data.


1.3.3 Momentum

Momentum accelerates convergence by accumulating velocity:

vt+1 = βvt + ∇θ L(θt)


​ ​ ​ ​ (1.15)

θt+1 = θt − ηvt+1
​ ​ ​
(1.16)

where β ∈ [0, 1) is the momentum coefficient (typically β = 0.9).

1.3.4 Adam Optimizer

Adam (Adaptive Moment Estimation) is the most commonly used optimizer for training LLMs. It combines
momentum with adaptive learning rates:

10
Chapter 1: Mathematical Foundations

Algorithm 1: Adam Optimizer

// Initialize first and second moment estimates


m0 = 0, v0 = 0, t = 0
​ ​

while not converged:


t= t+1
g t = ∇ θ L(θ t−1 )// Compute gradient
​ ​ ​

mt = β1 mt−1 + (1 − β1 )g t // Update biased first moment


​ ​ ​ ​ ​

vt = β2 vt−1 + (1 − β2 )g t2 // Update biased second moment


​ ​ ​ ​ ​

^mt = mt /(1 − β1t )// Bias-corrected first moment


​ ​ ​

^vt = vt /(1 − β2t )// Bias-corrected second moment


​ ​ ​

θ t = θ t−1 − η ⋅^mt /( ^vt + ϵ)


​ ​ ​ ​

Typical hyperparameters: β1 ​ = 0.9, β2 = 0.999, ϵ = 10−8 . ​

Bias Correction in Adam

The first moment estimate is initialized to zero, causing bias in early iterations:

t
mt = (1 − β1 ) ∑ β1t−i g i
​ ​ ​ ​ ​

i=1

At initialization (t = 0), E[mt] = 0 but E[g t] 


= 0. Taking expectations: ​ ​

t t
E[mt] = E[(1 − ​ β1 ) ∑ β1t−i g i ]
​ ​ ​ ​ = E[g t](1 − β1 ) ∑ β1t−i + ζ
​ ​ ​ ​

i=1 i=1

= E[g t](1 − β1t ) + ζ ​ ​

where ζ = 0 if E[g i ] is stationary. Thus,^mt = mt/(1 − β1t ) is bias-corrected.


​ ​ ​ ​

1.3.5 AdamW

AdamW decouples weight decay from gradient updates, which improves generalization:

^mt
θt = θt−1 − η ( + λθt−1 )

(1.17)
^vt + ϵ
​ ​ ​ ​

11
Chapter 1: Mathematical Foundations

where λ is the weight decay coefficient. This differs from L2 regularization in Adam, which adds weight
decay to the gradient before the adaptive update.

1.3.6 Learning Rate Schedules

Learning rate schedules adjust η during training:

1
ηt = ηmin + (ηmax − ηmin) (1 + cos ( π))
t
(1.18)
2
​ ​ ​ ​ ​ ​

Cosine decay with linear warmup is commonly used for LLM training:

{ Twarmup 1
t
ηmax t < T warmup
ηt =
​ ​ ​

(1.19)

t−T
ηmin + 2 (ηmax − ηmin)(1 + cos( T −Twarmup t ≥ T warmup
​ ​ ​

π))

​ ​ ​ ​ ​ ​

warmup ​

Figure 1.2: Common learning rate schedules for LLM training: cosine decay, linear decay, and constant with warmup.

1.4 Linear Algebra for Deep Learning

Transformers rely heavily on matrix operations for efficient parallel computation. This section reviews the
essential linear algebra concepts.

12
Chapter 1: Mathematical Foundations

Matrix multiplication: C = AB where Cij = ∑k A ik B kj ​ ​ ​ ​

Hadamard (element-wise) product: (A ⊙ B)ij = A ij B ij


​ ​ ​

Outer product: (u ⊗ v)ij ​ = ui vj for vectors u, v


Transpose: (A T )ij ​ = A ji ​

1.4.2 Attention as Matrix Operations

The attention mechanism can be expressed efficiently using matrix multiplication:

QK T
Attention(Q, K, V ) = softmax ( )V ​ (1.20)
dk ​

where Q, K, V ∈ Rn×dk are matrices of queries, keys, and values respectively, and n is the sequence

length.

1.4.3 Computational Complexity

Table 1.1: Computational Complexity of Attention Operations

Operation Time Complexity Space Complexity

QK T O(n2 ⋅ dk )
​ O(n2 )

Softmax O(n2 ) O(n2 )

Multiply by V O(n2 ⋅ dk )
​ O(n ⋅ dk )

Total Attention O(n2 ⋅ dk )


​ O(n2 )

1.4.4 Eigendecomposition and SVD

The Singular Value Decomposition (SVD) of a matrix A ∈ Rm×n is:

A = U ΣV T (1.21)

where U ∈ Rm×m and V ∈ Rn×n are orthogonal matrices, and Σ ∈ Rm×n is diagonal with non-
negative singular values σ1 ≥ σ2 ≥ ⋯ ≥ 0.
​ ​

SVD is used in various LLM techniques including:

13
Chapter 1: Mathematical Foundations

Low-rank adaptation (LoRA) for efficient fine-tuning

Model compression via truncated SVD

Analyzing the effective rank of weight matrices

1.5 Numerical Stability


Numerical stability is crucial for training deep networks. This section covers techniques for maintaining
stable computations.

1.5.4 Gradient Clipping

Gradient clipping prevents exploding gradients by capping their norm:

g clipped = {
g if ∥g∥ ≤ c
g (1.25)
c⋅ otherwise
​ ​ ​

∥g∥

Common values for c range from 1.0 to 5.0.

1.5.5 Weight Initialization

Xavier (Glorot) initialization sets weights to preserve variance through layers:

6 6
Wij ∼ U (− , ) (1.26)
nin + nout nin + nout
​ ​ ​ ​ ​

​ ​ ​

For ReLU activations, He initialization is preferred:

2
Wij ∼ N (0,
​ ) ​ ​
(1.27)
nin ​

14
Chapter 1: Mathematical Foundations

Xavier Initialization Derivation

For a linear layer y = W x with nin inputs and nout outputs: ​ ​

nin
Var(yi ) = ∑ Wij2 Var(xj ) = nin ⋅ Var(W ) ⋅ Var(x)

​ ​ ​ ​ ​

j=1

To maintain Var(y) = Var(x):


1
Var(W ) = ​

nin ​

Considering backpropagation as well gives the harmonic mean:

2
Var(W ) =
nin + nout

​ ​

1.5.1 Softmax Numerical Stability

The softmax function is defined as:

exi ​

softmax(xi ) = (1.22)
∑j exj
​ ​

This can overflow for large xi . The numerically stable version subtracts the maximum:

exi −max(x) ​

softmax(xi ) = (1.23)
∑j exj −max(x)
​ ​

Proof of Equivalence

exi −c

exi e−c exi e−c exi ​ ​ ​

= = −c =
∑j exj −c ∑j exj e−c e ∑j exj ∑j exj
​ ​ ​ ​

​ ​ ​ ​

​ ​ ​

Setting c = max(x) ensures all exponents are ≤ 0, preventing overflow.

1.5.2 Layer Normalization Stability

15
Chapter 1: Mathematical Foundations

Layer normalization computes:

x−μ
LayerNorm(x) = γ ⊙ +β (1.24)
σ2 + ϵ

where ϵ (typically 10−6 to 10−5 ) prevents division by zero and improves numerical stability.

1.5.3 Mixed Precision Training

Mixed precision uses FP16/BF16 for forward/backward and FP32 for optimizer states. Key considerations:

Loss scaling: Multiply loss by a large constant to preserve small gradients in FP16

Gradient unscaling: Divide gradients before weight update

Master weights: Maintain FP32 copy of weights for optimizer

Figure 1.3: Gradient flow in deep neural networks showing vanishing, exploding, and stable gradient patterns.

16
Chapter 2: Tokenization and Embeddings

Chapter 1 Summary

Language models learn conditional probability distributions over tokens using the chain rule:
$P(x_1, \ldots, x_n) = \prod_{i=1}^{n} P(x_i | x_{

Cross-entropy loss equals negative log-likelihood: $\mathcal{L}_{CE} = -\sum \log P(x_i | x_{

Perplexity measures effective vocabulary size: PPL = exp(H(P , Q))


Adam optimizer combines momentum with adaptive learning rates using first and second
moment estimates

Attention has O(n2 ⋅ dk ) time and O(n2 ) space complexity


Numerical stability techniques (softmax subtract-max, loss scaling) are essential for training

Chapter 2: Tokenization and Embeddings

Before neural networks can process text, it must be converted into numerical representations. This chapter
covers the fundamental techniques for tokenizing text and creating meaningful vector representations of
words and subwords. We present both the algorithms and their mathematical foundations.

2.1 Text Tokenization Algorithms


Tokenization is the process of breaking text into discrete units (tokens) that can be processed by neural
networks. The choice of tokenization strategy significantly impacts model performance, vocabulary size, and
out-of-vocabulary handling.

2.1.1 Tokenization Strategies

17
Chapter 2: Tokenization and Embeddings

Table 2.1: Comparison of Tokenization Methods

Vocab
Method Description Pros Cons
Size

Split on 100K- Intuitive, small Large vocabulary,


Word-level
whitespace/punctuation 1M+ sequences OOV issues

~100- No OOV, tiny


Character-level Each character is a token Very long sequences
256 vocabulary

Merge frequent character 32K- Balance of vocab and Can split words
Subword (BPE)
pairs 100K length awkwardly

Subword 32K- Good for


Greedy left-to-right Language-dependent
(WordPiece) 100K morphological

Language-agnostic 32K-
SentencePiece Works on raw text Requires pre-training
BPE/Unigram 250K

2.1.2 Byte Pair Encoding (BPE)

BPE is the most widely used tokenization algorithm in modern LLMs. It starts with a character vocabulary
and iteratively merges the most frequent adjacent pairs.

Algorithm 2: Byte Pair Encoding

// Initialize vocabulary with all characters


V = set of all characters in training data

fori = 1 to num_merges:
tAB = most frequent adjacent token pair (A, B) in corpus

V = V ∪ {AB}// Add merged token to vocabulary


Replace all occurrences of (A, B) with AB in corpus
Update pair frequencies

18
Chapter 2: Tokenization and Embeddings

BPE Merge Objective

BPE can be viewed as optimizing a compression objective. Each merge reduces the total number of
tokens in the corpus. The merge score for pair (A, B) is:

score(A, B) = count(A, B)

The algorithm greedily selects the pair with highest count, which locally optimizes the compression
ratio. This is a greedy approximation to the NP-hard problem of finding the optimal set of merges.

BPE Example

Initial corpus: "low", "lower", "lowest", "new", "newer", "newest"

Step 0: Vocabulary = {l, o, w, e, r, s, t, n}

Step 1: Most frequent pair: (e, r) in "lower", "newer" → Merge to "er"

Step 2: Most frequent pair: (er, ) → Merge to "er"

Step 3: Most frequent pair: (e, s) → Merge to "es"

After 10 merges: Vocabulary includes common subwords like "low", "new", "er", "est"

Figure 2.1: Byte Pair Encoding merge process showing vocabulary growth over training steps.

19
Chapter 2: Tokenization and Embeddings

2.1.3 WordPiece Algorithm

WordPiece, used in BERT, differs from BPE in how it selects merges. Instead of frequency, it uses a
language modeling objective:

P (AB)
score(A, B) = (2.1)
P (A)P (B)

This score measures how much more likely the merged pair is compared to the product of individual
probabilities. Higher scores indicate stronger co-occurrence patterns.

2.1.4 SentencePiece Unigram

SentencePiece's Unigram algorithm takes a different approach, starting with a large vocabulary and pruning
it:

Algorithm 3: Unigram Tokenization

// Initialize with large seed vocabulary


V = all substrings from corpus with frequency ≥ threshold

while∣V ∣ > target_size:


// Compute unigram language model
c(x)
P (x) = ∑ y∈V c(y)

for all
​ x∈V
// Compute loss for each token
L(x) = ∑ s∈corpus log P (s; V ∖ {x})

// Remove token with smallest loss increase


x∗ = arg minx L(x) ​

V = V ∖ {x∗ }

2.1.5 Subword Regularization

Subword regularization introduces randomness during tokenization to improve robustness:

P (x) = ∑ P (s)
(2.14)

s∈S(x)

where S(x) is the set of all possible segmentations of x.

2.1.6 Byte-Level BPE

20
Chapter 2: Tokenization and Embeddings

GPT-2 uses byte-level BPE to handle any Unicode text:

Base vocabulary: 256 bytes

Merges learned on byte sequences

No unknown tokens possible

2.1.7 Special Tokens

Modern tokenizers include special tokens for specific purposes:

Table 2.2: Common Special Tokens in LLMs

Token Purpose Used In

<|endoftext|> End of sequence marker GPT series

</s> End of sequence LLaMA, BLOOM

<|user|> , <|assistant|> Role markers in chat Chat models

<pad> Padding for batching Most models

<unk> Unknown token Word-level models

<|im_start|> , <|im_end|> Chat message boundaries ChatML format

Engineering Pitfall

Tokenization is not reversible in general. Different tokenizers may split the same text differently,
leading to subtle but important differences in model behavior. Always use the tokenizer that matches
the pre-trained model. Mixing tokenizers (e.g., using GPT-2's tokenizer with LLaMA's weights) will
produce garbage outputs.

2.2 Word Embeddings

Word embeddings map discrete tokens to continuous vector spaces where semantic relationships are
preserved. This section covers the evolution of word embedding techniques.

21
Chapter 2: Tokenization and Embeddings

2.2.1 One-Hot Encoding

The simplest representation is one-hot encoding, where each word is represented as a vector with a single 1
at its vocabulary index:

one_hot(wi ) = ei ∈ {0, 1}∣V ∣ ​ ​ (2.2)

This representation is sparse and fails to capture any semantic relationships between words. The dot product
of any two distinct one-hot vectors is zero.

2.2.2 Distributed Representations

Word2Vec (Mikolov et al., 2013) introduced efficient methods for learning dense word embeddings. Two
architectures were proposed:

CBOW (Continuous Bag of Words): Predict target word from context

P (wt∣wt−c, … , wt−1 , wt+1 , … , wt+c)


​ ​ ​ ​ ​ (2.3)

Skip-gram: Predict context words from target word

P (wt+j ∣wt) ​ ​ for j ∈ {−c, … , −1, 1, … , c} (2.4)

The skip-gram objective maximizes:

T
1
L = ∑∑ ​ ​ log P (wt+j ∣wt) ​ ​

(2.5)
T
t=1 −c≤j≤c,j=0

where the probability is computed using softmax:

exp(vw′TO vwI )
P (wO ∣wI ) =
​ ​

(2.6)

W
∑w=1 exp(vw′T vwI )
​ ​ ​

​ ​ ​


Here vw is the input vector and vw
​ is the output vector for word w . ​

22
Chapter 2: Tokenization and Embeddings

2.2.3 Negative Sampling

The full softmax is computationally expensive for large vocabularies. Negative sampling approximates it by
contrasting true pairs with random negative samples:

k
L= log σ(vw′TO vwI )


​ + ∑ Ewi ∼Pn (w) [log σ(−vw′Ti vwI )]

​ ​



​ (2.7)
i=1

where k is the number of negative samples and P n(w) is the noise distribution (typically proportional to

word frequency raised to the 3/4 power).

2.2.4 GloVe: Global Vectors

GloVe combines global corpus statistics with local context window methods. It uses word co-occurrence
counts Xij (how often word j appears in the context of word i):

V
~
L = ∑ f (Xij )(wiT~wj + bi + bj − log Xij )2
​ ​ ​ ​ ​ ​ ​
(2.8)
i,j=1

where f is a weighting function that down-weights rare co-occurrences:

f (x) = {
(x/xmax )α ​ if x < xmax ​

(2.9)
1 otherwise
​ ​

Typical values: xmax ​


= 100, α = 0.75.

23
Chapter 2: Tokenization and Embeddings

Figure 2.2: Word embeddings showing semantic relationships in 2D projection. Similar words cluster together and analogies
form parallelograms.

2.3 Contextual Embeddings

Static embeddings (Word2Vec, GloVe) assign the same vector to a word regardless of context. Contextual
embeddings address this limitation by producing different representations based on surrounding words.

2.3.1 ELMo: Embeddings from Language Models

ELMo generates contextualized word representations by combining hidden states from a bidirectional LSTM
language model:

L
ELMok = E(R k ; Θ) = γ ∑ sj hk,j
​ ​ ​ ​
(2.10)
j=0

where hk,j is the hidden state at layer


​ j for token k, sj are learned scalar weights, and γ is a task-specific

scale factor.

24
Chapter 2: Tokenization and Embeddings

h0 = Etoken (x) + Epos (position)


​ ​ ​
(2.11)

where Etoken ​ ∈ R∣V ∣×d is the token embedding matrix and Epos ∈ RLmax ×d is the positional embedding ​

matrix.

2.4 Embedding Arithmetic

Word embeddings exhibit interesting arithmetic properties that reflect semantic relationships.

2.4.1 Analogical Reasoning

Word2Vec embeddings capture analogies through vector arithmetic:

vking − vman + vwoman ≈ vqueen


​ ​ ​ ​ (2.12)

This can be formalized as finding the word w ∗ that maximizes cosine similarity:

∗ (va − vb + vc)T vw
w = arg max (2.13)
​ ​ ​ ​

∥va − vb + vc∥∥vw ∥
​ ​

w ​ ​ ​ ​

2.4.2 Linear Substructures

Mikolov et al. (2013) showed that semantic and syntactic relationships form linear subspaces in the
embedding space:

Gender: vhe − vshe ​ ​ ≈ vking − vqueen ≈ vactor − vactress


​ ​ ​ ​

Plural: vapple − vapples ​ ​


≈ vcar − vcars ​ ​

Capital-Country: vParis ​ − vFrance ≈ vRome − vItaly ​ ​ ​

25
Chapter 2: Tokenization and Embeddings

Key Takeaway

The ability of embeddings to capture analogies suggests they learn meaningful semantic
representations. However, this property is not perfect and embeddings can encode societal biases
present in the training data. For example, vdoctor ​
− vnurse may have a gender component due to

biased training corpora.

Figure 2.3: Effect of temperature on softmax distribution. Lower temperature makes the distribution sharper (more confident),
while higher temperature makes it more uniform.

Chapter 2 Summary

BPE is the dominant tokenization method, iteratively merging frequent character pairs
Subword tokenization balances vocabulary size and sequence length

Word2Vec uses skip-gram with negative sampling for efficient training

GloVe combines global co-occurrence statistics with local context

Contextual embeddings (ELMo, BERT, GPT) capture word meaning in context

Embedding arithmetic reflects semantic relationships: vking ​ − vman + vwoman ≈ vqueen


​ ​ ​

26
Chapter 3: Transformer Architecture

Chapter 3: Transformer Architecture

The Transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017),
revolutionized natural language processing by replacing recurrent layers with attention mechanisms. This
chapter provides a comprehensive mathematical treatment of the transformer architecture.

Figure 3.1: Transformer block architecture showing the flow from input through multi-head attention, add and norm, feed-
forward network, and output. Residual connections enable gradient flow in deep networks.

3.1 Self-Attention Mechanism

Self-attention allows each position in a sequence to attend to all other positions, computing a weighted
representation based on relevance scores. This is the core innovation of the transformer architecture.

3.1.1 Scaled Dot-Product Attention

The core attention computation involves three matrices: Queries (Q), Keys (K ), and Values (V ):

QK T
Attention(Q, K, V ) = softmax ( )V ​ (3.1)
dk ​

27
Chapter 3: Transformer Architecture

where Q, K, V ∈ Rn×dk and n is the sequence length.


Why the Scaling Factor dk ?


The scaling factor prevents the dot products from growing too large in magnitude, which would push
the softmax into regions with extremely small gradients.

For queries and keys with components drawn i.i.d. from N (0, 1):

dk
q ⋅ k = ∑ qi ki

​ ​ ​

i=1

E[q ⋅ k] = 0, Var(q ⋅ k) = dk ​

Thus, the dot product has standard deviation dk . Dividing by


​ dk normalizes the variance to 1,

keeping softmax gradients well-behaved.

3.1.2 Attention Score Interpretation

The attention weights α ij represent how much position i should attend to position j :

exp(eij )
α ij =

n (3.2)
∑k=1 exp(eik )
​ ​

​ ​

Qi KjT
where eij = is the attention score between positions i and j .
​ ​

dk

Key Takeaway

Attention can be viewed as a differentiable key-value lookup: queries match against keys, and the
resulting weights determine how much of each value to retrieve. The softmax ensures weights sum
to 1, creating a weighted average. This is analogous to database queries but with soft, differentiable
retrieval.

3.1.3 Causal (Autoregressive) Attention

For language modeling, we use causal attention that prevents positions from attending to future positions:

28
Chapter 3: Transformer Architecture

QK T
Attention(Q, K, V ) = softmax ( + M) V ​ (3.3)
dk ​

where M is a mask matrix with Mij ​ = −∞ for j > i (future positions) and 0 otherwise:

Mij = {
0 if j ≤ i
(3.4)
−∞ if j > i
​ ​

Figure 3.2: Attention mask patterns: causal (lower triangular) for autoregressive models, bidirectional (full) for encoder
models, and sliding window for sparse attention.

3.1.4 Attention as Matrix Multiplication

The full attention operation can be written as a series of matrix multiplications:

QK T
A = softmax ( ) ∈ Rn×n
​ (3.5)
dk ​

Output = AV ∈ Rn×dk ​

(3.6)

The attention matrix A contains all pairwise attention weights, enabling parallel computation on GPUs.

3.2 Multi-Head Attention

Multi-head attention allows the model to jointly attend to information from different representation
subspaces at different positions.

29
Chapter 3: Transformer Architecture

3.2.1 Multi-Head Formulation

MultiHead(Q, K, V ) = Concat(head1 , … , headh )W O ​ ​ (3.7)

where each head is computed as:

headi = Attention(QWiQ, KWiK , V WiV )


​ ​ ​ (3.8)

The projection matrices are:

Wi ∈ Rdmodel ×dk , WiK ∈ Rdmodel ×dk , WiV ∈ Rdmodel ×dv


Q

​ ​


​ ​


​ ​

(3.9)

W O ∈ Rhdv ×dmodel ​ ​

(3.10)

Typical configurations use h = 8 or h = 16 heads with dk = dv = dmodel /h.


​ ​ ​

3.2.2 Parameter Count

The total number of parameters in multi-head attention is:

ParamsM HA = h ⋅ (dmodel ⋅ dk + dmodel ⋅ dk + dmodel ⋅ dv ) + hdv ⋅ dmodel


​ ​ ​ ​ ​ ​ ​ ​ (3.11)

With dk ​ = dv = dmodel /h:


​ ​

ParamsM HA = 4 ⋅ d2model ​ (3.12)

30
Chapter 3: Transformer Architecture

Table 3.1: Multi-Head Attention Configurations in Popular Models

Model dmodel ​ Heads (h) dk = dv


​ ​ Params per Layer

GPT-2 Small 768 12 64 2.36M

BERT-Base 768 12 64 2.36M

GPT-3 12288 96 128 603M

LLaMA-2 70B 8192 64 128 268M

PaLM 540B 18432 48 384 1.36B

3.2.3 Computational Complexity

The computational complexity of multi-head attention is dominated by the matrix multiplications:

Table 3.2: Computational Complexity of Multi-Head Attention

Operation FLOPs Memory

Linear projections (Q, K, V ) 3 ⋅ h ⋅ n ⋅ dmodel ⋅ dk = 3nd2model


​ ​ ​ 3 ⋅ h ⋅ n ⋅ dk = 3ndmodel
​ ​

QK T computation h ⋅ n2 ⋅ dk = n2 dmodel
​ ​ h ⋅ n2 = n2 h

Softmax h ⋅ n2 h ⋅ n2

Attention × Values h ⋅ n2 ⋅ dv = n2 dmodel


​ ​ h ⋅ n ⋅ dv = ndmodel
​ ​

Output projection n ⋅ hdv ⋅ dmodel = nd2model


​ ​ ​ n ⋅ dmodel ​

Total O(n2 dmodel + nd2model )


​ ​ O(n2 h + ndmodel )

3.3 Feed-Forward Networks

Each transformer layer includes a position-wise feed-forward network applied independently to each
position.

3.3.1 FFN Architecture

31
Chapter 3: Transformer Architecture

FFN(x) = max(0, xW1 + b1 )W2 + b2 ​ ​ ​ ​


(3.13)

The inner dimension is typically dff = 4 × dmodel . This expansion allows the model to learn more complex
​ ​

transformations.

3.3.2 Parameter Count

ParamsFFN = dmodel ⋅ dff + dff + dff ⋅ dmodel + dmodel = 2 ⋅ dmodel ⋅ dff + dff + dmodel
​ ​ ​ ​ ​ ​ ​ ​ ​ ​
(3.14)

With dff ​ = 4dmodel :


ParamsFFN ≈ 8 ⋅ d2model
​ ​ (3.15)

3.3.3 GELU Activation

Modern LLMs often use GELU (Gaussian Error Linear Unit) instead of ReLU:

1
[ 1 + erf ( )]
x
GELU(x) = x ⋅ Φ(x) = x ⋅ (3.16)
2 2
​ ​

where Φ(x) is the cumulative distribution function of the standard normal distribution.

A common approximation is:

GELU(x) ≈ 0.5x (1 + tanh [


2
​(x + 0.044715x3 )])
​ (3.17)
π

32
Chapter 3: Transformer Architecture

GELU as Stochastic Regularization

GELU can be interpreted as a stochastic regularizer. If we multiply the input by a Bernoulli random
variable with probability Φ(x):

Dropout(x) = x ⋅ 1[ϵ<Φ(x)], ​ ϵ ∼ Uniform(0, 1)

Taking the expectation:

E[Dropout(x)] = x ⋅ P (ϵ < Φ(x)) = x ⋅ Φ(x) = GELU(x)

This explains why GELU provides smoother gradients compared to ReLU.

3.4 Layer Normalization


Layer normalization stabilizes training by normalizing across the feature dimension.

3.4.1 Layer Normalization Formula

x−μ
LayerNorm(x) = γ ⊙ +β (3.18)
σ2 + ϵ

where μ and σ 2 are computed across the feature dimension for each sample:

d d
1 1
μ = ∑ xi , ​ ​ ​ σ = ∑(xi − μ)2
2
​ ​ ​ (3.19)
d d
i=1 i=1

3.4.2 Pre-Norm vs Post-Norm

There are two common placements of layer normalization:

Post-Norm (original Transformer):

xl+1 = xl + Sublayer(LayerNorm(xl ))
​ ​ ​
(3.20)

Pre-Norm (modern LLMs):

xl+1 = xl + Sublayer(LayerNorm(xl ))
​ ​ ​ (3.21)

33
Chapter 3: Transformer Architecture

Wait, that's the same formula. Let me correct this:

Post-Norm (original Transformer): LayerNorm is applied after the residual connection

xl+1 = LayerNorm(xl + Sublayer(xl ))


​ ​ ​ (3.22)

Pre-Norm (modern LLMs): LayerNorm is applied before the sublayer

xl+1 = xl + Sublayer(LayerNorm(xl ))
​ ​ ​
(3.23)

Pre-norm is more stable for deep networks and is used in GPT, LLaMA, and most modern architectures.

Key Takeaway

Pre-norm architecture places layer normalization before the attention/FFN sublayers, while post-
norm places it after the residual addition. Pre-norm is more stable for deep networks (100+ layers)
because it prevents the gradient from exploding/vanishing through the residual path.

3.5 Architecture Variants

Different transformer variants have emerged for different use cases.

3.5.1 Encoder-Only (BERT)

Encoder-only models use bidirectional attention and are trained with masked language modeling:

L = −Ex∼D ∑ log P (xi ∣x\M; θ)


​ ​ ​ ​

(3.24)
i∈M

3.5.2 Decoder-Only (GPT)

Decoder-only models use causal attention and are trained with autoregressive language modeling:

34
Chapter 3: Transformer Architecture

N
L = − ∑ log P (xi ∣x1 , … , xi−1 ; θ)
​ ​ ​ ​ (3.25)
i=1

3.5.3 Encoder-Decoder (T5)

Encoder-decoder models use bidirectional attention on the input and causal attention on the output:

Ntarget ​

L = − ∑ log P (yi ∣x, y1 , … , yi−1 ; θ)


​ ​ ​ ​
(3.26)
i=1

3.5.4 Model Parameter Count

Total parameters in a transformer model:

Ntotal = L × (4d2model + 8d2model + 2dmodel ⋅ dvocab )


​ ​ ​ ​ ​ (3.27)

Breaking down by component:

Attention per layer: 4d2model ​

FFN per layer: 8d2model (with dff ​ ​ = 4dmodel ) ​

Embeddings: 2dmodel ​ ⋅ dvocab (input + output)


3.5.5 FLOPs per Forward Pass

FLOPs ≈ 2N + 2 ⋅ batch_size ⋅ seq_len2 ⋅ dmodel ​ (3.28)

The first term accounts for matrix multiplications, the second for attention.

35
Chapter 3: Transformer Architecture

Figure 3.3: Comparison of transformer architecture variants: encoder-only (BERT), decoder-only (GPT), and encoder-
decoder (T5).

Figure 3.4: Comparison of training and inference pipelines for transformer models, showing the different computational
requirements.

36
Chapter 3: Transformer Architecture

Chapter 3 Summary

Self-attention computes weighted representations: Attention(Q, K, V ) =


softmax(QK T / dk )V ​

Scaling by dk prevents softmax saturation and maintains gradient flow


Causal attention uses a lower-triangular mask to prevent attending to future tokens

Multi-head attention has 4d2model parameters and O(n2 dmodel ) complexity


​ ​

FFN has 8d2model parameters (with dff



= 4dmodel )

GELU provides smoother gradients than ReLU: GELU(x) = x ⋅ Φ(x)


Pre-norm architecture is more stable for deep networks

37
Chapter 4: Positional Encodings

Chapter 4: Positional Encodings

Since the self-attention mechanism is permutation-invariant, transformers need a way to incorporate


positional information. This chapter explores the evolution of positional encoding methods and their
mathematical foundations.

4.1 Absolute Positional Encodings


The original Transformer used fixed sinusoidal encodings to represent position. These encodings have nice
properties that enable the model to learn relative positions.

4.1.1 Sinusoidal Positional Encoding

The sinusoidal encoding uses sine and cosine functions of different frequencies:

pos
P E(pos,2i) = sin ( ) (4.1)
100002i/dmodel
​ ​

pos
P E(pos,2i+1) = cos ( ) (4.2)
100002i/dmodel
​ ​

Why Sinusoidal Functions?

Sinusoidal encodings allow the model to learn to attend to relative positions. For any fixed offset k,
P Epos+k can be expressed as a linear function of P Epos :
​ ​

sin(ω ⋅ (pos + k)) = sin(ω ⋅ pos) cos(ω ⋅ k) + cos(ω ⋅ pos) sin(ω ⋅ k)

cos(ω ⋅ (pos + k)) = cos(ω ⋅ pos) cos(ω ⋅ k) − sin(ω ⋅ pos) sin(ω ⋅ k)

This means the model can learn to attend to relative positions by learning attention patterns that depend
on the relative phase shifts.

4.1.2 Learned Absolute Embeddings

BERT and GPT use learned positional embeddings, treating position as another embedding lookup:

38
Chapter 4: Positional Encodings

h0 = Etoken (x) + Epos (position)


​ ​ ​
(4.3)

where Epos ​ ∈ RLmax ×dmodel is a learned matrix.


​ ​

Engineering Pitfall

Learned absolute embeddings have a fixed maximum sequence length. Extrapolation to longer
sequences often fails catastrophically. Models like GPT-2 were trained with Lmax = 1024 and

struggle with longer inputs. This is a fundamental limitation of learned position embeddings.

Figure 4.1: Heatmap visualization of sinusoidal positional encoding. Each row represents a position, and each column
represents a dimension. The wavelength increases with dimension index.

4.2 Relative Positional Encodings


Relative positional encodings encode the distance between positions rather than absolute position, which
generalizes better to longer sequences.

4.2.1 Shaw et al. (2018)

This method adds relative position embeddings to the attention scores:

39
Chapter 4: Positional Encodings

where a ij is a learned embedding based on the relative position j


​ − i.

4.2.2 T5 Relative Bias

T5 simplifies this by adding a learned bias to attention scores:

xi W Q(xj W K )T
eij = + bj−i (4.5)

​ ​

dk ​

where bk is a learned scalar bias for relative position k.


4.3 Rotary Position Embeddings (RoPE)


RoPE (Su et al., 2021) encodes position by rotating the query and key vectors in 2D subspaces. This has
become the standard for modern LLMs.

4.3.1 RoPE Formulation

RoPE applies a rotation matrix to pairs of dimensions:

cos mθ1 ​ − sin mθ1 ​ 0 0 ⋯


sin mθ1 ​
cos mθ1 ​
0 0 ⋯
R Θ,m = 0 0 cos mθ2 ​ − sin mθ2 ​ ⋯ (4.6)
0 0 sin mθ2 cos mθ2 ⋯
​ ​ ​ ​ ​ ​ ​

​ ​

⋮ ⋮ ⋮ ⋮ ⋱

where θi ​
= 10000−2i/dmodel are the rotation frequencies. ​

The query and key at position m are rotated by angles mθi : ​

qm = R Θ,m Wq xm ,
​ ​ ​ ​ kn = R Θ,nWk xn
​ ​ ​ ​ (4.7)

4.3.2 Relative Position Property

The key property of RoPE is that the inner product naturally encodes relative position:

T
qm kn = xTm WqT R Θ,n−m Wk xn
​ ​ ​ ​ ​ ​ (4.8)

40
Chapter 4: Positional Encodings

Proof of Relative Position Encoding

For a 2D rotation matrix at frequency θ:

cos mθ − sin mθ
R θ,m = ( )
sin mθ cos mθ
​ ​ ​

The inner product of rotated vectors:

T
qm kn = (R θ,m q)T (R θ,nk) = q T R θ,m
​ ​ ​ ​
T
R θ,nk
​ ​

T
Since rotation matrices are orthogonal: R θ,m ​ = R θ,−m ​

= q T R θ,n−m k ​

This shows the inner product depends only on the relative position n − m.

Key Takeaway

RoPE has become the dominant positional encoding method in modern LLMs (LLaMA, PaLM,
Mistral) because it: (1) naturally handles relative positions, (2) decays attention for distant positions
through rotation, and (3) extrapolates better to longer sequences than absolute encodings.

4.4 ALiBi and Long Context Methods

4.4.1 ALiBi (Attention with Linear Biases)

ALiBi adds a penalty to attention scores proportional to distance:

softmax(qi K T + m ⋅ [−(i − 1), … , −1, 0])


​ (4.9)

where m is a head-specific slope. ALiBi enables strong extrapolation to longer sequences than seen during
training.

4.4.2 Position Interpolation

To extend context length, position interpolation scales position indices:

41
Chapter 4: Positional Encodings


P Epos = P Epos⋅s

(4.10)

where s = Loriginal /Ltarget < 1 is the scaling factor. This allows models to handle longer contexts with
​ ​

minimal fine-tuning.

4.4.3 NTK-Aware Scaling

NTK-aware scaling modifies the base frequency for RoPE:

dmodel /(dmodel −2i)


θi′ = θi ⋅ ( )
Ltarget
​ ​

(4.11)

​ ​ ​

Loriginal ​

This provides better extrapolation by preserving the high-frequency components.

Table 4.1: Comparison of Positional Encoding Methods

Method Type Extrapolation Complexity Used In

Sinusoidal Absolute Poor O(1) Original Transformer

Learned Absolute Poor O(Lmax ⋅ d)


​ BERT, GPT-2

Relative (Shaw) Relative Moderate O(n2 ⋅ d) Transformer-XL

T5 Bias Relative Moderate O(n2 ) T5

RoPE Relative Good O(n ⋅ d) LLaMA, PaLM, Mistral

ALiBi Relative Excellent O(n2 ) BLOOM, MPT

42
Chapter 4: Positional Encodings

Chapter 4 Summary

Positional encodings provide sequence order information to transformers

Sinusoidal encodings: P E(pos,2i) ​ = sin(pos/100002i/dmodel )


Absolute encodings (sinusoidal, learned) have fixed maximum lengths

Relative encodings generalize better to longer sequences


T k depends only on relative position n − m
RoPE is the current standard: qm ​

n ​

ALiBi offers excellent extrapolation through linear distance penalties

43
Chapter 5: Training and Scaling Laws

Chapter 5: Training and Scaling Laws

Training large language models at scale requires understanding how model performance scales with
compute, data, and parameters. This chapter covers the empirical laws that govern LLM training.

Figure 5.1: The LLM lifecycle from pre-training through fine-tuning, alignment, and deployment.

5.1 Pre-training Objectives

5.1.1 Autoregressive Language Modeling

The standard pre-training objective for decoder-only models (GPT, LLaMA) is next-token prediction:

N
L = − ∑ log P (xi ∣x1 , … , xi−1 ; θ)
​ ​ ​ ​ (5.1)
i=1

5.1.2 Masked Language Modeling

Encoder models like BERT use masked language modeling:

L = −Ex∼D ∑ log P (xi ∣x\M; θ)


​ ​ ​ ​

(5.2)
i∈M

where M is the set of masked positions.

44
Chapter 5: Training and Scaling Laws

5.1.3 Prefix LM and Encoder-Decoder

T5 uses a text-to-text format where the input can attend bidirectionally and the output is autoregressive:

Ntarget ​

L = − ∑ log P (yi ∣x, y1 , … , yi−1 ; θ)


​ ​ ​ ​
(5.3)
i=1

5.2 Scaling Laws Derivation

Kaplan et al. (2020) from OpenAI established the first scaling laws for language models, showing that loss
scales as a power law with model size, data, and compute.

5.2.1 Power Law Scaling

The test loss L follows a power law relationship with model size N :

αN
L(N) = ( )
Nc

+ L∞ (5.4)

​ ​

where Nc is a critical parameter count, α N


​ ​ ≈ 0.076, and L∞ is the irreducible loss.

Similarly for data size D :

αD
L(D) = ( )
Dc

+ L∞

​ ​
(5.5)
D

where α D ​ ≈ 0.095.

5.2.2 Joint Scaling Law

The combined scaling law for both model size and data is:

αN /αD αD

L(N, D) = [( ) ]

Nc Dc
​ ​

+ + L∞ (5.6)
​ ​

​ ​

N D

45
Chapter 5: Training and Scaling Laws

Figure 5.2: Scaling laws showing validation loss decreasing as a power law with training compute. Each point represents a
model checkpoint.

5.2.3 Compute-Optimal Training

For a fixed compute budget C , the optimal allocation follows:

C ≈ 6ND (5.7)

46
Chapter 5: Training and Scaling Laws

5.3 Chinchilla Scaling

Hoffmann et al. (2022) from DeepMind revisited scaling laws and found that previous models were
significantly undertrained.

5.3.1 Chinchilla Optimal

For a model with N parameters, the optimal number of training tokens is:

Dopt ≈ 20N

(5.8)

This means a 70B parameter model should be trained on approximately 1.4 trillion tokens for compute-
optimal training.

Figure 5.3: Chinchilla scaling laws showing optimal model size and training tokens for a given compute budget. The dashed
line shows the optimal trade-off.

47
Chapter 5: Training and Scaling Laws

Table 5.1: Comparison of Model Training Regimes

Model Parameters Tokens Tokens/Param Chinchilla-Optimal?

GPT-3 175B 300B 1.7 Undertrained

Chinchilla 70B 1.4T 20 Optimal

PaLM 540B 780B 1.4 Undertrained

LLaMA-65B 65B 1.4T 21.5 Optimal

LLaMA-2-70B 70B 2T 28.6 Overtrained

5.4 Training at Scale

5.4.1 Training Data

Modern LLMs are trained on diverse web-scale datasets:

Table 5.2: Common Pre-training Data Sources

Source Description Typical % Quality

Common Crawl Web crawl data 60-80% Variable

C4 Cleaned Common Crawl 15-30% Medium

GitHub Code repositories 5-15% High

Wikipedia Encyclopedia articles 2-5% High

Books Book corpora 2-10% High

ArXiv Scientific papers 1-3% High

5.4.2 Data Quality and Filtering

Data quality significantly impacts model performance. Key filtering techniques include:

5.4.3 MinHash Deduplication

48
Chapter 5: Training and Scaling Laws

MinHash is used for efficient near-duplicate detection:

∣A ∩ B∣ ∣minhash(A) ∩ minhash(B)∣
J (A, B) = ≈ (5.9)
∣A ∪ B∣
​ ​

where k is the number of hash functions.

5.4.4 Quality Scoring

Quality classifiers assign scores to documents:

quality(d) = σ(w T ϕ(d) + b) (5.10)

where ϕ(d) are features like perplexity, readability, and language model scores.

5.4.5 Training Curriculum

Some approaches use curriculum learning, starting with high-quality data:

P (d) ∝ quality(d)α (5.11)

where α controls the sampling temperature.

Deduplication: Remove near-duplicate documents using MinHash LSH

Quality filtering: Use classifiers to remove low-quality content

Toxicity filtering: Remove harmful or inappropriate content

Language filtering: Select desired languages using langid

PII removal: Remove personally identifiable information

49
Chapter 5: Training and Scaling Laws

Figure 5.4: Training loss curves for different model sizes showing how larger models converge to lower validation loss.

Chapter 5 Summary

Next-token prediction is the standard pre-training objective: $\mathcal{L} = -\sum \log P(x_i |
x_{

Loss scales as a power law with model size and data: L(N) = (Nc/N)αN

Compute-optimal training requires C ≈ 6ND FLOPs


Chinchilla scaling: optimal tokens ≈ 20 × parameters

Many large models (GPT-3, PaLM) are undertrained relative to Chinchilla-optimal

Data quality filtering is as important as quantity

50
Chapter 6: Distributed Training

Chapter 6: Distributed Training

Training billion-parameter models requires distributing computation across hundreds or thousands of GPUs.
This chapter covers the parallelism strategies and memory optimization techniques that make large-scale
training feasible.

Figure 6.1: Overview of distributed training strategies including data, tensor, and pipeline parallelism.

6.1 Data Parallelism

Data parallelism replicates the model across multiple GPUs, with each GPU processing a different batch of
data.

6.1.1 Basic Data Parallel Training

51
Chapter 6: Distributed Training

Algorithm 4: Data Parallel Training (DDP)

// Each GPU has a copy of the model


for each batch (x, y):
// Split batch across GPUs
(x1 , y1 ), … , (xk , yk ) = split((x, y), k GPUs)
​ ​ ​ ​

// Forward pass on each GPU (in parallel)


fori = 1 to kin parallel:
^yi = model(xi)
​ ​ ​

L i = loss(^yi, yi)
​ ​ ​ ​

// Backward pass (in parallel)


fori = 1 to kin parallel:
g i = backward(L i)
​ ​

// Synchronize gradients (all-reduce)


g = all_reduce(g 1 , … , g k ) / k ​ ​

// Update parameters (same on all GPUs)


[Link](g )

6.1.2 All-Reduce Operation

The all-reduce operation aggregates gradients across GPUs. Efficient implementations use ring-allreduce or
tree-based algorithms.

k
1
all_reduce(g i ) = ∑ g j ​ ​ ​ ​
(6.1)
k
j=1

52
Chapter 6: Distributed Training

Ring All-Reduce Complexity

Ring all-reduce achieves the optimal bandwidth for gradient synchronization:

2(n − 1) data_size
Timering = ⋅
bandwidth
​ ​ ​

n
For large n, this approaches 2 ⋅ data_size/bandwidth, which is optimal. The algorithm works
by:

1. Each GPU sends its chunk to the next GPU in the ring

2. Each GPU accumulates received chunks


3. After n − 1 steps, each GPU has a partial sum

4. Results are broadcast back around the ring

6.2 Tensor and Pipeline Parallelism

When models exceed single-GPU memory, we must partition the model itself across GPUs.

6.2.1 Tensor Parallelism

Tensor parallelism splits individual layers across GPUs. For a linear layer Y = XW :

W = [W1 ∣W2 ∣ ⋯ ∣Wk ] ⇒ Y = [XW1 ∣XW2 ∣ ⋯ ∣XWk ]


​ ​ ​ ​ ​ ​ (6.2)

Megatron-LM implements efficient tensor parallelism for transformer layers by splitting attention heads and
FFN weights.

6.2.2 Pipeline Parallelism

Pipeline parallelism assigns different layers to different GPUs:

GPU 1 : Layers 1–L/k,



GPU 2 : Layers L/k + 1–2L/k,

… (6.3)

The main challenge is pipeline bubbles (idle time). Techniques like GPipe and PipeDream reduce bubbles
through micro-batching.

53
Chapter 6: Distributed Training

Figure 6.2: Memory requirements per GPU for different parallelism strategies across various model sizes. ZeRO-3 provides
the best scaling.

6.3 ZeRO and FSDP

ZeRO (Zero Redundancy Optimizer) and FSDP (Fully Sharded Data Parallel) optimize memory usage by
partitioning optimizer states, gradients, and parameters.

6.3.1 ZeRO Stages

Table 6.1: ZeRO Memory Optimization Stages

Stage Partitions Memory Formula Reduction

Baseline None 16Ψ + activations 1x

ZeRO-1 Optimizer states 4Ψ + 12Ψ/k + activations ~4x

ZeRO-2 + Gradients 2Ψ + 14Ψ/k + activations ~8x

ZeRO-3 + Parameters 16Ψ/k + activations kx

6.3.2 Memory Breakdown

54
Chapter 6: Distributed Training

Optimizer states: 12Ψ bytes (FP32 copy + momentum + variance)

Activations: Variable (depends on batch size and sequence length)

Total without ZeRO: 16Ψ bytes + activations.

Adam Optimizer State Breakdown

For each parameter, Adam stores:

FP32 parameter copy: 4 bytes

FP16 parameter: 2 bytes

FP16 gradient: 2 bytes

FP32 momentum (m): 4 bytes

FP32 variance (v ): 4 bytes

Total: 16 bytes per parameter. With ZeRO-3, only 16Ψ/k bytes per GPU.

6.3.3 FSDP (PyTorch)

FSDP is PyTorch's native implementation of ZeRO-3:

import [Link] as dist


from [Link] import FullyShardedDataParallel as FSDP
from [Link] import size_based_auto_wrap_policy

model = FSDP(
model,
auto_wrap_policy=size_based_auto_wrap_policy,
mixed_precision=torch.bfloat16,
device_id=[Link].current_device(),
limit_all_gathers=True
)

6.4 Memory Optimization

6.4.1 Activation Checkpointing

Activation checkpointing (gradient checkpointing) trades computation for memory by recomputing


activations during backward pass:

55
Chapter 6: Distributed Training

Memory checkpoint = O(1) per layer


​ vs Memory no_checkpoint = O(L) for L layers

(6.4)

Cost: ~20-30% additional forward computation.

6.4.2 Mixed Precision Training

Mixed precision uses FP16/BF16 for forward/backward and FP32 for optimizer states:

Table 6.2: Floating Point Formats Comparison

Format Exponent Mantissa Range Use Case

FP32 8 bits 23 bits ~10^±38 Master weights

FP16 5 bits 10 bits ~10^±5 Compute

BF16 8 bits 7 bits ~10^±38 Compute (preferred)

BF16 is preferred for LLMs due to better numerical stability (same exponent range as FP32).

6.4.3 Flash Attention

Flash Attention reduces memory usage and improves speed through IO-aware algorithms:

Standard attention: O(N 2 ) memory for attention matrix

Flash attention: O(N) memory through tiling and recomputation


Typical speedup: 2-4x on A100 GPUs

56
Chapter 7: Alignment and Fine-tuning

Chapter 7: Alignment and Fine-tuning

Pre-trained language models require alignment to follow instructions and behave safely. This chapter covers
supervised fine-tuning, RLHF, and alternative alignment methods with full mathematical derivations.

Figure 7.1: Reward model training progress showing accuracy improvement over training steps. Human-level agreement is
typically achieved around 80% accuracy.

7.1 Supervised Fine-tuning (SFT)

SFT adapts pre-trained models to follow instructions using labeled (prompt, response) pairs.

7.1.1 SFT Objective

where x is the instruction/prompt and y


is the desired response.

$$\mathcal{L}_{SFT} = -\sum_{(x, y) \in \mathcal{D}_{SFT}}


7.1.2 SFT Dataset Construction
\sum_{t=1}^{|y|} \log P(y_t | x, y_{
High-quality SFT data is crucial for
model performance:

57
Chapter 7: Alignment and Fine-tuning

Human-written: Expert annotators write responses

Distillation: Use stronger models (GPT-4) to generate training data

Rejection sampling: Generate multiple responses, select best

Engineering Pitfall

Overfitting during SFT is common. Typical SFT uses 10-100K examples for 1-3 epochs with
learning rates 10-100x smaller than pre-training (e.g., 1e-5 to 1e-6). Too much SFT can cause
catastrophic forgetting of pre-trained knowledge. Monitor validation loss on held-out tasks to detect
overfitting.

7.2 Reinforcement Learning from Human Feedback (RLHF)


RLHF further aligns models with human preferences using reinforcement learning. The pipeline consists of
two stages: reward model training and policy optimization.

7.2.1 Reward Model Training

Given pairs (yw , yl ) where yw is preferred over yl , we train a reward model using the Bradley-Terry
​ ​ ​

model:

P (yw ≻ yl ∣x) = σ(rθ (x, yw ) − rθ (x, yl ))


​ ​ ​ ​ ​ ​
(7.2)

The loss function is:

LR = −E(x,yw ,yl )∼D [log σ(rθ (x, yw ) − rθ (x, yl ))]


​ ​
​ ​ ​ ​ (7.3)

58
Chapter 7: Alignment and Fine-tuning

Bradley-Terry Model Derivation

The Bradley-Terry model assumes each item has an underlying "strength" parameter. The probability
that item i beats item j is:

πi
P (i ≻ j) =

πi + πj

​ ​

Setting πi ​ = eri where ri is the reward:


eri 1 ​

P (i ≻ j) = = = σ(ri − rj )
eri + erj 1 + e−(ri −rj )
​ ​ ​ ​


​ ​ ​

This logistic form is convenient for optimization and naturally handles ties through the difference in
rewards.

7.2.2 PPO Training

Proximal Policy Optimization (PPO) updates the policy while preventing large deviations from the reference
policy:

LPPO = E [min (rt(θ)A t, clip(rt(θ), 1 − ϵ, 1 + ϵ)A t)]


​ ​ ​ ​ ​ (7.4)

πθ (y∣x)
where rt(θ) = A
πref (y∣x) is the probability ratio and t is the advantage.

​ ​ ​

7.2.3 KL Divergence Penalty

To prevent reward hacking (exploiting the reward model), a KL penalty is added to the reward:

πθ (y∣x)
r(x, y) = rθ (x, y) − β log = rθ (x, y) − βDKL(πθ ∥πref )

(7.5)
πref (y∣x)
​ ​ ​ ​ ​ ​

59
Chapter 7: Alignment and Fine-tuning

Figure 7.2: RLHF training dynamics showing raw reward increasing but KL divergence also increasing. The penalized
reward peaks at the optimal stopping point.

7.3 Direct Preference Optimization (DPO)

DPO eliminates the need for explicit reward modeling and RL training by directly optimizing from
preferences.

7.3.1 DPO Derivation

DPO derives a closed-form solution for the optimal policy under the Bradley-Terry model. The optimal RL
policy is:

1 1
π∗ (y∣x) = πref (y ∣x) exp ( r(x, y))
​ ​ ​
(7.6)
Z(x) β

where Z(x) = ∑y πref (y∣x) exp(r(x, y)/β) is the partition function.


​ ​

60
Chapter 7: Alignment and Fine-tuning

DPO Objective Derivation

Solving for the reward from the optimal policy:

π∗ (y∣x)
r(x, y) = β log + β log Z(x)
πref (y∣x)

Substituting into the Bradley-Terry preference probability:

P (yw ≻ yl ) = σ(r(x, yw ) − r(x, yl ))


​ ​ ​ ​

π∗ (yw ∣x) π∗ (yl ∣x)


= σ (β log − β log )
​ ​

πref (yw ∣x) πref (yl ∣x)


​ ​

​ ​ ​ ​

The partition function Z(x) cancels out! This gives the DPO objective:

πθ (yw ∣x) πθ (yl ∣x)


LDPO = −E [ log σ (β log − β log )]
​ ​ ​ ​

πref (yw ∣x) πref (yl ∣x)


​ ​ ​

​ ​ ​ ​

7.3.2 DPO Objective

πθ (yw ∣x) πθ (yl ∣x)


LDPO = −E(x,yw ,yl ) [ log σ (β log − β log )]
​ ​ ​ ​

(7.7)
πref (yw ∣x) πref (yl ∣x)
​ ​ ​ ​

​ ​

​ ​ ​ ​

Key Takeaway

DPO achieves similar performance to RLHF with simpler training (no reward model, no PPO). The
key insight is that the optimal RL policy under a Bradley-Terry preference model can be expressed
in closed form, allowing direct optimization. DPO is more stable than PPO and requires less
hyperparameter tuning.

61
Chapter 7: Alignment and Fine-tuning

Figure 7.3: Training convergence comparison: DPO converges faster than RLHF (PPO) but RLHF is more stable in later
stages.

7.4 Constitutional AI and RLAIF


Constitutional AI uses AI feedback instead of human feedback for alignment.

7.4.1 Constitutional AI Process

1. Supervised stage: Model critiques and revises its own responses based on constitutional principles

2. RL stage: Train preference model on AI-labeled preferences, then RLHF

7.4.2 Example Constitutional Principles

"Choose the response that is most helpful, honest, and harmless"

"Avoid responses that promote illegal, violent, or hateful content"

"Prefer responses that acknowledge uncertainty rather than making things up"

7.4.3 RLAIF (RL from AI Feedback)

RLAIF uses a pre-trained LLM to generate preference labels:

1. Generate multiple responses to prompts

2. Use LLM to rank responses based on criteria

62
Chapter 8: Inference Optimization

4. Apply standard RLHF pipeline

Table 7.1: Comparison of Alignment Methods

Method Human Labels Reward Model RL Training Complexity

SFT Yes No No Low

RLHF Yes Yes Yes (PPO) High

DPO Yes No No Medium

Constitutional AI No Yes (AI) Yes High

RLAIF No Yes (AI) Yes High

Chapter 7 Summary

SFT adapts models to follow instructions using labeled data

RLHF uses human preferences to train a reward model: LR = −E[log σ(rθ (x, yw ) −
​ ​ ​

rθ (x, yl ))]
​ ​

PPO optimizes policy with KL penalty: r(x, y) = rθ (x, y) − βDKL(πθ ∥πref )


​ ​ ​ ​

DPO directly optimizes from preferences without explicit reward modeling

Constitutional AI and RLAIF reduce reliance on human labels

KL penalty prevents reward hacking during RL training

Chapter 8: Inference Optimization

Serving LLMs efficiently requires optimizing memory usage, reducing computation, and maximizing
throughput. This chapter covers the key

63
Chapter 8: Inference Optimization

techniques for production inference.

Figure 8.1: Overview of inference optimization techniques and their combined impact on latency and throughput.

8.1 KV Cache

The KV cache stores key and value tensors from previous tokens to avoid recomputation during
autoregressive generation.

8.1.1 Why KV Cache Matters

Without caching, each new token requires computing attention over all previous tokens:

FLOPs per token without cache = O(n2 ⋅ d) (8.1)

With KV cache, only the new token's query needs computation:

FLOPs per token with cache = O(n ⋅ d) (8.2)

8.1.2 KV Cache Memory

The memory required for KV cache grows with sequence length:

64
Chapter 8: Inference Optimization

where b = batch size, n = sequence length, l = layers, h = heads, dh = head dimension.


KV Cache Memory Calculation

For a 7B parameter model (LLaMA-2 7B):

Layers: l = 32
Heads: h = 32
Head dimension: dh ​
= 128
Precision: FP16 = 2 bytes

For batch size b = 1, sequence length n = 4096:

Memory KV = 2 ⋅ 1 ⋅ 4096 ⋅ 32 ⋅ 32 ⋅ 128 ⋅ 2 = 2.15 GB


For n = 32768 (32K context):

Memory KV = 17.2 GB

This exceeds the memory of most consumer GPUs!

Figure 8.2: KV cache memory growth with sequence length for different model sizes. Long contexts require significant
memory.

65
Chapter 8: Inference Optimization

Standard MHA: h key heads, h value heads

MQA: 1 key head, 1 value head (shared)

Memory reduction: h× (typically 8-32x)

8.1.4 Grouped-Query Attention (GQA)

GQA is a middle ground between MHA and MQA, grouping query heads to share keys/values:

GQA groups = g, KV heads = h/g (8.4)

LLaMA-2 70B uses GQA with 8 key-value groups (g = 8)


Balances memory efficiency with quality
Memory reduction: g×

8.2 Quantization

Quantization reduces model precision to decrease memory and increase speed.

8.2.1 Quantization Methods

Table 8.1: Quantization Precision Formats

Format Bits Memory Reduction Quality Impact Use Case

FP32 32 1x (baseline) None Training

FP16/BF16 16 2x Minimal Standard inference

INT8 8 4x Small (<1%) Production

INT4 4 8x Moderate (2-5%) Edge deployment

NF4/GGUF 4 8x Small-Moderate Consumer GPUs

66
Chapter 8: Inference Optimization

8.2.2 Post-Training Quantization (PTQ)

PTQ quantizes a pre-trained model without retraining:

w−z max(w) − min(w)


wquant = round ( ), s= (8.5)
2n − 1
​ ​ ​

where s is the scale and z is the zero point.

8.2.3 Quantization-Aware Training (QAT)

QAT simulates quantization during training to improve accuracy:

w
wf ake = s ⋅ round (
​ )

(8.6)
s

Key Takeaway

INT8 quantization typically achieves near-lossless compression for inference. INT4/4-bit methods
(GGUF, AWQ, GPTQ) enable running 70B models on consumer GPUs with acceptable quality
degradation. GPTQ uses second-order information for better quantization.

8.3 Flash Attention

Flash Attention optimizes the attention computation through IO-aware algorithms and tiling.

8.3.1 Standard Attention Memory

Standard attention materializes the full N × N attention matrix:

S = QK T ∈ RN ×N , P = softmax(S) ∈ RN ×N (8.7)

Memory: O(N 2 ) for the attention matrix alone.

8.3.2 Flash Attention Algorithm

67
Chapter 8: Inference Optimization

Flash Attention uses tiling and online softmax to avoid materializing the full matrix:

8.3.3 Memory and Speedup

Table 8.2: Flash Attention Memory and Speed Comparison

Sequence Length Standard Memory Flash Memory Speedup

1K 4 MB 0.5 MB 1.2x

4K 64 MB 2 MB 2.0x

16K 1 GB 8 MB 2.5x

64K 16 GB 32 MB 3.0x

8.3.4 Flash Attention-2

Flash Attention-2 further optimizes by:

Reducing non-matmul FLOPs


Better parallelism across sequence length

Improved work partitioning

Speedup over Flash Attention: 1.5-2x on A100.

Algorithm 5: Flash Attention (Simplified)


// Tile Q, K, V into blocks that fit in SRAM
for each block Qi of Q: ​

m = −∞, l = 0, O i = 0 ​

for each block K j , Vj : ​ ​

S ij = QiKjT // On-chip compute


​ ​ ​

mnew = max(m, maxrow (S ij ))


​ ​ ​

Pij = exp(S ij − mnew )


​ ​ ​

lnew = l ⋅ exp(m − mnew ) + ∑ row (Pij )


​ ​ ​ ​

O i = 1lnew (l ⋅ exp(m − mnew ) ⋅ O i + Pij Vj )


​ ​


​ ​ ​ ​

m = mnew , l = lnew ​ ​

Key insight: Compute attention in blocks that fit in SRAM (fast memory), avoiding slow HBM reads/writes.

68
Chapter 8: Inference Optimization

8.4 Speculative Decoding

Speculative decoding accelerates inference by drafting multiple tokens with a small model, then verifying
with the large model.

8.4.1 Speculative Decoding Algorithm

1. Use small draft model to generate K candidate tokens

2. Large model verifies all K tokens in parallel

3. Accept tokens up to first rejection

4. Resample rejected token and continue

8.4.2 Speedup Analysis

Theoretical speedup depends on draft model acceptance rate α :

1
Speedup ≈ cdraft (8.8)
1−α+ ⋅K

ctarget ​

where cdraf t/ctarget is the cost ratio of draft to target model.


​ ​

Figure 8.3: Speculative decoding speedup as a function of draft token acceptance rate for different numbers of draft tokens
(k).

69
Chapter 8: Inference Optimization

Medusa: Train multiple heads to predict future tokens

Prompt lookup: Match against prompt for token copying

Figure 8.4: Latency versus throughput trade-offs with different batch sizes. Larger batches improve throughput but increase
latency per request.

Chapter 8 Summary

KV cache reduces per-token computation from O(n²) to O(n)

MQA/GQA reduce KV cache memory by sharing key-value heads

Quantization (INT8, INT4) reduces memory 2-8x with minimal quality loss

Flash Attention reduces attention memory from O(N²) to O(N)

Speculative decoding achieves 2-3x speedup with draft models

Combined optimizations can achieve 10-20x inference speedup

70
Chapter 9: RAG, Agents, and Tool Use

Chapter 9: RAG, Agents, and Tool Use

Extending LLMs with external knowledge and capabilities enables more accurate, up-to-date, and actionable
responses. This chapter covers retrieval-augmented generation, tool calling, and agent architectures.

Figure 9.1: Retrieval-Augmented Generation architecture showing the retrieve-augment-generate pipeline.

9.1 Retrieval-Augmented Generation

RAG grounds LLM responses in external knowledge, reducing hallucinations and enabling access to
information beyond the training data.

9.1.1 RAG Pipeline

1. Indexing: Documents are chunked, embedded, and stored in a vector database

2. Retrieval: Query is embedded and similar documents are retrieved

3. Augmentation: Retrieved documents are added to the prompt context


4. Generation: LLM generates response grounded in retrieved context

9.1.2 Embedding Models

71
Chapter 9: RAG, Agents, and Tool Use

Table 9.1: Embedding Models for RAG

Model Dimensions Context Length Best For

text-embedding-ada-002 1536 8192 General purpose

text-embedding-3-large 3072 8192 High accuracy

e5-large-v2 1024 512 Open source

bge-large-en 1024 512 Multilingual

9.1.3 Retrieval Algorithms

Approximate Nearest Neighbor (ANN) Search:

NN(q) = arg min ∥e(q) − e(d)∥2


​ ​

(9.1)
d∈D

Common ANN algorithms include HNSW, IVF, and product quantization.

Figure 9.2: Comparison of vector database indexing methods showing recall vs embedding dimensions.

9.2 Tool Calling

72
Chapter 9: RAG, Agents, and Tool Use

9.2.1 Tool Calling Format

Models are trained to output structured tool calls:

{"tool": "weather_api",
"parameters": {
"location": "San Francisco",
"date": "2024-01-15"
}}

9.2.2 Tool Calling Loop

1. User provides query and available tools

2. LLM decides whether to call a tool or respond directly


3. If tool call: execute tool, return result to LLM

4. LLM synthesizes final response using tool output

Figure 9.3: LLM agent tool calling loop showing the iterative decision-execute-respond cycle.

9.3 LLM Agents

73
Chapter 9: RAG, Agents, and Tool Use

Planning: Break down complex tasks into subtasks

Memory: Short-term (context) and long-term (vector store)

Tool use: Interact with external systems

Reflection: Evaluate and improve actions

9.4 ReAct and Chain-of-Thought

9.4.1 ReAct (Reasoning + Acting)

ReAct interleaves reasoning traces with actions:

Thought: I need to find the current weather in Paris to answer the question.
Action: weather_api(location="Paris")
Observation: {"temperature": 15, "condition": "cloudy"}
Thought: Now I have the weather information. I can provide the answer.
Final Answer: The current weather in Paris is 15°C and cloudy.

9.4.2 Chain-of-Thought Prompting

CoT encourages step-by-step reasoning:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.


Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls.
5 + 6 = 11. The answer is 11.

74
Chapter 9: RAG, Agents, and Tool Use

Chapter 9 Summary

RAG grounds LLM responses in external knowledge sources

Vector databases enable efficient similarity search

Tool calling enables LLMs to interact with APIs and external systems

LLM agents combine reasoning, planning, memory, and tool use


ReAct interleaves reasoning traces with actions

Chain-of-thought prompting improves reasoning capabilities

75
Chapter 10: Multimodal LLMs

Chapter 10: Multimodal LLMs

Extending language models to process and generate multiple modalities (vision, audio) enables richer
interactions and new capabilities. This chapter covers vision-language models and multimodal architectures.

10.1 Vision Transformers (ViT)


Vision Transformers apply the transformer architecture to images by treating image patches as tokens.

10.1.1 ViT Architecture

1. Split image into fixed-size patches (e.g., 16×16)

2. Flatten and linearly embed each patch


3. Add positional embeddings

4. Feed to standard transformer encoder

5. Use [CLS] token for classification

z0 = [xclass ; x1p E ; x2p E ; ⋯ ; xN


​ ​

p E ] + Epos
​ ​ ​ ​
(10.1)

where xip is the i-th image patch and E is the patch embedding matrix.

10.1.2 Patch Embedding

For an image of size H × W and patch size P :

H ⋅W
N= (10.2)
P2

Each patch is flattened and projected to dimension D :

2
⋅C
xip ∈ RP ​ → xip E ∈ RD

(10.3)

76
Chapter 10: Multimodal LLMs

10.2 Vision-Language Models

Vision-language models combine visual encoders with language models for multimodal understanding.

10.2.1 CLIP (Contrastive Language-Image Pre-training)

CLIP learns joint representations of images and text through contrastive learning:

(− ∑ log )
1 exp(⟨Ii , T i ⟩/τ ) exp(⟨Ii , T i ⟩/τ )
L= − ∑ log
​ ​ ​ ​

(10.4)
2 ∑ exp(⟨I , ⟩/τ ) ∑ exp(⟨I , ⟩/τ )
​ ​ ​ ​ ​

i j ​

i T j ​ ​

i j ​

j T i ​

where τ is a temperature parameter and ⟨I, T ⟩ is the cosine similarity between image and text embeddings.

10.2.2 LLaVA (Large Language and Vision Assistant)

LLaVA connects a vision encoder (CLIP) to an LLM with a simple projection layer:

Image → CLIP Vision Encoder → Projection → LLM Token Space → Language Model

10.2.3 Multimodal GPT

Multimodal GPT models process images as additional tokens:

Image is encoded into visual tokens

Visual tokens are interleaved with text tokens


Standard autoregressive training on multimodal data

10.3 Audio and Speech

Speech models convert audio to text and vice versa, enabling voice interfaces.

10.3.1 Whisper (OpenAI)

Whisper is an encoder-decoder transformer for speech recognition:

Log-Mel spectrogram as input

77
Chapter 10: Multimodal LLMs

Encoder processes audio features

Decoder generates text tokens autoregressively

Trained on 680,000 hours of multilingual audio

Chapter 10 Summary

ViT treats image patches as tokens for transformer processing

CLIP learns joint image-text representations via contrastive learning

LLaVA connects vision encoders to LLMs with projection layers

Multimodal GPT models process images as visual tokens


Whisper uses encoder-decoder architecture for speech recognition

78
Chapter 11: Evaluation and Safety

Chapter 11: Evaluation and Safety

Evaluating LLMs requires diverse benchmarks covering knowledge, reasoning, coding, and safety. This
chapter covers evaluation methodologies and safety considerations.

Figure 11.1: LLM performance across key evaluation benchmarks comparing GPT-4, GPT-3.5, and Llama-2-70B.

11.1 Evaluation Benchmarks

11.1.1 Knowledge and Reasoning Benchmarks

79
Chapter 11: Evaluation and Safety

Table 11.1: Key LLM Evaluation Benchmarks

Benchmark Measures Format Examples

MMLU Knowledge (57 subjects) Multiple choice 15,908

GSM8K Math reasoning Open response 8,500

HumanEval Code generation Function completion 164

HellaSwag Common sense Sentence completion 10,000

TruthfulQA Factuality QA 817

BBH Complex reasoning Various 23 tasks

11.1.2 MMLU (Massive Multitask Language Understanding)

MMLU tests knowledge across 57 subjects from elementary to professional level:

STEM: mathematics, physics, chemistry, biology, computer science


Humanities: history, philosophy, law, politics

Social Sciences: psychology, economics, geography

Other: medicine, accounting, business, misc

11.1.3 HumanEval

HumanEval measures functional correctness of generated code:

164 programming problems


Model generates function completion

Pass@k metric: probability that at least one of k samples passes tests

Pass@k = EProblems [ 1 − ]
(n−c
k
) ​

(11.1)
(nk)
​ ​

where n is the number of samples and c is the number of correct samples.

80
Chapter 11: Evaluation and Safety

11.2 LLM-as-a-Judge

Using LLMs to evaluate other LLMs addresses scalability challenges of human evaluation.

11.2.1 LaaJ Methodology

1. Define evaluation criteria (helpfulness, accuracy, safety)

2. Prompt judge LLM to compare two responses

3. Aggregate pairwise comparisons for ranking

11.2.2 Common Biases

Position bias: Preference for first/second response

Verbosity bias: Preference for longer responses

Self-enhancement: Models favor their own outputs

Style bias: Preference for certain writing styles

Engineering Pitfall

LLM-as-a-Judge can be unreliable for nuanced reasoning tasks. Always validate against human
judgments and be aware of known biases. GPT-4 as judge correlates ~80% with human judgments,
but this varies significantly by task domain.

11.3 Safety and Alignment

11.3.1 Safety Categories

Hate speech: Content attacking protected groups

Harassment: Bullying, threats, intimidation


Self-harm: Encouraging suicide or self-injury

Sexual content: Explicit or inappropriate sexual content

Violence: Graphic violence or instructions for harm

Deception: Scams, phishing, misinformation

81
Chapter 11: Evaluation and Safety

11.3.2 Safety Evaluation Datasets

Table 11.2: Safety Evaluation Datasets

Dataset Focus Size

HHH (Helpful-Honest-Harmless) Red teaming 170K+

MT-Bench Multi-turn safety 80

StrongREJECT Jailbreak robustness 346

XSTest Refusal behavior 200

11.4 Red Teaming

Red teaming involves actively trying to make models produce harmful outputs to identify vulnerabilities.

11.4.1 Red Teaming Approaches

Manual: Human experts craft adversarial prompts

Automated: Use LLMs to generate test cases

Optimization: Gradient-based adversarial attacks

11.4.2 Jailbreak Techniques

Roleplay: "Pretend you are a character without safety constraints"


Encoding: Base64, rot13, or other encodings

Translation: Request in low-resource languages

Few-shot: Provide examples of harmful responses

Suffix attacks: Add optimized adversarial suffix

82
Chapter 11: Evaluation and Safety

Chapter 11 Summary

MMLU measures knowledge across 57 subjects

HumanEval tests functional code correctness with Pass@k

LLM-as-a-Judge scales evaluation but has known biases

Safety evaluation requires diverse harm categories


Red teaming identifies vulnerabilities through adversarial testing

83
Chapter 12: Reasoning and Advanced Topics

Chapter 12: Reasoning and Advanced Topics

Advancing beyond pattern matching to genuine reasoning is a key frontier in LLM research. This chapter
covers reasoning techniques, test-time compute, and open research problems.

12.1 Chain of Thought

Chain-of-Thought (CoT) prompting elicits step-by-step reasoning by providing examples.

12.1.1 Zero-Shot CoT

Simply adding "Let's think step by step" can improve reasoning:

Q: A juggler has 16 balls. Half are golf balls and half are tennis balls.
If 3 golf balls are removed, how many golf balls remain?
A: Let's think step by step.

12.1.2 Self-Consistency

Generate multiple CoT reasoning paths and take the majority vote:

k
^y = arg max ∑ 1[yi = y]
​ ​ ​ ​ (12.1)
y
i=1

12.1.3 Tree of Thoughts (ToT)

ToT maintains multiple reasoning paths and uses search:

1. Thought decomposition: Break problem into steps

2. Thought generation: Generate candidate thoughts

3. State evaluation: Heuristic to evaluate progress

4. Search: BFS or DFS over thought space

84
Chapter 12: Reasoning and Advanced Topics

12.2 Test-Time Compute

Increasing computation at inference time can improve performance without training larger models.

12.2.1 Methods for Test-Time Compute

Sampling: Generate multiple answers, select best

Verification: Train verifier to score answers

Process reward models: Score each reasoning step

Monte Carlo Tree Search: Search over reasoning paths

12.2.2 o1 and Test-Time Training

OpenAI's o1 models use learned reasoning strategies:

Extended thinking time for complex problems

Internal chain of thought

Revised answer based on self-reflection

12.3 Mixture of Experts

Mixture of Experts (MoE) scales model capacity without proportional compute increase.

85
Chapter 12: Reasoning and Advanced Topics

12.3.1 MoE Architecture

N
y = ∑ G(x)i ⋅ Ei (x) ​ ​ ​ (12.2)
i=1

where G is the gating network and Ei are expert networks. Only top-k experts are activated.

12.3.2 Load Balancing

To prevent collapse to a few experts, load balancing is enforced:

N
Laux = α ⋅ N ⋅ ∑ fi ⋅ P i
​ ​ ​ ​
(12.3)
i=1

where fi is the fraction of tokens routed to expert i and P i is the average routing probability.
​ ​

12.4 Open Research Problems

12.4.1 Key Challenges

Factual hallucination: Models generate plausible but false information

Reasoning limitations: Struggle with multi-step logical deduction

Context window: Limited ability to process very long documents


Alignment: Ensuring models behave safely across all inputs

Interpretability: Understanding what models have learned

Efficiency: Reducing compute and memory requirements

12.4.2 Future Directions

Test-time compute scaling: Better inference-time reasoning

World models: Internal simulation of environments

Neuro-symbolic: Combining neural and symbolic reasoning

Continual learning: Updating knowledge without forgetting

86
Chapter 12: Reasoning and Advanced Topics

Multimodal reasoning: Cross-modal understanding

Figure 12.2: Growth of large language model sizes over time, showing the rapid increase in parameters from 2018 to 2024.

Research Insight

The field is rapidly evolving. Techniques that were state-of-the-art 6 months ago may be obsolete
today. Key trends include: (1) increased focus on reasoning over scaling, (2) test-time compute as a
new scaling dimension, (3) multimodal as default, and (4) efficiency enabling broader access.

Chapter 12 Summary

Chain-of-Thought elicits step-by-step reasoning

Self-consistency aggregates multiple reasoning paths

Test-time compute can improve performance without larger models

Mixture of Experts scales capacity with sparse activation

Hallucination, reasoning, and alignment remain open challenges

The field is evolving rapidly with new techniques emerging constantly

87
Appendix: Worked Examples

Appendix: Worked Examples

A.1 Attention Computation Example

Let's walk through a complete attention computation for a simple example.

Example: Computing Self-Attention

Given: Sequence "the cat sat" with embeddings:

x1 = [1, 0, 0]T (the), x2 = [0, 1, 0]T (cat), x3 = [0, 0, 1]T (sat)


​ ​ ​

Weight matrices: WQ ​ = WK = WV = I (identity)


​ ​

1 0 0
Compute: Q = K = V = X = 0 ​ ​ 1 ​ 0 ​ ​

0 0 1
1 0 0
Attention scores: S = QK = 0 T ​ ​
1 ​
0 ​ ​

0 0 1

Softmax (with scaling 3): ​

0.577 0.211 0.211


A = softmax(S/ 3) = 0.211 ​ ​ ​ 0.577 ​ 0.211 ​ ​

0.211 0.211 0.577


0.577 0.211 0.211
Output: O = AV = 0.211 0.577 ​ 0.211 ​ ​

0.211 0.211 0.577

A.2 Gradient Computation Example

88
Appendix: Worked Examples

Example: Backpropagation Through a Linear Layer

Forward: y = Wx + b

Loss: L = 21 ∥y − t∥2

Gradients:

∂L
∂y =y−t
∂L
∂W = (y − t)xT
∂L
∂b =y−t
∂L
∂x = W T (y − t)

A.3 Scaling Law Calculation

Example: Predicting Model Performance

Given: A 7B parameter model trained on 300B tokens achieves perplexity 12.

Question: What perplexity would a 70B model trained on 1.4T tokens achieve?

Solution: Using Chinchilla scaling with α N ​ = 0.076, α D = 0.095:


L(N, D) = L∞ + (Nc/N)αN + (Dc/D)αD


​ ​

From the 7B model: 12 = L∞ + (Nc/7B)0.076 + (Dc/300B)0.095


​ ​ ​

For 70B model on 1.4T tokens:

L = L∞ + (Nc/70B)0.076 + (Dc/1400B)0.095
​ ​ ​

Assuming L∞ ​ = 1: L ≈ 1 + 0.7 + 0.6 = 2.3 (in log-perplexity units)

Converting: Perplexity ≈ e2.3 ≈ 10

89
Appendix: Worked Examples

A.4 Memory Calculation Example

Example: Training Memory Requirements

Model: 13B parameters, sequence length 2048, batch size 32

Memory breakdown:

Parameters (FP16): 2 × 13B = 26 GB


Gradients (FP16): 2 × 13B = 26 GB
Optimizer states (FP32): 12 × 13B = 156 GB
Activations: ≈ 32 × 2048 × 40 × 5120 × 2 = 27 GB

Total: ≈ 235 GB without ZeRO

With ZeRO-3 across 8 GPUs: ≈ 29 GB per GPU

A.5 DPO Loss Derivation

Example: Computing DPO Loss

Given: Reference policy πref (yw ∣x) ​ ​ = 0.3, πref (yl ∣x) = 0.2
​ ​

Current policy: πθ (yw ∣x) ​ ​ = 0.5, πθ (yl ∣x) = 0.15


​ ​

With β = 0.1:
0.5
rw = β log

0.3 ​ = 0.1 × 0.51 = 0.051


0.15
rl = β log

0.2

= 0.1 × (−0.29) = −0.029

LDPO = − log σ(0.051 − (−0.029)) = − log σ(0.08)


= − log(0.52) = 0.65

A.6 KV Cache Size Calculation

90
Appendix: Worked Examples

Example: Computing KV Cache Memory

Model: LLaMA-2 70B

Layers: 80

Heads: 64
Head dimension: 128

GQA groups: 8 (so 8 KV heads)

For batch size 1, sequence 8192:

Memory KV = 2 × 1 × 8192 × 80 × 8 × 128 × 2 bytes


= 2 × 8192 × 80 × 8 × 128 × 2

= 2.15 GB

For batch size 32: = 68.7 GB

A.7 Temperature Sampling

91
Appendix: Worked Examples

Example: Effect of Temperature on Sampling

Logits: [2.0, 1.0, 0.5] for tokens ["cat", "dog", "bird"]

T = 1.0:

P = [e2 , e1 , e0.5 ]/(e2 + e1 + e0.5 )

= [7.39, 2.72, 1.65]/11.76 = [0.63, 0.23, 0.14]

T = 0.5:

P = [e4 , e2 , e1 ]/(e4 + e2 + e1 )

= [54.6, 7.39, 2.72]/64.7 = [0.84, 0.11, 0.04]

T = 2.0:

P = [e1 , e0.5 , e0.25 ]/(e1 + e0.5 + e0.25 )

= [2.72, 1.65, 1.28]/5.65 = [0.48, 0.29, 0.23]

A.8 Top-k and Top-p Sampling

Example: Nucleus (Top-p) Sampling

Sorted probabilities: [0.4, 0.3, 0.15, 0.1, 0.05]

Top-k with k=2: Sample from [0.4, 0.3] (renormalized to [0.57, 0.43])

Top-p with p=0.8:

Cumulative: [0.4, 0.7, 0.85, 0.95, 1.0]

Include tokens until cumulative > 0.8: [0.4, 0.3, 0.15]

Renormalized: [0.53, 0.4, 0.2]

A.9 Quantization Error Analysis

92
Appendix: Worked Examples

Example: Computing Quantization Error

Weights: w = [0.5, −0.3, 0.8, −0.2]

INT8 quantization (scale = 127 / 0.8 = 158.75):

wquant = round(w × 158.75) = [79, −48, 127, −32]

wdequant = wquant/158.75 = [0.498, −0.302, 0.8, −0.202]


​ ​

MSE: 14 ​ ∑(w − wdequant)2 = 2.5 × 10−6


A.10 Beam Search

Example: Beam Search with Width 2

Step 1: Logits [2.0, 1.5, 0.5] → Probs [0.62, 0.31, 0.07]

Keep top 2: "The" (0.62), "A" (0.31)

Step 2: For "The": logits [1.5, 1.0, 0.5] → probs [0.55, 0.30, 0.15]

For "A": logits [1.0, 0.8, 0.3] → probs [0.50, 0.33, 0.17]

Scores: "The cat" (0.62×0.55=0.34), "The dog" (0.62×0.30=0.19),

"A cat" (0.31×0.50=0.16), "A dog" (0.31×0.33=0.10)

Keep top 2: "The cat", "The dog"

A.11 Cosine Similarity in RAG

93
Appendix: Worked Examples

Example: Computing Retrieval Scores

Query embedding: q = [0.5, 0.3, 0.8]

Document embeddings:

d1 = [0.6, 0.2, 0.7], d2 = [0.1, 0.9, 0.3]


Cosine similarity:
0.5(0.6)+0.3(0.2)+0.8(0.7) 0.88
sim(q, d1 ) =

∥q∥∥d1 ∥

= 0.99×0.92
​ = 0.97
0.5(0.1)+0.3(0.9)+0.8(0.3) 0.50
sim(q, d2 ) =

∥q∥∥d2 ∥

= 0.99×0.95
​ = 0.53

Result: Document 1 is more relevant to the query.

A.12 Perplexity Calculation

Example: Computing Model Perplexity

Sentence: "The cat sat"

Model probabilities:

P (The) = 0.1, P (cat∣The) = 0.05, P (sat∣The cat) = 0.2

Joint probability: 0.1 × 0.05 × 0.2 = 0.001

Average negative log-likelihood:

− 13 (log 0.1 + log 0.05 + log 0.2) = − 31 (−2.30 − 3.00 − 1.61) = 2.30
​ ​

Perplexity: e2.30 = 9.97

94
Glossary

Glossary

Attention: A mechanism that computes weighted representations by comparing queries against keys to
determine relevance. The core operation is softmax(QK T / dk )V .

Autoregressive: A model that generates output one token at a time, conditioning each new token on
previously generated tokens. Used in GPT-style decoder-only models.

Backpropagation: Algorithm for computing gradients of the loss with respect to model parameters by
applying the chain rule through the computation graph.

BERT: Bidirectional Encoder Representations from Transformers; encoder-only model pre-trained with
masked language modeling.

BPE: Byte Pair Encoding; subword tokenization algorithm that iteratively merges frequent character pairs to
build a vocabulary.

Chain-of-Thought: Prompting technique that elicits step-by-step reasoning from language models by
providing examples of intermediate reasoning steps.

Context Window: The maximum sequence length a model can process at once. Limited by memory and
positional encoding method.

Cross-Entropy: Loss function measuring the difference between predicted and true probability distributions:
H(P , Q) = − ∑x P (x) log Q(x).

Decoder: The part of a transformer that generates output autoregressively; used in GPT-style models with
causal attention.

DPO: Direct Preference Optimization; alignment method that directly optimizes from preferences without
reward modeling.

Embedding: A dense vector representation of discrete tokens in a continuous space, typically learned during
training.

Encoder: The part of a transformer that processes input bidirectionally; used in BERT-style models.

95
Glossary

FSDP: Fully Sharded Data Parallel; PyTorch's implementation of ZeRO for distributed training.

GPT: Generative Pre-trained Transformer; decoder-only autoregressive language model architecture.

Gradient Checkpointing: Memory optimization technique that recomputes activations during backward
pass instead of storing them.

Hallucination: Generation of plausible but factually incorrect or unsupported content by a language model.

KV Cache: Storage of key and value tensors from previous tokens to avoid recomputation during
autoregressive inference.

LLM: Large Language Model; neural network with billions of parameters trained on vast text corpora.

LoRA: Low-Rank Adaptation; parameter-efficient fine-tuning method using low-rank matrix decomposition.

Masking: Preventing attention to certain positions, typically future tokens in autoregressive models using a
causal mask.

MoE: Mixture of Experts; architecture using sparse activation of specialized sub-networks to scale model
capacity.

Multi-Head Attention: Parallel attention computations with different learned projections, allowing attention
to different representation subspaces.

Perplexity: Metric for language model quality; exponential of average negative log-likelihood: PPL =
exp(− 1N ​ ∑ log P (xi )).

Positional Encoding: Method to inject sequence position information into transformer models, which are
otherwise permutation-invariant.

PPO: Proximal Policy Optimization; reinforcement learning algorithm used in RLHF to optimize policies
with a trust region constraint.

Quantization: Reducing numerical precision of model weights to decrease memory and increase speed, e.g.,
FP16 → INT8.

RAG: Retrieval-Augmented Generation; augmenting LLMs with external knowledge retrieval to reduce
hallucination.

96
Glossary

RLHF: Reinforcement Learning from Human Feedback; alignment method using human preferences to train
a reward model and optimize policy.

RoPE: Rotary Position Embedding; relative positional encoding using rotation matrices that encode relative
position naturally.

Self-Attention: Attention mechanism where queries, keys, and values come from the same sequence.

SFT: Supervised Fine-Tuning; adapting pre-trained models with labeled instruction-response pairs.

Softmax: Function converting logits to probabilities that sum to 1: softmax(xi ) ​


= exi / ∑j exj .


Token: Discrete unit of text (word, subword, or character) processed by language models.

Tokenization: Process of converting text into discrete tokens that can be processed by neural networks.

Transformer: Neural network architecture based on attention mechanisms, replacing recurrence with
parallelizable self-attention.

ZeRO: Zero Redundancy Optimizer; memory optimization technique for distributed training that partitions
optimizer states, gradients, and parameters.

97
References

References

1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017).
Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998-6008.

2. Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative
pre-training. OpenAI Technical Report.

3. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised
multitask learners. OpenAI Blog.

4. Brown, T., Mann, B., Ryder, N., et al. (2020). Language models are few-shot learners. Advances in Neural Information
Processing Systems, 33, 1877-1901.

5. Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). Scaling laws for neural language models. arXiv preprint
arXiv:2001.08361.

6. Raffel, C., Shazeer, N., Roberts, A., et al. (2020). Exploring the limits of transfer learning with a unified text-to-text
transformer. Journal of Machine Learning Research, 21(140), 1-67.

7. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for
language understanding. Proceedings of NAACL-HLT, 4171-4186.

8. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space.
ICLR Workshop.

9. Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. Proceedings of
EMNLP, 1532-1543.

10. Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). Training compute-optimal large language models. arXiv preprint
arXiv:2203.15556.

11. Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback.
Advances in Neural Information Processing Systems, 35, 27730-27744.

12. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct preference optimization:
Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.

13. Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv preprint
arXiv:2212.08073.

14. Touvron, H., Lavril, T., Izacard, G., et al. (2023). LLaMA: Open and efficient foundation language models. arXiv
preprint arXiv:2302.13971.

98
References

15. Touvron, H., Martin, L., Stone, K., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint
arXiv:2307.09288.

16. Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2024). RoFormer: Enhanced transformer with rotary position
embedding. Neurocomputing, 568, 127063.

17. Press, O., Smith, N. A., & Lewis, M. (2022). Train short, test long: Attention with linear biases enables input length
extrapolation. ICLR.

18. Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Re, C. (2022). FlashAttention: Fast and memory-efficient exact attention
with IO-awareness. Advances in Neural Information Processing Systems, 35, 16344-16359.

19. Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks.
Advances in Neural Information Processing Systems, 33, 9459-9474.

20. Yao, S., Zhao, J., Yu, D., et al. (2023). ReAct: Synergizing reasoning and acting in language models. ICLR.

21. Wei, J., Wang, X., Schuurmans, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language
models. Advances in Neural Information Processing Systems, 35, 24824-24837.

22. OpenAI. (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774.

23. Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2020). ZeRO: Memory optimizations toward training trillion
parameter models. SC20: International Conference for High Performance Computing, 1-16.

24. Shoeybi, M., Patwary, M., Puri, R., et al. (2019). Megatron-LM: Training multi-billion parameter language models using
model parallelism. arXiv preprint arXiv:1909.08053.

25. Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). LLM.int8(): 8-bit matrix multiplication for transformers
at scale. Advances in Neural Information Processing Systems, 35, 30318-30332.

26. Lin, J., Tang, J., Tang, H., et al. (2024). AWQ: Activation-aware weight quantization for LLM compression and
acceleration. MLSys.

27. Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast inference from transformers via speculative decoding. ICML,
19274-19286.

28. Ainslie, J., Lee-Thorp, J., de Jong, M., et al. (2023). GQA: Training generalized multi-query transformer models from
multi-head checkpoints. EMNLP.

29. Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch transformers: Scaling to trillion parameter models with simple and
efficient sparsity. Journal of Machine Learning Research, 23(120), 1-39.

99
References

30. Hendrycks, D., Burns, C., Basart, S., et al. (2021). Measuring massive multitask language understanding. ICLR.

31. Cobbe, K., Kosaraju, V., Bavarian, M., et al. (2021). Training verifiers to solve math word problems. arXiv preprint
arXiv:2110.14168.

32. Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating large language models trained on code. arXiv preprint
arXiv:2107.03374.

33. Zellers, R., Holtzman, A., Bisk, Y., et al. (2019). HellaSwag: Can a machine really finish your sentence? ACL, 4791-
4800.

34. Lin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring how models mimic human falsehoods. ACL, 3214-3252.

35. Suzgun, M., Scales, N., Sch{"a}rli, N., et al. (2023). Challenging BIG-Bench tasks and whether chain-of-thought can
solve them. ACL Findings, 13003-13051.

36. Zheng, L., Chiang, W. L., Sheng, Y., et al. (2023). Judging LLM-as-a-judge with MT-Bench and chatbot arena.
Advances in Neural Information Processing Systems, 36.

37. Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning transferable visual models from natural language
supervision. ICML, 8748-8763.

38. Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2024). Visual instruction tuning. Advances in Neural Information Processing
Systems, 36.

39. Radford, A., Kim, J. W., Xu, T., et al. (2023). Robust speech recognition via large-scale weak supervision. ICML,
28492-28518.

40. Yao, S., Yu, D., Zhao, J., et al. (2024). Tree of thoughts: Deliberate problem solving with large language models.
Advances in Neural Information Processing Systems, 36.

100
References

Amer Hussein
(7.1)
(1.12)
(1.5)
Connect:
LinkedIn: [Link]/in/amer-hussein/

101

You might also like