Mathematical Foundations of Transformers & RNNs:
From Equations to Matrix Multiplications
Your Name
April 11, 2025
Introduction
Transformers and RNNs rely on key mathematical operations that are ultimately converted
into large matrix multiplications (MatMul) for efficient computation on GPUs/TPUs. This
document explains the core equations and their conversion to MatMul operations.
1 Recurrent Neural Networks (RNNs)
RNNs process sequential data using recurrent connections. The core operation is the hidden
state update.
Vanilla RNN Equations
For a timestep t, input xt , hidden state ht , and output yt :
ht = σ(Wxh xt + Whh ht−1 + bh )
yt = Why ht + by
Where:
• Wxh , Whh , Why are weight matrices.
• σ is an activation function (e.g., tanh).
How it becomes a MatMul?
1. Concatenate Input & Hidden State: Combine xt and ht−1 into a single matrix:
xt
zt =
ht−1
2. Stack Weights: Combine Wxh and Whh into one matrix:
Wh = Wxh Whh
1
3. Single Matrix Multiplication: The hidden state update becomes:
ht = σ(Wh zt + bh )
Example
If xt ∈ Rdx and ht−1 ∈ Rdh , then:
Wh ∈ Rdh ×(dx +dh ) , zt ∈ R(dx +dh )×1
The product Wh zt is a large MatMul.
2 Transformers (Self-Attention & Feed-Forward)
Transformers rely on self-attention and feed-forward networks, both heavily using Mat-
Mul.
(A) Self-Attention Mechanism
Given input X ∈ Rn×d (sequence length n, embedding dim d):
1. Compute Queries, Keys, Values:
Q = XWQ , K = XWK , V = XWV
(where WQ , WK , WV ∈ Rd×dk )
2. Attention Scores:
QKT
Attention(Q, K, V) = softmax √ V
dk
How it becomes MatMul?
• QKT is a MatMul of size (n × dk ) × (dk × n) → (n × n).
• The second MatMul: (n × n) × (n × dk ) → (n × dk ).
Example
If n = 1024 and dk = 64, then:
QKT is (1024 × 64) × (64 × 1024) → 1024 × 1024 matrix.
2
(B) Feed-Forward Network (FFN)
Each position i in the sequence undergoes:
FFN(xi ) = ReLU(xi W1 + b1 )W2 + b2
Where:
• W1 ∈ Rd×df f , W2 ∈ Rdf f ×d .
• df f is typically 4d (e.g., d = 512 → df f = 2048).
How it becomes MatMul?
The entire sequence X ∈ Rn×d is processed as:
Y = ReLU(XW1 + b1 )W2 + b2
• First MatMul: (n × d) × (d × df f ) → (n × df f ).
• Second MatMul: (n × df f ) × (df f × d) → (n × d).
Example
For n = 1024, d = 512, df f = 2048:
First MatMul: (1024 × 512) × (512 × 2048).
Second MatMul: (1024 × 2048) × (2048 × 512).
Why MatMul Dominates?
• Hardware Optimization: GPUs/TPUs are optimized for large matrix operations.
• Parallelization: MatMul can be batched and computed in parallel.
• Efficiency: Combining operations into a single MatMul reduces overhead.
Summary
Model Core Operation MatMul Conversion Example
RNN Hidden state update ht = σ([Wxh Whh ][xt ; ht−1 ] + bh )
Transformer Self-Attention QKT (n×n attention scores)
Transformer Feed-Forward Network XW1 W2 (two large MatMuls)