0% found this document useful (0 votes)
11 views3 pages

Math Behind Transformers & RNNs

The document discusses the mathematical foundations of Transformers and RNNs, focusing on how core operations are converted into matrix multiplications for efficient computation. It details the equations for RNNs, including hidden state updates, and explains the self-attention mechanism and feed-forward networks in Transformers, highlighting their reliance on matrix multiplications. The advantages of using matrix multiplications, such as hardware optimization and efficiency, are also emphasized.

Uploaded by

dibawed780
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views3 pages

Math Behind Transformers & RNNs

The document discusses the mathematical foundations of Transformers and RNNs, focusing on how core operations are converted into matrix multiplications for efficient computation. It details the equations for RNNs, including hidden state updates, and explains the self-attention mechanism and feed-forward networks in Transformers, highlighting their reliance on matrix multiplications. The advantages of using matrix multiplications, such as hardware optimization and efficiency, are also emphasized.

Uploaded by

dibawed780
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Mathematical Foundations of Transformers & RNNs:

From Equations to Matrix Multiplications


Your Name
April 11, 2025

Introduction
Transformers and RNNs rely on key mathematical operations that are ultimately converted
into large matrix multiplications (MatMul) for efficient computation on GPUs/TPUs. This
document explains the core equations and their conversion to MatMul operations.

1 Recurrent Neural Networks (RNNs)


RNNs process sequential data using recurrent connections. The core operation is the hidden
state update.

Vanilla RNN Equations


For a timestep t, input xt , hidden state ht , and output yt :

ht = σ(Wxh xt + Whh ht−1 + bh )


yt = Why ht + by
Where:
• Wxh , Whh , Why are weight matrices.
• σ is an activation function (e.g., tanh).

How it becomes a MatMul?


1. Concatenate Input & Hidden State: Combine xt and ht−1 into a single matrix:
 
xt
zt =
ht−1

2. Stack Weights: Combine Wxh and Whh into one matrix:


 
Wh = Wxh Whh

1
3. Single Matrix Multiplication: The hidden state update becomes:

ht = σ(Wh zt + bh )

Example
If xt ∈ Rdx and ht−1 ∈ Rdh , then:

Wh ∈ Rdh ×(dx +dh ) , zt ∈ R(dx +dh )×1

The product Wh zt is a large MatMul.

2 Transformers (Self-Attention & Feed-Forward)


Transformers rely on self-attention and feed-forward networks, both heavily using Mat-
Mul.

(A) Self-Attention Mechanism


Given input X ∈ Rn×d (sequence length n, embedding dim d):

1. Compute Queries, Keys, Values:

Q = XWQ , K = XWK , V = XWV

(where WQ , WK , WV ∈ Rd×dk )

2. Attention Scores:
QKT
 
Attention(Q, K, V) = softmax √ V
dk

How it becomes MatMul?


• QKT is a MatMul of size (n × dk ) × (dk × n) → (n × n).

• The second MatMul: (n × n) × (n × dk ) → (n × dk ).

Example
If n = 1024 and dk = 64, then:

QKT is (1024 × 64) × (64 × 1024) → 1024 × 1024 matrix.

2
(B) Feed-Forward Network (FFN)
Each position i in the sequence undergoes:

FFN(xi ) = ReLU(xi W1 + b1 )W2 + b2

Where:

• W1 ∈ Rd×df f , W2 ∈ Rdf f ×d .

• df f is typically 4d (e.g., d = 512 → df f = 2048).

How it becomes MatMul?


The entire sequence X ∈ Rn×d is processed as:

Y = ReLU(XW1 + b1 )W2 + b2

• First MatMul: (n × d) × (d × df f ) → (n × df f ).

• Second MatMul: (n × df f ) × (df f × d) → (n × d).

Example
For n = 1024, d = 512, df f = 2048:

First MatMul: (1024 × 512) × (512 × 2048).

Second MatMul: (1024 × 2048) × (2048 × 512).

Why MatMul Dominates?


• Hardware Optimization: GPUs/TPUs are optimized for large matrix operations.

• Parallelization: MatMul can be batched and computed in parallel.

• Efficiency: Combining operations into a single MatMul reduces overhead.

Summary

Model Core Operation MatMul Conversion Example


RNN Hidden state update ht = σ([Wxh Whh ][xt ; ht−1 ] + bh )
Transformer Self-Attention QKT (n×n attention scores)
Transformer Feed-Forward Network XW1 W2 (two large MatMuls)

You might also like