0% found this document useful (0 votes)

11 views3 pages

Math Behind Transformers & RNNs

The document discusses the mathematical foundations of Transformers and RNNs, focusing on how core operations are converted into matrix multiplications for efficient computation. It details the equations for RNNs, including hidden state updates, and explains the self-attention mechanism and feed-forward networks in Transformers, highlighting their reliance on matrix multiplications. The advantages of using matrix multiplications, such as hardware optimization and efficiency, are also emphasized.

Uploaded by

dibawed780

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views3 pages

Math Behind Transformers & RNNs

Uploaded by

dibawed780

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Mathematical Foundations of Transformers & RNNs:

From Equations to Matrix Multiplications

Your Name
April 11, 2025

Introduction
Transformers and RNNs rely on key mathematical operations that are ultimately converted
into large matrix multiplications (MatMul) for efficient computation on GPUs/TPUs. This
document explains the core equations and their conversion to MatMul operations.

1 Recurrent Neural Networks (RNNs)

RNNs process sequential data using recurrent connections. The core operation is the hidden
state update.

Vanilla RNN Equations

For a timestep t, input xt , hidden state ht , and output yt :

ht = σ(Wxh xt + Whh ht−1 + bh )

yt = Why ht + by
Where:
• Wxh , Whh , Why are weight matrices.
• σ is an activation function (e.g., tanh).

How it becomes a MatMul?

1. Concatenate Input & Hidden State: Combine xt and ht−1 into a single matrix:

xt
zt =
ht−1

2. Stack Weights: Combine Wxh and Whh into one matrix:

Wh = Wxh Whh

1
3. Single Matrix Multiplication: The hidden state update becomes:

ht = σ(Wh zt + bh )

Example
If xt ∈ Rdx and ht−1 ∈ Rdh , then:

Wh ∈ Rdh ×(dx +dh ) , zt ∈ R(dx +dh )×1

The product Wh zt is a large MatMul.

2 Transformers (Self-Attention & Feed-Forward)

Transformers rely on self-attention and feed-forward networks, both heavily using Mat-
Mul.

(A) Self-Attention Mechanism

Given input X ∈ Rn×d (sequence length n, embedding dim d):

1. Compute Queries, Keys, Values:

Q = XWQ , K = XWK , V = XWV

(where WQ , WK , WV ∈ Rd×dk )

2. Attention Scores:
QKT

Attention(Q, K, V) = softmax √ V
dk

How it becomes MatMul?

• QKT is a MatMul of size (n × dk ) × (dk × n) → (n × n).

• The second MatMul: (n × n) × (n × dk ) → (n × dk ).

Example
If n = 1024 and dk = 64, then:

QKT is (1024 × 64) × (64 × 1024) → 1024 × 1024 matrix.

2
(B) Feed-Forward Network (FFN)
Each position i in the sequence undergoes:

FFN(xi ) = ReLU(xi W1 + b1 )W2 + b2

Where:

• W1 ∈ Rd×df f , W2 ∈ Rdf f ×d .

• df f is typically 4d (e.g., d = 512 → df f = 2048).

How it becomes MatMul?

The entire sequence X ∈ Rn×d is processed as:

Y = ReLU(XW1 + b1 )W2 + b2

• First MatMul: (n × d) × (d × df f ) → (n × df f ).

• Second MatMul: (n × df f ) × (df f × d) → (n × d).

Example
For n = 1024, d = 512, df f = 2048:

First MatMul: (1024 × 512) × (512 × 2048).

Second MatMul: (1024 × 2048) × (2048 × 512).

Why MatMul Dominates?

• Hardware Optimization: GPUs/TPUs are optimized for large matrix operations.

• Parallelization: MatMul can be batched and computed in parallel.

• Efficiency: Combining operations into a single MatMul reduces overhead.

Summary

Model Core Operation MatMul Conversion Example

RNN Hidden state update ht = σ([Wxh Whh ][xt ; ht−1 ] + bh )
Transformer Self-Attention QKT (n×n attention scores)
Transformer Feed-Forward Network XW1 W2 (two large MatMuls)

Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
338 pages
RNN Overview and Training Challenges
No ratings yet
RNN Overview and Training Challenges
55 pages
DSA 5102: Machine Learning Foundations
No ratings yet
DSA 5102: Machine Learning Foundations
45 pages
Overview of Recurrent Neural Networks
No ratings yet
Overview of Recurrent Neural Networks
16 pages
Overview of Recurrent Neural Networks
No ratings yet
Overview of Recurrent Neural Networks
25 pages
CNN for ECG Atrial Fibrillation Classification
No ratings yet
CNN for ECG Atrial Fibrillation Classification
20 pages
2025 AN2DL 04 RecurrentNeuralNetworks
No ratings yet
2025 AN2DL 04 RecurrentNeuralNetworks
48 pages
Autoencoders in Deep Learning
No ratings yet
Autoencoders in Deep Learning
73 pages
DSCI 556: Machine Learning Foundations
No ratings yet
DSCI 556: Machine Learning Foundations
44 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
34 pages
Parallelizing RNNs for Long Sequences
No ratings yet
Parallelizing RNNs for Long Sequences
37 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
9 pages
Deep Learning: Feedforward & Convolution Networks
No ratings yet
Deep Learning: Feedforward & Convolution Networks
6 pages
Backpropagation Through Time Explained
No ratings yet
Backpropagation Through Time Explained
6 pages
Fady Morris Deep Learning
No ratings yet
Fady Morris Deep Learning
13 pages
Neural Networks II: Backpropagation & Design
No ratings yet
Neural Networks II: Backpropagation & Design
85 pages
Module 2 - 2
No ratings yet
Module 2 - 2
23 pages
RNN Professional Notes
No ratings yet
RNN Professional Notes
13 pages
Backpropagation A Peek Into The Mathematics of Optimization
No ratings yet
Backpropagation A Peek Into The Mathematics of Optimization
4 pages
Backpropagation: Mathematical Insights
No ratings yet
Backpropagation: Mathematical Insights
4 pages
Deep Learning for Handwritten Digit Recognition
No ratings yet
Deep Learning for Handwritten Digit Recognition
15 pages
SGD Variants in Neural Networks
No ratings yet
SGD Variants in Neural Networks
211 pages
Backpropagation in Neural Networks Explained
No ratings yet
Backpropagation in Neural Networks Explained
26 pages
RNNs, LSTMs, GRUs, and Text Embeddings
No ratings yet
RNNs, LSTMs, GRUs, and Text Embeddings
61 pages
Deep Learning Concepts and Techniques
No ratings yet
Deep Learning Concepts and Techniques
4 pages
Backpropagation in Deep Networks
No ratings yet
Backpropagation in Deep Networks
3 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
47 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
39 pages
Deep Learning Concepts and Techniques
No ratings yet
Deep Learning Concepts and Techniques
55 pages
ML Class Presentation
No ratings yet
ML Class Presentation
15 pages
Deeplearning ML
No ratings yet
Deeplearning ML
95 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
35 pages
Understanding Neural Tangent Kernel Dynamics
No ratings yet
Understanding Neural Tangent Kernel Dynamics
23 pages
L10 RecurrentNeuralNetworks 2018
No ratings yet
L10 RecurrentNeuralNetworks 2018
38 pages
Neural Networks Cost Function & Backpropagation
No ratings yet
Neural Networks Cost Function & Backpropagation
15 pages
Introduction to Recurrent Neural Networks
No ratings yet
Introduction to Recurrent Neural Networks
48 pages
Unit III Notes DL
No ratings yet
Unit III Notes DL
20 pages
Language Models for Mathematicians
No ratings yet
Language Models for Mathematicians
53 pages
Recurrent Neural Networks Overview
No ratings yet
Recurrent Neural Networks Overview
34 pages
Simple Neural Network with Backpropagation
No ratings yet
Simple Neural Network with Backpropagation
18 pages
Understanding Deep Neural Networks
No ratings yet
Understanding Deep Neural Networks
38 pages
Neural Networks in Information Retrieval
No ratings yet
Neural Networks in Information Retrieval
290 pages
Tran Formers
No ratings yet
Tran Formers
7 pages
Neural Network Basics and Training
No ratings yet
Neural Network Basics and Training
73 pages
Step-by-Step RNN Building Guide
No ratings yet
Step-by-Step RNN Building Guide
18 pages
Annda Lab1b
No ratings yet
Annda Lab1b
15 pages
Backpropagation in Deep Learning
No ratings yet
Backpropagation in Deep Learning
66 pages
Neural Networks: Fundamentals and Training
No ratings yet
Neural Networks: Fundamentals and Training
20 pages
RNN and LSTM Overview
No ratings yet
RNN and LSTM Overview
71 pages
Feedforward Propagation in Neural Networks
No ratings yet
Feedforward Propagation in Neural Networks
11 pages
Understanding Neural Networks Basics
No ratings yet
Understanding Neural Networks Basics
27 pages
Winsem2025-26 VL Macse630 00100 Eth 2026-01-30 RNN
No ratings yet
Winsem2025-26 VL Macse630 00100 Eth 2026-01-30 RNN
36 pages
Neural Networks: Classification & Training
No ratings yet
Neural Networks: Classification & Training
57 pages
Machine Learning: HMM & RNN Overview
No ratings yet
Machine Learning: HMM & RNN Overview
90 pages
Annona Squamosa Extract's Effects on A549 Cells
No ratings yet
Annona Squamosa Extract's Effects on A549 Cells
10 pages
EViews 8 Command Ref
No ratings yet
EViews 8 Command Ref
723 pages
Understanding Verb Complements
No ratings yet
Understanding Verb Complements
25 pages
MANET Routing Protocols Overview
No ratings yet
MANET Routing Protocols Overview
74 pages
Dental Impression Techniques Overview
No ratings yet
Dental Impression Techniques Overview
9 pages
Java Object-Oriented Concepts Exam 2023
No ratings yet
Java Object-Oriented Concepts Exam 2023
2 pages
Freehand Weaving Pattern: Puffins Design
No ratings yet
Freehand Weaving Pattern: Puffins Design
5 pages
Industrial Saline Water Desalination Study
No ratings yet
Industrial Saline Water Desalination Study
10 pages
Understanding Sphere Properties and Formulas
No ratings yet
Understanding Sphere Properties and Formulas
20 pages
Temperature and Pressure Measurement Basics
No ratings yet
Temperature and Pressure Measurement Basics
25 pages
IT Basics: Computer Hardware & Software Quiz
No ratings yet
IT Basics: Computer Hardware & Software Quiz
3 pages
Introduction to Micrometeorology for Wind Energy
No ratings yet
Introduction to Micrometeorology for Wind Energy
104 pages
Class 8 Physics Syllabus 2024-25
No ratings yet
Class 8 Physics Syllabus 2024-25
4 pages
SKM 420 GR Flow Sensor Specifications
No ratings yet
SKM 420 GR Flow Sensor Specifications
1 page
Organic Mirror Gallery for Interiors
No ratings yet
Organic Mirror Gallery for Interiors
36 pages
Advanced Mathematics Exam Paper 2013
No ratings yet
Advanced Mathematics Exam Paper 2013
4 pages
Understanding Colligative Properties
No ratings yet
Understanding Colligative Properties
29 pages
Grandstream Revenue Overview 2023
No ratings yet
Grandstream Revenue Overview 2023
6 pages
BCA BSC CA Theory Exam Time Table Dec2025
No ratings yet
BCA BSC CA Theory Exam Time Table Dec2025
2 pages
1.1 Arduino - Project 1 - Code and Descriptions
No ratings yet
1.1 Arduino - Project 1 - Code and Descriptions
5 pages
Soil Water Movement and Darcy's Law
No ratings yet
Soil Water Movement and Darcy's Law
69 pages
Indian-West African Rainfall Teleconnections
No ratings yet
Indian-West African Rainfall Teleconnections
3 pages
Java Non-Static Initializer Blocks
No ratings yet
Java Non-Static Initializer Blocks
245 pages
Mitsubishi Service Manual
100% (1)
Mitsubishi Service Manual
48 pages
Parallel Operation of Single Phase Transformers
No ratings yet
Parallel Operation of Single Phase Transformers
7 pages
GeoWEPP Manual for ArcGIS 9.x
No ratings yet
GeoWEPP Manual for ArcGIS 9.x
129 pages
Technical Writing in Circuit Lab
No ratings yet
Technical Writing in Circuit Lab
8 pages
JEE Main 2025 Physics, Chemistry, Math Test
No ratings yet
JEE Main 2025 Physics, Chemistry, Math Test
8 pages
Tanabe-Sugano Diagrams in Inorganic Chemistry
No ratings yet
Tanabe-Sugano Diagrams in Inorganic Chemistry
10 pages
Grade 5 Class Record for Bita-Ug School
No ratings yet
Grade 5 Class Record for Bita-Ug School
46 pages

Math Behind Transformers & RNNs

Uploaded by

Math Behind Transformers & RNNs

Uploaded by

Mathematical Foundations of Transformers & RNNs:

From Equations to Matrix Multiplications

1 Recurrent Neural Networks (RNNs)

Vanilla RNN Equations

ht = σ(Wxh xt + Whh ht−1 + bh )

How it becomes a MatMul?

2. Stack Weights: Combine Wxh and Whh into one matrix:

Wh ∈ Rdh ×(dx +dh ) , zt ∈ R(dx +dh )×1

The product Wh zt is a large MatMul.

2 Transformers (Self-Attention & Feed-Forward)

(A) Self-Attention Mechanism

1. Compute Queries, Keys, Values:

Q = XWQ , K = XWK , V = XWV

How it becomes MatMul?

• The second MatMul: (n × n) × (n × dk ) → (n × dk ).

QKT is (1024 × 64) × (64 × 1024) → 1024 × 1024 matrix.

FFN(xi ) = ReLU(xi W1 + b1 )W2 + b2

• df f is typically 4d (e.g., d = 512 → df f = 2048).

How it becomes MatMul?

• Second MatMul: (n × df f ) × (df f × d) → (n × d).

First MatMul: (1024 × 512) × (512 × 2048).

Second MatMul: (1024 × 2048) × (2048 × 512).

Why MatMul Dominates?

• Parallelization: MatMul can be batched and computed in parallel.

• Efficiency: Combining operations into a single MatMul reduces overhead.

Model Core Operation MatMul Conversion Example

You might also like