0% found this document useful (0 votes)

8 views27 pages

CO 2 6 Transformers

The document provides an overview of Transformers in the context of Computer Vision, detailing their advantages over traditional CNNs, such as modeling global relationships and handling long-range dependencies. It outlines the structure and process of Vision Transformers (ViT), including image patching, embedding, and the use of self-attention mechanisms. The session aims to familiarize students with the course objectives, basic concepts of Transformers, and their applications in image processing.

Uploaded by

benredcliffkhongsai2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views27 pages

CO 2 6 Transformers

Uploaded by

benredcliffkhongsai2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Department of CSE

23AVI3202R - COMPUTER VISION (R)

TOPIC:
TRANSFORMERS

Session - 1

CREATED BY K. VICTOR BABU

AIM OF THE SESSION
To familiarize students with the COs, syllabus and evaluation plan of the course.
To familiarize students with the basic concept of Computer vision

INSTRUCTIONAL OBJECTIVES

This Session is designed to:

1. Explain the Cos, Syllabus and evaluation plan of the course
2. Describe the basic concepts of Transformers
3. List out the various applications of Transformers

LEARNING OUTCOMES

At the end of this session, you should be able to:

1. Define the concept of Transformers
2. Describe the various application of Transformers

CREATED BY K. VICTOR BABU

“Transformer”

CREATED BY K. VICTOR BABU

Introduction to Transformers in Image
Processing
Limitations of CNNs
❌ Capture only local features
❌ Long-range dependencies need many layers
❌ Fixed inductive bias (translation invariance)

Motivation for Vision Transformers

Transformers were highly successful in NLP(Natural language processing).
because they:
✔ Model global relationships
✔ Use self-attention
✔ Learn long-range dependencies directly

• Transformers are deep learning models originally developed for sequence modeling (language),
based on the self-attention mechanism.
• In image processing, they are adapted to handle images by treating parts of an image as a sequence
of visual tokens.
CREATED BY K. VICTOR BABU
Why Transformers for Images?
• Traditional CNNs are excellent at capturing local
features (edges, textures), but they struggle with
long-range dependencies. Transformers overcome
this by:
• Modeling global relationships between all parts of
an image
• Using self-attention instead of convolution
• Enabling parallel computation, which improves
scalability

CREATED BY K. VICTOR BABU

Transformer in CV
1. The transformer works with a set of tokens
2. What are tokens in images?
Tokens are Regions of Interests!

CREATED BY K. VICTOR BABU

Transformer in CV

CREATED BY K. VICTOR BABU

How Transformers Process Images
• Image Patching:
An input image is divided into fixed-size patches (e.g., 16×16 pixels).
• Patch Embedding:
Each patch is flattened and projected into a feature vector.
• Positional Encoding:
Spatial information is added so the model knows where each patch comes
from.
• Transformer Encoder:
Stacked layers of multi-head self-attention and MLPs learn global
context.
• Classification / Output Head:
A special token (or pooled features) is used for image classification or
other tasks.
CREATED BY K. VICTOR BABU
ViT Architecture

CREATED BY K. VICTOR BABU

Vision Transformer(ViT)

CREATED BY K. VICTOR BABU

Vision Transformer(ViT)
• Image to Patches: The image is split into a sequence of small, fixed-size patches, which are then
embedded as vectors to act as input tokens.

• Patch embedding: It converts structured image data into a sequence of vector embeddings by
splitting images into small, flattened patches and mapping them to a fixed-size dimension.
Without Patch Embeddings: Images cannot be fed into a standard Transformer.

• Adding CLS Token:A learnable class token [CLS] is prepended, It gathers contextual data from all
patches via self-attention, enabling it to represent the entire image's content for tasks like image
classification

• Position embeddings are learnable, ordered vectors added to all tokens (CLS and patches) to
provide the spatial information needed for Transformers to understand the image
structure. Without Position Embedding, the Transformer would treat the image as a "bag of
patches," losing all spatial context.

CREATED BY K. VICTOR BABU

Vision Transformer(ViT)

• A Transformer encoder processes images by treating flattened, linear patches as sequence tokens,
similar to words in NLP. It applies global self-attention to model dependencies between all
patches, enabling the model to understand global context rather than relying on local convolutions,
which enhances performance in tasks like classification, detection, and segmentation.

• Classification: The output of the CLS token (its final embedding after all layers) is fed
into a simple MLP (Multi-Layer Perceptron) head for final classification, effectively
summarizing the entire image for the prediction.

• The MLP (Multilayer Perceptron) head is the final classification layer placed on top of the
transformer encoder, responsible for taking the processed, rich feature representations and mapping
them to the final output (e.g., class probabilities).

CREATED BY K. VICTOR BABU

What is Attention?
Attention is a mechanism that allows a model to focus on the most relevant parts of the input
while processing information.
• Instead of treating all inputs equally, attention assigns different importance (weights) to different
elements.

Mathematically:
Attention=∑(weighti×valuei)
• Higher weight → more importance
• Lower weight → less importance

Why attention is needed

• Captures long-range dependencies
• Improves context understanding
• Avoids information loss in long sequences

CREATED BY K. VICTOR BABU

What is Self-Attention?
Self-attention is an attention mechanism where elements in a sequence attend to other elements
within the same sequence.

Example
Sentence: “The animal didn’t cross the road because it was tired.”
Question: What does “it” refer to?
Self-attention computes relationships between:
• it ↔ animal
• it ↔ road
• The model learns that “it” refers to “animal”.

CREATED BY K. VICTOR BABU

How it works
• Each word creates:
• Query (Q)
• Key (K)
• Value (V)
• Then:
𝑄𝐾 𝑇
• 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄 𝐾 𝑉 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉
𝑑𝑘
• So each word:
• looks at all other words
• decides how much to attend to them

• No external input
• Same sequence provides Q, K, V
• Hence the name self-attention.

CREATED BY K. VICTOR BABU

CREATED BY K. VICTOR BABU
What is Global Attention?
Global attention allows each token to attend to all tokens in the entire sequence.
Every position can see the whole sentence.

Characteristics
✔ Captures long-distance relationships
✔ High accuracy
❌ Computationally expensive (O(n²))

Where global attention is used

• Original Transformer
• BERT
• GPT
• Machine translation

CREATED BY K. VICTOR BABU

What is Multi-Head Attention?

Multi-head attention uses multiple attention mechanisms in parallel to learn different types of relationships
simultaneously.

How it works
• Input embedding is projected into multiple Q, K, V sets:
• For each head i:
• ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄𝑖 𝐾𝑖 𝑉𝑖
• Then:
• 𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 𝑄 𝐾 𝑉 = 𝐶𝑜𝑛𝑐𝑎𝑡 ℎ 𝑒 𝑎𝑑1 . . . ℎ𝑒𝑎𝑑ℎ 𝑊 𝑂
Why multiple heads?
Because one attention cannot capture everything.
Multi-head allows:
• richer representation
• better contextual understanding
• parallel learning of relations
CREATED BY K. VICTOR BABU
Vision Transformers (ViT)

• The Vision Transformer (ViT) is the most popular transformer-based

image model. It:
• Treats image patches like words in a sentence
• Uses only transformer encoders (no convolution)
• Achieves competitive or superior performance to CNNs on large
datasets

CREATED BY K. VICTOR BABU

Flow Diagram
Input Image
↓
Split into Patches
↓
Patch Embedding + Positional Encoding
↓
Transformer Encoder (× L layers)
↓
Classification Head
↓
CREATED BY K. VICTOR BABU Output Class
Input Image & Patch Splitting

CREATED BY K. VICTOR BABU

Patch Embedding

CREATED BY K. VICTOR BABU

Class Token & Positional Encoding

CREATED BY K. VICTOR BABU

Transformer Encoder (Repeated L Times)

CREATED BY K. VICTOR BABU

Classification Head

CREATED BY K. VICTOR BABU

CNN vs Transformers

• CNNs learn local patterns (edges, textures) using convolution and

gradually build global understanding.
• Use CNNs for local feature extraction
• Transformers (ViT) learn global relationships directly using self-
attention over image patches.
• Use Transformers for global reasoning

CREATED BY K. VICTOR BABU

THANK YOU

CREATED BY K. VICTOR BABU

Lecture 21
No ratings yet
Lecture 21
28 pages
Vision Transformers Explained: CNNs vs. ViTs
No ratings yet
Vision Transformers Explained: CNNs vs. ViTs
63 pages
Vision Transformers in Computer Vision
100% (1)
Vision Transformers in Computer Vision
14 pages
Understanding Vision Transformers
No ratings yet
Understanding Vision Transformers
69 pages
Overview of Deep Learning Architectures
No ratings yet
Overview of Deep Learning Architectures
69 pages
Interpreting Attention in Vision Transformers
No ratings yet
Interpreting Attention in Vision Transformers
152 pages
Vision Transformer
No ratings yet
Vision Transformer
6 pages
Vision Transformer Overview and Architecture
No ratings yet
Vision Transformer Overview and Architecture
18 pages
Understanding Vision Transformers (ViT)
No ratings yet
Understanding Vision Transformers (ViT)
15 pages
Vision Transformers Overview and Challenges
No ratings yet
Vision Transformers Overview and Challenges
8 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
3 pages
CSE 317 Project Explanation
No ratings yet
CSE 317 Project Explanation
5 pages
Understanding Transformers in Deep Learning
No ratings yet
Understanding Transformers in Deep Learning
40 pages
Overview of Transformer Architecture
No ratings yet
Overview of Transformer Architecture
22 pages
Vision Transformer: Vit and Its Derivatives
No ratings yet
Vision Transformer: Vit and Its Derivatives
10 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
CSE 317 Project Explanation
No ratings yet
CSE 317 Project Explanation
5 pages
Understanding Transformers in AI
No ratings yet
Understanding Transformers in AI
69 pages
Introduction to Transformer Architecture
No ratings yet
Introduction to Transformer Architecture
10 pages
An Introduction To Transformers: Ret26@cam - Ac.uk
No ratings yet
An Introduction To Transformers: Ret26@cam - Ac.uk
10 pages
Comprehensive Guide to Transformers
No ratings yet
Comprehensive Guide to Transformers
30 pages
Key Concepts of Transformer Architecture
100% (1)
Key Concepts of Transformer Architecture
8 pages
Video Quality Assessment with Vision Transformers
No ratings yet
Video Quality Assessment with Vision Transformers
5 pages
Vision Transformers: Advances & Insights
No ratings yet
Vision Transformers: Advances & Insights
7 pages
Vision Transformers in Computer Vision
No ratings yet
Vision Transformers in Computer Vision
9 pages
Understanding Transformers in NLP
No ratings yet
Understanding Transformers in NLP
44 pages
Deep Learning in Biotechnology Overview
No ratings yet
Deep Learning in Biotechnology Overview
39 pages
Vision Transformer Architecture Explained
No ratings yet
Vision Transformer Architecture Explained
4 pages
Vision Transformers for Image Recognition
No ratings yet
Vision Transformers for Image Recognition
21 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
37 pages
Vision Transformers: Architecture & Insights
No ratings yet
Vision Transformers: Architecture & Insights
28 pages
Transformers in Computer Vision Explained
No ratings yet
Transformers in Computer Vision Explained
31 pages
Vision Transformers for Video Action Prediction
No ratings yet
Vision Transformers for Video Action Prediction
9 pages
Deep Learning Interview Prep - Transformers & ViT
No ratings yet
Deep Learning Interview Prep - Transformers & ViT
31 pages
Vision Transformers Explained
No ratings yet
Vision Transformers Explained
11 pages
Understanding Transformer Models in NLP
No ratings yet
Understanding Transformer Models in NLP
5 pages
Vision Transformer for Diabetic Retinopathy
No ratings yet
Vision Transformer for Diabetic Retinopathy
30 pages
Vision Transformers Overview and Insights
No ratings yet
Vision Transformers Overview and Insights
34 pages
Vision Transformers Explained: Image Processing
No ratings yet
Vision Transformers Explained: Image Processing
6 pages
Introduction to Transformers in DL
No ratings yet
Introduction to Transformers in DL
7 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
4 pages
Transformers
No ratings yet
Transformers
7 pages
Video Quality Assessment with ViTs
No ratings yet
Video Quality Assessment with ViTs
6 pages
Vision Transformer Overview by Shusen Wang
No ratings yet
Vision Transformer Overview by Shusen Wang
35 pages
Transformers in Computer Vision Explained
No ratings yet
Transformers in Computer Vision Explained
92 pages
Cross Attention Transformer for Vision
No ratings yet
Cross Attention Transformer for Vision
15 pages
Transformer
No ratings yet
Transformer
44 pages
Vision Transformer vs CNNs in Image Classification
No ratings yet
Vision Transformer vs CNNs in Image Classification
42 pages
Vision Transformers: Image Classification Redefined
No ratings yet
Vision Transformers: Image Classification Redefined
231 pages
Key Terms in Deep Learning & Transformers
No ratings yet
Key Terms in Deep Learning & Transformers
21 pages
CNN vs Transformer in Image Processing
No ratings yet
CNN vs Transformer in Image Processing
90 pages
Deep Learning Architectures Overview
No ratings yet
Deep Learning Architectures Overview
65 pages
Variational Autoencode
No ratings yet
Variational Autoencode
22 pages
Building Vision Transformers in PyTorch
No ratings yet
Building Vision Transformers in PyTorch
19 pages
Understanding Attention Mechanisms in Transformers
No ratings yet
Understanding Attention Mechanisms in Transformers
14 pages
Vision Transformer Architecture Explained
No ratings yet
Vision Transformer Architecture Explained
3 pages
Vision Transformers: Architecture & Benefits
No ratings yet
Vision Transformers: Architecture & Benefits
14 pages
Mathematics Educator from Jameen Salvarpatti
No ratings yet
Mathematics Educator from Jameen Salvarpatti
2 pages
Tandag Central Elementary School FGD Minutes
No ratings yet
Tandag Central Elementary School FGD Minutes
3 pages
Writing A Case Study: Quick Guide For Students
No ratings yet
Writing A Case Study: Quick Guide For Students
3 pages
Social Issues and Youth Empowerment
No ratings yet
Social Issues and Youth Empowerment
8 pages
Mainpuri Primary School Directory 2011-12
No ratings yet
Mainpuri Primary School Directory 2011-12
194 pages
TLE Performance Task Rubric
No ratings yet
TLE Performance Task Rubric
1 page
Nursing Supervision Plan 2017-2018
No ratings yet
Nursing Supervision Plan 2017-2018
8 pages
B.Ed. Course: Assessment for Learning
No ratings yet
B.Ed. Course: Assessment for Learning
131 pages
Media and Information Literacy Module 5
No ratings yet
Media and Information Literacy Module 5
22 pages
Grade 6 Homeroom Guidance Lesson Plan
No ratings yet
Grade 6 Homeroom Guidance Lesson Plan
2 pages
Paghahanda at Ebalwasyon ng Kagamitang Panturo
No ratings yet
Paghahanda at Ebalwasyon ng Kagamitang Panturo
65 pages
Parent Counselling Report: Insights & Actions
No ratings yet
Parent Counselling Report: Insights & Actions
3 pages
1st Grade Math & Science Lesson Plan
No ratings yet
1st Grade Math & Science Lesson Plan
8 pages
IELTS Speaking Practice Guide
No ratings yet
IELTS Speaking Practice Guide
11 pages
Seminars Attended
No ratings yet
Seminars Attended
1 page
Impact of Project-Based Learning on Creativity
No ratings yet
Impact of Project-Based Learning on Creativity
15 pages
French-English Linguist & Content Specialist
No ratings yet
French-English Linguist & Content Specialist
1 page
Crafting Effective Qualitative Research Questions
No ratings yet
Crafting Effective Qualitative Research Questions
14 pages
Self-Evaluation for OJT Readiness
No ratings yet
Self-Evaluation for OJT Readiness
7 pages
Arts 9 Lesson on Western Classical Techniques
No ratings yet
Arts 9 Lesson on Western Classical Techniques
2 pages
Intro to Macbeth Lesson Plan ELA
No ratings yet
Intro to Macbeth Lesson Plan ELA
3 pages
MuAViC: Multilingual Audio-Visual Corpus
No ratings yet
MuAViC: Multilingual Audio-Visual Corpus
8 pages
Cambridge Architecture Entry Requirements
No ratings yet
Cambridge Architecture Entry Requirements
3 pages
ENG202 Reading & Vocabulary Syllabus
No ratings yet
ENG202 Reading & Vocabulary Syllabus
4 pages
Education For Sustainable Development in Further Education: Edited by Denise Summers and Roger Cutting
No ratings yet
Education For Sustainable Development in Further Education: Edited by Denise Summers and Roger Cutting
298 pages
Theory Assessment for Curriculum Design
0% (1)
Theory Assessment for Curriculum Design
30 pages
Blended Learning - LDD Talk - July 2021
No ratings yet
Blended Learning - LDD Talk - July 2021
26 pages
Verbal Fluency Linked to Longevity
No ratings yet
Verbal Fluency Linked to Longevity
6 pages
Master Thesis Requirements Overview
No ratings yet
Master Thesis Requirements Overview
5 pages
Teachers' Activity Monitoring Calendar 2020
No ratings yet
Teachers' Activity Monitoring Calendar 2020
17 pages

CO 2 6 Transformers

Uploaded by

CO 2 6 Transformers

Uploaded by

Department of CSE

23AVI3202R - COMPUTER VISION (R)

CREATED BY K. VICTOR BABU

This Session is designed to:

At the end of this session, you should be able to:

CREATED BY K. VICTOR BABU

CREATED BY K. VICTOR BABU

Motivation for Vision Transformers

CREATED BY K. VICTOR BABU

CREATED BY K. VICTOR BABU

CREATED BY K. VICTOR BABU

CREATED BY K. VICTOR BABU

CREATED BY K. VICTOR BABU

CREATED BY K. VICTOR BABU

CREATED BY K. VICTOR BABU

Why attention is needed

CREATED BY K. VICTOR BABU

CREATED BY K. VICTOR BABU

CREATED BY K. VICTOR BABU

Where global attention is used

CREATED BY K. VICTOR BABU

• The Vision Transformer (ViT) is the most popular transformer-based

CREATED BY K. VICTOR BABU

CREATED BY K. VICTOR BABU

CREATED BY K. VICTOR BABU

CREATED BY K. VICTOR BABU

CREATED BY K. VICTOR BABU

CREATED BY K. VICTOR BABU

• CNNs learn local patterns (edges, textures) using convolution and

CREATED BY K. VICTOR BABU

CREATED BY K. VICTOR BABU

You might also like