Department of CSE
23AVI3202R - COMPUTER VISION (R)
TOPIC:
TRANSFORMERS
Session - 1
CREATED BY K. VICTOR BABU
AIM OF THE SESSION
To familiarize students with the COs, syllabus and evaluation plan of the course.
To familiarize students with the basic concept of Computer vision
INSTRUCTIONAL OBJECTIVES
This Session is designed to:
1. Explain the Cos, Syllabus and evaluation plan of the course
2. Describe the basic concepts of Transformers
3. List out the various applications of Transformers
LEARNING OUTCOMES
At the end of this session, you should be able to:
1. Define the concept of Transformers
2. Describe the various application of Transformers
CREATED BY K. VICTOR BABU
“Transformer”
CREATED BY K. VICTOR BABU
Introduction to Transformers in Image
Processing
Limitations of CNNs
❌ Capture only local features
❌ Long-range dependencies need many layers
❌ Fixed inductive bias (translation invariance)
Motivation for Vision Transformers
Transformers were highly successful in NLP(Natural language processing).
because they:
✔ Model global relationships
✔ Use self-attention
✔ Learn long-range dependencies directly
• Transformers are deep learning models originally developed for sequence modeling (language),
based on the self-attention mechanism.
• In image processing, they are adapted to handle images by treating parts of an image as a sequence
of visual tokens.
CREATED BY K. VICTOR BABU
Why Transformers for Images?
• Traditional CNNs are excellent at capturing local
features (edges, textures), but they struggle with
long-range dependencies. Transformers overcome
this by:
• Modeling global relationships between all parts of
an image
• Using self-attention instead of convolution
• Enabling parallel computation, which improves
scalability
CREATED BY K. VICTOR BABU
Transformer in CV
1. The transformer works with a set of tokens
2. What are tokens in images?
Tokens are Regions of Interests!
CREATED BY K. VICTOR BABU
Transformer in CV
CREATED BY K. VICTOR BABU
How Transformers Process Images
• Image Patching:
An input image is divided into fixed-size patches (e.g., 16×16 pixels).
• Patch Embedding:
Each patch is flattened and projected into a feature vector.
• Positional Encoding:
Spatial information is added so the model knows where each patch comes
from.
• Transformer Encoder:
Stacked layers of multi-head self-attention and MLPs learn global
context.
• Classification / Output Head:
A special token (or pooled features) is used for image classification or
other tasks.
CREATED BY K. VICTOR BABU
ViT Architecture
CREATED BY K. VICTOR BABU
Vision Transformer(ViT)
CREATED BY K. VICTOR BABU
Vision Transformer(ViT)
• Image to Patches: The image is split into a sequence of small, fixed-size patches, which are then
embedded as vectors to act as input tokens.
• Patch embedding: It converts structured image data into a sequence of vector embeddings by
splitting images into small, flattened patches and mapping them to a fixed-size dimension.
Without Patch Embeddings: Images cannot be fed into a standard Transformer.
• Adding CLS Token:A learnable class token [CLS] is prepended, It gathers contextual data from all
patches via self-attention, enabling it to represent the entire image's content for tasks like image
classification
• Position embeddings are learnable, ordered vectors added to all tokens (CLS and patches) to
provide the spatial information needed for Transformers to understand the image
structure. Without Position Embedding, the Transformer would treat the image as a "bag of
patches," losing all spatial context.
CREATED BY K. VICTOR BABU
Vision Transformer(ViT)
• A Transformer encoder processes images by treating flattened, linear patches as sequence tokens,
similar to words in NLP. It applies global self-attention to model dependencies between all
patches, enabling the model to understand global context rather than relying on local convolutions,
which enhances performance in tasks like classification, detection, and segmentation.
• Classification: The output of the CLS token (its final embedding after all layers) is fed
into a simple MLP (Multi-Layer Perceptron) head for final classification, effectively
summarizing the entire image for the prediction.
• The MLP (Multilayer Perceptron) head is the final classification layer placed on top of the
transformer encoder, responsible for taking the processed, rich feature representations and mapping
them to the final output (e.g., class probabilities).
CREATED BY K. VICTOR BABU
What is Attention?
Attention is a mechanism that allows a model to focus on the most relevant parts of the input
while processing information.
• Instead of treating all inputs equally, attention assigns different importance (weights) to different
elements.
Mathematically:
Attention=∑(weighti×valuei)
• Higher weight → more importance
• Lower weight → less importance
Why attention is needed
• Captures long-range dependencies
• Improves context understanding
• Avoids information loss in long sequences
CREATED BY K. VICTOR BABU
What is Self-Attention?
Self-attention is an attention mechanism where elements in a sequence attend to other elements
within the same sequence.
Example
Sentence: “The animal didn’t cross the road because it was tired.”
Question: What does “it” refer to?
Self-attention computes relationships between:
• it ↔ animal
• it ↔ road
• The model learns that “it” refers to “animal”.
CREATED BY K. VICTOR BABU
How it works
• Each word creates:
• Query (Q)
• Key (K)
• Value (V)
• Then:
𝑄𝐾 𝑇
• 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄 𝐾 𝑉 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉
𝑑𝑘
• So each word:
• looks at all other words
• decides how much to attend to them
• No external input
• Same sequence provides Q, K, V
• Hence the name self-attention.
CREATED BY K. VICTOR BABU
CREATED BY K. VICTOR BABU
What is Global Attention?
Global attention allows each token to attend to all tokens in the entire sequence.
Every position can see the whole sentence.
Characteristics
✔ Captures long-distance relationships
✔ High accuracy
❌ Computationally expensive (O(n²))
Where global attention is used
• Original Transformer
• BERT
• GPT
• Machine translation
CREATED BY K. VICTOR BABU
What is Multi-Head Attention?
Multi-head attention uses multiple attention mechanisms in parallel to learn different types of relationships
simultaneously.
How it works
• Input embedding is projected into multiple Q, K, V sets:
• For each head i:
• ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄𝑖 𝐾𝑖 𝑉𝑖
• Then:
• 𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 𝑄 𝐾 𝑉 = 𝐶𝑜𝑛𝑐𝑎𝑡 ℎ 𝑒 𝑎𝑑1 . . . ℎ𝑒𝑎𝑑ℎ 𝑊 𝑂
Why multiple heads?
Because one attention cannot capture everything.
Multi-head allows:
• richer representation
• better contextual understanding
• parallel learning of relations
CREATED BY K. VICTOR BABU
Vision Transformers (ViT)
• The Vision Transformer (ViT) is the most popular transformer-based
image model. It:
• Treats image patches like words in a sentence
• Uses only transformer encoders (no convolution)
• Achieves competitive or superior performance to CNNs on large
datasets
CREATED BY K. VICTOR BABU
Flow Diagram
Input Image
↓
Split into Patches
↓
Patch Embedding + Positional Encoding
↓
Transformer Encoder (× L layers)
↓
Classification Head
↓
CREATED BY K. VICTOR BABU Output Class
Input Image & Patch Splitting
CREATED BY K. VICTOR BABU
Patch Embedding
CREATED BY K. VICTOR BABU
Class Token & Positional Encoding
CREATED BY K. VICTOR BABU
Transformer Encoder (Repeated L Times)
CREATED BY K. VICTOR BABU
Classification Head
CREATED BY K. VICTOR BABU
CNN vs Transformers
• CNNs learn local patterns (edges, textures) using convolution and
gradually build global understanding.
• Use CNNs for local feature extraction
• Transformers (ViT) learn global relationships directly using self-
attention over image patches.
• Use Transformers for global reasoning
CREATED BY K. VICTOR BABU
THANK YOU
CREATED BY K. VICTOR BABU