0% found this document useful (0 votes)
8 views27 pages

CO 2 6 Transformers

The document provides an overview of Transformers in the context of Computer Vision, detailing their advantages over traditional CNNs, such as modeling global relationships and handling long-range dependencies. It outlines the structure and process of Vision Transformers (ViT), including image patching, embedding, and the use of self-attention mechanisms. The session aims to familiarize students with the course objectives, basic concepts of Transformers, and their applications in image processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views27 pages

CO 2 6 Transformers

The document provides an overview of Transformers in the context of Computer Vision, detailing their advantages over traditional CNNs, such as modeling global relationships and handling long-range dependencies. It outlines the structure and process of Vision Transformers (ViT), including image patching, embedding, and the use of self-attention mechanisms. The session aims to familiarize students with the course objectives, basic concepts of Transformers, and their applications in image processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Department of CSE

23AVI3202R - COMPUTER VISION (R)


TOPIC:
TRANSFORMERS

Session - 1

CREATED BY K. VICTOR BABU


AIM OF THE SESSION
To familiarize students with the COs, syllabus and evaluation plan of the course.
To familiarize students with the basic concept of Computer vision

INSTRUCTIONAL OBJECTIVES

This Session is designed to:


1. Explain the Cos, Syllabus and evaluation plan of the course
2. Describe the basic concepts of Transformers
3. List out the various applications of Transformers

LEARNING OUTCOMES

At the end of this session, you should be able to:


1. Define the concept of Transformers
2. Describe the various application of Transformers

CREATED BY K. VICTOR BABU


“Transformer”

CREATED BY K. VICTOR BABU


Introduction to Transformers in Image
Processing
Limitations of CNNs
❌ Capture only local features
❌ Long-range dependencies need many layers
❌ Fixed inductive bias (translation invariance)

Motivation for Vision Transformers


Transformers were highly successful in NLP(Natural language processing).
because they:
✔ Model global relationships
✔ Use self-attention
✔ Learn long-range dependencies directly

• Transformers are deep learning models originally developed for sequence modeling (language),
based on the self-attention mechanism.
• In image processing, they are adapted to handle images by treating parts of an image as a sequence
of visual tokens.
CREATED BY K. VICTOR BABU
Why Transformers for Images?
• Traditional CNNs are excellent at capturing local
features (edges, textures), but they struggle with
long-range dependencies. Transformers overcome
this by:
• Modeling global relationships between all parts of
an image
• Using self-attention instead of convolution
• Enabling parallel computation, which improves
scalability

CREATED BY K. VICTOR BABU


Transformer in CV
1. The transformer works with a set of tokens
2. What are tokens in images?
Tokens are Regions of Interests!

CREATED BY K. VICTOR BABU


Transformer in CV

CREATED BY K. VICTOR BABU


How Transformers Process Images
• Image Patching:
An input image is divided into fixed-size patches (e.g., 16×16 pixels).
• Patch Embedding:
Each patch is flattened and projected into a feature vector.
• Positional Encoding:
Spatial information is added so the model knows where each patch comes
from.
• Transformer Encoder:
Stacked layers of multi-head self-attention and MLPs learn global
context.
• Classification / Output Head:
A special token (or pooled features) is used for image classification or
other tasks.
CREATED BY K. VICTOR BABU
ViT Architecture

CREATED BY K. VICTOR BABU


Vision Transformer(ViT)

CREATED BY K. VICTOR BABU


Vision Transformer(ViT)
• Image to Patches: The image is split into a sequence of small, fixed-size patches, which are then
embedded as vectors to act as input tokens.

• Patch embedding: It converts structured image data into a sequence of vector embeddings by
splitting images into small, flattened patches and mapping them to a fixed-size dimension.
Without Patch Embeddings: Images cannot be fed into a standard Transformer.

• Adding CLS Token:A learnable class token [CLS] is prepended, It gathers contextual data from all
patches via self-attention, enabling it to represent the entire image's content for tasks like image
classification

• Position embeddings are learnable, ordered vectors added to all tokens (CLS and patches) to
provide the spatial information needed for Transformers to understand the image
structure. Without Position Embedding, the Transformer would treat the image as a "bag of
patches," losing all spatial context.

CREATED BY K. VICTOR BABU


Vision Transformer(ViT)

• A Transformer encoder processes images by treating flattened, linear patches as sequence tokens,
similar to words in NLP. It applies global self-attention to model dependencies between all
patches, enabling the model to understand global context rather than relying on local convolutions,
which enhances performance in tasks like classification, detection, and segmentation.

• Classification: The output of the CLS token (its final embedding after all layers) is fed
into a simple MLP (Multi-Layer Perceptron) head for final classification, effectively
summarizing the entire image for the prediction.

• The MLP (Multilayer Perceptron) head is the final classification layer placed on top of the
transformer encoder, responsible for taking the processed, rich feature representations and mapping
them to the final output (e.g., class probabilities).

CREATED BY K. VICTOR BABU


What is Attention?
Attention is a mechanism that allows a model to focus on the most relevant parts of the input
while processing information.
• Instead of treating all inputs equally, attention assigns different importance (weights) to different
elements.

Mathematically:
Attention=∑(weighti×valuei​)
• Higher weight → more importance
• Lower weight → less importance

Why attention is needed


• Captures long-range dependencies
• Improves context understanding
• Avoids information loss in long sequences

CREATED BY K. VICTOR BABU


What is Self-Attention?
Self-attention is an attention mechanism where elements in a sequence attend to other elements
within the same sequence.

Example
Sentence: “The animal didn’t cross the road because it was tired.”
Question: What does “it” refer to?
Self-attention computes relationships between:
• it ↔ animal
• it ↔ road
• The model learns that “it” refers to “animal”.

CREATED BY K. VICTOR BABU


How it works
• Each word creates:
• Query (Q)
• Key (K)
• Value (V)
• Then:
𝑄𝐾 𝑇
• 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄 𝐾 𝑉 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉
𝑑𝑘
• So each word:
• looks at all other words
• decides how much to attend to them

• No external input
• Same sequence provides Q, K, V
• Hence the name self-attention.

CREATED BY K. VICTOR BABU


CREATED BY K. VICTOR BABU
What is Global Attention?
Global attention allows each token to attend to all tokens in the entire sequence.
Every position can see the whole sentence.

Characteristics
✔ Captures long-distance relationships
✔ High accuracy
❌ Computationally expensive (O(n²))

Where global attention is used


• Original Transformer
• BERT
• GPT
• Machine translation

CREATED BY K. VICTOR BABU


What is Multi-Head Attention?

Multi-head attention uses multiple attention mechanisms in parallel to learn different types of relationships
simultaneously.

How it works
• Input embedding is projected into multiple Q, K, V sets:
• For each head i:
• ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄𝑖 𝐾𝑖 𝑉𝑖
• Then:
• 𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 𝑄 𝐾 𝑉 = 𝐶𝑜𝑛𝑐𝑎𝑡 ℎ 𝑒 𝑎𝑑1 . . . ℎ𝑒𝑎𝑑ℎ 𝑊 𝑂
Why multiple heads?
Because one attention cannot capture everything.
Multi-head allows:
• richer representation
• better contextual understanding
• parallel learning of relations
CREATED BY K. VICTOR BABU
Vision Transformers (ViT)

• The Vision Transformer (ViT) is the most popular transformer-based


image model. It:
• Treats image patches like words in a sentence
• Uses only transformer encoders (no convolution)
• Achieves competitive or superior performance to CNNs on large
datasets

CREATED BY K. VICTOR BABU


Flow Diagram
Input Image

Split into Patches

Patch Embedding + Positional Encoding

Transformer Encoder (× L layers)

Classification Head

CREATED BY K. VICTOR BABU Output Class
Input Image & Patch Splitting

CREATED BY K. VICTOR BABU


Patch Embedding

CREATED BY K. VICTOR BABU


Class Token & Positional Encoding

CREATED BY K. VICTOR BABU


Transformer Encoder (Repeated L Times)

CREATED BY K. VICTOR BABU


Classification Head

CREATED BY K. VICTOR BABU


CNN vs Transformers

• CNNs learn local patterns (edges, textures) using convolution and


gradually build global understanding.
• Use CNNs for local feature extraction
• Transformers (ViT) learn global relationships directly using self-
attention over image patches.
• Use Transformers for global reasoning

CREATED BY K. VICTOR BABU


THANK YOU

CREATED BY K. VICTOR BABU

You might also like