Vision Transformer

The Vision Transformer (ViT) architecture utilizes self-attention to process images as sequences of patches, enabling a global understanding without convolutions. It includes components such as patch embedding, positional encoding, and a classification token, and excels in tasks like image classification and object detection. ViTs require large datasets for training due to their low inductive bias and have advantages over CNNs in capturing global relationships and scalability, but also face challenges like high computational costs and sensitivity to hyperparameters.

Uploaded by

kvds_2012

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views6 pages

Vision Transformer

Uploaded by

kvds_2012

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Vision Transformer (ViT) Architecture

Vision Transformer (ViT) is a deep learning architecture that applies the Transformer model to
images. Instead of relying on convolutions, ViTs use self-attention to capture relationships
across all image patches, enabling a global understanding of the image. This approach has
achieved state-of-the-art results in various computer vision tasks.
 Uses self-attention to model global dependencies between image patches.
 Unlike CNNs, it does not rely on convolution operations for feature extraction.
 Demonstrates strong performance in image classification, object detection and
segmentation.

Vision Transformer (ViT) Architecture Overview

Instead of processing words, ViT treats an image as a sequence of fixed-size patches and
applies self-attention across them. This allows the model to capture long range dependencies
between different parts of an image without relying on convolution operations.
ViT architecture includes the following major components:
This stage converts a 2D image into a sequence of patch embeddings, analogous to tokens in
NLP. It forms the input for the Transformer by turning spatial information into a linear
sequence.
 Patch Splitting:The image is divided into fixed size and non-overlapping patches
each treated as a token and converted into a 1D sequence for the Transformer
reducing computation while preserving local spatial information.

 Patch Flattening: Each patch of size P × P ×C is flattened into a single vector of

length P2 × C . This flattening removes spatial dimensions temporarily and allows
the model to treat patches uniformly. The flattened vectors serve as the raw inputs
to the linear projection layer.
 Patch Embedding (Linear Projection): Each flattened patch is mapped to a
learnable D dimensional embedding enabling the model to learn high level features
similar to word embeddings in NLP.
Patch embeddings can also be extracted using a convolution layer with kernel size
and stride equal to the patch size, making each convolution act as a patch extractor.

2. Positional Encoding

Since Transformers are permutation invariant, positional encodings inject spatial order so the
model knows the relative positions of patches .

 Need for Positional Encoding: Since Transformers treat tokens as unordered

positional encodings are added to retain spatial structure and patch location
information.
 Learnable Positional Embeddings: ViT uses learnable positional vectors to
capture local and global spatial relationships adapting better than fixed encodings
across image resolutions.

3. Adding the Classification Token (CLS Token)

A learnable CLS token is prepended to the patch sequence to aggregate information from all
patches, serving as the image-level representation for classification.
 Purpose of the CLS Token: The CLS token is a learnable vector added to patch
embeddings that gathers global information and is used for final classification,
similar to BERT.
 How the CLS Token Learns Image-Level Representation: The CLS token
attends to all patches to learn global image features and its final output alone is used
for prediction without CNN style pooling .

4. Transformer Encoder (Pre-LayerNorm Architecture)

Pre-LayerNorm applies LayerNorm before both the attention and feed-forward blocks. This
stabilizes gradient flow and prevents the exploding/vanishing gradient problem in deep
Transformers.
Each Encoder Block has:
 Multi-Head Self-Attention (MSA)
 Feed-Forward Network (FFN)
 Residual connections and LayerNorm

5. Multi-Head Self-Attention (MSA)

 Allows each patch to attend to every other patch to model global dependencies,
capturing relationships between distant image regions.
1. Self-Attention Mechanism
 Self-attention enables each patch to relate to all others by using query, key and value
projections with the attention matrix controlling token influence. The input sequence
consists of N image patches plus 1 CLS token, with each token represented by a D-
dimensional embedding.
 Compute Queries, Keys and Values

2. Multi-Head Attention
Multiple attention heads allow the model to attend to different types of information
simultaneously. The outputs of all heads are concatenated and linearly projected to form the
final attention output. This parallel attention mechanism leads to richer and more diverse
feature representations.
MSA(X)=Concat(head1,…,headh)WO
Multiple heads (hhh) allow the model to focus on different types of relationships
simultaneously (e.g., edges, color, textures, global shapes)
i i i
headi =Attention ( X W Q , X W K , X W V )

6. Feed-Forward Network (FFN)

The FFN transforms each patch embedded to a higher-dimensional space and back using two
dense layers with a GELU activation, enabling complex feature learning. It operates
independently on each token with shared weights, allowing efficient non-linear
transformations.
Expands and transforms features for better expressiveness. GELU Activation is used for
smooth non-linearity improves learning and stability.

7. Residual Connections and Layer Normalization

Ensures stable training in deep networks by preserving information and normalizing

activations.
 Residual (Skip) Connections: Residual connections bypass transformation blocks
to preserve earlier layer information, preventing degradation in deep networks.
They enable the model to learn incremental refinements, improving convergence
and stability in deep ViTs.
 Layer Normalization: LayerNorm normalizes features across the input, stabilizing
training and reducing internal covariate shift. Pre-LN ensures well-conditioned
gradients and consistent scaling across tokens in deep Transformers.

8. Classification Head (MLP Head)

Converts the CLS token output into class probabilities using a small feed-forward network.
 MLP Head Structure: The classification head uses one or two fully connected
layers on the final CLS token to produce class probabilities, optionally with dropout
for regularization. It serves as the ViT’s final decision-making component.
 Softmax for Prediction: Softmax converts logits into normalized probabilities
summing to 1, with the highest probability indicating the predicted class. It enables
multi-class classification and pairs with cross-entropy loss during training.

9. Training Vision Transformers

ViTs need more data than CNNs due to low inductive bias and training involves pretraining on
large datasets followed by finetuning.
 Inductive Bias Differences: CNNs use strong inductive biases like locality and
translation invariance, while ViTs treat images as patch sequences, requiring more
data but offering greater flexibility.
 Data Requirements: ViTs need large-scale datasets and augmentations to
generalize well due to their low inductive bias, unlike CNNs.
 Pretraining: Pretraining lets ViTs learn general visual features via supervised or
self-supervised methods, reducing compute needs for downstream tasks.
 Finetuning: Finetuning adapts pretrained ViTs to specific datasets using fewer
labels, often with layer-wise learning rate decay to improve performance.
Vision Transformer (ViT) vs. Convolutional Neural Networks (CNNs)
Here we compare ViT with CNN
Features CNNs ViTs
Attention Scope Capture local features via Capture global relationships via
convolutions self-attention
Inductive Bias Strong biases (locality, Minimal biases, more flexible but
translation invariance) data-hungry
Data Work well with small datasets Need large datasets for best
Requirement performance
Feature Learn hierarchical features Learn context-rich, long-range
Learning features
Advantages
 Global Context: Captures long-range dependencies between patches,
understanding the overall image context.
 Scalability: Performs well with larger datasets and deeper architectures for
complex vision tasks.
 Parallel Processing: Transformer architecture allows for efficient parallel
computation compared to sequential CNN operations.
 Unified Architecture: Can handle different input modalities (images, patches, or
tokens) within the same framework.
 Strong Representations: Learns powerful high-level feature representations due to
attention mechanisms capturing diverse patterns.
Limitations
 Data-Hungry Nature: Requires very large datasets to learn meaningful visual
representations.
 High Computational Cost: Self-attention scales quadratically with the number of
patches, increasing memory and compute requirements.
 Lack of Local Feature Bias: Does not naturally exploit local patterns, reducing
sample efficiency.
 Sensitivity to Hyperparameters: Patch size, embedding dimension, and attention
heads need careful tuning.
 Difficulty with Small Images: Few tokens from small images reduce attention
effectiveness.
 Longer Training Times: High complexity and large datasets lead to extended training
durations.

Lecture 21
No ratings yet
Lecture 21
28 pages
Vision Transformers: Revolutionizing CV
No ratings yet
Vision Transformers: Revolutionizing CV
16 pages
Vision Transformers: Advances & Insights
No ratings yet
Vision Transformers: Advances & Insights
7 pages
Vision Transformers: Architecture & Benefits
No ratings yet
Vision Transformers: Architecture & Benefits
14 pages
Vision Transformer Architecture Explained
No ratings yet
Vision Transformer Architecture Explained
3 pages
Understanding Vision Transformers (ViT)
No ratings yet
Understanding Vision Transformers (ViT)
15 pages
CSE 317 Project Explanation
No ratings yet
CSE 317 Project Explanation
5 pages
Vision Transformers for Video Action Prediction
No ratings yet
Vision Transformers for Video Action Prediction
9 pages
Vision Transformers: Architecture & Insights
No ratings yet
Vision Transformers: Architecture & Insights
28 pages
CvT: Convolutional Vision Transformer
No ratings yet
CvT: Convolutional Vision Transformer
10 pages
Vision Transformers in Feature Extraction
No ratings yet
Vision Transformers in Feature Extraction
6 pages
CSE 317 Project Explanation
No ratings yet
CSE 317 Project Explanation
5 pages
Three Things Everyone Should Know About Vision Transformers: Abstract
No ratings yet
Three Things Everyone Should Know About Vision Transformers: Abstract
19 pages
Video Quality Assessment with Vision Transformers
No ratings yet
Video Quality Assessment with Vision Transformers
5 pages
Vision Transformers in Image Processing
No ratings yet
Vision Transformers in Image Processing
10 pages
IJRPR37500
No ratings yet
IJRPR37500
3 pages
Convolutional Vision Transformer (CvT)
No ratings yet
Convolutional Vision Transformer (CvT)
10 pages
Vision Transformer for Image Classification
No ratings yet
Vision Transformer for Image Classification
19 pages
Transformer iN Transformer Architecture
No ratings yet
Transformer iN Transformer Architecture
14 pages
Vision Transformers: Robustness Insights
No ratings yet
Vision Transformers: Robustness Insights
13 pages
CO 2 6 Transformers
No ratings yet
CO 2 6 Transformers
27 pages
Overview of Transformer Architecture
No ratings yet
Overview of Transformer Architecture
22 pages
Universal Vision Transformer for Segmentation
No ratings yet
Universal Vision Transformer for Segmentation
23 pages
Vision Transformers in Computer Vision
100% (1)
Vision Transformers in Computer Vision
14 pages
UViT: Efficient Vision Transformer for Datasets
No ratings yet
UViT: Efficient Vision Transformer for Datasets
12 pages
Vi Ts
No ratings yet
Vi Ts
20 pages
Vision Transformers Explained
No ratings yet
Vision Transformers Explained
11 pages
Vision Transformer: Vit and Its Derivatives
No ratings yet
Vision Transformer: Vit and Its Derivatives
10 pages
Vision Transformers for Image Recognition
No ratings yet
Vision Transformers for Image Recognition
21 pages
Vision Transformers in Computer Vision
No ratings yet
Vision Transformers in Computer Vision
32 pages
Transformer-iN-Transformer Model for Vision
No ratings yet
Transformer-iN-Transformer Model for Vision
10 pages
Vision Transformers Overview and Insights
No ratings yet
Vision Transformers Overview and Insights
34 pages
Vision Transformer for Diabetic Retinopathy
No ratings yet
Vision Transformer for Diabetic Retinopathy
30 pages
Overview of Deep Learning Architectures
No ratings yet
Overview of Deep Learning Architectures
69 pages
Vision Transformers Overview and Challenges
No ratings yet
Vision Transformers Overview and Challenges
8 pages
Vision Transformer vs CNNs in Image Classification
No ratings yet
Vision Transformer vs CNNs in Image Classification
42 pages
Journal Review - 22MSS038
No ratings yet
Journal Review - 22MSS038
20 pages
CNN vs ViT for Image Classification
No ratings yet
CNN vs ViT for Image Classification
5 pages
Transformer Robustness in Image Classification
No ratings yet
Transformer Robustness in Image Classification
11 pages
Vision Transformer for Image Recognition
No ratings yet
Vision Transformer for Image Recognition
2 pages
Vision Transformers for Dense Prediction
No ratings yet
Vision Transformers for Dense Prediction
22 pages
Vision Transformer Architecture Explained
No ratings yet
Vision Transformer Architecture Explained
4 pages
Interpreting Attention in Vision Transformers
No ratings yet
Interpreting Attention in Vision Transformers
152 pages
Vision Transformer Overview and Architecture
No ratings yet
Vision Transformer Overview and Architecture
18 pages
ViT-Adapter for Dense Predictions
No ratings yet
ViT-Adapter for Dense Predictions
29 pages
Vision Transformer Overview and Models
No ratings yet
Vision Transformer Overview and Models
26 pages
Vision Transformers: Architecture & Benefits
No ratings yet
Vision Transformers: Architecture & Benefits
19 pages
UNetFormer: 3D Medical Image Segmentation
No ratings yet
UNetFormer: 3D Medical Image Segmentation
12 pages
Assigment14
No ratings yet
Assigment14
6 pages
Vision Transformers for Image Classification
No ratings yet
Vision Transformers for Image Classification
26 pages
ViT - Visual Transformer Intro
No ratings yet
ViT - Visual Transformer Intro
2 pages
Graph-based Vision Transformer for Small Datasets
No ratings yet
Graph-based Vision Transformer for Small Datasets
9 pages
AE-ViT: Enhancing Vision Transformers
No ratings yet
AE-ViT: Enhancing Vision Transformers
12 pages
Vision Transformers in Computer Vision
No ratings yet
Vision Transformers in Computer Vision
9 pages
Feature Extraction in Computer Vision
No ratings yet
Feature Extraction in Computer Vision
11 pages
10 Jsee3517
No ratings yet
10 Jsee3517
19 pages
Comprehensive Guide to Transformers
No ratings yet
Comprehensive Guide to Transformers
30 pages
MLflow Presentation
No ratings yet
MLflow Presentation
32 pages
Understanding Linear Filters in Vision
No ratings yet
Understanding Linear Filters in Vision
61 pages
Urban Unemployment Rate in India 2016-2017
No ratings yet
Urban Unemployment Rate in India 2016-2017
2 pages
Cloud Computing: Models & Architectures
No ratings yet
Cloud Computing: Models & Architectures
15 pages
Smart Agriculture: ML for Crop & Disease Management
No ratings yet
Smart Agriculture: ML for Crop & Disease Management
10 pages
Automatic Rail Track Inspection System
No ratings yet
Automatic Rail Track Inspection System
1 page
Design Checklist for Sales Presentation
No ratings yet
Design Checklist for Sales Presentation
1 page
Design Checklist for Board Presentation
No ratings yet
Design Checklist for Board Presentation
1 page
Economic Empowerment Trends Analysis
No ratings yet
Economic Empowerment Trends Analysis
3 pages
Distributed System Lab Manual New
No ratings yet
Distributed System Lab Manual New
31 pages
IoT Security Challenges by Kvds
No ratings yet
IoT Security Challenges by Kvds
13 pages
HTML5 and CSS3 Exercise Guide
No ratings yet
HTML5 and CSS3 Exercise Guide
4 pages
Understanding Multilayer Perceptrons
No ratings yet
Understanding Multilayer Perceptrons
15 pages
Understanding RNNs, LSTMs, and CNNs
No ratings yet
Understanding RNNs, LSTMs, and CNNs
2 pages
SVM and Naïve Bayes Assignment Guide
No ratings yet
SVM and Naïve Bayes Assignment Guide
3 pages
Analisis Univariat
No ratings yet
Analisis Univariat
14 pages
Neural Networks Lab Syllabus 2025-2026
No ratings yet
Neural Networks Lab Syllabus 2025-2026
4 pages
Deep Learning Mock Exam Questions
No ratings yet
Deep Learning Mock Exam Questions
4 pages
Adaline: Linear Activation in Neural Networks
No ratings yet
Adaline: Linear Activation in Neural Networks
19 pages
Backpropagation in XOR Neural Network
No ratings yet
Backpropagation in XOR Neural Network
8 pages
Credit Card Fraud Detection with ML
No ratings yet
Credit Card Fraud Detection with ML
7 pages
Deep Learning Basics: Weights & Biases
No ratings yet
Deep Learning Basics: Weights & Biases
45 pages
Deep Learning Concepts in ANN
No ratings yet
Deep Learning Concepts in ANN
56 pages
Understanding Ensemble Learning in ML
No ratings yet
Understanding Ensemble Learning in ML
25 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
2 pages
Overfitting and Regularization Techniques
No ratings yet
Overfitting and Regularization Techniques
17 pages
Machine Learning for Fire Resistance in RC Columns
No ratings yet
Machine Learning for Fire Resistance in RC Columns
44 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
17 pages
Grade Level Enrollment and Testing Data
No ratings yet
Grade Level Enrollment and Testing Data
4 pages
Understanding Boosting in Machine Learning
No ratings yet
Understanding Boosting in Machine Learning
42 pages
Video Analytics with Deep Learning Insights
No ratings yet
Video Analytics with Deep Learning Insights
20 pages
ANN in Soft Computing Overview
No ratings yet
ANN in Soft Computing Overview
78 pages
ID3 Algorithm Decision Tree Implementation
No ratings yet
ID3 Algorithm Decision Tree Implementation
6 pages
Neural Network Parameter Calculation Guide
No ratings yet
Neural Network Parameter Calculation Guide
10 pages
RNNs: LSTM and GRU Overview
No ratings yet
RNNs: LSTM and GRU Overview
45 pages
Deep Learning Quiz: Test Your Knowledge
No ratings yet
Deep Learning Quiz: Test Your Knowledge
4 pages
Data Mining: Classification Techniques Overview
No ratings yet
Data Mining: Classification Techniques Overview
62 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
41 pages
Perceptron Model: Pros and Cons
No ratings yet
Perceptron Model: Pros and Cons
5 pages
Decision Tree Algorithms Overview
No ratings yet
Decision Tree Algorithms Overview
8 pages
Deep Neural Networks Course Overview
No ratings yet
Deep Neural Networks Course Overview
8 pages
Artificial Neural Networks Question Bank
No ratings yet
Artificial Neural Networks Question Bank
3 pages