0% found this document useful (0 votes)

14 views42 pages

Vision Transformer vs CNNs in Image Classification

The Vision Transformer (ViT) is a neural network architecture that outperforms traditional CNNs like ResNet in image classification when pretrained on large datasets of at least 100 million images. ViT utilizes a transformer model originally designed for natural language processing and involves splitting images into patches for processing. The document discusses the architecture, training, and performance comparisons of ViT against CNNs across various datasets.

Uploaded by

janarajan04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views42 pages

Vision Transformer vs CNNs in Image Classification

Uploaded by

janarajan04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Vision Transformer (ViT)

[Link]
What is in the image?
Neural
Network �
�
Neural
Network �
�

Confidence
0.4

0.2
0.12
0.06 0.06 0.07
0.05 0.04
Classes
bird car cat dog fox jet snake tige
r
Image Classification

• CNNs, e.g., ResNet, were the best solutions to image classification.

• Vision Transformer (ViT) [1] beats CNNs (by a small margin), if the
dataset for pretraining is sufficiently large (at least 100 million
images).
• ViT is based on Transformer (for NLP) [2].

Reference

1. Dosovitskiy et al. An image is worth 16×16 words: transformers for image recognition at
scale. In ICLR, 2021.
2. Vaswani et al. Attention Is All You Need. In NIPS, 2017.
Split Image into Patches
Split Image into Patches

• Here, the patches do not overlap.

Split Image into Patches

• Here, the patches do not overlap.

• The patches can overlap.
• User specifies:
• patch size, e.g., 16×16;

• stride, e.g., 16×16.

Vectorization
Vectorization
Vectorization
If the patches are 𝑑1×𝑑2×𝑑3 tensors, then the vectors are
𝑑1𝑑2𝑑3×1.

𝐱 𝐱 𝐱 𝐱 𝐱 𝐱 𝐱 𝐱 𝐱
1 2 3 4 5 6 7 8 9
𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
𝐳 1 = 𝐖 𝐱1 +
𝐛
Dense

𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
𝐳 𝐳2 =𝐖
1 𝐱2 + 𝐛
Dense Dense

𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
𝐳 𝐳 𝐳 ⋯ 𝐳

⋯
1 2 3 𝑛
Share
Dense Dense Dense Dense Parameters

𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
1 2 3 �
�

⋯
Positional
Encoding:

𝐳 𝐳 𝐳 ⋯ 𝐳

⋯
1 2 3 𝑛

Den Den Den Den

se se se se

𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
Add positional encoding vectors to 𝐳1, 𝐳2, ⋯ , 𝐳𝑛.

𝐳 𝐳 𝐳 ⋯ 𝐳

⋯
1 2 3 𝑛

Den Den Den Den

se se se se

𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
Add positional encoding vectors to 𝐳1, 𝐳2, ⋯ , 𝐳𝑛. (Why?)

𝐳 𝐳 𝐳 ⋯ 𝐳

⋯
1 2 3 𝑛

Den Den Den Den

se se se se

𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
Add positional encoding vectors to 𝐳1, 𝐳2, ⋯ , 𝐳𝑛. (Why?)

① ③②

⋯
④ ⑤ 𝐳⑥
1 𝐳2 𝐳3 𝐳𝑛

⋯⋯ Dense
⑦ ⑧ Dense ⑨ Dense Dense
Add positional encoding vectors to 𝐳1, 𝐳2, ⋯ , 𝐳𝑛. (Why?)

①① ②
② ③
③

④
④ ⑤ 𝐳1 ⑥
⑤ ⑥𝐳2 𝐳3
⋯ 𝐳𝑛

⑦ ⑧Dense ⑨
⑨Dense Dense
⋯⋯ Dense
⑧
𝐳 𝐳 𝐳 𝐳 ⋯ 𝐳

⋯
0 1 2 3 𝑛

Embed Dense Dense Dense Dense

[CLS] 𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
⋯
Dense ⋯
Multi-Head Self-Attention

𝐳 𝐳 𝐳 𝐳 ⋯ 𝐳

⋯
0 1 2 3 𝑛

Embed Dense Dense Dense Dense

[CLS] 𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
⋯

⋯
Dense

⋯
Transformer
Multi-Head Self-Attention Encoder
Network
Dense

Multi-Head Self-Attention

𝐳 𝐳 𝐳 𝐳 ⋯ 𝐳

⋯
0 1 2 3 𝑛

Embed Dense Dense Dense Dense

[CLS] 𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
𝐜 𝐜 𝐜 𝐜 ⋯ 𝐜
0 1 2 3 𝑛

Transformer Encoder Network

𝐳 𝐳 𝐳 𝐳 ⋯ 𝐳

⋯
0 1 2 3 𝑛

Embed Dense Dense Dense Dense

[CLS] 𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
𝐜
0
𝐜1 𝐜2 𝐜3
⋯ 𝐜𝑛

Transformer Encoder Network

𝐳 𝐳 𝐳 𝐳 ⋯ 𝐳

⋯
0 1 2 3 𝑛

Embed Dense Dense Dense Dense

[CLS] 𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
Softmax
�
�0
Classifier
�
�
Transformer Encoder Network

𝐳 𝐳 𝐳 𝐳 ⋯ 𝐳

⋯
0 1 2 3 𝑛

Embed Dense Dense Dense Dense

[CLS] 𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
Softmax
�
�0
Classifier
�
�

Confide
0.4
nce

0.2
0.12
0.06 0.06 0.07
0.05 0.04
Classes
bird car cat dog fox jet tige
snake r
Randomly
Pretrained
Initialized

Dataset
A
Randomly
Pretrained Fine-tuned
Initialized

Training Set of
Dataset Dataset B
A
Randomly
Pretrained Fine-tuned
Initialized

Test
Accuracy

Training Set of Test Set of

Dataset Dataset B Dataset B
A
Datasets

# of Images # of Classes

ImageNet
1.3 Million 1 Thousand
(Small)

ImageNet-21K
14 Million 21 Thousand
(Medium)

JFT
300 Million 18 Thousand
(Big)
Image Classification Accuracies

• Pretrain the model on Dataset A, fine-tune the model on Dataset B,

and evaluate the model on Dataset B.

• Pretrained on ImageNet (small), ViT is slightly worse than ResNet.

• Pretrained on ImageNet-21K (medium), ViT is comparable to ResNet.
• Pretrained on JFT (large), ViT is slightly better than ResNet.
Image Classification Accuracies

ResNet is better ViT is better

# of Images
for
pretraining

100 Images 300 Images

M M
Thank You!

[Link]
Reference

[Link]

Vision Transformer Overview by Shusen Wang
No ratings yet
Vision Transformer Overview by Shusen Wang
35 pages
Vision Transformer
No ratings yet
Vision Transformer
6 pages
Lecture 21
No ratings yet
Lecture 21
28 pages
CSE 317 Project Explanation
No ratings yet
CSE 317 Project Explanation
5 pages
CSE 317 Project Explanation
No ratings yet
CSE 317 Project Explanation
5 pages
l09 Post Class
No ratings yet
l09 Post Class
29 pages
Vision Transformers: Architecture & Insights
No ratings yet
Vision Transformers: Architecture & Insights
28 pages
Vision Transformers for Video Action Prediction
No ratings yet
Vision Transformers for Video Action Prediction
9 pages
Vision Transformers Overview and Insights
No ratings yet
Vision Transformers Overview and Insights
34 pages
Vision Transformer for Image Recognition
No ratings yet
Vision Transformer for Image Recognition
2 pages
Vision Transformer Overview and Models
No ratings yet
Vision Transformer Overview and Models
26 pages
Vision Transformer: Vit and Its Derivatives
No ratings yet
Vision Transformer: Vit and Its Derivatives
10 pages
Vision Transformers for Image Recognition
No ratings yet
Vision Transformers for Image Recognition
21 pages
Vision Transformers: Advances & Insights
No ratings yet
Vision Transformers: Advances & Insights
7 pages
CO 2 6 Transformers
No ratings yet
CO 2 6 Transformers
27 pages
Vision Transformers in Feature Extraction
No ratings yet
Vision Transformers in Feature Extraction
6 pages
Vision Transformers: Revolutionizing CV
No ratings yet
Vision Transformers: Revolutionizing CV
16 pages
Vision Transformer for Diabetic Retinopathy
No ratings yet
Vision Transformer for Diabetic Retinopathy
30 pages
Vision Transformers Explained: CNNs vs. ViTs
No ratings yet
Vision Transformers Explained: CNNs vs. ViTs
63 pages
Understanding Vision Transformers (ViT)
No ratings yet
Understanding Vision Transformers (ViT)
15 pages
Vision Transformers: Architecture & Benefits
No ratings yet
Vision Transformers: Architecture & Benefits
14 pages
Vision Transformers for Dense Prediction
No ratings yet
Vision Transformers for Dense Prediction
22 pages
Vi Ts
No ratings yet
Vi Ts
20 pages
Vision Transformer Architecture Explained
No ratings yet
Vision Transformer Architecture Explained
3 pages
Vision Transformers in Computer Vision
No ratings yet
Vision Transformers in Computer Vision
9 pages
Vision Transformer for Image Classification
No ratings yet
Vision Transformer for Image Classification
19 pages
Vision Transformer Overview and Architecture
No ratings yet
Vision Transformer Overview and Architecture
18 pages
Overview of Transformer Architecture
No ratings yet
Overview of Transformer Architecture
22 pages
Vision Transformer for Lung Disease Detection
No ratings yet
Vision Transformer for Lung Disease Detection
10 pages
CvT: Convolutional Vision Transformer
No ratings yet
CvT: Convolutional Vision Transformer
10 pages
Vision Transformers for Image Recognition
No ratings yet
Vision Transformers for Image Recognition
22 pages
Vision Transformers Explained
No ratings yet
Vision Transformers Explained
11 pages
Interpreting Attention in Vision Transformers
No ratings yet
Interpreting Attention in Vision Transformers
152 pages
Transformers in Computer Vision Explained
No ratings yet
Transformers in Computer Vision Explained
31 pages
Vision Transformers for Image Recognition
No ratings yet
Vision Transformers for Image Recognition
22 pages
Vision Transformers in Computer Vision
No ratings yet
Vision Transformers in Computer Vision
32 pages
Assigment14
No ratings yet
Assigment14
6 pages
ViT-Adapter for Dense Predictions
No ratings yet
ViT-Adapter for Dense Predictions
29 pages
UNetFormer: 3D Medical Image Segmentation
No ratings yet
UNetFormer: 3D Medical Image Segmentation
12 pages
Three Things Everyone Should Know About Vision Transformers: Abstract
No ratings yet
Three Things Everyone Should Know About Vision Transformers: Abstract
19 pages
Overview of Deep Learning Architectures
No ratings yet
Overview of Deep Learning Architectures
69 pages
Video Quality Assessment with Vision Transformers
No ratings yet
Video Quality Assessment with Vision Transformers
5 pages
Convolutional Vision Transformer (CvT)
No ratings yet
Convolutional Vision Transformer (CvT)
10 pages
Vision Transformers for Image Recognition
No ratings yet
Vision Transformers for Image Recognition
22 pages
Vision Transformers for Image Recognition
No ratings yet
Vision Transformers for Image Recognition
28 pages
Transformer-iN-Transformer Model for Vision
No ratings yet
Transformer-iN-Transformer Model for Vision
10 pages
Transformer iN Transformer Architecture
No ratings yet
Transformer iN Transformer Architecture
14 pages
Vision Transformers for Image Recognition
No ratings yet
Vision Transformers for Image Recognition
21 pages
CNN vs ViT for Image Classification
No ratings yet
CNN vs ViT for Image Classification
5 pages
Understanding Transformers and Attention
No ratings yet
Understanding Transformers and Attention
20 pages
UViT: Efficient Vision Transformer for Datasets
No ratings yet
UViT: Efficient Vision Transformer for Datasets
12 pages
Multimodal AI for Medical Image Analysis
No ratings yet
Multimodal AI for Medical Image Analysis
7 pages
Transformers in Language and Vision
No ratings yet
Transformers in Language and Vision
10 pages
Transformer Robustness in Image Classification
No ratings yet
Transformer Robustness in Image Classification
11 pages
Vision Transformers: Robustness Insights
No ratings yet
Vision Transformers: Robustness Insights
13 pages
Building Vision Transformers in PyTorch
No ratings yet
Building Vision Transformers in PyTorch
19 pages
Maxvit: Multi-Axis Vision Transformer: Abstract
No ratings yet
Maxvit: Multi-Axis Vision Transformer: Abstract
31 pages
Esser Fire Alarm Control Panels Overview
No ratings yet
Esser Fire Alarm Control Panels Overview
2 pages
Sonic Wall TZ Series
No ratings yet
Sonic Wall TZ Series
14 pages
Understanding Linear Regression Concepts
No ratings yet
Understanding Linear Regression Concepts
10 pages
MIS Hardware and Software Support Overview
No ratings yet
MIS Hardware and Software Support Overview
30 pages
Diablo IV System Log Analysis
No ratings yet
Diablo IV System Log Analysis
37 pages
Exporting PDF from FreeHand Guide
0% (1)
Exporting PDF from FreeHand Guide
2 pages
Tekmar Pump Sequencer Relay Overview
No ratings yet
Tekmar Pump Sequencer Relay Overview
3 pages
Automated Yoga Instructor Project Overview
No ratings yet
Automated Yoga Instructor Project Overview
18 pages
Computer Architecture and Organization Overview
No ratings yet
Computer Architecture and Organization Overview
3 pages
Citra Log: File Access Errors
No ratings yet
Citra Log: File Access Errors
8 pages
Interpolation Techniques Explained
No ratings yet
Interpolation Techniques Explained
1 page
Snoop Transactions in AMBA CHI Protocol
No ratings yet
Snoop Transactions in AMBA CHI Protocol
18 pages
Subprograms in C Programming
No ratings yet
Subprograms in C Programming
10 pages
KD Shx705 (SM)
No ratings yet
KD Shx705 (SM)
67 pages
Smart Power Grid Optimization Report
No ratings yet
Smart Power Grid Optimization Report
7 pages
6th Grade Computer Exam Questions
No ratings yet
6th Grade Computer Exam Questions
1 page
Hill Climbing and Tabu Search Explained
No ratings yet
Hill Climbing and Tabu Search Explained
1 page
SAP S/4HANA Cloud 2302.3 Updates
No ratings yet
SAP S/4HANA Cloud 2302.3 Updates
102 pages
How to Play Sudoku: Rules and Tips
No ratings yet
How to Play Sudoku: Rules and Tips
48 pages
AJAZZ AJ159 APEX Gaming Mouse Review
No ratings yet
AJAZZ AJ159 APEX Gaming Mouse Review
1 page
Butterworth Filter Design: Ω N is the order of the filter
No ratings yet
Butterworth Filter Design: Ω N is the order of the filter
3 pages
Half/Full Adder and Subtractor Lab
No ratings yet
Half/Full Adder and Subtractor Lab
4 pages
Bias-Dependent Electron Velocity in GaN HEMTs
No ratings yet
Bias-Dependent Electron Velocity in GaN HEMTs
5 pages
EEG Graph Neural Network for Alzheimer's Classification
No ratings yet
EEG Graph Neural Network for Alzheimer's Classification
11 pages
Social and Prof Issues Module1 1
No ratings yet
Social and Prof Issues Module1 1
27 pages
Lenovo Legion Y530 Datasheet EN
No ratings yet
Lenovo Legion Y530 Datasheet EN
2 pages
Introduction to Network Types and Protocols
No ratings yet
Introduction to Network Types and Protocols
37 pages
QGIS 2.18 Training Manual
100% (1)
QGIS 2.18 Training Manual
657 pages
Automatic Power Factor Correction with Arduino
100% (1)
Automatic Power Factor Correction with Arduino
7 pages
Excel Tips for Simplifying Work
No ratings yet
Excel Tips for Simplifying Work
8 pages