0% found this document useful (0 votes)
14 views42 pages

Vision Transformer vs CNNs in Image Classification

The Vision Transformer (ViT) is a neural network architecture that outperforms traditional CNNs like ResNet in image classification when pretrained on large datasets of at least 100 million images. ViT utilizes a transformer model originally designed for natural language processing and involves splitting images into patches for processing. The document discusses the architecture, training, and performance comparisons of ViT against CNNs across various datasets.

Uploaded by

janarajan04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views42 pages

Vision Transformer vs CNNs in Image Classification

The Vision Transformer (ViT) is a neural network architecture that outperforms traditional CNNs like ResNet in image classification when pretrained on large datasets of at least 100 million images. ViT utilizes a transformer model originally designed for natural language processing and involves splitting images into patches for processing. The document discusses the architecture, training, and performance comparisons of ViT against CNNs across various datasets.

Uploaded by

janarajan04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Vision Transformer (ViT)

[Link]
What is in the image?
Neural
Network �

Neural
Network �

Confidence
0.4

0.2
0.12
0.06 0.06 0.07
0.05 0.04
Classes
bird car cat dog fox jet snake tige
r
Image Classification

• CNNs, e.g., ResNet, were the best solutions to image classification.


• Vision Transformer (ViT) [1] beats CNNs (by a small margin), if the
dataset for pretraining is sufficiently large (at least 100 million
images).
• ViT is based on Transformer (for NLP) [2].

Reference

1. Dosovitskiy et al. An image is worth 16×16 words: transformers for image recognition at
scale. In ICLR, 2021.
2. Vaswani et al. Attention Is All You Need. In NIPS, 2017.
Split Image into Patches
Split Image into Patches

• Here, the patches do not overlap.


Split Image into Patches

• Here, the patches do not overlap.


• The patches can overlap.
• User specifies:
• patch size, e.g., 16×16;

• stride, e.g., 16×16.


Vectorization
Vectorization
Vectorization
If the patches are 𝑑1×𝑑2×𝑑3 tensors, then the vectors are
𝑑1𝑑2𝑑3×1.

𝐱 𝐱 𝐱 𝐱 𝐱 𝐱 𝐱 𝐱 𝐱
1 2 3 4 5 6 7 8 9
𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
𝐳 1 = 𝐖 𝐱1 +
𝐛
Dense

𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
𝐳 𝐳2 =𝐖
1 𝐱2 + 𝐛
Dense Dense

𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
𝐳 𝐳 𝐳 ⋯ 𝐳


1 2 3 𝑛
Share
Dense Dense Dense Dense Parameters

𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
1 2 3 �


Positional
Encoding:

𝐳 𝐳 𝐳 ⋯ 𝐳


1 2 3 𝑛

Den Den Den Den


se se se se

𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
Add positional encoding vectors to 𝐳1, 𝐳2, ⋯ , 𝐳𝑛.

𝐳 𝐳 𝐳 ⋯ 𝐳


1 2 3 𝑛

Den Den Den Den


se se se se

𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
Add positional encoding vectors to 𝐳1, 𝐳2, ⋯ , 𝐳𝑛. (Why?)

𝐳 𝐳 𝐳 ⋯ 𝐳


1 2 3 𝑛

Den Den Den Den


se se se se

𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
Add positional encoding vectors to 𝐳1, 𝐳2, ⋯ , 𝐳𝑛. (Why?)

① ③②


④ ⑤ 𝐳⑥
1 𝐳2 𝐳3 𝐳𝑛

⋯⋯ Dense
⑦ ⑧ Dense ⑨ Dense Dense
Add positional encoding vectors to 𝐳1, 𝐳2, ⋯ , 𝐳𝑛. (Why?)

①① ②
② ③


④ ⑤ 𝐳1 ⑥
⑤ ⑥𝐳2 𝐳3
⋯ 𝐳𝑛

⑦ ⑧Dense ⑨
⑨Dense Dense
⋯⋯ Dense

𝐳 𝐳 𝐳 𝐳 ⋯ 𝐳


0 1 2 3 𝑛

Embed Dense Dense Dense Dense

[CLS] 𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛

Dense ⋯
Multi-Head Self-Attention

𝐳 𝐳 𝐳 𝐳 ⋯ 𝐳


0 1 2 3 𝑛

Embed Dense Dense Dense Dense

[CLS] 𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛


Dense


Transformer
Multi-Head Self-Attention Encoder
Network
Dense

Multi-Head Self-Attention

𝐳 𝐳 𝐳 𝐳 ⋯ 𝐳


0 1 2 3 𝑛

Embed Dense Dense Dense Dense

[CLS] 𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
𝐜 𝐜 𝐜 𝐜 ⋯ 𝐜
0 1 2 3 𝑛

Transformer Encoder Network

𝐳 𝐳 𝐳 𝐳 ⋯ 𝐳


0 1 2 3 𝑛

Embed Dense Dense Dense Dense

[CLS] 𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
𝐜
0
𝐜1 𝐜2 𝐜3
⋯ 𝐜𝑛

Transformer Encoder Network

𝐳 𝐳 𝐳 𝐳 ⋯ 𝐳


0 1 2 3 𝑛

Embed Dense Dense Dense Dense

[CLS] 𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
Softmax

�0
Classifier


Transformer Encoder Network

𝐳 𝐳 𝐳 𝐳 ⋯ 𝐳


0 1 2 3 𝑛

Embed Dense Dense Dense Dense

[CLS] 𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
Softmax

�0
Classifier

Confide
0.4
nce

0.2
0.12
0.06 0.06 0.07
0.05 0.04
Classes
bird car cat dog fox jet tige
snake r
Randomly
Pretrained
Initialized

Dataset
A
Randomly
Pretrained Fine-tuned
Initialized

Training Set of
Dataset Dataset B
A
Randomly
Pretrained Fine-tuned
Initialized

Test
Accuracy

Training Set of Test Set of


Dataset Dataset B Dataset B
A
Datasets

# of Images # of Classes

ImageNet
1.3 Million 1 Thousand
(Small)

ImageNet-21K
14 Million 21 Thousand
(Medium)

JFT
300 Million 18 Thousand
(Big)
Image Classification Accuracies

• Pretrain the model on Dataset A, fine-tune the model on Dataset B,


and evaluate the model on Dataset B.

• Pretrained on ImageNet (small), ViT is slightly worse than ResNet.


• Pretrained on ImageNet-21K (medium), ViT is comparable to ResNet.
• Pretrained on JFT (large), ViT is slightly better than ResNet.
Image Classification Accuracies

ResNet is better ViT is better


# of Images
for
pretraining

100 Images 300 Images


M M
Thank You!

[Link]
Reference

[Link]

You might also like