Vision Transformer (ViT)
[Link]
What is in the image?
Neural
Network �
�
Neural
Network �
�
Confidence
0.4
0.2
0.12
0.06 0.06 0.07
0.05 0.04
Classes
bird car cat dog fox jet snake tige
r
Image Classification
• CNNs, e.g., ResNet, were the best solutions to image classification.
• Vision Transformer (ViT) [1] beats CNNs (by a small margin), if the
dataset for pretraining is sufficiently large (at least 100 million
images).
• ViT is based on Transformer (for NLP) [2].
Reference
1. Dosovitskiy et al. An image is worth 16×16 words: transformers for image recognition at
scale. In ICLR, 2021.
2. Vaswani et al. Attention Is All You Need. In NIPS, 2017.
Split Image into Patches
Split Image into Patches
• Here, the patches do not overlap.
Split Image into Patches
• Here, the patches do not overlap.
• The patches can overlap.
• User specifies:
• patch size, e.g., 16×16;
• stride, e.g., 16×16.
Vectorization
Vectorization
Vectorization
If the patches are 𝑑1×𝑑2×𝑑3 tensors, then the vectors are
𝑑1𝑑2𝑑3×1.
𝐱 𝐱 𝐱 𝐱 𝐱 𝐱 𝐱 𝐱 𝐱
1 2 3 4 5 6 7 8 9
𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
𝐳 1 = 𝐖 𝐱1 +
𝐛
Dense
𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
𝐳 𝐳2 =𝐖
1 𝐱2 + 𝐛
Dense Dense
𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
𝐳 𝐳 𝐳 ⋯ 𝐳
⋯
1 2 3 𝑛
Share
Dense Dense Dense Dense Parameters
𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
1 2 3 �
�
⋯
Positional
Encoding:
𝐳 𝐳 𝐳 ⋯ 𝐳
⋯
1 2 3 𝑛
Den Den Den Den
se se se se
𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
Add positional encoding vectors to 𝐳1, 𝐳2, ⋯ , 𝐳𝑛.
𝐳 𝐳 𝐳 ⋯ 𝐳
⋯
1 2 3 𝑛
Den Den Den Den
se se se se
𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
Add positional encoding vectors to 𝐳1, 𝐳2, ⋯ , 𝐳𝑛. (Why?)
𝐳 𝐳 𝐳 ⋯ 𝐳
⋯
1 2 3 𝑛
Den Den Den Den
se se se se
𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
Add positional encoding vectors to 𝐳1, 𝐳2, ⋯ , 𝐳𝑛. (Why?)
① ③②
⋯
④ ⑤ 𝐳⑥
1 𝐳2 𝐳3 𝐳𝑛
⋯⋯ Dense
⑦ ⑧ Dense ⑨ Dense Dense
Add positional encoding vectors to 𝐳1, 𝐳2, ⋯ , 𝐳𝑛. (Why?)
①① ②
② ③
③
④
④ ⑤ 𝐳1 ⑥
⑤ ⑥𝐳2 𝐳3
⋯ 𝐳𝑛
⑦ ⑧Dense ⑨
⑨Dense Dense
⋯⋯ Dense
⑧
𝐳 𝐳 𝐳 𝐳 ⋯ 𝐳
⋯
0 1 2 3 𝑛
Embed Dense Dense Dense Dense
[CLS] 𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
⋯
Dense ⋯
Multi-Head Self-Attention
𝐳 𝐳 𝐳 𝐳 ⋯ 𝐳
⋯
0 1 2 3 𝑛
Embed Dense Dense Dense Dense
[CLS] 𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
⋯
⋯
Dense
⋯
Transformer
Multi-Head Self-Attention Encoder
Network
Dense
Multi-Head Self-Attention
𝐳 𝐳 𝐳 𝐳 ⋯ 𝐳
⋯
0 1 2 3 𝑛
Embed Dense Dense Dense Dense
[CLS] 𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
𝐜 𝐜 𝐜 𝐜 ⋯ 𝐜
0 1 2 3 𝑛
Transformer Encoder Network
𝐳 𝐳 𝐳 𝐳 ⋯ 𝐳
⋯
0 1 2 3 𝑛
Embed Dense Dense Dense Dense
[CLS] 𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
𝐜
0
𝐜1 𝐜2 𝐜3
⋯ 𝐜𝑛
Transformer Encoder Network
𝐳 𝐳 𝐳 𝐳 ⋯ 𝐳
⋯
0 1 2 3 𝑛
Embed Dense Dense Dense Dense
[CLS] 𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
Softmax
�
�0
Classifier
�
�
Transformer Encoder Network
𝐳 𝐳 𝐳 𝐳 ⋯ 𝐳
⋯
0 1 2 3 𝑛
Embed Dense Dense Dense Dense
[CLS] 𝐱 𝐱 𝐱 ⋯ 𝐱
1 2 3 𝑛
Softmax
�
�0
Classifier
�
�
Confide
0.4
nce
0.2
0.12
0.06 0.06 0.07
0.05 0.04
Classes
bird car cat dog fox jet tige
snake r
Randomly
Pretrained
Initialized
Dataset
A
Randomly
Pretrained Fine-tuned
Initialized
Training Set of
Dataset Dataset B
A
Randomly
Pretrained Fine-tuned
Initialized
Test
Accuracy
Training Set of Test Set of
Dataset Dataset B Dataset B
A
Datasets
# of Images # of Classes
ImageNet
1.3 Million 1 Thousand
(Small)
ImageNet-21K
14 Million 21 Thousand
(Medium)
JFT
300 Million 18 Thousand
(Big)
Image Classification Accuracies
• Pretrain the model on Dataset A, fine-tune the model on Dataset B,
and evaluate the model on Dataset B.
• Pretrained on ImageNet (small), ViT is slightly worse than ResNet.
• Pretrained on ImageNet-21K (medium), ViT is comparable to ResNet.
• Pretrained on JFT (large), ViT is slightly better than ResNet.
Image Classification Accuracies
ResNet is better ViT is better
# of Images
for
pretraining
100 Images 300 Images
M M
Thank You!
[Link]
Reference
[Link]