Vision Transformers
Transformers are a modern neural network architecture designed to handle sequential
(like text, time-series, or sequences) using a mechanism called attention, instea
recurrence (RNN) or convolution (CNN).
ntroduced in 2017, Transformers Achieved
Astonishing Performance for NLP Problems
Inspired, researchers in the computer vision community explored
transformers for many vision problems and discovered they perform well
Khan et al. Transformers in Vision: A Survey. C
Common Paradigm for NLP Transformers
Infe
ansformers can provide effective features for downstream task
[Link]
Why ViT?
Named after the proposed technique: Vision Transformer
Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition
at Scale. ICLR 2021.
Novelty:
First paper to demonstrate that a pure transformer architecture can achieve
strong performance on vision tasks, achieving comparable or better image
classification results to the best methods at the time
Approach
Infe
[Link]
ViT: Key Ingredients for Success
Transformer architecture (embeds self-attention)
Pre-training with massive amounts of data
ViT: Key Ingredients for Success
Transformer architecture (embeds self-attention)
Pre-training with massive amounts of data
Architecture
Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. IC
Architecture: Uses Popular BERT (Bidirectional Encoder Representations from Transfo
Architecture
Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. IC
Architecture: Key Novelty is Self-Attention
Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. IC
ViT Solution: Input Patches Instead of Pixels
a 160 x 160 pixel
omposed
non-
g patches
implified);
tions
attened”
t features
Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. IC
ViT Solution: Use [CLS] for Image Classification
ken
nts
mage
Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. IC
ViT: Key Ingredients for Success
Transformer architecture (embeds self-attention)
Pre-training with massive amounts of data
Approach
Infe
[Link]
ViT Pre-Training
• Dataset: JFT with 303M labeled images
(proprietary Google dataset)
• Task: classification loss (supervised)
• Optimizer: Adam * Note: research also is ex
how smaller training data
be effective; e.g., data eff
image transformers (DeiT
“Training data-efficient im
transformers & distillation
through attention”
Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. IC
ViT Training
Infe
[Link]
ViT Fine-Tuning: Other Image Classification Tas
MLP replaced with a single linear layer when
fine-tuning to new classification categories
Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. IC