0% found this document useful (0 votes)
3 views20 pages

Vi Ts

Vision Transformers (ViT) leverage a pure transformer architecture to achieve strong performance in image classification tasks, comparable to traditional methods. Key ingredients for ViT's success include self-attention mechanisms and pre-training on large datasets, such as Google's JFT with 303 million labeled images. The approach involves inputting image patches instead of pixels and using a [CLS] token for classification.

Uploaded by

architmishra062
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views20 pages

Vi Ts

Vision Transformers (ViT) leverage a pure transformer architecture to achieve strong performance in image classification tasks, comparable to traditional methods. Key ingredients for ViT's success include self-attention mechanisms and pre-training on large datasets, such as Google's JFT with 303 million labeled images. The approach involves inputting image patches instead of pixels and using a [CLS] token for classification.

Uploaded by

architmishra062
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Vision Transformers

Transformers are a modern neural network architecture designed to handle sequential


(like text, time-series, or sequences) using a mechanism called attention, instea
recurrence (RNN) or convolution (CNN).
ntroduced in 2017, Transformers Achieved
Astonishing Performance for NLP Problems

Inspired, researchers in the computer vision community explored


transformers for many vision problems and discovered they perform well
Khan et al. Transformers in Vision: A Survey. C
Common Paradigm for NLP Transformers

Infe

ansformers can provide effective features for downstream task


[Link]
Why ViT?
Named after the proposed technique: Vision Transformer

Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition
at Scale. ICLR 2021.

Novelty:
First paper to demonstrate that a pure transformer architecture can achieve
strong performance on vision tasks, achieving comparable or better image
classification results to the best methods at the time
Approach

Infe

[Link]
ViT: Key Ingredients for Success

Transformer architecture (embeds self-attention)

Pre-training with massive amounts of data


ViT: Key Ingredients for Success

Transformer architecture (embeds self-attention)

Pre-training with massive amounts of data


Architecture

Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. IC
Architecture: Uses Popular BERT (Bidirectional Encoder Representations from Transfo
Architecture

Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. IC
Architecture: Key Novelty is Self-Attention

Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. IC
ViT Solution: Input Patches Instead of Pixels

a 160 x 160 pixel

omposed
non-
g patches
implified);
tions
attened”
t features

Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. IC
ViT Solution: Use [CLS] for Image Classification

ken
nts
mage

Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. IC
ViT: Key Ingredients for Success

Transformer architecture (embeds self-attention)

Pre-training with massive amounts of data


Approach

Infe

[Link]
ViT Pre-Training
• Dataset: JFT with 303M labeled images
(proprietary Google dataset)
• Task: classification loss (supervised)
• Optimizer: Adam * Note: research also is ex
how smaller training data
be effective; e.g., data eff
image transformers (DeiT
“Training data-efficient im
transformers & distillation
through attention”

Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. IC
ViT Training

Infe

[Link]
ViT Fine-Tuning: Other Image Classification Tas
MLP replaced with a single linear layer when
fine-tuning to new classification categories

Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. IC

You might also like