Vision Transformer (ViT) Architecture
Vision Transformer (ViT) is a deep learning architecture that applies the Transformer model to
images. Instead of relying on convolutions, ViTs use self-attention to capture relationships
across all image patches, enabling a global understanding of the image. This approach has
achieved state-of-the-art results in various computer vision tasks.
Uses self-attention to model global dependencies between image patches.
Unlike CNNs, it does not rely on convolution operations for feature extraction.
Demonstrates strong performance in image classification, object detection and
segmentation.
Vision Transformer (ViT) Architecture Overview
Instead of processing words, ViT treats an image as a sequence of fixed-size patches and
applies self-attention across them. This allows the model to capture long range dependencies
between different parts of an image without relying on convolution operations.
ViT architecture includes the following major components:
This stage converts a 2D image into a sequence of patch embeddings, analogous to tokens in
NLP. It forms the input for the Transformer by turning spatial information into a linear
sequence.
Patch Splitting:The image is divided into fixed size and non-overlapping patches
each treated as a token and converted into a 1D sequence for the Transformer
reducing computation while preserving local spatial information.
Patch Flattening: Each patch of size P × P ×C is flattened into a single vector of
length P2 × C . This flattening removes spatial dimensions temporarily and allows
the model to treat patches uniformly. The flattened vectors serve as the raw inputs
to the linear projection layer.
Patch Embedding (Linear Projection): Each flattened patch is mapped to a
learnable D dimensional embedding enabling the model to learn high level features
similar to word embeddings in NLP.
Patch embeddings can also be extracted using a convolution layer with kernel size
and stride equal to the patch size, making each convolution act as a patch extractor.
2. Positional Encoding
Since Transformers are permutation invariant, positional encodings inject spatial order so the
model knows the relative positions of patches .
Need for Positional Encoding: Since Transformers treat tokens as unordered
positional encodings are added to retain spatial structure and patch location
information.
Learnable Positional Embeddings: ViT uses learnable positional vectors to
capture local and global spatial relationships adapting better than fixed encodings
across image resolutions.
3. Adding the Classification Token (CLS Token)
A learnable CLS token is prepended to the patch sequence to aggregate information from all
patches, serving as the image-level representation for classification.
Purpose of the CLS Token: The CLS token is a learnable vector added to patch
embeddings that gathers global information and is used for final classification,
similar to BERT.
How the CLS Token Learns Image-Level Representation: The CLS token
attends to all patches to learn global image features and its final output alone is used
for prediction without CNN style pooling .
4. Transformer Encoder (Pre-LayerNorm Architecture)
Pre-LayerNorm applies LayerNorm before both the attention and feed-forward blocks. This
stabilizes gradient flow and prevents the exploding/vanishing gradient problem in deep
Transformers.
Each Encoder Block has:
Multi-Head Self-Attention (MSA)
Feed-Forward Network (FFN)
Residual connections and LayerNorm
5. Multi-Head Self-Attention (MSA)
Allows each patch to attend to every other patch to model global dependencies,
capturing relationships between distant image regions.
1. Self-Attention Mechanism
Self-attention enables each patch to relate to all others by using query, key and value
projections with the attention matrix controlling token influence. The input sequence
consists of N image patches plus 1 CLS token, with each token represented by a D-
dimensional embedding.
Compute Queries, Keys and Values
2. Multi-Head Attention
Multiple attention heads allow the model to attend to different types of information
simultaneously. The outputs of all heads are concatenated and linearly projected to form the
final attention output. This parallel attention mechanism leads to richer and more diverse
feature representations.
MSA(X)=Concat(head1,…,headh)WO
Multiple heads (hhh) allow the model to focus on different types of relationships
simultaneously (e.g., edges, color, textures, global shapes)
i i i
headi =Attention ( X W Q , X W K , X W V )
6. Feed-Forward Network (FFN)
The FFN transforms each patch embedded to a higher-dimensional space and back using two
dense layers with a GELU activation, enabling complex feature learning. It operates
independently on each token with shared weights, allowing efficient non-linear
transformations.
Expands and transforms features for better expressiveness. GELU Activation is used for
smooth non-linearity improves learning and stability.
7. Residual Connections and Layer Normalization
Ensures stable training in deep networks by preserving information and normalizing
activations.
Residual (Skip) Connections: Residual connections bypass transformation blocks
to preserve earlier layer information, preventing degradation in deep networks.
They enable the model to learn incremental refinements, improving convergence
and stability in deep ViTs.
Layer Normalization: LayerNorm normalizes features across the input, stabilizing
training and reducing internal covariate shift. Pre-LN ensures well-conditioned
gradients and consistent scaling across tokens in deep Transformers.
8. Classification Head (MLP Head)
Converts the CLS token output into class probabilities using a small feed-forward network.
MLP Head Structure: The classification head uses one or two fully connected
layers on the final CLS token to produce class probabilities, optionally with dropout
for regularization. It serves as the ViT’s final decision-making component.
Softmax for Prediction: Softmax converts logits into normalized probabilities
summing to 1, with the highest probability indicating the predicted class. It enables
multi-class classification and pairs with cross-entropy loss during training.
9. Training Vision Transformers
ViTs need more data than CNNs due to low inductive bias and training involves pretraining on
large datasets followed by finetuning.
Inductive Bias Differences: CNNs use strong inductive biases like locality and
translation invariance, while ViTs treat images as patch sequences, requiring more
data but offering greater flexibility.
Data Requirements: ViTs need large-scale datasets and augmentations to
generalize well due to their low inductive bias, unlike CNNs.
Pretraining: Pretraining lets ViTs learn general visual features via supervised or
self-supervised methods, reducing compute needs for downstream tasks.
Finetuning: Finetuning adapts pretrained ViTs to specific datasets using fewer
labels, often with layer-wise learning rate decay to improve performance.
Vision Transformer (ViT) vs. Convolutional Neural Networks (CNNs)
Here we compare ViT with CNN
Features CNNs ViTs
Attention Scope Capture local features via Capture global relationships via
convolutions self-attention
Inductive Bias Strong biases (locality, Minimal biases, more flexible but
translation invariance) data-hungry
Data Work well with small datasets Need large datasets for best
Requirement performance
Feature Learn hierarchical features Learn context-rich, long-range
Learning features
Advantages
Global Context: Captures long-range dependencies between patches,
understanding the overall image context.
Scalability: Performs well with larger datasets and deeper architectures for
complex vision tasks.
Parallel Processing: Transformer architecture allows for efficient parallel
computation compared to sequential CNN operations.
Unified Architecture: Can handle different input modalities (images, patches, or
tokens) within the same framework.
Strong Representations: Learns powerful high-level feature representations due to
attention mechanisms capturing diverse patterns.
Limitations
Data-Hungry Nature: Requires very large datasets to learn meaningful visual
representations.
High Computational Cost: Self-attention scales quadratically with the number of
patches, increasing memory and compute requirements.
Lack of Local Feature Bias: Does not naturally exploit local patterns, reducing
sample efficiency.
Sensitivity to Hyperparameters: Patch size, embedding dimension, and attention
heads need careful tuning.
Difficulty with Small Images: Few tokens from small images reduce attention
effectiveness.
Longer Training Times: High complexity and large datasets lead to extended training
durations.