0% found this document useful (0 votes)

15 views31 pages

Deep Learning Interview Prep - Transformers & ViT

The document is a comprehensive guide for advanced deep learning interview preparation, focusing on Transformers and attention mechanisms. It includes 50 interview questions categorized into topics such as Transformer foundations, self-attention, multi-head attention, encoder-decoder architecture, and Vision Transformers, along with Python code snippets for practical understanding. The content aims to equip candidates with essential knowledge and skills for deep learning roles, particularly in relation to the Transformer architecture.

Uploaded by

Nirupama sekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views31 pages

Deep Learning Interview Prep - Transformers & ViT

Uploaded by

Nirupama sekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

DEEP LEARNING INTERVIEW PREP SERIES

Advanced Deep Learning

Deep Learning
Advanced
Transformers and Attention Mechanisms
Self-Attention | Multi-Head Attention | Encoder-Decoder | Positional Encoding

50 Interview Questions with Python Code Snippets

• Transformer Foundations • Encoder-Decoder

• Self-Attention • Vision Transformer (ViT)
• Multi-Head Attention • CNN vs ViT & Hybrids

LAMHOT SIAGIAN
AI Engineering Insider
2026 Edition • From Words to Witness: The Rise of Transformers & ViT
Contents
Category 1: Transformer Foundations 4
Q1: What is a Transformer and why was it introduced to replace RNN/LSTM? . . . . 4
Q2: List and explain the four core components of a Transformer block. . . . . . . . . . 4
Q3: What is Positional Encoding and why is it necessary? . . . . . . . . . . . . . . . . 5
Q4: How does parallel processing in Transformers improve training compared to RNNs? 5
Q5: What is Layer Normalisation and how does it differ from Batch Normalisation in
Transformers? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Q6: Explain the residual (skip) connections in the Transformer and their purpose. . . 7
Q7: What is the Feed-Forward Network (FFN) inside a Transformer block? . . . . . . 7
Q8: Why does self-attention have O(n2 )complexityandwhatmethodsreduceit? . . . . . 8
Q9: What applications leverage the Transformer architecture? . . . . . . . . . . . . . . 8
Q10: What is the difference between Pre-LN and Post-LN Transformer variants? . . . 9

Category 2: Self-Attention Mechanism 10

Q11: Explain the Query, Key, Value (Q, K, V) abstraction in self-attention. . . . . . . 10
Q12: Write out the full scaled
√ dot-product attention formula and explain each term. . 10
Q13: Why do we divide by dk intheattentionf ormula? . . . . . . . . . . . . . . . . . 11
Q14: What is an attention mask and when is it used? . . . . . . . . . . . . . . . . . . 11
Q15: How does self-attention capture long-range dependencies? . . . . . . . . . . . . . 12
Q16: What is cross-attention (encoder-decoder attention)? . . . . . . . . . . . . . . . . 12
Q17: What are the advantages and limitations of self-attention? . . . . . . . . . . . . . 13
Q18: Implement self-attention from scratch in NumPy/PyTorch. . . . . . . . . . . . . 13
Q19: What is the role of Softmax in attention and can it be replaced? . . . . . . . . . 14
Q20: How do you visualise attention weights for interpretability? . . . . . . . . . . . . 14

Category 3: Multi-Head Attention 16

Q21: What is Multi-Head Attention (MHA) and why is it better than single-head
attention? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Q22: Walk through the step-by-step computation of Multi-Head Attention. . . . . . . 16
Q23: How do you choose the number of attention heads h? . . . . . . . . . . . . . . . 17
Q24: What does each attention head typically learn? . . . . . . . . . . . . . . . . . . . 17
Q25: What is the output projection matrix WO inM HAandwhyisitneeded? . . . . . . 17
Q26: Explain Grouped Query Attention (GQA) used in LLaMA-2/3. . . . . . . . . . . 18
Q27: What is attention dropout and why is it used? . . . . . . . . . . . . . . . . . . . 18
Q28: Compare MHA parameter counts: 1 head vs 8 heads for dmodel = 512. . . . . . . 19

Category 4: Encoder–Decoder Architecture 20

Q29: What does the Encoder do in a Transformer and what is its output? . . . . . . . 20
Q30: Describe the three attention sub-layers inside a Transformer decoder block. . . . 20
Q31: Why is Masked Self-Attention used in the decoder? . . . . . . . . . . . . . . . . . 20
Q32: How does the Transformer generate output sequences (inference)? . . . . . . . . 21
Q33: What is the output layer of a Transformer and how are logits produced? . . . . . 21
Q34: Compare encoder-only, decoder-only, and encoder-decoder Transformer architec-
tures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Q35: What is the KV-Cache and how does it accelerate autoregressive decoding? . . . 22
Q36: What is Beam Search and how is it used in Transformer generation? . . . . . . . 22

Category 5: Vision Transformer — ViT 24

Q37: What is a Vision Transformer (ViT) and how does it process an image? . . . . . 24
Q38: Why does ViT need positional embeddings for image patches? . . . . . . . . . . 24
Q39: What is the CLS token in ViT and how is it used for classification? . . . . . . . . 24
Q40: What are the advantages of ViT over CNNs? . . . . . . . . . . . . . . . . . . . . 25

2
Deep Learning Interview Prep Lamhot Siagian »

Q41: What are the limitations of ViT compared to CNNs? . . . . . . . . . . . . . . . . 25

Q42: How does a Swin Transformer address ViT’s scalability issues? . . . . . . . . . . 26
Q43: Implement a minimal ViT forward pass in PyTorch. . . . . . . . . . . . . . . . . 26

Category 6: CNN vs. ViT & Hybrid Models 27

Q44: What is inductive bias and why do CNNs have more of it than ViTs? . . . . . . 27
Q45: When should you choose a CNN over ViT and vice versa? . . . . . . . . . . . . . 27
Q46: What are Hybrid CNN-Transformer models and give two examples. . . . . . . . 27
Q47: What is DINO and how does it enable self-supervised ViT training? . . . . . . . 28
Q48: Compare BERT (encoder-only) vs GPT (decoder-only) architectures. . . . . . . . 28
Q49: What is Flash Attention and how does it speed up Transformer training? . . . . 28
Q50: What is LoRA (Low-Rank Adaptation) and how is it used for efficient fine-tuning
of large Transformers? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

—3— AI Engineering Insider

Deep Learning Interview Prep Lamhot Siagian »

[CAT] Category 1: Transformer Foundations (Q1–Q10)

Core concepts of the Transformer architecture: what it is, why it was introduced, and how
its key components work together.

Q1
What is a Transformer and why was it introduced to replace RNN/LSTM?

A Transformer is a deep learning architecture based entirely on attention mechanisms,

dispensing with recurrence and convolution. It was introduced in the landmark 2017 paper
“Attention Is All You Need” to overcome two critical RNN/LSTM limitations:
• Sequential processing — RNNs process tokens one at a time, making parallelisation
impossible and training slow on long sequences.
• Long-range dependency — vanishing-gradient issues cause information from distant
tokens to be lost.
Transformers solve both by processing all tokens simultaneously through self-attention,
giving every token direct access to every other token regardless of distance.

Code Python
1 import torch , torch . nn as nn
2
3 # Minimal Transformer Encoder Layer usage
4 encoder_layer = nn . T ra n sf or me r En co d er La ye r (
5 d_model =512 , nhead =8 , dim_feedforward =2048 ,
6 dropout =0.1 , batch_first = True
7 )
8 encoder = nn . TransformerEncoder ( encoder_layer , num_layers =6)
9
10 # src : ( batch , seq_len , d_model )
11 src = torch . rand (2 , 10 , 512)
12 out = encoder ( src ) # (2 , 10 , 512)
13 print ( " Encoder output shape : " , out . shape )

Q2
List and explain the four core components of a Transformer block.

Each Transformer encoder block contains:

1. Input Embedding — converts discrete tokens to continuous dmodel -dimensional vec-
tors.
2. Positional Encoding — injects order information (sine/cosine or learned) since atten-
tion is permutation-invariant.
3. Multi-Head Self-Attention — allows each token to attend to all others, computing
weighted context vectors in parallel.
4. Feed-Forward Network (FFN) — two linear layers with a ReLU (or GELU) in
between, applied position-wise; expands then contracts the representation.
Each sub-layer is wrapped with a residual connection and layer normalisation:
output = LayerNorm(x + Sublayer(x)).

—4— AI Engineering Insider

Deep Learning Interview Prep Lamhot Siagian »

Code Python
1 import torch , torch . nn as nn , math
2
3 class TransformerBlock ( nn . Module ) :
4 def __init__ ( self , d_model =512 , nhead =8 , ffn_dim =2048) :
5 super () . __init__ ()
6 self . attn = nn . MultiheadAttention ( d_model , nhead , batch_first =
True )
7 self . ffn = nn . Sequential (
8 nn . Linear ( d_model , ffn_dim ) , nn . ReLU () ,
9 nn . Linear ( ffn_dim , d_model )
10 )
11 self . ln1 = nn . LayerNorm ( d_model )
12 self . ln2 = nn . LayerNorm ( d_model )
13
14 def forward ( self , x ) :
15 # Self - attention + residual
16 attn_out , _ = self . attn (x , x , x )
17 x = self . ln1 ( x + attn_out )
18 # FFN + residual
19 x = self . ln2 ( x + self . ffn ( x ) )
20 return x

Q3
What is Positional Encoding and why is it necessary?

Self-attention is permutation-invariant: shuffling the input tokens produces identical atten-

tion scores. Without explicit position signals, the model cannot distinguish “dog bites man”
from “man bites dog”.
The original paper uses fixed sinusoidal encoding:
pos pos
P E(pos,2i) = sin , P E (pos,2i+1) = cos
100002i/d 100002i/d
These are added (not concatenated) to the token embeddings. Sinusoidal encodings gener-
alise to unseen sequence lengths; learned positional embeddings often perform similarly but
are bounded by training-set length.

Code Python
1 import torch , math
2
3 def positional_encoding ( seq_len , d_model ) :
4 pe = torch . zeros ( seq_len , d_model )
5 pos = torch . arange (0 , seq_len ) . unsqueeze (1) . float ()
6 div = torch . exp (
7 torch . arange (0 , d_model , 2) . float () * ( - math . log (10000) /
d_model )
8 )
9 pe [: , 0::2] = torch . sin ( pos * div ) # even dims
10 pe [: , 1::2] = torch . cos ( pos * div ) # odd dims
11 return pe # ( seq_len , d_model )
12
13 pe = positional_encoding (50 , 512)
14 print ( pe . shape ) # torch . Size ([50 , 512])

—5— AI Engineering Insider

Deep Learning Interview Prep Lamhot Siagian »

Q4
How does parallel processing in Transformers improve training compared to
RNNs?

In an RNN the hidden state at step t depends on step t−1, creating a strict sequential
dependency that prevents parallelism during training. This means:
• GPU utilisation is poor for long sequences.
• Training time scales linearly with sequence length.
Transformers compute attention for all positions simultaneously via matrix multiplications,
allowing the GPU/TPU to process the entire sequence in a single forward pass. Training
time scales with O(n2 · d) (quadratic in sequence length, but highly parallel) rather than
O(n) serial steps.

Code Python
1 import torch , time
2
3 # RNN : sequential hidden - state updates
4 rnn = torch . nn . RNN (512 , 512 , batch_first = True )
5 batch = torch . rand (8 , 512 , 512) # ( batch , seq , feat )
6
7 t0 = time . perf_counter ()
8 for _ in range (50) : rnn ( batch )
9 print ( f " RNN 50 passes : { time . perf_counter () - t0 :.3 f } s " )
10
11 # Transformer : all positions at once
12 enc = torch . nn . Tr an sf o rm er En c od er L ay er (
13 d_model =512 , nhead =8 , batch_first = True )
14
15 t0 = time . perf_counter ()
16 for _ in range (50) : enc ( batch )
17 print ( f " TFM 50 passes : { time . perf_counter () - t0 :.3 f } s " )

Q5
What is Layer Normalisation and how does it differ from Batch Normalisation
in Transformers?

Batch Norm normalises across the batch dimension for each feature. This is problematic
for variable-length sequences and small batches common in NLP.
Layer Norm normalises across the feature dimension for each single sample independently:
xi − µ i
x̂i = ·γ+β
σi + ϵ
where µi and σi are computed over the dmodel features of sample i. This makes it batch-size
independent and well-suited for sequence models.

—6— AI Engineering Insider

Deep Learning Interview Prep Lamhot Siagian »

Code Python
1 import torch , torch . nn as nn
2
3 x = torch . rand (4 , 10 , 512) # ( batch , seq , d_model )
4
5 bn = nn . BatchNorm1d (512) # needs (N , C ) or (N , C , L )
6 ln = nn . LayerNorm (512)
7
8 # LayerNorm : works directly on ( batch , seq , d_model )
9 out_ln = ln ( x ) # shape unchanged
10
11 # BatchNorm : must transpose seq / feature dims
12 out_bn = bn ( x . transpose (1 , 2) ) . transpose (1 , 2)
13
14 print ( " LN : " , out_ln . shape ) # (4 , 10 , 512)
15 print ( " BN : " , out_bn . shape ) # (4 , 10 , 512)

Q6
Explain the residual (skip) connections in the Transformer and their purpose.

Each sub-layer output is computed as y = LayerNorm(x + F (x)) where F (x) is either the
attention or FFN sub-layer. Residual connections:
• Alleviate the vanishing gradient problem in deep stacks (the gradient can flow directly
through the identity path).
• Allow the sub-layer to learn residuals (small refinements) rather than full transforma-
tions, easing optimisation.
• Enable training very deep networks (the original paper uses 6 encoder and 6 decoder
layers; GPT-3 uses 96).

Code Python
1 import torch , torch . nn as nn
2
3 class ResidualSubLayer ( nn . Module ) :
4 def __init__ ( self , sublayer , d_model =512) :
5 super () . __init__ ()
6 self . sublayer = sublayer
7 self . norm = nn . LayerNorm ( d_model )
8
9 def forward ( self , x , ** kwargs ) :
10 # Pre - Norm variant ( used in GPT -2+) :
11 return x + self . sublayer ( self . norm ( x ) , ** kwargs )
12 # Post - Norm ( original paper ) :
13 # return self . norm ( x + self . sublayer (x , ** kwargs ) )

Q7
What is the Feed-Forward Network (FFN) inside a Transformer block?

The position-wise FFN is applied identically and independently to each token position after
the attention sub-layer:

FFN(x) = max(0, xW1 + b1 ) W2 + b2

—7— AI Engineering Insider

Deep Learning Interview Prep Lamhot Siagian »

Typically dff = 4 × dmodel (e.g. 2048 for dmodel = 512). The expansion-then-contraction
structure lets the network project into a higher-dimensional space to perform non-linear
feature mixing before returning to the residual stream. GELU activation is preferred in
modern models (BERT, GPT).

Code Python
1 import torch . nn as nn , torch . nn . functional as F
2
3 class PositionwiseFFN ( nn . Module ) :
4 def __init__ ( self , d_model =512 , d_ff =2048 , dropout =0.1) :
5 super () . __init__ ()
6 self . w1 = nn . Linear ( d_model , d_ff )
7 self . w2 = nn . Linear ( d_ff , d_model )
8 self . drop = nn . Dropout ( dropout )
9
10 def forward ( self , x ) :
11 return self . w2 ( self . drop ( F . gelu ( self . w1 ( x ) ) ) )

Q8
Why does self-attention have O(n2 ) complexity and what methods reduce it?

For a sequence of length n, every token attends to every other token, producing an n × n
attention matrix. Computing and storing this matrix is O(n2 ) in both time and memory.
Efficient variants include:
• Sparse Attention (Longformer, BigBird) — each token attends to a fixed window +
global tokens.
• Linformer — projects K, V down to O(k) dimensions, reducing complexity to O(n).
• Flash Attention — IO-aware CUDA kernel; same O(n2 ) computation but much re-
duced HBM reads/writes.
• Performer — approximates softmax with random feature maps.

Code Python
1 import torch
2
3 def attention_complexity (n , d ) :
4 " " " Theoretical FLOP counts . " " "
5 qk_matmul = n * n * d # Q K ^ T -> (n , n )
6 softmax_ops = n * n # row softmax
7 av_matmul = n * n * d # Attn V -> (n , d )
8 total = 2 * d * n **2 + n **2
9 return total
10
11 for n in [128 , 512 , 1024 , 4096]:
12 print ( f " n ={ n :5 d } FLOPs ~ { attention_complexity (n , 64) /1 e6 :.1 f } M " )

Q9
What applications leverage the Transformer architecture?

Transformers now underpin virtually all state-of-the-art systems across modalities:

—8— AI Engineering Insider

Deep Learning Interview Prep Lamhot Siagian »

Domain Task Model Examples

NLP Translation, QA, Summarisation BERT, GPT-4, T5
Vision Image Classification, Detection ViT, DINO, CLIP
Speech ASR, TTS Whisper, SpeechT5
Biology Protein Folding AlphaFold2
Multimodal Image+Text GPT-4V, Flamingo

Q10
What is the difference between Pre-LN and Post-LN Transformer variants?

Post-LN (original paper): y = LN(x + Sublayer(x)). LayerNorm is applied after the

residual addition. Works well but requires careful warm-up scheduling due to unstable
gradients early in training.
Pre-LN (GPT-2, most modern LLMs): y = x + Sublayer(LN(x)). LayerNorm is applied
before the sublayer. More stable training without warm-up; slightly lower peak performance
but preferred in practice for large models.

Code Python
1 import torch . nn as nn
2
3 class PreLNBlock ( nn . Module ) :
4 def __init__ ( self , d =512 , h =8 , ff =2048) :
5 super () . __init__ ()
6 self . ln1 = nn . LayerNorm ( d )
7 self . ln2 = nn . LayerNorm ( d )
8 self . attn = nn . MultiheadAttention (d , h , batch_first = True )
9 self . ffn = nn . Sequential (
10 nn . Linear (d , ff ) , nn . GELU () , nn . Linear ( ff , d ) )
11
12 def forward ( self , x ) :
13 x2 , _ = self . attn ( self . ln1 ( x ) , self . ln1 ( x ) , self . ln1 ( x ) )
14 x = x + x2
15 x = x + self . ffn ( self . ln2 ( x ) )
16 return x

—9— AI Engineering Insider

Deep Learning Interview Prep Lamhot Siagian »

[ATT] Category 2: Self-Attention Mechanism (Q11–Q20)

The mathematical heart of the Transformer: Query, Key, Value formulation, scaled dot-
product attention, and its properties.

Q11
Explain the Query, Key, Value (Q, K, V) abstraction in self-attention.

Each token embedding is linearly projected into three distinct vectors via learned weight
matrices W Q , W K , W V ∈ Rd×dk :
• Query (Q) — “What information am I looking for?”
• Key (K) — “What information do I have to offer?”
• Value (V) — “What content do I actually contribute?”
The similarity between a query and all keys (dot products) determines how much each value
is retrieved. In self -attention all three projections come from the same input sequence.

Code Python
1 import torch , torch . nn as nn
2
3 d_model , d_k = 512 , 64
4 Wq = nn . Linear ( d_model , d_k , bias = False )
5 Wk = nn . Linear ( d_model , d_k , bias = False )
6 Wv = nn . Linear ( d_model , d_k , bias = False )
7
8 x = torch . rand (2 , 10 , d_model ) # ( batch , seq , d_model )
9 Q = Wq ( x ) # (2 , 10 , 64)
10 K = Wk ( x ) # (2 , 10 , 64)
11 V = Wv ( x ) # (2 , 10 , 64)
12 print ( Q . shape , K . shape , V . shape )

Q12
Write out the full scaled dot-product attention formula and explain each term.

QK ⊤

Attention(Q, K, V ) = softmax √ V
dk
1. QK √ — dot products measure how similar each query is to each key (n×n score matrix).
⊤

2. ÷ dk — scaling factor. Without it, large dk causes dot products to grow in magnitude,
pushing softmax into regions with tiny gradients.
3. softmax(·) — converts scores to a probability distribution (attention weights) that sums
to 1 per query.
4. (· · · ) V — weighted sum of values; each query retrieves a blend of all values proportional
to attention weights.

— 10 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »

Code Python
1 import torch , torch . nn . functional as F , math
2
3 def s c a l e d _ d o t _ p r o d u c t _ a t t e n t i o n (Q , K , V , mask = None ) :
4 d_k = Q . size ( -1)
5 scores = torch . matmul (Q , K . transpose ( -2 , -1) ) / math . sqrt ( d_k )
6 if mask is not None :
7 scores = scores . masked_fill ( mask == 0 , float ( ’ - inf ’) )
8 weights = F . softmax ( scores , dim = -1)
9 return torch . matmul ( weights , V ) , weights
10
11 Q = torch . rand (2 , 10 , 64)
12 K = torch . rand (2 , 10 , 64)
13 V = torch . rand (2 , 10 , 64)
14 out , w = s c a l e d _ d o t _ p r o d u c t _ a t t e n t i o n (Q , K , V )
15 print ( out . shape , w . shape ) # (2 ,10 ,64) (2 ,10 ,10)

Q13
√
Why do we divide by dk in the attention formula?

For queries and keys with zero mean and unit variance, the dot product q ·k = di=1 qi ki has
Pk
variance dk . With large dk (e.g. 64), dot products can be large in magnitude,
√ pushing the
softmax into saturation regions where gradients approach zero. Dividing by dk re-centres
the variance to 1, keeping gradients healthy.

Code Python
1 import torch , torch . nn . functional as F , math
2
3 d_k = 64
4 q = torch . randn (1 , 1 , d_k )
5 k = torch . randn (1 , 10 , d_k )
6
7 raw_score = torch . bmm (q , k . transpose (1 ,2) )
8 scaled_score = raw_score / math . sqrt ( d_k )
9
10 print ( f " Raw score std : { raw_score . std () :.3 f } " )
11 print ( f " Scaled score std : { scaled_score . std () :.3 f } " )
12 print ( f " Softmax entropy ( raw ) : "
13 f " {( - F . softmax ( raw_score , -1) * F . log_softmax ( raw_score , -1) ) . sum ()
:.3 f } " )
14 print ( f " Softmax entropy ( scaled ) : "
15 f " {( - F . softmax ( scaled_score , -1) * F . log_softmax ( scaled_score , -1) ) .
sum () :.3 f } " )

Q14
What is an attention mask and when is it used?

An attention mask prevents certain positions from contributing to attention. Two main
types:
• Padding mask — marks padded positions (added to make sequences the same length
in a batch) as −∞ before softmax, so they receive zero weight.
• Causal (look-ahead) mask — used in decoder to ensure position i can only attend to

— 11 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »

positions ≤ i (prevents the decoder from “seeing the future” during training).

Code Python
1 import torch
2
3 def causal_mask ( seq_len ) :
4 " " " Upper - triangular True = masked out . " " "
5 return torch . triu ( torch . ones ( seq_len , seq_len , dtype = torch . bool ) ,
diagonal =1)
6
7 def padding_mask ( lengths , max_len ) :
8 " " " True where token is PAD . " " "
9 return torch . arange ( max_len ) . unsqueeze (0) >= lengths . unsqueeze (1)
10
11 mask = causal_mask (5)
12 print ( mask . int () )
13 # [[0 ,1 ,1 ,1 ,1] ,
14 # [0 ,0 ,1 ,1 ,1] , ... ]

Q15
How does self-attention capture long-range dependencies?

In self-attention, the path length between any two tokens is constant (O(1)) regardless
of their distance in the sequence. Every token computes an attention score with every
other token in a single layer. In contrast, an RNN must propagate information through
O(n) hidden-state transitions, causing gradients to vanish over long distances. This direct
any-to-any connectivity is why Transformers excel at coreference resolution, long-document
summarisation, and other tasks requiring distant context.

Q16
What is cross-attention (encoder-decoder attention)?

In the decoder, the cross-attention sub-layer uses:

• Q from the decoder ’s previous sub-layer output.
• K and V from the encoder ’s final output.
This allows every decoder position to attend over all encoder positions, effectively “reading”
the encoded source representation while generating each output token. It is the mechanism
that enables sequence-to-sequence tasks such as machine translation.

— 12 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »

Code Python
1 import torch , torch . nn as nn
2
3 class CrossAttention ( nn . Module ) :
4 def __init__ ( self , d_model =512 , nhead =8) :
5 super () . __init__ ()
6 self . attn = nn . MultiheadAttention ( d_model , nhead , batch_first =
True )
7 self . norm = nn . LayerNorm ( d_model )
8
9 def forward ( self , decoder_x , encoder_out ) :
10 # Q from decoder , K & V from encoder
11 out , _ = self . attn ( query = decoder_x ,
12 key = encoder_out ,
13 value = encoder_out )
14 return self . norm ( decoder_x + out )

Q17
What are the advantages and limitations of self-attention?

Advantages Limitations
Captures long-range dependencies O(n2 ) time and memory w.r.t. se-
in O(1) path length quence length
Fully parallelisable — no sequential Positional information must be in-
bottleneck jected explicitly
Better contextual understanding Memory-intensive for very long se-
through direct token-to-token atten- quences (documents, genomes)
tion

Q18
Implement self-attention from scratch in NumPy/PyTorch.

— 13 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »

Code Python
1 import torch , torch . nn as nn , math , torch . nn . functional as F
2
3 class SelfAttention ( nn . Module ) :
4 def __init__ ( self , d_model =512 , d_k =64) :
5 super () . __init__ ()
6 self . Wq = nn . Linear ( d_model , d_k , bias = False )
7 self . Wk = nn . Linear ( d_model , d_k , bias = False )
8 self . Wv = nn . Linear ( d_model , d_k , bias = False )
9 self . scale = math . sqrt ( d_k )
10
11 def forward ( self , x , mask = None ) :
12 Q , K , V = self . Wq ( x ) , self . Wk ( x ) , self . Wv ( x )
13 scores = torch . bmm (Q , K . transpose (1 , 2) ) / self . scale
14 if mask is not None :
15 scores = scores . masked_fill ( mask , float ( ’ - inf ’) )
16 weights = F . softmax ( scores , dim = -1)
17 return torch . bmm ( weights , V ) , weights # output , attn_map
18
19 sa = SelfAttention ()
20 x = torch . rand (2 , 10 , 512)
21 out , w = sa ( x )
22 print ( out . shape , w . shape ) # (2 ,10 ,64) (2 ,10 ,10)

Q19
What is the role of Softmax in attention and can it be replaced?

Softmax normalises the raw attention scores into a valid probability distribution (non-
negative, sums to 1), ensuring the weighted sum of values remains well-scaled. However it
forces all attention distributions to be dense (every position gets non-zero weight).
Alternatives investigated in the literature:
• Sparsemax — produces sparse distributions; many weights are exactly 0.
• α-entmax — generalises softmax/sparsemax via a temperature-like parameter.
• ReLU-based attention — used in some efficient Transformers to avoid the normali-
sation bottleneck.

Q20
How do you visualise attention weights for interpretability?

The n × n attention weight matrix for each head can be plotted as a heatmap. High values
between positions (i, j) indicate that token i attends strongly to token j. Tools like BertViz
provide interactive head views for BERT-style models.

— 14 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »

Code Python
1 import torch , matplotlib . pyplot as plt , seaborn as sns
2
3 tokens = [ " The " , " animal " , " didn ’t " , " cross " , " because " , " it " , " tired
"]
4 n = len ( tokens )
5 # Synthetic attention weights ( replace with real model output )
6 weights = torch . softmax ( torch . randn (n , n ) , dim = -1) . numpy ()
7
8 fig , ax = plt . subplots ( figsize =(6 , 5) )
9 sns . heatmap ( weights , xticklabels = tokens , yticklabels = tokens ,
10 cmap = ’ Blues ’ , ax = ax )
11 ax . set_title ( " Self - Attention Weights " )
12 plt . tight_layout ()
13 plt . savefig ( " attn_heatmap . png " , dpi =150)

— 15 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »

[MHA] Category 3: Multi-Head Attention (Q21–Q28)

Why a single attention head is insufficient, the mathematics of multiple parallel heads,
and implementation details.

Q21
What is Multi-Head Attention (MHA) and why is it better than single-head
attention?

Multi-Head Attention runs h independent attention operations (“heads”) in parallel, each

with its own learned projection matrices:

MultiHead(Q, K, V ) = Concat(head1 , . . . , headh ) W O

headi = Attention(QWiQ , KWiK , V WiV )

Each head can learn to attend to different aspects of the input (syntax, coreference, seman-
tics, positional patterns), whereas a single head is forced to average over all of these. Total
parameter count is similar to one large head because dk = dmodel /h.

Code Python
1 import torch , torch . nn as nn
2
3 mha = nn . MultiheadAttention ( embed_dim =512 , num_heads =8 ,
4 batch_first = True )
5 x = torch . rand (2 , 10 , 512)
6 out , attn_weights = mha (x , x , x ) # self - attention
7 print ( " Output : " , out . shape ) # (2 , 10 , 512)
8 print ( " Weights : " , attn_weights . shape ) # (2 , 10 , 10)

Q22
Walk through the step-by-step computation of Multi-Head Attention.

1. Linear projections — project input X ∈ Rn×d to Qi , Ki , Vi ∈ Rn×dk for each head i

where dk = d/h.
2. Scaled dot-product attention — compute headi = Attention(Qi , Ki , Vi ).
3. Concatenation — stack all head outputs along the feature dimension:
[h1 ; h2 ; . . . ; hh ] ∈ Rn×d .
4. Output projection — multiply by W O ∈ Rd×d to produce the final MHA output.

— 16 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »

Code Python
1 import torch , torch . nn as nn , math , torch . nn . functional as F
2
3 class MultiHeadAttention ( nn . Module ) :
4 def __init__ ( self , d =512 , h =8) :
5 super () . __init__ ()
6 self .h , self . dk = h , d // h
7 self . Wq = nn . Linear (d , d , bias = False )
8 self . Wk = nn . Linear (d , d , bias = False )
9 self . Wv = nn . Linear (d , d , bias = False )
10 self . Wo = nn . Linear (d , d , bias = False )
11
12 def split_heads ( self , x , B , T ) :
13 return x . view (B , T , self .h , self . dk ) . transpose (1 , 2)
14
15 def forward ( self , x ) :
16 B , T , _ = x . shape
17 Q = self . split_heads ( self . Wq ( x ) , B , T ) # (B ,h ,T , dk )
18 K = self . split_heads ( self . Wk ( x ) , B , T )
19 V = self . split_heads ( self . Wv ( x ) , B , T )
20 scores = Q @ K . transpose ( -2 , -1) / math . sqrt ( self . dk )
21 weights = F . softmax ( scores , dim = -1)
22 heads = ( weights @ V ) . transpose (1 ,2) . contiguous ()
23 heads = heads . view (B , T , -1) # concat
24 return self . Wo ( heads )

Q23
How do you choose the number of attention heads h?

The standard constraint is dk = dmodel /h (integer division). Common choices:

Model dmodel h dk Layers
Transformer-base 512 8 64 6
BERT-base 768 12 64 12
GPT-3 175B 12288 96 128 96
ViT-B 768 12 64 12
More heads generally improve expressiveness but increase computation linearly. Head prun-
ing research shows that many heads learn redundant patterns.

Q24
What does each attention head typically learn?

Probing studies (e.g. Clark et al., 2019 on BERT) reveal that individual heads tend to
specialise:
• Syntactic heads — attend to direct objects, subjects, or dependency arcs.
• Positional heads — attend to immediately adjacent tokens (local window).
• Coreference heads — link pronouns to their antecedents (“it” → “animal”).
• Rare token heads — pay disproportionate attention to infrequent, informative tokens.
This emergent specialisation is a key reason MHA outperforms single-head attention.

— 17 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »

Q25
What is the output projection matrix W O in MHA and why is it needed?

After concatenating the h head outputs (each dk -dimensional), we obtain a vector of di-
mension h × dk = dmodel . The projection W O ∈ Rd×d mixes information across heads,
allowing the model to learn how different heads’ specialised representations should be com-
bined. Without W O , each head’s output would be an independent, isolated view with no
cross-head interaction.

Q26
Explain Grouped Query Attention (GQA) used in LLaMA-2/3.

Grouped Query Attention (GQA) is a memory-efficient variant that shares K and V

projections across groups of query heads:
• MHA — h Q heads, h K heads, h V heads.
• Multi-Query Attention (MQA) — h Q heads, 1 K/V head.
• GQA — h Q heads, g K/V heads (1 < g < h), each group of h/g queries shares one
K/V pair.
GQA reduces the KV-cache size during inference (critical for long-context generation) while
maintaining quality close to full MHA.

Code Python
1 # Conceptual GQA : 8 query heads , 2 KV groups
2 import torch , torch . nn as nn , math , torch . nn . functional as F
3
4 H_Q , H_KV , d = 8 , 2 , 512
5 dk = d // H_Q
6 Wq = nn . Linear (d , H_Q * dk , bias = False )
7 Wk = nn . Linear (d , H_KV * dk , bias = False )
8 Wv = nn . Linear (d , H_KV * dk , bias = False )
9
10 x = torch . rand (1 , 10 , d)
11 Q = Wq ( x ) . view (1 , 10 , H_Q , dk ) . transpose (1 , 2) # (1 ,8 ,10 , dk )
12 K = Wk ( x ) . view (1 , 10 , H_KV , dk ) . transpose (1 , 2) # (1 ,2 ,10 , dk )
13 V = Wv ( x ) . view (1 , 10 , H_KV , dk ) . transpose (1 , 2)
14
15 # Repeat K , V to match H_Q heads
16 K = K . repeat_interleave ( H_Q // H_KV , dim =1) # (1 ,8 ,10 , dk )
17 V = V . repeat_interleave ( H_Q // H_KV , dim =1)

Q27
What is attention dropout and why is it used?

Attention dropout applies a dropout mask to the attention weight matrix after softmax and
before the weighted sum over values. This randomly zeroes out some attention connections
during training, preventing the model from over-relying on specific (query, key) pairs and
improving generalisation. Typical dropout rates: 0.1 for base models, 0.0 for large fine-
tuning stages.

— 18 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »

Code Python
1 import torch , torch . nn as nn , torch . nn . functional as F , math
2
3 def at ten ti on_ wi th_ dr opo ut (Q , K , V , dropout =0.1 , training = True ) :
4 d_k = Q . size ( -1)
5 scores = Q @ K . transpose ( -2 , -1) / math . sqrt ( d_k )
6 weights = F . softmax ( scores , dim = -1)
7 weights = F . dropout ( weights , p = dropout , training = training )
8 return weights @ V

Q28
Compare MHA parameter counts: 1 head vs 8 heads for dmodel = 512.

Both configurations have the same total parameter count:

• 1 head, dk = 512: 4 × 5122 = 1,048,576 params (W Q , W K , W V , W O ).
• 8 heads, dk = 64: 8 × 3 × 512 × 64 + 5122 = 786,432 + 262,144 = 1,048,576 params.
The cost is the same, but 8 heads gives richer subspace representations.

Code Python
1 import torch . nn as nn
2
3 def count_params ( m ) :
4 return sum ( p . numel () for p in m . parameters () )
5
6 d = 512
7 single = nn . MultiheadAttention (d , num_heads =1 , bias = False , batch_first
= True )
8 multi = nn . MultiheadAttention (d , num_heads =8 , bias = False , batch_first
= True )
9
10 print ( f " 1 - head params : { count_params ( single ) : ,} " )
11 print ( f " 8 - heads params : { count_params ( multi ) : ,} " )
12 # Both should print 1 ,048 ,576

— 19 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »

[ENC] Category 4: Encoder–Decoder Architecture (Q29–Q36)

The full Transformer pipeline: encoder stack, decoder stack, masked attention, and output
generation.

Q29
What does the Encoder do in a Transformer and what is its output?

The encoder processes the entire source sequence in parallel. Each of its N identical blocks
applies multi-head self-attention followed by a position-wise FFN, with residual connections
and layer normalisation.
The output is a sequence of contextualised embeddings Z ∈ Rn×d — one vector per
source token, enriched with global context from all other tokens. This representation is
passed to every decoder layer via cross-attention.

Code Python
1 import torch , torch . nn as nn
2
3 encoder_layer = nn . T ra n sf or me r En co d er La ye r (
4 d_model =512 , nhead =8 , batch_first = True )
5 encoder = nn . TransformerEncoder ( encoder_layer , num_layers =6)
6
7 src = torch . rand (2 , 15 , 512) # ( batch , src_len , d )
8 # s r c_key_padding_mask : True where tokens are PAD
9 memory = encoder ( src ) # (2 , 15 , 512)
10 print ( " Memory shape : " , memory . shape )

Q30
Describe the three attention sub-layers inside a Transformer decoder block.

Each decoder block has three sub-layers:

1. Masked Multi-Head Self-Attention — the decoder attends to its own previous
outputs, but a causal mask prevents position t from seeing positions > t (autoregressive
property).
2. Encoder–Decoder Cross-Attention — Q comes from the decoder, K and V come
from the encoder memory Z. Enables the decoder to “look at” the full source.
3. Position-wise FFN — same as in the encoder.

Q31
Why is Masked Self-Attention used in the decoder?

During training, the entire target sequence is fed to the decoder simultaneously (teacher
forcing). Without masking, position t could attend to ground-truth tokens at positions
t + 1, t + 2, . . . — it would “cheat” by reading the answer. The causal mask ensures that the
prediction at position t depends only on positions ≤ t, replicating the left-to-right generation
process used at inference time.

— 20 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »

Code Python
1 import torch , torch . nn as nn
2
3 d , h , N = 512 , 8 , 6
4 dec_layer = nn . T r an sf o rm er De c od er La y er ( d_model =d , nhead =h ,
5 batch_first = True )
6 decoder = nn . TransformerDecoder ( dec_layer , num_layers = N )
7
8 tgt = torch . rand (2 , 10 , d )
9 memory = torch . rand (2 , 15 , d ) # encoder output
10
11 # Causal mask ( upper - triangular True = masked )
12 tgt_mask = nn . Transformer . g e n e r a t e _ s q u a r e _ s u b s e q u e n t _ m a s k (10)
13
14 out = decoder ( tgt , memory , tgt_mask = tgt_mask )
15 print ( out . shape ) # (2 , 10 , 512)

Q32
How does the Transformer generate output sequences (inference)?

At inference the decoder generates tokens autoregressively:

1. Start with a [BOS] (beginning-of-sequence) token.
2. Feed encoder output (computed once) + current decoder sequence to the decoder.
3. Apply a linear projection + softmax to the last decoder hidden state to get a probability
distribution over the vocabulary.
4. Sample or take the argmax to obtain the next token.
5. Append the token and repeat until [EOS] or max length.
The KV-cache optimisation stores computed K/V pairs at each step to avoid recomputation.

Q33
What is the output layer of a Transformer and how are logits produced?

The final decoder hidden state ht ∈ Rd is passed through:

logits = ht Wout + b, Wout ∈ Rd×|V |

followed by softmax to get a probability over vocabulary V . In many models (e.g. GPT-2)
Wout is tied to the input embedding matrix to reduce parameters and improve generalisation.

— 21 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »

Code Python
1 import torch , torch . nn as nn , torch . nn . functional as F
2
3 vocab_size , d_model = 32000 , 512
4
5 # Weight - tied output projection
6 embedding = nn . Embedding ( vocab_size , d_model )
7 output_proj = lambda h : h @ embedding . weight . T # tied
8
9 h = torch . rand (2 , 10 , d_model ) # decoder output
10 logits = output_proj ( h ) # (2 , 10 , 32000)
11 probs = F . softmax ( logits , dim = -1)
12 next_token = probs [: , -1 , :]. argmax ( dim = -1)
13 print ( " Next token ids : " , next_token )

Q34
Compare encoder-only, decoder-only, and encoder-decoder Transformer archi-
tectures.

Type Example Attention Best For

Encoder-only BERT, RoBERTa Bidirectional Classification,
NER, QA
Decoder-only GPT-2/3/4, Causal (masked) Generation, LM
LLaMA
Enc-Dec T5, BART, original Bi + Cross Seq2Seq, Transla-
tion

Q35
What is the KV-Cache and how does it accelerate autoregressive decoding?

During greedy/beam decoding, the K and V projections for all previous tokens never change.
Recomputing them at every step is wasteful. The KV-cache stores the K and V tensors
from all past decoder steps; at step t only the new token’s K/V pair is computed and
appended. This reduces per-step computation from O(t · d) to O(d). Memory cost is
O(n · L · d) where L is the number of layers.

Code Python
1 # Simplified KV - cache concept
2 past_keys = [] # list of K tensors per layer
3 past_values = [] # list of V tensors per layer
4
5 def decode_step ( x_new , past_k , past_v , attn_layer ) :
6 k_new = attn_layer . key_proj ( x_new ) # (1 , 1 , dk )
7 v_new = attn_layer . val_proj ( x_new )
8 K_all = torch . cat ( past_k + [ k_new ] , dim =1)
9 V_all = torch . cat ( past_v + [ v_new ] , dim =1)
10 # Q from x_new only ; KV from full history
11 q = attn_layer . query_proj ( x_new )
12 out = s c a l e d _ d o t _ p r o d u c t _ a t t e n t i o n (q , K_all , V_all )
13 return out , K_all , V_all

— 22 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »

Q36
What is Beam Search and how is it used in Transformer generation?

Beam search is a heuristic decoding strategy that keeps the top-k (beam width) most
probable partial sequences at each step, expanding each candidate and retaining the k best,
until all beams end with [EOS]. It approximates finding the globally most probable sequence
without an intractable exhaustive search.

Code Python
1 from transformers import AutoTokenizer , AutoM odelFo rSeq2 SeqLM
2
3 tokenizer = AutoTokenizer . from_pretrained ( " t5 - small " )
4 model = Auto ModelF orSeq 2SeqLM . from_pretrained ( " t5 - small " )
5
6 inputs = tokenizer ( " translate English to French : Hello world " ,
7 return_tensors = " pt " )
8 outputs = model . generate (** inputs , num_beams =4 , max_new_tokens =20)
9 print ( tokenizer . decode ( outputs [0] , skip_special_tokens = True ) )
10 # " Bonjour monde "

— 23 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »

[VIT] Category 5: Vision Transformer — ViT (Q37–Q43)

Applying Transformer architecture to image data: patch tokenisation, positional embed-
dings, classification head, and design choices.

Q37
What is a Vision Transformer (ViT) and how does it process an image?

ViT (Dosovitskiy et al., 2020) applies the standard Transformer encoder directly to images
by converting them into a sequence of flat patch embeddings:
1. Patch splitting — divide image H ×W ×C into N = HW/P 2 non-overlapping patches
of size P × P (typically P = 16).
2. Linear projection — flatten each patch to a vector of size P 2 C and project to dmodel .
3. CLS token — prepend a learnable [CLS] embedding; its final-layer output is used for
classification.
4. Positional embeddings — add learned 1-D positional embeddings.
5. Transformer encoder — process the N + 1 token sequence.

Code Python
1 import torch , torch . nn as nn
2
3 class PatchEmbedding ( nn . Module ) :
4 def __init__ ( self , img_size =224 , patch_size =16 ,
5 in_channels =3 , d_model =768) :
6 super () . __init__ ()
7 n_patches = ( img_size // patch_size ) ** 2
8 self . proj = nn . Conv2d ( in_channels , d_model ,
9 kernel_size = patch_size ,
10 stride = patch_size )
11 self . cls_tok = nn . Parameter ( torch . zeros (1 , 1 , d_model ) )
12 self . pos_emb = nn . Parameter (
13 torch . zeros (1 , n_patches + 1 , d_model ) )
14
15 def forward ( self , x ) : # x : (B , 3 , 224 , 224)
16 B = x . size (0)
17 x = self . proj ( x ) # (B , d , H /P , W / P )
18 x = x . flatten (2) . transpose (1 ,2) # (B , N , d )
19 cls = self . cls_tok . expand (B , -1 , -1)
20 x = torch . cat ([ cls , x ] , dim =1) # (B , N +1 , d )
21 return x + self . pos_emb

Q38
Why does ViT need positional embeddings for image patches?

Like text Transformers, ViT’s self-attention is permutation-invariant: without position in-

formation, patch #5 and patch #37 would be indistinguishable. Positional embeddings
encode each patch’s spatial location so the model can learn which patches are neighbours,
far apart, or arranged in semantic regions.
ViT uses learnable 1-D positional embeddings (one per patch index). Surprisingly, 2-
D positional embeddings provide little benefit — the model learns to encode 2-D structure
internally. At test time with different resolutions, 2-D interpolation of positional embeddings
is used.

— 24 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »

Q39
What is the CLS token in ViT and how is it used for classification?

The CLS token is a learnable parameter prepended to the sequence of patch embeddings
before the Transformer encoder. Because all tokens (including CLS) attend to each other
through self-attention, by the final encoder layer the CLS token has aggregated information
from all patches. Its final hidden state is fed into a small MLP classification head:

ŷ = MLP(LN(z0L ))

where z0L is the CLS token output at layer L.

Code Python
1 import torch , torch . nn as nn
2
3 class ViTClassifier ( nn . Module ) :
4 def __init__ ( self , d_model =768 , num_classes =1000) :
5 super () . __init__ ()
6 enc_layer = nn . T ra ns f or me rE n co de r La ye r (
7 d_model , nhead =12 , batch_first = True , norm_first = True )
8 self . enc = nn . TransformerEncoder ( enc_layer , 12)
9 self . norm = nn . LayerNorm ( d_model )
10 self . head = nn . Linear ( d_model , num_classes )
11
12 def forward ( self , tokens ) : # (B , N +1 , d )
13 z = self . enc ( tokens ) # (B , N +1 , d )
14 cls = self . norm ( z [: , 0]) # CLS token
15 return self . head ( cls ) # (B , num_classes )

Q40
What are the advantages of ViT over CNNs?

• Global receptive field — every patch attends to every other patch from the first layer;
CNNs build global context only in deeper layers through stacking local convolutions.
• Scalability — ViT performance scales predictably with data and model size; larger
datasets yield consistently better models.
• Transfer across modalities — the same architecture works for text, images, audio,
video with minimal modification.
• Interpretability — attention maps provide intuitive visualisations of what the model
“looks at”.

Q41
What are the limitations of ViT compared to CNNs?

• Data hungry — lacks CNN’s inductive biases (translation equivariance, locality). Re-
quires large datasets (ImageNet-21k, JFT-300M) to outperform CNNs; on small datasets
CNNs often win.
• Quadratic complexity — O(N 2 ) attention w.r.t. number of patches; high-resolution
images produce many patches.
• High compute cost — pre-training from scratch requires significant GPU resources.
• Positional encoding sensitivity — variable-resolution fine-tuning requires positional

— 25 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »

embedding interpolation.

Q42
How does a Swin Transformer address ViT’s scalability issues?

The Swin Transformer (Liu et al., 2021) introduces two key ideas:
1. Shifted Window Attention — attention is computed within local non-overlapping
windows (not globally), reducing complexity from O(N 2 ) to O(N ). Shifted windows
connect adjacent windows across layers.
2. Hierarchical feature maps — patches are merged as depth increases, producing multi-
scale features similar to CNN feature pyramids — essential for detection and segmen-
tation.
Swin achieves SOTA on ImageNet classification and is the backbone of many detection/seg-
mentation frameworks (Mask R-CNN + Swin).

Q43
Implement a minimal ViT forward pass in PyTorch.

Code Python
1 import torch , torch . nn as nn
2
3 class MiniViT ( nn . Module ) :
4 def __init__ ( self , img =224 , patch =16 , c =3 ,
5 d =768 , heads =12 , layers =12 , classes =1000) :
6 super () . __init__ ()
7 n = ( img // patch ) ** 2
8 self . patch_emb = nn . Conv2d (c , d , patch , stride = patch )
9 self . cls = nn . Parameter ( torch . zeros (1 , 1 , d ) )
10 self . pos = nn . Parameter ( torch . zeros (1 , n +1 , d ) )
11 enc = nn . T ra ns fo r me rE n co de rL a ye r (
12 d , heads , dim_feedforward = d *4 ,
13 batch_first = True , norm_first = True )
14 self . enc = nn . TransformerEncoder ( enc , layers )
15 self . norm = nn . LayerNorm ( d )
16 self . head = nn . Linear (d , classes )
17
18 def forward ( self , x ) :
19 B = x . shape [0]
20 x = self . patch_emb ( x ) . flatten (2) . transpose (1 ,2)
21 cls = self . cls . expand (B , -1 , -1)
22 x = torch . cat ([ cls , x ] , 1) + self . pos
23 z = self . enc ( x )
24 return self . head ( self . norm ( z [: ,0]) )
25
26 model = MiniViT ()
27 dummy = torch . rand (2 , 3 , 224 , 224)
28 print ( model ( dummy ) . shape ) # (2 , 1000)

— 26 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »

[HYB] Category 6: CNN vs. ViT & Hybrid Models (Q44–Q50)

Head-to-head comparison of CNNs and ViTs; hybrid architectures that blend both worlds;
practical guidance on model selection.

Q44
What is inductive bias and why do CNNs have more of it than ViTs?

Inductive bias is a set of assumptions built into the architecture that constrain the hy-
pothesis space, helping generalisation when data is limited.
CNN inductive biases:
• Translation equivariance — the same filter is applied across all spatial locations; a
feature detected in the top-left is detected everywhere.
• Locality — each neuron looks at a local receptive field, reflecting the assumption that
nearby pixels are most correlated.
ViT has neither by default: it treats patches as unordered tokens and must learn spatial
structure from data alone, requiring more examples to compensate.

Q45
When should you choose a CNN over ViT and vice versa?

Scenario Prefer CNN Prefer ViT

Dataset size Small/medium (<100k) Large (>1M)
Compute budget Constrained High
Task Dense pred. (seg., det.) Classification, retrieval
Domain shift Low High (with pre-training)
Edge deployment Efficient CNNs (Mo- ViT-Tiny
bileNet)
In practice: use a pre-trained ViT when data is sufficient, otherwise fine-tune a CNN
backbone with ImageNet pre-training.

Q46
What are Hybrid CNN-Transformer models and give two examples.

Hybrid models combine CNN’s local feature extraction with Transformer’s global context
modelling:
• CNN + Transformer (sequential) — a CNN extracts feature maps, which are then
flattened and fed as tokens to a Transformer encoder. Examples: ViT with ResNet
stem, TransUNet (medical image segmentation).
• ConvNeXt — a pure CNN modernised with Transformer design choices (large kernels,
LayerNorm, GELU, inverted bottleneck).
• Swin Transformer — uses shifted-window attention inside a CNN-like hierarchical
pyramid.

— 27 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »

Code Python
1 import torch , torch . nn as nn , torchvision . models as tv
2
3 class HybridViT ( nn . Module ) :
4 def __init__ ( self , num_classes =1000) :
5 super () . __init__ ()
6 # CNN stem : ResNet50 minus avgpool & fc
7 resnet = tv . resnet50 ( weights = ’ IMAGENET1K_V1 ’)
8 self . cnn_stem = nn . Sequential (* list ( resnet . children () ) [: -2])
9 # Transformer on CNN feature map tokens
10 enc = nn . T ra ns f or me rE n co de rL a ye r (
11 2048 , 8 , dim_feedforward =4096 , batch_first = True )
12 self . tfm = nn . TransformerEncoder ( enc , 2)
13 self . pool = nn . AdaptiveAvgPool1d (1)
14 self . fc = nn . Linear (2048 , num_classes )
15
16 def forward ( self , x ) :
17 f = self . cnn_stem ( x ) # (B , 2048 , 7 , 7)
18 tok = f . flatten (2) . transpose (1 , 2) # (B , 49 , 2048)
19 z = self . tfm ( tok ) # (B , 49 , 2048)
20 z = self . pool ( z . transpose (1 ,2) ) [: , :, 0]
21 return self . fc ( z )

Q47
What is DINO and how does it enable self-supervised ViT training?

DINO (Self-DIstillation with NO labels) is a self-supervised learning framework for ViT

(Caron et al., 2021). It trains a student ViT to match the output distribution of a teacher
ViT (updated via exponential moving average) using different augmented views of the same
image (multi-crop strategy).
Key properties:
• The CLS token learns semantically meaningful representations without labels.
• Attention maps produce surprisingly clear object segmentation (“emergent segmenta-
tion”).
• DINOv2 scales this to produce powerful general-purpose visual features.

Q48
Compare BERT (encoder-only) vs GPT (decoder-only) architectures.

Property BERT GPT

Architecture Encoder-only Decoder-only
Attention Bidirectional (all tokens see Causal (only left context)
all)
Pre-training Masked LM + NSP Autoregressive LM
Strengths Understanding (NLU) Generation (NLG)
Fine-tuning Add classification head Prompt/instruction tuning
Examples BERT, RoBERTa, ALBERT GPT-2/3/4, LLaMA, Mistral

Q49
What is Flash Attention and how does it speed up Transformer training?

— 28 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »

Flash Attention (Dao et al., 2022) is an IO-aware exact attention algorithm that avoids
materialising the full n × n attention matrix in high-bandwidth memory (HBM):
• Tiles Q, K, V into blocks that fit in SRAM.
• Fuses the softmax + matmul operations into a single GPU kernel.
• Achieves the same mathematical output as standard attention but uses O(n) memory
(no n2 matrix stored).
• Provides 2 − 4× wall-clock speedup and enables longer context windows (32k, 128k
tokens).

Code Python
1 # PyTorch 2.0+ s c a l e d _ d o t _ p r o d u c t _ a t t e n t i o n
2 # automatically dispatches to Flash Attention when available
3 import torch , torch . nn . functional as F
4
5 Q = torch . rand (2 , 8 , 1024 , 64 , device = ’ cuda ’ , dtype = torch . float16 )
6 K = torch . rand (2 , 8 , 1024 , 64 , device = ’ cuda ’ , dtype = torch . float16 )
7 V = torch . rand (2 , 8 , 1024 , 64 , device = ’ cuda ’ , dtype = torch . float16 )
8
9 with torch . backends . cuda . sdp_kernel (
10 enable_flash = True , enable_math = False ,
11 enable_mem_efficient = False ) :
12 out = F . s c a l e d _ d o t _ p r o d u c t _ a t t e n t i o n (Q , K , V , is_causal = True )
13 print ( out . shape ) # (2 , 8 , 1024 , 64)

Q50
What is LoRA (Low-Rank Adaptation) and how is it used for efficient fine-
tuning of large Transformers?

LoRA (Hu et al., 2021) fine-tunes large pre-trained Transformers by injecting trainable
low-rank decompositions into the weight matrices instead of updating all parameters:

W ′ = W + ∆W = W + AB, A ∈ Rd×r , B ∈ Rr×k , r ≪ min(d, k)

W is frozen; only A and B are trained. With r = 8, a 768 × 768 matrix goes from 590k to
12k trainable parameters per layer (a 98% reduction), while matching or approaching full
fine-tuning quality.

— 29 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »

Code Python
1 import torch , torch . nn as nn
2
3 class LoRALinear ( nn . Module ) :
4 def __init__ ( self , in_f , out_f , rank =8 , alpha =16) :
5 super () . __init__ ()
6 self . linear = nn . Linear ( in_f , out_f , bias = False )
7 self . linear . weight . requires_grad = False # frozen
8 self . A = nn . Parameter ( torch . randn ( rank , in_f ) * 0.01)
9 self . B = nn . Parameter ( torch . zeros ( out_f , rank ) )
10 self . scale = alpha / rank
11
12 def forward ( self , x ) :
13 base = self . linear ( x )
14 lora = ( x @ self . A . T ) @ self . B . T
15 return base + self . scale * lora
16
17 layer = LoRALinear (768 , 768 , rank =8)
18 x = torch . rand (2 , 10 , 768)
19 print ( layer ( x ) . shape ) # (2 , 10 , 768)
20 trainable = sum ( p . numel () for p in layer . parameters ()
21 if p . requires_grad )
22 print ( f " Trainable params : { trainable : ,} " ) # 12 ,288

— 30 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »

[REF] Quick-Reference: Key Formulas & Concepts

Concept Formula / Key Fact

QK ⊤

Scaled Dot-Product Attention(Q, K, V ) = softmax √ V
dk
Multi-Head MH = Concat(head1 , . . .) W O
Positional Encoding P E(p,2i) = sin(p/100002i/d )
FFN FFN(x) = max(0, xW1 + b1 )W2 + b2
Residual y = LN(x + Sublayer(x))
ViT patches N = HW/P 2 , typically P = 16
Self-attn complexity O(n2 d) time, O(n2 ) memory
LoRA update W ′ = W + AB, r ≪ d

BERT Encoder-only, bidirectional, MLM

GPT Decoder-only, causal LM, generation
T5 Encoder-decoder, text-to-text
ViT-B/16 d = 768, 12 heads, 12 layers, 86M params

» Lamhot Siagian
AI Engineering Insider
Advanced Deep Learning — From Words to Witness:
The Rise of Transformers & Vision Transformers (ViT)

50 Questions · 6 Categories · Python Code Snippets · Product Company Ready

— 31 — AI Engineering Insider

Vision Transformers Overview and Challenges
No ratings yet
Vision Transformers Overview and Challenges
8 pages
Key Concepts of Transformer Architecture
100% (1)
Key Concepts of Transformer Architecture
8 pages
Transformer Architecture 1771950057
No ratings yet
Transformer Architecture 1771950057
42 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
3 pages
Interpreting Attention in Vision Transformers
No ratings yet
Interpreting Attention in Vision Transformers
152 pages
Overview of Transformer Architecture
No ratings yet
Overview of Transformer Architecture
5 pages
Transformers
No ratings yet
Transformers
7 pages
Understanding Transformers: Key Concepts
No ratings yet
Understanding Transformers: Key Concepts
16 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages
Transformers in Computer Vision Explained
No ratings yet
Transformers in Computer Vision Explained
92 pages
Transformer Model: Self-Attention in NLP
No ratings yet
Transformer Model: Self-Attention in NLP
7 pages
Transformer Model: Attention Mechanism
No ratings yet
Transformer Model: Attention Mechanism
15 pages
Understanding Transformer Models in AI
No ratings yet
Understanding Transformer Models in AI
36 pages
The Transformer Revolution Research Paper
No ratings yet
The Transformer Revolution Research Paper
4 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
7 pages
A Survey of Transformers:, Yuxin Wang, Xiangyang Liu, and
No ratings yet
A Survey of Transformers:, Yuxin Wang, Xiangyang Liu, and
40 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
4 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
11 pages
Transformer Architecture Overview
No ratings yet
Transformer Architecture Overview
32 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
50+ Generative AI Interview Questions
No ratings yet
50+ Generative AI Interview Questions
27 pages
The Transformer Architecture
No ratings yet
The Transformer Architecture
5 pages
Understanding Transformer Models in NLP
No ratings yet
Understanding Transformer Models in NLP
50 pages
Overview of Transformer Architecture
No ratings yet
Overview of Transformer Architecture
22 pages
Understanding Transformers in Deep Learning
No ratings yet
Understanding Transformers in Deep Learning
40 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
37 pages
A1
No ratings yet
A1
11 pages
Deep Learning with Transformers
No ratings yet
Deep Learning with Transformers
58 pages
Attention Is All You Need
100% (2)
Attention Is All You Need
15 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
38 pages
Transformers in Machine Learning
No ratings yet
Transformers in Machine Learning
25 pages
Deep Learning in Biotechnology Overview
No ratings yet
Deep Learning in Biotechnology Overview
39 pages
Understanding the Transformer Architecture
No ratings yet
Understanding the Transformer Architecture
10 pages
LLM Faq
No ratings yet
LLM Faq
21 pages
Transformer: Attention Mechanism Model
No ratings yet
Transformer: Attention Mechanism Model
11 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
4 pages
nlp50 Transformer
No ratings yet
nlp50 Transformer
87 pages
2025 Transformer
No ratings yet
2025 Transformer
56 pages
Comprehensive Survey of Transformers
No ratings yet
Comprehensive Survey of Transformers
22 pages
NPTEL LLM Week6 Complete Guide
No ratings yet
NPTEL LLM Week6 Complete Guide
17 pages
04 Transformer Architecture
No ratings yet
04 Transformer Architecture
2 pages
Overview of Deep Learning Architectures
No ratings yet
Overview of Deep Learning Architectures
69 pages
DLBasic Lec4 Transformer Networks
No ratings yet
DLBasic Lec4 Transformer Networks
54 pages
Transformer: Revolutionizing Attention Mechanisms
No ratings yet
Transformer: Revolutionizing Attention Mechanisms
11 pages
CO 2 6 Transformers
No ratings yet
CO 2 6 Transformers
27 pages
ML LP Question Bank Transformers
No ratings yet
ML LP Question Bank Transformers
4 pages
Arxiv:1706. 03762V7 (Cs - CL) 2 Aug 2023: Attention Is All You Need
No ratings yet
Arxiv:1706. 03762V7 (Cs - CL) 2 Aug 2023: Attention Is All You Need
24 pages
Introduction to Transformers in DL
No ratings yet
Introduction to Transformers in DL
7 pages
Transformer Model for Sequence Transduction
No ratings yet
Transformer Model for Sequence Transduction
15 pages
Transformer Model for Sequence Transduction
No ratings yet
Transformer Model for Sequence Transduction
139 pages
Transformer Model for Sequence Transduction
No ratings yet
Transformer Model for Sequence Transduction
15 pages
Transformer Model for Sequence Transduction
No ratings yet
Transformer Model for Sequence Transduction
3 pages
Transformer: A New Neural Architecture
No ratings yet
Transformer: A New Neural Architecture
15 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
11 pages
Transformer: Attention Mechanism Unleashed
75% (4)
Transformer: Attention Mechanism Unleashed
11 pages
Transformers in Computer Vision Explained
No ratings yet
Transformers in Computer Vision Explained
31 pages
2025 Transformer 1
No ratings yet
2025 Transformer 1
56 pages
CNN vs Transformer in Image Processing
No ratings yet
CNN vs Transformer in Image Processing
90 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
15 pages

Deep Learning Interview Prep - Transformers & ViT

Uploaded by

Deep Learning Interview Prep - Transformers & ViT

Uploaded by

DEEP LEARNING INTERVIEW PREP SERIES

Advanced Deep Learning

50 Interview Questions with Python Code Snippets

• Transformer Foundations • Encoder-Decoder

Category 2: Self-Attention Mechanism 10

Category 3: Multi-Head Attention 16

Category 4: Encoder–Decoder Architecture 20

Category 5: Vision Transformer — ViT 24

Q41: What are the limitations of ViT compared to CNNs? . . . . . . . . . . . . . . . . 25

Category 6: CNN vs. ViT & Hybrid Models 27

—3— AI Engineering Insider

[CAT] Category 1: Transformer Foundations (Q1–Q10)

A Transformer is a deep learning architecture based entirely on attention mechanisms,

Each Transformer encoder block contains:

—4— AI Engineering Insider

Self-attention is permutation-invariant: shuffling the input tokens produces identical atten-

—5— AI Engineering Insider

—6— AI Engineering Insider

FFN(x) = max(0, xW1 + b1 ) W2 + b2

—7— AI Engineering Insider

Transformers now underpin virtually all state-of-the-art systems across modalities:

—8— AI Engineering Insider

Domain Task Model Examples

Post-LN (original paper): y = LN(x + Sublayer(x)). LayerNorm is applied after the

—9— AI Engineering Insider

[ATT] Category 2: Self-Attention Mechanism (Q11–Q20)

In the decoder, the cross-attention sub-layer uses:

[MHA] Category 3: Multi-Head Attention (Q21–Q28)

Multi-Head Attention runs h independent attention operations (“heads”) in parallel, each

MultiHead(Q, K, V ) = Concat(head1 , . . . , headh ) W O

headi = Attention(QWiQ , KWiK , V WiV )

1. Linear projections — project input X ∈ Rn×d to Qi , Ki , Vi ∈ Rn×dk for each head i

The standard constraint is dk = dmodel /h (integer division). Common choices:

Grouped Query Attention (GQA) is a memory-efficient variant that shares K and V

Both configurations have the same total parameter count:

[ENC] Category 4: Encoder–Decoder Architecture (Q29–Q36)

Each decoder block has three sub-layers:

At inference the decoder generates tokens autoregressively:

The final decoder hidden state ht ∈ Rd is passed through:

logits = ht Wout + b, Wout ∈ Rd×|V |

Type Example Attention Best For

[VIT] Category 5: Vision Transformer — ViT (Q37–Q43)

Like text Transformers, ViT’s self-attention is permutation-invariant: without position in-

where z0L is the CLS token output at layer L.

[HYB] Category 6: CNN vs. ViT & Hybrid Models (Q44–Q50)

Scenario Prefer CNN Prefer ViT

DINO (Self-DIstillation with NO labels) is a self-supervised learning framework for ViT

Property BERT GPT

W ′ = W + ∆W = W + AB, A ∈ Rd×r , B ∈ Rr×k , r ≪ min(d, k)

[REF] Quick-Reference: Key Formulas & Concepts

Concept Formula / Key Fact

BERT Encoder-only, bidirectional, MLM

50 Questions · 6 Categories · Python Code Snippets · Product Company Ready

You might also like