Deep Learning Interview Prep - Transformers & ViT
Deep Learning Interview Prep - Transformers & ViT
Deep Learning
Advanced
Transformers and Attention Mechanisms
Self-Attention | Multi-Head Attention | Encoder-Decoder | Positional Encoding
LAMHOT SIAGIAN
AI Engineering Insider
2026 Edition • From Words to Witness: The Rise of Transformers & ViT
Contents
Category 1: Transformer Foundations 4
Q1: What is a Transformer and why was it introduced to replace RNN/LSTM? . . . . 4
Q2: List and explain the four core components of a Transformer block. . . . . . . . . . 4
Q3: What is Positional Encoding and why is it necessary? . . . . . . . . . . . . . . . . 5
Q4: How does parallel processing in Transformers improve training compared to RNNs? 5
Q5: What is Layer Normalisation and how does it differ from Batch Normalisation in
Transformers? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Q6: Explain the residual (skip) connections in the Transformer and their purpose. . . 7
Q7: What is the Feed-Forward Network (FFN) inside a Transformer block? . . . . . . 7
Q8: Why does self-attention have O(n2 )complexityandwhatmethodsreduceit? . . . . . 8
Q9: What applications leverage the Transformer architecture? . . . . . . . . . . . . . . 8
Q10: What is the difference between Pre-LN and Post-LN Transformer variants? . . . 9
2
Deep Learning Interview Prep Lamhot Siagian »
Q1
What is a Transformer and why was it introduced to replace RNN/LSTM?
Code Python
1 import torch , torch . nn as nn
2
3 # Minimal Transformer Encoder Layer usage
4 encoder_layer = nn . T ra n sf or me r En co d er La ye r (
5 d_model =512 , nhead =8 , dim_feedforward =2048 ,
6 dropout =0.1 , batch_first = True
7 )
8 encoder = nn . TransformerEncoder ( encoder_layer , num_layers =6)
9
10 # src : ( batch , seq_len , d_model )
11 src = torch . rand (2 , 10 , 512)
12 out = encoder ( src ) # (2 , 10 , 512)
13 print ( " Encoder output shape : " , out . shape )
Q2
List and explain the four core components of a Transformer block.
Code Python
1 import torch , torch . nn as nn , math
2
3 class TransformerBlock ( nn . Module ) :
4 def __init__ ( self , d_model =512 , nhead =8 , ffn_dim =2048) :
5 super () . __init__ ()
6 self . attn = nn . MultiheadAttention ( d_model , nhead , batch_first =
True )
7 self . ffn = nn . Sequential (
8 nn . Linear ( d_model , ffn_dim ) , nn . ReLU () ,
9 nn . Linear ( ffn_dim , d_model )
10 )
11 self . ln1 = nn . LayerNorm ( d_model )
12 self . ln2 = nn . LayerNorm ( d_model )
13
14 def forward ( self , x ) :
15 # Self - attention + residual
16 attn_out , _ = self . attn (x , x , x )
17 x = self . ln1 ( x + attn_out )
18 # FFN + residual
19 x = self . ln2 ( x + self . ffn ( x ) )
20 return x
Q3
What is Positional Encoding and why is it necessary?
Code Python
1 import torch , math
2
3 def positional_encoding ( seq_len , d_model ) :
4 pe = torch . zeros ( seq_len , d_model )
5 pos = torch . arange (0 , seq_len ) . unsqueeze (1) . float ()
6 div = torch . exp (
7 torch . arange (0 , d_model , 2) . float () * ( - math . log (10000) /
d_model )
8 )
9 pe [: , 0::2] = torch . sin ( pos * div ) # even dims
10 pe [: , 1::2] = torch . cos ( pos * div ) # odd dims
11 return pe # ( seq_len , d_model )
12
13 pe = positional_encoding (50 , 512)
14 print ( pe . shape ) # torch . Size ([50 , 512])
Q4
How does parallel processing in Transformers improve training compared to
RNNs?
In an RNN the hidden state at step t depends on step t−1, creating a strict sequential
dependency that prevents parallelism during training. This means:
• GPU utilisation is poor for long sequences.
• Training time scales linearly with sequence length.
Transformers compute attention for all positions simultaneously via matrix multiplications,
allowing the GPU/TPU to process the entire sequence in a single forward pass. Training
time scales with O(n2 · d) (quadratic in sequence length, but highly parallel) rather than
O(n) serial steps.
Code Python
1 import torch , time
2
3 # RNN : sequential hidden - state updates
4 rnn = torch . nn . RNN (512 , 512 , batch_first = True )
5 batch = torch . rand (8 , 512 , 512) # ( batch , seq , feat )
6
7 t0 = time . perf_counter ()
8 for _ in range (50) : rnn ( batch )
9 print ( f " RNN 50 passes : { time . perf_counter () - t0 :.3 f } s " )
10
11 # Transformer : all positions at once
12 enc = torch . nn . Tr an sf o rm er En c od er L ay er (
13 d_model =512 , nhead =8 , batch_first = True )
14
15 t0 = time . perf_counter ()
16 for _ in range (50) : enc ( batch )
17 print ( f " TFM 50 passes : { time . perf_counter () - t0 :.3 f } s " )
Q5
What is Layer Normalisation and how does it differ from Batch Normalisation
in Transformers?
Batch Norm normalises across the batch dimension for each feature. This is problematic
for variable-length sequences and small batches common in NLP.
Layer Norm normalises across the feature dimension for each single sample independently:
xi − µ i
x̂i = ·γ+β
σi + ϵ
where µi and σi are computed over the dmodel features of sample i. This makes it batch-size
independent and well-suited for sequence models.
Code Python
1 import torch , torch . nn as nn
2
3 x = torch . rand (4 , 10 , 512) # ( batch , seq , d_model )
4
5 bn = nn . BatchNorm1d (512) # needs (N , C ) or (N , C , L )
6 ln = nn . LayerNorm (512)
7
8 # LayerNorm : works directly on ( batch , seq , d_model )
9 out_ln = ln ( x ) # shape unchanged
10
11 # BatchNorm : must transpose seq / feature dims
12 out_bn = bn ( x . transpose (1 , 2) ) . transpose (1 , 2)
13
14 print ( " LN : " , out_ln . shape ) # (4 , 10 , 512)
15 print ( " BN : " , out_bn . shape ) # (4 , 10 , 512)
Q6
Explain the residual (skip) connections in the Transformer and their purpose.
Each sub-layer output is computed as y = LayerNorm(x + F (x)) where F (x) is either the
attention or FFN sub-layer. Residual connections:
• Alleviate the vanishing gradient problem in deep stacks (the gradient can flow directly
through the identity path).
• Allow the sub-layer to learn residuals (small refinements) rather than full transforma-
tions, easing optimisation.
• Enable training very deep networks (the original paper uses 6 encoder and 6 decoder
layers; GPT-3 uses 96).
Code Python
1 import torch , torch . nn as nn
2
3 class ResidualSubLayer ( nn . Module ) :
4 def __init__ ( self , sublayer , d_model =512) :
5 super () . __init__ ()
6 self . sublayer = sublayer
7 self . norm = nn . LayerNorm ( d_model )
8
9 def forward ( self , x , ** kwargs ) :
10 # Pre - Norm variant ( used in GPT -2+) :
11 return x + self . sublayer ( self . norm ( x ) , ** kwargs )
12 # Post - Norm ( original paper ) :
13 # return self . norm ( x + self . sublayer (x , ** kwargs ) )
Q7
What is the Feed-Forward Network (FFN) inside a Transformer block?
The position-wise FFN is applied identically and independently to each token position after
the attention sub-layer:
Typically dff = 4 × dmodel (e.g. 2048 for dmodel = 512). The expansion-then-contraction
structure lets the network project into a higher-dimensional space to perform non-linear
feature mixing before returning to the residual stream. GELU activation is preferred in
modern models (BERT, GPT).
Code Python
1 import torch . nn as nn , torch . nn . functional as F
2
3 class PositionwiseFFN ( nn . Module ) :
4 def __init__ ( self , d_model =512 , d_ff =2048 , dropout =0.1) :
5 super () . __init__ ()
6 self . w1 = nn . Linear ( d_model , d_ff )
7 self . w2 = nn . Linear ( d_ff , d_model )
8 self . drop = nn . Dropout ( dropout )
9
10 def forward ( self , x ) :
11 return self . w2 ( self . drop ( F . gelu ( self . w1 ( x ) ) ) )
Q8
Why does self-attention have O(n2 ) complexity and what methods reduce it?
For a sequence of length n, every token attends to every other token, producing an n × n
attention matrix. Computing and storing this matrix is O(n2 ) in both time and memory.
Efficient variants include:
• Sparse Attention (Longformer, BigBird) — each token attends to a fixed window +
global tokens.
• Linformer — projects K, V down to O(k) dimensions, reducing complexity to O(n).
• Flash Attention — IO-aware CUDA kernel; same O(n2 ) computation but much re-
duced HBM reads/writes.
• Performer — approximates softmax with random feature maps.
Code Python
1 import torch
2
3 def attention_complexity (n , d ) :
4 " " " Theoretical FLOP counts . " " "
5 qk_matmul = n * n * d # Q K ^ T -> (n , n )
6 softmax_ops = n * n # row softmax
7 av_matmul = n * n * d # Attn V -> (n , d )
8 total = 2 * d * n **2 + n **2
9 return total
10
11 for n in [128 , 512 , 1024 , 4096]:
12 print ( f " n ={ n :5 d } FLOPs ~ { attention_complexity (n , 64) /1 e6 :.1 f } M " )
Q9
What applications leverage the Transformer architecture?
Q10
What is the difference between Pre-LN and Post-LN Transformer variants?
Code Python
1 import torch . nn as nn
2
3 class PreLNBlock ( nn . Module ) :
4 def __init__ ( self , d =512 , h =8 , ff =2048) :
5 super () . __init__ ()
6 self . ln1 = nn . LayerNorm ( d )
7 self . ln2 = nn . LayerNorm ( d )
8 self . attn = nn . MultiheadAttention (d , h , batch_first = True )
9 self . ffn = nn . Sequential (
10 nn . Linear (d , ff ) , nn . GELU () , nn . Linear ( ff , d ) )
11
12 def forward ( self , x ) :
13 x2 , _ = self . attn ( self . ln1 ( x ) , self . ln1 ( x ) , self . ln1 ( x ) )
14 x = x + x2
15 x = x + self . ffn ( self . ln2 ( x ) )
16 return x
Q11
Explain the Query, Key, Value (Q, K, V) abstraction in self-attention.
Each token embedding is linearly projected into three distinct vectors via learned weight
matrices W Q , W K , W V ∈ Rd×dk :
• Query (Q) — “What information am I looking for?”
• Key (K) — “What information do I have to offer?”
• Value (V) — “What content do I actually contribute?”
The similarity between a query and all keys (dot products) determines how much each value
is retrieved. In self -attention all three projections come from the same input sequence.
Code Python
1 import torch , torch . nn as nn
2
3 d_model , d_k = 512 , 64
4 Wq = nn . Linear ( d_model , d_k , bias = False )
5 Wk = nn . Linear ( d_model , d_k , bias = False )
6 Wv = nn . Linear ( d_model , d_k , bias = False )
7
8 x = torch . rand (2 , 10 , d_model ) # ( batch , seq , d_model )
9 Q = Wq ( x ) # (2 , 10 , 64)
10 K = Wk ( x ) # (2 , 10 , 64)
11 V = Wv ( x ) # (2 , 10 , 64)
12 print ( Q . shape , K . shape , V . shape )
Q12
Write out the full scaled dot-product attention formula and explain each term.
QK ⊤
Attention(Q, K, V ) = softmax √ V
dk
1. QK √ — dot products measure how similar each query is to each key (n×n score matrix).
⊤
2. ÷ dk — scaling factor. Without it, large dk causes dot products to grow in magnitude,
pushing softmax into regions with tiny gradients.
3. softmax(·) — converts scores to a probability distribution (attention weights) that sums
to 1 per query.
4. (· · · ) V — weighted sum of values; each query retrieves a blend of all values proportional
to attention weights.
— 10 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »
Code Python
1 import torch , torch . nn . functional as F , math
2
3 def s c a l e d _ d o t _ p r o d u c t _ a t t e n t i o n (Q , K , V , mask = None ) :
4 d_k = Q . size ( -1)
5 scores = torch . matmul (Q , K . transpose ( -2 , -1) ) / math . sqrt ( d_k )
6 if mask is not None :
7 scores = scores . masked_fill ( mask == 0 , float ( ’ - inf ’) )
8 weights = F . softmax ( scores , dim = -1)
9 return torch . matmul ( weights , V ) , weights
10
11 Q = torch . rand (2 , 10 , 64)
12 K = torch . rand (2 , 10 , 64)
13 V = torch . rand (2 , 10 , 64)
14 out , w = s c a l e d _ d o t _ p r o d u c t _ a t t e n t i o n (Q , K , V )
15 print ( out . shape , w . shape ) # (2 ,10 ,64) (2 ,10 ,10)
Q13
√
Why do we divide by dk in the attention formula?
For queries and keys with zero mean and unit variance, the dot product q ·k = di=1 qi ki has
Pk
variance dk . With large dk (e.g. 64), dot products can be large in magnitude,
√ pushing the
softmax into saturation regions where gradients approach zero. Dividing by dk re-centres
the variance to 1, keeping gradients healthy.
Code Python
1 import torch , torch . nn . functional as F , math
2
3 d_k = 64
4 q = torch . randn (1 , 1 , d_k )
5 k = torch . randn (1 , 10 , d_k )
6
7 raw_score = torch . bmm (q , k . transpose (1 ,2) )
8 scaled_score = raw_score / math . sqrt ( d_k )
9
10 print ( f " Raw score std : { raw_score . std () :.3 f } " )
11 print ( f " Scaled score std : { scaled_score . std () :.3 f } " )
12 print ( f " Softmax entropy ( raw ) : "
13 f " {( - F . softmax ( raw_score , -1) * F . log_softmax ( raw_score , -1) ) . sum ()
:.3 f } " )
14 print ( f " Softmax entropy ( scaled ) : "
15 f " {( - F . softmax ( scaled_score , -1) * F . log_softmax ( scaled_score , -1) ) .
sum () :.3 f } " )
Q14
What is an attention mask and when is it used?
An attention mask prevents certain positions from contributing to attention. Two main
types:
• Padding mask — marks padded positions (added to make sequences the same length
in a batch) as −∞ before softmax, so they receive zero weight.
• Causal (look-ahead) mask — used in decoder to ensure position i can only attend to
— 11 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »
positions ≤ i (prevents the decoder from “seeing the future” during training).
Code Python
1 import torch
2
3 def causal_mask ( seq_len ) :
4 " " " Upper - triangular True = masked out . " " "
5 return torch . triu ( torch . ones ( seq_len , seq_len , dtype = torch . bool ) ,
diagonal =1)
6
7 def padding_mask ( lengths , max_len ) :
8 " " " True where token is PAD . " " "
9 return torch . arange ( max_len ) . unsqueeze (0) >= lengths . unsqueeze (1)
10
11 mask = causal_mask (5)
12 print ( mask . int () )
13 # [[0 ,1 ,1 ,1 ,1] ,
14 # [0 ,0 ,1 ,1 ,1] , ... ]
Q15
How does self-attention capture long-range dependencies?
In self-attention, the path length between any two tokens is constant (O(1)) regardless
of their distance in the sequence. Every token computes an attention score with every
other token in a single layer. In contrast, an RNN must propagate information through
O(n) hidden-state transitions, causing gradients to vanish over long distances. This direct
any-to-any connectivity is why Transformers excel at coreference resolution, long-document
summarisation, and other tasks requiring distant context.
Q16
What is cross-attention (encoder-decoder attention)?
— 12 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »
Code Python
1 import torch , torch . nn as nn
2
3 class CrossAttention ( nn . Module ) :
4 def __init__ ( self , d_model =512 , nhead =8) :
5 super () . __init__ ()
6 self . attn = nn . MultiheadAttention ( d_model , nhead , batch_first =
True )
7 self . norm = nn . LayerNorm ( d_model )
8
9 def forward ( self , decoder_x , encoder_out ) :
10 # Q from decoder , K & V from encoder
11 out , _ = self . attn ( query = decoder_x ,
12 key = encoder_out ,
13 value = encoder_out )
14 return self . norm ( decoder_x + out )
Q17
What are the advantages and limitations of self-attention?
Advantages Limitations
Captures long-range dependencies O(n2 ) time and memory w.r.t. se-
in O(1) path length quence length
Fully parallelisable — no sequential Positional information must be in-
bottleneck jected explicitly
Better contextual understanding Memory-intensive for very long se-
through direct token-to-token atten- quences (documents, genomes)
tion
Q18
Implement self-attention from scratch in NumPy/PyTorch.
— 13 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »
Code Python
1 import torch , torch . nn as nn , math , torch . nn . functional as F
2
3 class SelfAttention ( nn . Module ) :
4 def __init__ ( self , d_model =512 , d_k =64) :
5 super () . __init__ ()
6 self . Wq = nn . Linear ( d_model , d_k , bias = False )
7 self . Wk = nn . Linear ( d_model , d_k , bias = False )
8 self . Wv = nn . Linear ( d_model , d_k , bias = False )
9 self . scale = math . sqrt ( d_k )
10
11 def forward ( self , x , mask = None ) :
12 Q , K , V = self . Wq ( x ) , self . Wk ( x ) , self . Wv ( x )
13 scores = torch . bmm (Q , K . transpose (1 , 2) ) / self . scale
14 if mask is not None :
15 scores = scores . masked_fill ( mask , float ( ’ - inf ’) )
16 weights = F . softmax ( scores , dim = -1)
17 return torch . bmm ( weights , V ) , weights # output , attn_map
18
19 sa = SelfAttention ()
20 x = torch . rand (2 , 10 , 512)
21 out , w = sa ( x )
22 print ( out . shape , w . shape ) # (2 ,10 ,64) (2 ,10 ,10)
Q19
What is the role of Softmax in attention and can it be replaced?
Softmax normalises the raw attention scores into a valid probability distribution (non-
negative, sums to 1), ensuring the weighted sum of values remains well-scaled. However it
forces all attention distributions to be dense (every position gets non-zero weight).
Alternatives investigated in the literature:
• Sparsemax — produces sparse distributions; many weights are exactly 0.
• α-entmax — generalises softmax/sparsemax via a temperature-like parameter.
• ReLU-based attention — used in some efficient Transformers to avoid the normali-
sation bottleneck.
Q20
How do you visualise attention weights for interpretability?
The n × n attention weight matrix for each head can be plotted as a heatmap. High values
between positions (i, j) indicate that token i attends strongly to token j. Tools like BertViz
provide interactive head views for BERT-style models.
— 14 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »
Code Python
1 import torch , matplotlib . pyplot as plt , seaborn as sns
2
3 tokens = [ " The " , " animal " , " didn ’t " , " cross " , " because " , " it " , " tired
"]
4 n = len ( tokens )
5 # Synthetic attention weights ( replace with real model output )
6 weights = torch . softmax ( torch . randn (n , n ) , dim = -1) . numpy ()
7
8 fig , ax = plt . subplots ( figsize =(6 , 5) )
9 sns . heatmap ( weights , xticklabels = tokens , yticklabels = tokens ,
10 cmap = ’ Blues ’ , ax = ax )
11 ax . set_title ( " Self - Attention Weights " )
12 plt . tight_layout ()
13 plt . savefig ( " attn_heatmap . png " , dpi =150)
— 15 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »
Q21
What is Multi-Head Attention (MHA) and why is it better than single-head
attention?
Code Python
1 import torch , torch . nn as nn
2
3 mha = nn . MultiheadAttention ( embed_dim =512 , num_heads =8 ,
4 batch_first = True )
5 x = torch . rand (2 , 10 , 512)
6 out , attn_weights = mha (x , x , x ) # self - attention
7 print ( " Output : " , out . shape ) # (2 , 10 , 512)
8 print ( " Weights : " , attn_weights . shape ) # (2 , 10 , 10)
Q22
Walk through the step-by-step computation of Multi-Head Attention.
— 16 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »
Code Python
1 import torch , torch . nn as nn , math , torch . nn . functional as F
2
3 class MultiHeadAttention ( nn . Module ) :
4 def __init__ ( self , d =512 , h =8) :
5 super () . __init__ ()
6 self .h , self . dk = h , d // h
7 self . Wq = nn . Linear (d , d , bias = False )
8 self . Wk = nn . Linear (d , d , bias = False )
9 self . Wv = nn . Linear (d , d , bias = False )
10 self . Wo = nn . Linear (d , d , bias = False )
11
12 def split_heads ( self , x , B , T ) :
13 return x . view (B , T , self .h , self . dk ) . transpose (1 , 2)
14
15 def forward ( self , x ) :
16 B , T , _ = x . shape
17 Q = self . split_heads ( self . Wq ( x ) , B , T ) # (B ,h ,T , dk )
18 K = self . split_heads ( self . Wk ( x ) , B , T )
19 V = self . split_heads ( self . Wv ( x ) , B , T )
20 scores = Q @ K . transpose ( -2 , -1) / math . sqrt ( self . dk )
21 weights = F . softmax ( scores , dim = -1)
22 heads = ( weights @ V ) . transpose (1 ,2) . contiguous ()
23 heads = heads . view (B , T , -1) # concat
24 return self . Wo ( heads )
Q23
How do you choose the number of attention heads h?
Q24
What does each attention head typically learn?
Probing studies (e.g. Clark et al., 2019 on BERT) reveal that individual heads tend to
specialise:
• Syntactic heads — attend to direct objects, subjects, or dependency arcs.
• Positional heads — attend to immediately adjacent tokens (local window).
• Coreference heads — link pronouns to their antecedents (“it” → “animal”).
• Rare token heads — pay disproportionate attention to infrequent, informative tokens.
This emergent specialisation is a key reason MHA outperforms single-head attention.
— 17 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »
Q25
What is the output projection matrix W O in MHA and why is it needed?
After concatenating the h head outputs (each dk -dimensional), we obtain a vector of di-
mension h × dk = dmodel . The projection W O ∈ Rd×d mixes information across heads,
allowing the model to learn how different heads’ specialised representations should be com-
bined. Without W O , each head’s output would be an independent, isolated view with no
cross-head interaction.
Q26
Explain Grouped Query Attention (GQA) used in LLaMA-2/3.
Code Python
1 # Conceptual GQA : 8 query heads , 2 KV groups
2 import torch , torch . nn as nn , math , torch . nn . functional as F
3
4 H_Q , H_KV , d = 8 , 2 , 512
5 dk = d // H_Q
6 Wq = nn . Linear (d , H_Q * dk , bias = False )
7 Wk = nn . Linear (d , H_KV * dk , bias = False )
8 Wv = nn . Linear (d , H_KV * dk , bias = False )
9
10 x = torch . rand (1 , 10 , d)
11 Q = Wq ( x ) . view (1 , 10 , H_Q , dk ) . transpose (1 , 2) # (1 ,8 ,10 , dk )
12 K = Wk ( x ) . view (1 , 10 , H_KV , dk ) . transpose (1 , 2) # (1 ,2 ,10 , dk )
13 V = Wv ( x ) . view (1 , 10 , H_KV , dk ) . transpose (1 , 2)
14
15 # Repeat K , V to match H_Q heads
16 K = K . repeat_interleave ( H_Q // H_KV , dim =1) # (1 ,8 ,10 , dk )
17 V = V . repeat_interleave ( H_Q // H_KV , dim =1)
Q27
What is attention dropout and why is it used?
Attention dropout applies a dropout mask to the attention weight matrix after softmax and
before the weighted sum over values. This randomly zeroes out some attention connections
during training, preventing the model from over-relying on specific (query, key) pairs and
improving generalisation. Typical dropout rates: 0.1 for base models, 0.0 for large fine-
tuning stages.
— 18 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »
Code Python
1 import torch , torch . nn as nn , torch . nn . functional as F , math
2
3 def at ten ti on_ wi th_ dr opo ut (Q , K , V , dropout =0.1 , training = True ) :
4 d_k = Q . size ( -1)
5 scores = Q @ K . transpose ( -2 , -1) / math . sqrt ( d_k )
6 weights = F . softmax ( scores , dim = -1)
7 weights = F . dropout ( weights , p = dropout , training = training )
8 return weights @ V
Q28
Compare MHA parameter counts: 1 head vs 8 heads for dmodel = 512.
Code Python
1 import torch . nn as nn
2
3 def count_params ( m ) :
4 return sum ( p . numel () for p in m . parameters () )
5
6 d = 512
7 single = nn . MultiheadAttention (d , num_heads =1 , bias = False , batch_first
= True )
8 multi = nn . MultiheadAttention (d , num_heads =8 , bias = False , batch_first
= True )
9
10 print ( f " 1 - head params : { count_params ( single ) : ,} " )
11 print ( f " 8 - heads params : { count_params ( multi ) : ,} " )
12 # Both should print 1 ,048 ,576
— 19 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »
Q29
What does the Encoder do in a Transformer and what is its output?
The encoder processes the entire source sequence in parallel. Each of its N identical blocks
applies multi-head self-attention followed by a position-wise FFN, with residual connections
and layer normalisation.
The output is a sequence of contextualised embeddings Z ∈ Rn×d — one vector per
source token, enriched with global context from all other tokens. This representation is
passed to every decoder layer via cross-attention.
Code Python
1 import torch , torch . nn as nn
2
3 encoder_layer = nn . T ra n sf or me r En co d er La ye r (
4 d_model =512 , nhead =8 , batch_first = True )
5 encoder = nn . TransformerEncoder ( encoder_layer , num_layers =6)
6
7 src = torch . rand (2 , 15 , 512) # ( batch , src_len , d )
8 # s r c_key_padding_mask : True where tokens are PAD
9 memory = encoder ( src ) # (2 , 15 , 512)
10 print ( " Memory shape : " , memory . shape )
Q30
Describe the three attention sub-layers inside a Transformer decoder block.
Q31
Why is Masked Self-Attention used in the decoder?
During training, the entire target sequence is fed to the decoder simultaneously (teacher
forcing). Without masking, position t could attend to ground-truth tokens at positions
t + 1, t + 2, . . . — it would “cheat” by reading the answer. The causal mask ensures that the
prediction at position t depends only on positions ≤ t, replicating the left-to-right generation
process used at inference time.
— 20 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »
Code Python
1 import torch , torch . nn as nn
2
3 d , h , N = 512 , 8 , 6
4 dec_layer = nn . T r an sf o rm er De c od er La y er ( d_model =d , nhead =h ,
5 batch_first = True )
6 decoder = nn . TransformerDecoder ( dec_layer , num_layers = N )
7
8 tgt = torch . rand (2 , 10 , d )
9 memory = torch . rand (2 , 15 , d ) # encoder output
10
11 # Causal mask ( upper - triangular True = masked )
12 tgt_mask = nn . Transformer . g e n e r a t e _ s q u a r e _ s u b s e q u e n t _ m a s k (10)
13
14 out = decoder ( tgt , memory , tgt_mask = tgt_mask )
15 print ( out . shape ) # (2 , 10 , 512)
Q32
How does the Transformer generate output sequences (inference)?
Q33
What is the output layer of a Transformer and how are logits produced?
followed by softmax to get a probability over vocabulary V . In many models (e.g. GPT-2)
Wout is tied to the input embedding matrix to reduce parameters and improve generalisation.
— 21 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »
Code Python
1 import torch , torch . nn as nn , torch . nn . functional as F
2
3 vocab_size , d_model = 32000 , 512
4
5 # Weight - tied output projection
6 embedding = nn . Embedding ( vocab_size , d_model )
7 output_proj = lambda h : h @ embedding . weight . T # tied
8
9 h = torch . rand (2 , 10 , d_model ) # decoder output
10 logits = output_proj ( h ) # (2 , 10 , 32000)
11 probs = F . softmax ( logits , dim = -1)
12 next_token = probs [: , -1 , :]. argmax ( dim = -1)
13 print ( " Next token ids : " , next_token )
Q34
Compare encoder-only, decoder-only, and encoder-decoder Transformer archi-
tectures.
Q35
What is the KV-Cache and how does it accelerate autoregressive decoding?
During greedy/beam decoding, the K and V projections for all previous tokens never change.
Recomputing them at every step is wasteful. The KV-cache stores the K and V tensors
from all past decoder steps; at step t only the new token’s K/V pair is computed and
appended. This reduces per-step computation from O(t · d) to O(d). Memory cost is
O(n · L · d) where L is the number of layers.
Code Python
1 # Simplified KV - cache concept
2 past_keys = [] # list of K tensors per layer
3 past_values = [] # list of V tensors per layer
4
5 def decode_step ( x_new , past_k , past_v , attn_layer ) :
6 k_new = attn_layer . key_proj ( x_new ) # (1 , 1 , dk )
7 v_new = attn_layer . val_proj ( x_new )
8 K_all = torch . cat ( past_k + [ k_new ] , dim =1)
9 V_all = torch . cat ( past_v + [ v_new ] , dim =1)
10 # Q from x_new only ; KV from full history
11 q = attn_layer . query_proj ( x_new )
12 out = s c a l e d _ d o t _ p r o d u c t _ a t t e n t i o n (q , K_all , V_all )
13 return out , K_all , V_all
— 22 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »
Q36
What is Beam Search and how is it used in Transformer generation?
Beam search is a heuristic decoding strategy that keeps the top-k (beam width) most
probable partial sequences at each step, expanding each candidate and retaining the k best,
until all beams end with [EOS]. It approximates finding the globally most probable sequence
without an intractable exhaustive search.
Code Python
1 from transformers import AutoTokenizer , AutoM odelFo rSeq2 SeqLM
2
3 tokenizer = AutoTokenizer . from_pretrained ( " t5 - small " )
4 model = Auto ModelF orSeq 2SeqLM . from_pretrained ( " t5 - small " )
5
6 inputs = tokenizer ( " translate English to French : Hello world " ,
7 return_tensors = " pt " )
8 outputs = model . generate (** inputs , num_beams =4 , max_new_tokens =20)
9 print ( tokenizer . decode ( outputs [0] , skip_special_tokens = True ) )
10 # " Bonjour monde "
— 23 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »
Q37
What is a Vision Transformer (ViT) and how does it process an image?
ViT (Dosovitskiy et al., 2020) applies the standard Transformer encoder directly to images
by converting them into a sequence of flat patch embeddings:
1. Patch splitting — divide image H ×W ×C into N = HW/P 2 non-overlapping patches
of size P × P (typically P = 16).
2. Linear projection — flatten each patch to a vector of size P 2 C and project to dmodel .
3. CLS token — prepend a learnable [CLS] embedding; its final-layer output is used for
classification.
4. Positional embeddings — add learned 1-D positional embeddings.
5. Transformer encoder — process the N + 1 token sequence.
Code Python
1 import torch , torch . nn as nn
2
3 class PatchEmbedding ( nn . Module ) :
4 def __init__ ( self , img_size =224 , patch_size =16 ,
5 in_channels =3 , d_model =768) :
6 super () . __init__ ()
7 n_patches = ( img_size // patch_size ) ** 2
8 self . proj = nn . Conv2d ( in_channels , d_model ,
9 kernel_size = patch_size ,
10 stride = patch_size )
11 self . cls_tok = nn . Parameter ( torch . zeros (1 , 1 , d_model ) )
12 self . pos_emb = nn . Parameter (
13 torch . zeros (1 , n_patches + 1 , d_model ) )
14
15 def forward ( self , x ) : # x : (B , 3 , 224 , 224)
16 B = x . size (0)
17 x = self . proj ( x ) # (B , d , H /P , W / P )
18 x = x . flatten (2) . transpose (1 ,2) # (B , N , d )
19 cls = self . cls_tok . expand (B , -1 , -1)
20 x = torch . cat ([ cls , x ] , dim =1) # (B , N +1 , d )
21 return x + self . pos_emb
Q38
Why does ViT need positional embeddings for image patches?
— 24 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »
Q39
What is the CLS token in ViT and how is it used for classification?
The CLS token is a learnable parameter prepended to the sequence of patch embeddings
before the Transformer encoder. Because all tokens (including CLS) attend to each other
through self-attention, by the final encoder layer the CLS token has aggregated information
from all patches. Its final hidden state is fed into a small MLP classification head:
ŷ = MLP(LN(z0L ))
Code Python
1 import torch , torch . nn as nn
2
3 class ViTClassifier ( nn . Module ) :
4 def __init__ ( self , d_model =768 , num_classes =1000) :
5 super () . __init__ ()
6 enc_layer = nn . T ra ns f or me rE n co de r La ye r (
7 d_model , nhead =12 , batch_first = True , norm_first = True )
8 self . enc = nn . TransformerEncoder ( enc_layer , 12)
9 self . norm = nn . LayerNorm ( d_model )
10 self . head = nn . Linear ( d_model , num_classes )
11
12 def forward ( self , tokens ) : # (B , N +1 , d )
13 z = self . enc ( tokens ) # (B , N +1 , d )
14 cls = self . norm ( z [: , 0]) # CLS token
15 return self . head ( cls ) # (B , num_classes )
Q40
What are the advantages of ViT over CNNs?
• Global receptive field — every patch attends to every other patch from the first layer;
CNNs build global context only in deeper layers through stacking local convolutions.
• Scalability — ViT performance scales predictably with data and model size; larger
datasets yield consistently better models.
• Transfer across modalities — the same architecture works for text, images, audio,
video with minimal modification.
• Interpretability — attention maps provide intuitive visualisations of what the model
“looks at”.
Q41
What are the limitations of ViT compared to CNNs?
• Data hungry — lacks CNN’s inductive biases (translation equivariance, locality). Re-
quires large datasets (ImageNet-21k, JFT-300M) to outperform CNNs; on small datasets
CNNs often win.
• Quadratic complexity — O(N 2 ) attention w.r.t. number of patches; high-resolution
images produce many patches.
• High compute cost — pre-training from scratch requires significant GPU resources.
• Positional encoding sensitivity — variable-resolution fine-tuning requires positional
— 25 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »
embedding interpolation.
Q42
How does a Swin Transformer address ViT’s scalability issues?
The Swin Transformer (Liu et al., 2021) introduces two key ideas:
1. Shifted Window Attention — attention is computed within local non-overlapping
windows (not globally), reducing complexity from O(N 2 ) to O(N ). Shifted windows
connect adjacent windows across layers.
2. Hierarchical feature maps — patches are merged as depth increases, producing multi-
scale features similar to CNN feature pyramids — essential for detection and segmen-
tation.
Swin achieves SOTA on ImageNet classification and is the backbone of many detection/seg-
mentation frameworks (Mask R-CNN + Swin).
Q43
Implement a minimal ViT forward pass in PyTorch.
Code Python
1 import torch , torch . nn as nn
2
3 class MiniViT ( nn . Module ) :
4 def __init__ ( self , img =224 , patch =16 , c =3 ,
5 d =768 , heads =12 , layers =12 , classes =1000) :
6 super () . __init__ ()
7 n = ( img // patch ) ** 2
8 self . patch_emb = nn . Conv2d (c , d , patch , stride = patch )
9 self . cls = nn . Parameter ( torch . zeros (1 , 1 , d ) )
10 self . pos = nn . Parameter ( torch . zeros (1 , n +1 , d ) )
11 enc = nn . T ra ns fo r me rE n co de rL a ye r (
12 d , heads , dim_feedforward = d *4 ,
13 batch_first = True , norm_first = True )
14 self . enc = nn . TransformerEncoder ( enc , layers )
15 self . norm = nn . LayerNorm ( d )
16 self . head = nn . Linear (d , classes )
17
18 def forward ( self , x ) :
19 B = x . shape [0]
20 x = self . patch_emb ( x ) . flatten (2) . transpose (1 ,2)
21 cls = self . cls . expand (B , -1 , -1)
22 x = torch . cat ([ cls , x ] , 1) + self . pos
23 z = self . enc ( x )
24 return self . head ( self . norm ( z [: ,0]) )
25
26 model = MiniViT ()
27 dummy = torch . rand (2 , 3 , 224 , 224)
28 print ( model ( dummy ) . shape ) # (2 , 1000)
— 26 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »
Q44
What is inductive bias and why do CNNs have more of it than ViTs?
Inductive bias is a set of assumptions built into the architecture that constrain the hy-
pothesis space, helping generalisation when data is limited.
CNN inductive biases:
• Translation equivariance — the same filter is applied across all spatial locations; a
feature detected in the top-left is detected everywhere.
• Locality — each neuron looks at a local receptive field, reflecting the assumption that
nearby pixels are most correlated.
ViT has neither by default: it treats patches as unordered tokens and must learn spatial
structure from data alone, requiring more examples to compensate.
Q45
When should you choose a CNN over ViT and vice versa?
Q46
What are Hybrid CNN-Transformer models and give two examples.
Hybrid models combine CNN’s local feature extraction with Transformer’s global context
modelling:
• CNN + Transformer (sequential) — a CNN extracts feature maps, which are then
flattened and fed as tokens to a Transformer encoder. Examples: ViT with ResNet
stem, TransUNet (medical image segmentation).
• ConvNeXt — a pure CNN modernised with Transformer design choices (large kernels,
LayerNorm, GELU, inverted bottleneck).
• Swin Transformer — uses shifted-window attention inside a CNN-like hierarchical
pyramid.
— 27 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »
Code Python
1 import torch , torch . nn as nn , torchvision . models as tv
2
3 class HybridViT ( nn . Module ) :
4 def __init__ ( self , num_classes =1000) :
5 super () . __init__ ()
6 # CNN stem : ResNet50 minus avgpool & fc
7 resnet = tv . resnet50 ( weights = ’ IMAGENET1K_V1 ’)
8 self . cnn_stem = nn . Sequential (* list ( resnet . children () ) [: -2])
9 # Transformer on CNN feature map tokens
10 enc = nn . T ra ns f or me rE n co de rL a ye r (
11 2048 , 8 , dim_feedforward =4096 , batch_first = True )
12 self . tfm = nn . TransformerEncoder ( enc , 2)
13 self . pool = nn . AdaptiveAvgPool1d (1)
14 self . fc = nn . Linear (2048 , num_classes )
15
16 def forward ( self , x ) :
17 f = self . cnn_stem ( x ) # (B , 2048 , 7 , 7)
18 tok = f . flatten (2) . transpose (1 , 2) # (B , 49 , 2048)
19 z = self . tfm ( tok ) # (B , 49 , 2048)
20 z = self . pool ( z . transpose (1 ,2) ) [: , :, 0]
21 return self . fc ( z )
Q47
What is DINO and how does it enable self-supervised ViT training?
Q48
Compare BERT (encoder-only) vs GPT (decoder-only) architectures.
Q49
What is Flash Attention and how does it speed up Transformer training?
— 28 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »
Flash Attention (Dao et al., 2022) is an IO-aware exact attention algorithm that avoids
materialising the full n × n attention matrix in high-bandwidth memory (HBM):
• Tiles Q, K, V into blocks that fit in SRAM.
• Fuses the softmax + matmul operations into a single GPU kernel.
• Achieves the same mathematical output as standard attention but uses O(n) memory
(no n2 matrix stored).
• Provides 2 − 4× wall-clock speedup and enables longer context windows (32k, 128k
tokens).
Code Python
1 # PyTorch 2.0+ s c a l e d _ d o t _ p r o d u c t _ a t t e n t i o n
2 # automatically dispatches to Flash Attention when available
3 import torch , torch . nn . functional as F
4
5 Q = torch . rand (2 , 8 , 1024 , 64 , device = ’ cuda ’ , dtype = torch . float16 )
6 K = torch . rand (2 , 8 , 1024 , 64 , device = ’ cuda ’ , dtype = torch . float16 )
7 V = torch . rand (2 , 8 , 1024 , 64 , device = ’ cuda ’ , dtype = torch . float16 )
8
9 with torch . backends . cuda . sdp_kernel (
10 enable_flash = True , enable_math = False ,
11 enable_mem_efficient = False ) :
12 out = F . s c a l e d _ d o t _ p r o d u c t _ a t t e n t i o n (Q , K , V , is_causal = True )
13 print ( out . shape ) # (2 , 8 , 1024 , 64)
Q50
What is LoRA (Low-Rank Adaptation) and how is it used for efficient fine-
tuning of large Transformers?
LoRA (Hu et al., 2021) fine-tunes large pre-trained Transformers by injecting trainable
low-rank decompositions into the weight matrices instead of updating all parameters:
W is frozen; only A and B are trained. With r = 8, a 768 × 768 matrix goes from 590k to
12k trainable parameters per layer (a 98% reduction), while matching or approaching full
fine-tuning quality.
— 29 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »
Code Python
1 import torch , torch . nn as nn
2
3 class LoRALinear ( nn . Module ) :
4 def __init__ ( self , in_f , out_f , rank =8 , alpha =16) :
5 super () . __init__ ()
6 self . linear = nn . Linear ( in_f , out_f , bias = False )
7 self . linear . weight . requires_grad = False # frozen
8 self . A = nn . Parameter ( torch . randn ( rank , in_f ) * 0.01)
9 self . B = nn . Parameter ( torch . zeros ( out_f , rank ) )
10 self . scale = alpha / rank
11
12 def forward ( self , x ) :
13 base = self . linear ( x )
14 lora = ( x @ self . A . T ) @ self . B . T
15 return base + self . scale * lora
16
17 layer = LoRALinear (768 , 768 , rank =8)
18 x = torch . rand (2 , 10 , 768)
19 print ( layer ( x ) . shape ) # (2 , 10 , 768)
20 trainable = sum ( p . numel () for p in layer . parameters ()
21 if p . requires_grad )
22 print ( f " Trainable params : { trainable : ,} " ) # 12 ,288
— 30 — AI Engineering Insider
Deep Learning Interview Prep Lamhot Siagian »
» Lamhot Siagian
AI Engineering Insider
Advanced Deep Learning — From Words to Witness:
The Rise of Transformers & Vision Transformers (ViT)
— 31 — AI Engineering Insider