Attention Mechanism
मुझे नहीं पता <end>
c2 Compute context vector:
<start> मुझे नहीं पता
+¿
𝛼 21 𝛼 22
𝛼 23 𝛼 24
h1 h2 h3 h4
h0 h1 h2 h3 h4
The attention mechanism enables models to
selectively focus on relevant parts of the input
sequence while generating each element of the output
sequence, improving accuracy and capturing long-
range dependencies in tasks like machine translation.
i do not know
Language translation with Attention
Mechanism Input: Sequence
Output: Sequence
𝑎1 ,0 𝑎1 ,1 𝑎1 ,2 𝑎1 ,3
Encoder:
where
is MLP
𝑒 1 ,0 𝑒 1 ,1 𝑒 1 ,2 𝑒 1 ,3 मुझे
Decoder:
where context vector c is often
𝑒 𝑒 𝑒 𝑒 𝑑 𝑑
h 𝑜 h 1 h 2 h 3 h 𝑜 h 1
⨂
star
𝐼 𝑑𝑜 𝑛𝑜𝑡 𝑘𝑛𝑜𝑤 𝑐1 t
2
credit: from slide of Vikas sir
Language translation with Attention
Mechanism
𝑎2 ,0 𝑎2 ,1 𝑎2 ,2 𝑎2 ,3
𝑒 2 ,0 𝑒 2 ,1 𝑒 2 ,2 𝑒 2 ,3 मुझे नहीं
𝑒 𝑒 𝑒 𝑒 𝑑 𝑑 𝑑
h 𝑜 h 1 h 2 h 3 h 𝑜 h 1 h 2
⨂
star
𝐼 𝑑𝑜 𝑛𝑜𝑡 𝑘𝑛𝑜𝑤 𝑐2 t मुझे
3
credit: from slide of Vikas sir
Language translation with Attention
Mechanism
𝑎3 , 0 𝑎3 , 1 𝑎3 , 2 𝑎3 , 3
𝑒3 , 0 𝑒 3 ,1 𝑒3 , 2 𝑒3 , 3 मुझे नहीं पता
𝑒 𝑒 𝑒 𝑒 𝑑 𝑑 𝑑 𝑑
h 𝑜 h 1 h 2 h 3 h 𝑜 h 1 h 2 h 3
⨂
star
𝐼 𝑑𝑜 𝑛𝑜𝑡 𝑘𝑛𝑜𝑤 𝑐3 t मुझे नहीं
4
credit: from slide of Vikas sir
Language translation with Attention
Mechanism
𝑎4 ,0 𝑎4 ,1 𝑎4 ,2 𝑎4 ,3
𝑒4 , 0 𝑒4 , 1 𝑒4 , 2 𝑒4 , 3 मुझे नहीं पता end
𝑒 𝑒 𝑒 𝑒 𝑑 𝑑 𝑑 𝑑 𝑑
h 𝑜 h 1 h 2 h 3 h 𝑜 h 1 h 2 h 3 h 4
⨂
star
𝐼 𝑑𝑜 𝑛𝑜𝑡 𝑘𝑛𝑜𝑤 𝑐4 t मुझे नहीं पता
5
credit: from slide of Vikas sir
Word Embedding
After gets trained
on a large corpus of
text data
Problem with Word Embedding
Apple = [taste, technology]
1. An apple a day keeps a doctor away. [0.6,0]
2. Apple is healthy. [0.7,0]
3. Apple is better than orange.[0.8,0]
4. Apple makes great phone.[0.75,0.2]
Technology
Taste
Self attention
𝑦1 𝑦𝑇 Outputs:
• context vector: c (shape: D)
mul + add
𝑣1 𝑎 1 ,1 𝑎 1 ,𝑇 Permutation invariant
Attention
Operations:
𝑣2 𝑎 2 ,1 𝑎 2 ,𝑇 • Key vectors: Problem: how can we encode
⋮ ⋮ ⋮
• Value vectors: ordered sequences like
𝑣𝑇 𝑎 𝑇 ,1 𝑎𝑇 ,𝑇 • Query: language or spatially ordered
• Alignment: image features?
softmax • Attention:
• Output:
𝑥1 𝑘 1 𝑒 1 ,1 𝑒 1 ,𝑇
Input Vectors
Alignment
𝑥 2 𝑘 2 𝑒 2 ,1 𝑒 2 ,𝑇
⋮
Inputs:
⋮ ⋮ ⋮
𝑥 𝑇 𝑘 𝑇 𝑒 𝑇 ,1 𝑒 𝑇 ,𝑇 • Input Vectors: ’s (shape: D
𝑞1 𝑞𝑇
⋮
credit: from slide of Vikas sir
Self attention
How Are You
𝑒 h𝑜𝑤 𝑒 𝑎𝑟𝑒 𝑒 𝑦𝑜𝑢
𝑊𝑞 𝑊𝑘 𝑊𝑣 𝑊𝑞 𝑊𝑘 𝑊𝑣 𝑊𝑞 𝑊𝑘 𝑊𝑣
𝑞 h𝑜𝑤 k v 𝑞 𝑎𝑟𝑒 k k 𝑞 𝑦𝑜𝑢 k k
Self attention
𝑞 h𝑜𝑤 𝑞 h𝑜𝑤 𝑞 h𝑜𝑤 𝑞 𝑎𝑟𝑒 𝑞 𝑎𝑟𝑒 𝑞 𝑎𝑟𝑒 𝑞 𝑦𝑜𝑢 𝑞 𝑦𝑜𝑢 𝑞 𝑦𝑜𝑢
k
k
k
k
k
k
𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑆𝑜𝑓𝑡𝑚𝑎𝑥
𝑊 11 𝑊 12 𝑊 13 𝑊 21 𝑊 22 𝑊 23 𝑊 31 𝑊 32 𝑊 33
v v v v v v v v v
+¿ +¿ +¿
y_how y_are y_you
Image credit: [Link]
Multihead attention
“The man saw the astronomer with a
telescope.”
E_bank
E_money
1 1 1 2 2 2
𝑊 𝑞 𝑊 𝑘 𝑊 𝑣 𝑊 𝑞 𝑊 𝑘 𝑊 𝑣
1 1 1 2 2 2
𝑊 𝑞 𝑊 𝑘 𝑊 𝑣 𝑊 𝑞 𝑊 𝑘 𝑊 𝑣
𝑞 h𝑜 𝑤1 k v 𝑞 h𝑜 𝑤2 k v
𝑞 h𝑜 𝑤1 k v 𝑞 h𝑜 𝑤2 k v
Image credit: [Link]
Add and Norm
Z1-norm Z2-norm Z3-norm
Layer Normalization
Z1 Z2 Z3
• Layer normalization is used in transformer
• Normalization stabilizes the training process
• Residual connection allows the model to learn X1 X2 X3
more effectively without vanishing gradients
+ + +
z1 z2 z3
Residual
connection
Multi-head attention
X1 X2 X3
how are you
Image credit: [Link]
Feed Forward Network
Y1 Y2 Y3
• First layer with 2048 neurons and a ReLU activation
function.
• Second layer with 512 neurons and a linear activation
function. 512 neurons with linear
• ReLU activation in the first layer, which 2048*512 activation function
introduces non-linearities into the model.
• This allows the FFNN to learn more complex patterns
2048 neurons with Relu
than it could with a simple linear transformation. activation function
512*2048
Z1-norm Z2-norm Z3-norm
Image credit: [Link]
Multi-masked attention
• The decoder works in an auto-regressive manner,
meaning it generates each token in a sequence by using
the tokens generated.
• During training, the decoder doesn’t follow the auto-
regressive approach
• If we were to treat the training process as fully auto-
regressive, similar to inference, it would slow down
the entire process.
• instead of predicting each word one by one, we
can parallelize the entire process.
• Prevent vectors from looking at future vectors.
• Manually set alignment scores to -infinity
Image credit: from slide of Vikas sir
Image credit: [Link]
Cross attention
• Cross attention identifies connections between two
sequences
• It generates query vectors from the output
sequence (Hindi), while key and value vectors are
derived from the input sequence (English).
• this process helps the model determine how similar
or related words from the output sequence (Hindi)
are to words from the input sequence (English).
Image credit: [Link]
Autoformer: Decomposition Transformers with
Auto-Correlation for Long-Term Series
Forecasting
• Published in 2021
• Conference on Neural Information Processing Systems (NeurIPS 2021) – A*
• Dataset used – ETT, Electricity, Exchange , Traffic, Weather and ILI(influenza-like illness)
Decomposition Layer
• Autoformer incorporate decomposition into the Transformer architecture
• the encoder and decoder use a decomposition block to aggregate the trend-cyclical part and
extract the seasonal part from the series progressively.
• For an input series with length L, decomposition layer returns defines as:
Image credit: Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting
Attention (Autocorrelation) Mechanism
• Autoformer employs a novel auto-correlation mechanism which replaces the self-attention
• In Autoformer, attention weights are computed in frequency domain (using fast fourier transform)
and aggregates them by time delay.
Image credit: Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting
Frequency Domain Attention
• a time lag 𝜏 , autocorrelation for a single discrete variable 𝑦 is used to
measure the "relationship" between the variable's current value at time 𝑡 to
its past value at time 𝑡−𝜏:
• Using autocorrelation, Autoformer extracts frequency-based dependencies
from the queries and keys, instead of the standard dot-product between
them.
• The theory behind computing autocorrelation using FFT is based on the
Wiener–Khinchin theorem
Image credit: Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting
Time Delay Aggregation
• The autocorrelations (referred to as attn_weights) as
• is aligned by calculating its value for each time delay 1,𝜏2,...𝜏𝑘,
𝑘 which is also known as Rolling.
• Subsequently, we conduct element-wise multiplication between the aligned and the autocorrelations.
• the left side showcasing the rolling of by time delay, while the right side illustrates the element-wise
multiplication with the autocorrelations.
Image credit: Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting