0% found this document useful (0 votes)
6 views28 pages

Autoformer: Long-Term Forecasting Method

The document discusses the attention mechanism in neural networks, particularly its application in language translation and self-attention. It explains how attention allows models to focus on relevant input parts, improving accuracy in tasks like machine translation. Additionally, it introduces concepts like multi-head attention, cross attention, and the Autoformer architecture for long-term series forecasting, highlighting the use of autocorrelation in attention mechanisms.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views28 pages

Autoformer: Long-Term Forecasting Method

The document discusses the attention mechanism in neural networks, particularly its application in language translation and self-attention. It explains how attention allows models to focus on relevant input parts, improving accuracy in tasks like machine translation. Additionally, it introduces concepts like multi-head attention, cross attention, and the Autoformer architecture for long-term series forecasting, highlighting the use of autocorrelation in attention mechanisms.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Attention Mechanism

मुझे नहीं पता <end>

c2 Compute context vector:

<start> मुझे नहीं पता


+¿
𝛼 21 𝛼 22
𝛼 23 𝛼 24
h1 h2 h3 h4

h0 h1 h2 h3 h4

The attention mechanism enables models to


selectively focus on relevant parts of the input
sequence while generating each element of the output
sequence, improving accuracy and capturing long-
range dependencies in tasks like machine translation.
i do not know
Language translation with Attention
Mechanism Input: Sequence
Output: Sequence
𝑎1 ,0 𝑎1 ,1 𝑎1 ,2 𝑎1 ,3
Encoder:
where
is MLP
𝑒 1 ,0 𝑒 1 ,1 𝑒 1 ,2 𝑒 1 ,3 मुझे
Decoder:
where context vector c is often
𝑒 𝑒 𝑒 𝑒 𝑑 𝑑
h 𝑜 h 1 h 2 h 3 h 𝑜 h 1


star
𝐼 𝑑𝑜 𝑛𝑜𝑡 𝑘𝑛𝑜𝑤 𝑐1 t

2
credit: from slide of Vikas sir
Language translation with Attention
Mechanism
𝑎2 ,0 𝑎2 ,1 𝑎2 ,2 𝑎2 ,3

𝑒 2 ,0 𝑒 2 ,1 𝑒 2 ,2 𝑒 2 ,3 मुझे नहीं
𝑒 𝑒 𝑒 𝑒 𝑑 𝑑 𝑑
h 𝑜 h 1 h 2 h 3 h 𝑜 h 1 h 2


star
𝐼 𝑑𝑜 𝑛𝑜𝑡 𝑘𝑛𝑜𝑤 𝑐2 t मुझे
3
credit: from slide of Vikas sir
Language translation with Attention
Mechanism
𝑎3 , 0 𝑎3 , 1 𝑎3 , 2 𝑎3 , 3

𝑒3 , 0 𝑒 3 ,1 𝑒3 , 2 𝑒3 , 3 मुझे नहीं पता


𝑒 𝑒 𝑒 𝑒 𝑑 𝑑 𝑑 𝑑
h 𝑜 h 1 h 2 h 3 h 𝑜 h 1 h 2 h 3


star
𝐼 𝑑𝑜 𝑛𝑜𝑡 𝑘𝑛𝑜𝑤 𝑐3 t मुझे नहीं
4
credit: from slide of Vikas sir
Language translation with Attention
Mechanism
𝑎4 ,0 𝑎4 ,1 𝑎4 ,2 𝑎4 ,3

𝑒4 , 0 𝑒4 , 1 𝑒4 , 2 𝑒4 , 3 मुझे नहीं पता end


𝑒 𝑒 𝑒 𝑒 𝑑 𝑑 𝑑 𝑑 𝑑
h 𝑜 h 1 h 2 h 3 h 𝑜 h 1 h 2 h 3 h 4


star
𝐼 𝑑𝑜 𝑛𝑜𝑡 𝑘𝑛𝑜𝑤 𝑐4 t मुझे नहीं पता
5
credit: from slide of Vikas sir
Word Embedding

After gets trained


on a large corpus of
text data
Problem with Word Embedding

Apple = [taste, technology]


1. An apple a day keeps a doctor away. [0.6,0]
2. Apple is healthy. [0.7,0]
3. Apple is better than orange.[0.8,0]
4. Apple makes great phone.[0.75,0.2]

Technology

Taste
Self attention

𝑦1 𝑦𝑇 Outputs:
• context vector: c (shape: D)
mul + add

𝑣1 𝑎 1 ,1 𝑎 1 ,𝑇 Permutation invariant

Attention
Operations:
𝑣2 𝑎 2 ,1 𝑎 2 ,𝑇 • Key vectors: Problem: how can we encode
⋮ ⋮ ⋮
• Value vectors: ordered sequences like
𝑣𝑇 𝑎 𝑇 ,1 𝑎𝑇 ,𝑇 • Query: language or spatially ordered
• Alignment: image features?
softmax • Attention:
• Output:
𝑥1 𝑘 1 𝑒 1 ,1 𝑒 1 ,𝑇
Input Vectors

Alignment

𝑥 2 𝑘 2 𝑒 2 ,1 𝑒 2 ,𝑇

Inputs:
⋮ ⋮ ⋮
𝑥 𝑇 𝑘 𝑇 𝑒 𝑇 ,1 𝑒 𝑇 ,𝑇 • Input Vectors: ’s (shape: D

𝑞1 𝑞𝑇

credit: from slide of Vikas sir


Self attention

How Are You


𝑒 h𝑜𝑤 𝑒 𝑎𝑟𝑒 𝑒 𝑦𝑜𝑢

𝑊𝑞 𝑊𝑘 𝑊𝑣 𝑊𝑞 𝑊𝑘 𝑊𝑣 𝑊𝑞 𝑊𝑘 𝑊𝑣

𝑞 h𝑜𝑤 k v 𝑞 𝑎𝑟𝑒 k k 𝑞 𝑦𝑜𝑢 k k


Self attention

𝑞 h𝑜𝑤 𝑞 h𝑜𝑤 𝑞 h𝑜𝑤 𝑞 𝑎𝑟𝑒 𝑞 𝑎𝑟𝑒 𝑞 𝑎𝑟𝑒 𝑞 𝑦𝑜𝑢 𝑞 𝑦𝑜𝑢 𝑞 𝑦𝑜𝑢

k
k

k
k

k
k
𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑆𝑜𝑓𝑡𝑚𝑎𝑥
𝑊 11 𝑊 12 𝑊 13 𝑊 21 𝑊 22 𝑊 23 𝑊 31 𝑊 32 𝑊 33

v v v v v v v v v

+¿ +¿ +¿
y_how y_are y_you
Image credit: [Link]
Multihead attention
“The man saw the astronomer with a
telescope.”
E_bank
E_money

1 1 1 2 2 2
𝑊 𝑞 𝑊 𝑘 𝑊 𝑣 𝑊 𝑞 𝑊 𝑘 𝑊 𝑣
1 1 1 2 2 2
𝑊 𝑞 𝑊 𝑘 𝑊 𝑣 𝑊 𝑞 𝑊 𝑘 𝑊 𝑣
𝑞 h𝑜 𝑤1 k v 𝑞 h𝑜 𝑤2 k v
𝑞 h𝑜 𝑤1 k v 𝑞 h𝑜 𝑤2 k v
Image credit: [Link]
Add and Norm
Z1-norm Z2-norm Z3-norm

Layer Normalization

Z1 Z2 Z3
• Layer normalization is used in transformer
• Normalization stabilizes the training process
• Residual connection allows the model to learn X1 X2 X3
more effectively without vanishing gradients
+ + +
z1 z2 z3

Residual
connection
Multi-head attention

X1 X2 X3

how are you


Image credit: [Link]
Feed Forward Network

Y1 Y2 Y3

• First layer with 2048 neurons and a ReLU activation


function.
• Second layer with 512 neurons and a linear activation
function. 512 neurons with linear
• ReLU activation in the first layer, which 2048*512 activation function
introduces non-linearities into the model.
• This allows the FFNN to learn more complex patterns
2048 neurons with Relu
than it could with a simple linear transformation. activation function
512*2048

Z1-norm Z2-norm Z3-norm


Image credit: [Link]
Multi-masked attention

• The decoder works in an auto-regressive manner,


meaning it generates each token in a sequence by using
the tokens generated.
• During training, the decoder doesn’t follow the auto-
regressive approach
• If we were to treat the training process as fully auto-
regressive, similar to inference, it would slow down
the entire process.
• instead of predicting each word one by one, we
can parallelize the entire process.
• Prevent vectors from looking at future vectors.
• Manually set alignment scores to -infinity

Image credit: from slide of Vikas sir


Image credit: [Link]
Cross attention

• Cross attention identifies connections between two


sequences
• It generates query vectors from the output
sequence (Hindi), while key and value vectors are
derived from the input sequence (English).
• this process helps the model determine how similar
or related words from the output sequence (Hindi)
are to words from the input sequence (English).

Image credit: [Link]


Autoformer: Decomposition Transformers with
Auto-Correlation for Long-Term Series
Forecasting

• Published in 2021
• Conference on Neural Information Processing Systems (NeurIPS 2021) – A*
• Dataset used – ETT, Electricity, Exchange , Traffic, Weather and ILI(influenza-like illness)
Decomposition Layer

• Autoformer incorporate decomposition into the Transformer architecture


• the encoder and decoder use a decomposition block to aggregate the trend-cyclical part and
extract the seasonal part from the series progressively.
• For an input series with length L, decomposition layer returns defines as:

Image credit: Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting
Attention (Autocorrelation) Mechanism

• Autoformer employs a novel auto-correlation mechanism which replaces the self-attention


• In Autoformer, attention weights are computed in frequency domain (using fast fourier transform)
and aggregates them by time delay.

Image credit: Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting
Frequency Domain Attention

• a time lag 𝜏 , autocorrelation for a single discrete variable 𝑦 is used to


measure the "relationship" between the variable's current value at time 𝑡 to
its past value at time 𝑡−𝜏:

• Using autocorrelation, Autoformer extracts frequency-based dependencies


from the queries and keys, instead of the standard dot-product between
them.
• The theory behind computing autocorrelation using FFT is based on the
Wiener–Khinchin theorem

Image credit: Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting
Time Delay Aggregation
• The autocorrelations (referred to as attn_weights) as
• is aligned by calculating its value for each time delay 1,𝜏2,...𝜏𝑘,
𝑘 which is also known as Rolling.
• Subsequently, we conduct element-wise multiplication between the aligned and the autocorrelations.
• the left side showcasing the rolling of by time delay, while the right side illustrates the element-wise
multiplication with the autocorrelations.

Image credit: Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting

You might also like