Machine Learning
T-MachLe
12. Recurrent Neural Networks
Andres Perez Uribe
Jean Hennebert
1
Master of Science
in Engineering
Plan
12.1 From feed-forward to recurrent networks
12.2 Time-Delay-Neural Networks (TDNN)
12.3 Backpropagation through time (BPTT)
12.4 The vanishing gradients problem
12.5 Long short-term Memory (LSTM) networks
12.6 Beyond RNN models
12.7 Applications
Practical Work 12
MSE - T-MachLe - Andres Perez-Uribe 2
Master of Science
in Engineering
Threshold logic
XOR ?
MSE - T-MachLe - Andres Perez-Uribe
Minsky & Pappert, 1969 3
Master of Science
in Engineering
Combinational behavior of an MLP
Combinational refers to systems whose output is a function
of the present value of the inputs only
MLP Encoder
MSE - T-MachLe - Andres Perez-Uribe 4
Master of Science
in Engineering
Taking care of sequences
Lots of information that we store in our brains is not
random access, because they were learned as a
sequence. Examples:
• Try to list the alphabet backwards
• Try to list the musical notes in an octave backwards
• Try to say your phone number backwards
• Try to sing the lyrics of a song starting the in the middle of a
verse
How can we incorporate this into a Machine Learning
algorithm ?
MSE - T-MachLe - Andres Perez-Uribe 5
Master of Science
in Engineering
Taking care of sequences (2)
Lots of data is of temporal nature and generally does
not change in an abrupt manner. Examples:
• The mean temperature of a day (e.g., it depends on
seasonality, weather conditions, etc)
• Financial indicators
• Heart frequency
How can we deal with this sort of data using Neural Networks ?
MSE - T-MachLe - Andres Perez-Uribe 6
Master of Science
in Engineering
Examples of sequences
DO, RE -> MI Songs:
RE, MI -> FA DO,RE,MI -> DO
MI, FA -> SOL DO,DO,RE -> DO
FA, SOL -> LA
chatbot:
SOL, LA -> SI
Comment tu vas ? -> bien
LA, SI -> DO
Comment tu t’appelle ? -> Pepper
CHF vs € (t, t-1,.. t-n) -> CHF vs € (t+1)
MSE - T-MachLe - Andres Perez-Uribe 7
Master of Science
in Engineering
TDNN
Time series: A series of data points given in time order.
Problem: predict future values from historical ones.
Given {x(t), x(t-1), … x(t-n)} predict x(t+s), where s is
the time horizon.
if f is an MLP we talk about Time-Delay Neural Networks
MSE - T-MachLe - Andres Perez-Uribe 8
Master of Science
in Engineering
DTRNN
Discrete Time Recurrent Neural Network
Memorize the previous state of the network and use it with time-delay
inputs to compute the predicted value
MSE - T-MachLe - Andres Perez-Uribe 9
Master of Science
in Engineering
Elman and Jordan Networks
Memorize previous output (Jordan) or hidden (Elman) values and use
Jordan How – 1990, Elman
them in the following iteration. Trains as a feed-froward network using
Backpropagation
g • Simpler than Jordan network
• Hidden representation into special
units
across • Trained with backprop
nging
e steps • Fundamental to the development of
LSTMs
s Elkan. “A Critical Review of Recurrent Neural Networks for Sequence Learning.”
org/abs/1506.00019. Jordan Networks (1986) Elman Networks (1990)
Lipton, Zachary C., John Berkowitz, and Charles Elkan. “A Critical Review of Recurrent Neural Networks for Sequence Learning.”
arXiv:1506.00019 [cs], May 29, 2015. [Link]
Lipton et al., “A Critical Review of Recurrent Neural Networks for Sequence Learning.”,
MSE - T-MachLe - Andres Perez-Uribe [Link] 10
Master of Science
in Engineering
Recurrent Neural Networks
• The training is similar to that of a feed-forward network, but each
epoch must run through the observations in sequential order.
• Supposing that the network has been unfolded to a depth of k=3, each
training pattern consists of [x(t-1), x(t), x(t+1) ; O(t+1)]
• Given an error function E(Ot+1,Ôt+1), the objective is to find U, W and V
by using gradient descent to minimize E.
MSE - T-MachLe - Andres Perez-Uribe adapted from Wikipedia 11
Master of Science
in Engineering
Weight sharing in time
The final learned weights U, W and V are the same when we
unfold the computation performed by the recurrent unit in time:
MSE - T-MachLe - Andres Perez-Uribe 12
Master of Science
in Engineering
Back propagation through time (BPTT)
Task: given x(t-n) … x(t-1),x(t),x(t+1) -> predict y(t+1),y(t+2)…y(t+m)
ŷt
yt is the correct output at time t, and ŷt is
the predicted one. We want to predict a
full sequence of yt’s. That full sequence
is treated as one training example, thus
we must sum up the errors.
Just like we sum up the errors, we also sum up the
gradients at each time step for one training example:
MSE - T-MachLe - Andres Perez-Uribe from [Link] 13
Master of Science
in Engineering
BPTT (2)
Suppose a “simple” task: given x(t-n) … x(t-1),x(t),x(t+1) -> predict y(t+1)
….at a given moment: learn to predict y3 based on x3, x2, x1 and x0
E3 = E( y3, ŷ3)
W
U
MSE - T-MachLe - Andres Perez-Uribe from [Link] 14
Master of Science
in Engineering
BPTT (3)
Now, note that s3 is a function of s2, which depends on W and s2,
and so on: s3 = tanh(Ux3 + Ws2), s2 = tanh(Ux2 + Ws1), etc…
W W W
U U U
, where k is the depth
and for example:
thus, in general, the larger the temporal horizon, the more longer the
chain rule of derivatives (which imply more and more products)
MSE - T-MachLe - Andres Perez-Uribe from [Link] 15
Master of Science
in Engineering
Vanishing/exploding gradients
Gradient contributions from “far away” steps become zero, and the state
at those steps doesn’t contribute to what you are learning: we end up
not learning long-range dependencies.
BPTT works for short-time scale dependencies between inputs
and outputs
MSE - T-MachLe - Andres Perez-Uribe from [Link] 16
Master of Science
in Engineering
Key problem
• Learning long-term dependencies is hard
DO -> RE
DO, RE -> MI
DO, RE -> ___ , ___ , ___ , _?_
MSE - T-MachLe - Andres Perez-Uribe Image Credit: Christopher Olah ([Link] 17
Master of Science
in Engineering
Towards modern RNNs
Avoiding vanishing gradients: Juergen Schmidhuber, IDSIA
• Instead of multiplying the previous hidden state by a
matrix to get the new state, add something to the old
hidden state to get the new one.
• Gate all operations to store only what is useful and to
retrieve only what can help predicting the target output.
Memory cells with a self-connected recurrent edge (Hochreiter & Schmidhuber, 1997)
Possibility of forgetting (Gers, Schmidhuber, Cummins, 1999)
MSE - T-MachLe - Andres Perez-Uribe 18
Master of Science
in Engineering
Standard Recurrent Neural Network
• All recurrent neural networks have the form of a chain of repeating modules of
neural network. Here a single layer of tanh functions.
MSE - T-MachLe - Andres Perez-Uribe From Christopher Olah ([Link] 19
Master of Science
in Engineering
Long Short Term Memory Networks
(LSTM)
• An LSTM is a RNN composed of four different layers
MSE - T-MachLe - Andres Perez-Uribe From Christopher Olah ([Link] 20
Master of Science
in Engineering
The LSTM building blocks (1)
There are three gates (i,f,o)
controlled by “Perceptrons” weighing
the inputs and the recurrent outputs
, Given an input IN, we compute a
new state c’ that can:
be ignored (i = 0)
replace the previous state (f=0 and i=1)
be used to compute a new state C + c’
(f=1 and i=1)
Given a state C, the network outputs
OUT (o=1) or not (o=0)
MSE - T-MachLe - Andres Perez-Uribe 21
Master of Science
in Engineering
LSTMs: Cell state (or Memory) and Gates
• The cell state is kind of like a conveyor belt.
• The LSTM does have the ability to remove or add information
to the cell state, carefully regulated by structures called gates.
∑wixi
Gates are composed of a sigmoid
neural net layer and a pointwise
multiplication operation.
The sigmoid layer outputs numbers between zero and one, describing how
much of each component should be let through. A value of zero means “let
nothing through,” while a value of one means “let everything through!”
MSE - T-MachLe - Andres Perez-Uribe From Christopher Olah ([Link] 22
Master of Science
in Engineering
Language example
AdaLovelace was the daughter of LordByron. _____
met CharlesBabbage who was mathematician and
inventor. ______ designed the first mechanical
computer. She ______ interested in his invention and
came up with the first computer program.
Objective: complete the sentences
What to memorize in the cell state (c) from the
sequence of words (inputs) to decide what to
output ?
MSE - T-MachLe - Andres Perez-Uribe 23
Master of Science
in Engineering
LSTMs: Forget gate
• Should we continue to remember this “bit” of information or not ?
• A ft = 1 represents “completely keep this” while a 0 represents
“completely get rid of this.”
Example: we want to predict the next word based on all the previous ones. In such a
problem, the cell state might include the gender of the present subject, so that the correct
pronouns (he or she) can be used. When we see a new subject in the input (xt), we want
to forget the gender of the old subject.
MSE - T-MachLe - Andres Perez-Uribe From Christopher Olah ([Link] 24
Master of Science
in Engineering
LSTMs: Input gate
• Should we update this “bit” of information or not?
– If so, with what?
Example: in the example of our language model, we want to add the gender of
the new subject to the cell state, to replace the old one we’re forgetting.
MSE - T-MachLe - Andres Perez-Uribe From Christopher Olah ([Link] 25
Master of Science
in Engineering
LSTMs: Memory update
• Forget that + memorize this
Example: in the case of the language model, this is where we will actually
drop the information about the old subject’s gender and add the new one.
MSE - T-MachLe - Andres Perez-Uribe From Christopher Olah ([Link] 26
Master of Science
in Engineering
LSTMs: Output gate
• What shall we output given the context ?
• The output is a function of the cell state (Ct). A tanh pushes the cell state values
towards +1 or -1
• A « Perceptron » selects what to output based on the current input xt and the
previous output ht-1, by multiplying its output ot and tanh(ct)
Example: in the case of the language model, given that we just saw a new subject,
this cell might decide how to conjugate a verb, based on its knowledge of the
subject, whether it is plural or singular.
MSE - T-MachLe - Andres Perez-Uribe From Christopher Olah ([Link] 27
Master of Science
in Engineering
The LSTM building blocks (2)
MSE - T-MachLe - Andres Perez-Uribe 28
Master of Science
in Engineering
A two-unit LSTM network
outputs
Each output is fed back to all
Perceptron gates of each LSTM
unit (here we only show the
connections of the outputs of one
LSTM unit)
Each input is connected to all
Perceptron gates of each LSTM
unit
MSE - T-MachLe - Andres Perez-Uribe
inputs 29
Master of Science
in Engineering
LSTM complexity
4 x (Ninputs + NLSTMblocks + bias) x NLSTMblocks
weights for 32 LSTM units & 2-dim inputs: 4 x (2 + 32 + 1) x 32 = 4480
MSE - T-MachLe - Andres Perez-Uribe 30
Master of Science
in Engineering
– Long-Term Dependencies
LSTM vs RNNsWhy – Long-Term Dependen
want to remember everything, just the important things for a long time
Input forgetting when looking Memorized inputs for long-term
for long-term dependencies in dependencies in LSTMs
RNNs
pervised*sequence*labelling.)Springer)Berlin)Heidelberg,)2012.
On the LSTM (right), the memory cell « remembers » the first input as
long as the forget gate isGraves,)Alex.
open (‘o’) and the input gate is closed (‘-‘).
Supervised*sequence*labelling.)Springer)Berlin)Heidelberg,)2012.
The sensitivity of the output can be switched on and off by the output
gate without affecting the rest of the cell.
MSE - T-MachLe - Andres Perez-Uribe from Graves, Alex, « Supervised sequence labelling », 2012 31
Master of Science
in Engineering
LSTM-based architectures & applications
a) Feed-forward network
b) Text and video classification: a sequence is
mapped to one fixed length vector
c) Image captioning: the input image is a single
non-sequential data point.
d) Natural language translation, a sequence-
to-sequence task (they might have varying and
different sizes)
13: Recurrent neural networks have been used successfully to model both
e) Learn a generative model for text, predicting
tial inputs and sequential outputs as well as mappings between single data
and sequences (in both directions). This figure, based on a similar at each
figure step the following character.
pathy [2015] shows how numerous tasks can be modeled with RNNs with
tial inputs and/or sequential outputs. In Lipton
each subfigure, blueReview
et al., “A Critical rectangles
of Recurrent Neural Networks for Sequence Learning.”,
pond toMSEinputs, red rectangles to outputs
- T-MachLe - Andres Perez-Uribe and green rectangles to the en-
[Link] 32
dden state of the neural network. (a) This is the conventional independent
Master of Science
in Engineering
Example applications
Image captioning
ave been used successfully to model both
as well as mappings between single data
s). This figure, based on a similar figure
s tasks can be modeled with RNNs with
3: Recurrent
puts. In eachneural networks have
subfigure, bluebeen used successfully to model both
rectangles
al inputs and
outputs and sequential
green outputs as well
rectangles to as
themappings
en- between single data
MSE - T-MachLe
nd sequences - Andres
(in both This figure, based on a similarMachine
Perez-Uribe
directions). figure translation 33
(a) This is the conventional independent
Master of Science
in Engineering
Seq2Seq (Sutskever, Ilya; Vinyals, Oriol; Le, Quoc Viet (2014)
The original encoder-decoder architecture supposes that a vector is
enough to encode the input sequence, but we ease the problem if each
state of the decoder can have access to the output of the previous
state to guide the generation of the subsequent output (e.g., use ‘I’ to
compute “saw”).
MSE - T-MachLe - Andres Perez-Uribe from [Link] 34
Master of Science
in Engineering
Attention: at different steps, let a model
"focus" on different parts of the input.
Here, each state of the decoder uses information from the input in addition
to the output of the previous state, to guide the generation of the
subsequent output.
MSE - T-MachLe - Andres Perez-Uribe from [Link] 35
Master of Science
in Engineering
Self-Attention is all you need (Vaswani et al., 2017)
Self-attention computes for each input a representation that is a function of all
the inputs of the sequence. It can be run in parallel (since, it is of
combinatorial nature), which is way more efficient than sequential processing
using LSTMs. This new module is at the core of the new Transformers.
MSE - T-MachLe - Andres Perez-Uribe 36
Master of Science
in Engineering
Runner speed prediction (case study)
How to guide the beginner runner on uphill slopes ?
Figure 13: Recurrent neural networks have been used
sequential inputs and sequential outputs as well as map
points and sequences (in both directions). This figure
in Karpathy [2015] shows how numerous tasks can be
sequential inputs and/or sequential outputs. In each
20km de Lausanne
correspond to inputs, red rectangles to outputs and g
tire hidden state of the neural network. (a) This is the
case, as assumed by standard feedforward networks.
MSE - T-MachLe - Andres Perez-Uribe 37
fication are tasks in which a sequence is mapped to on
Master of Science
in Engineering
Runner speed prediction (2)
preceding km 𝑣→
slope
𝑣
→ accumulated
distance/height
𝑣
predictive
→ model
𝑣
→
Runner training data
MSE - T-MachLe - Andres Perez-Uribe 38
Master of Science
in Engineering
Runner speed prediction (3)
speed
v(t-n) …. v(t-2) v(t-1) v(t) ?
t-n …. t-2 t-1 t next time
Figure 13: Recurrent neural networks have been used successfull
sequential inputs and sequential outputs as well as mappings bet
slope(t-n) …. slope(t-2) slope(t-1) slope(t) next slope
points and sequences (in both directions). This figure, based on
in Karpathy [2015] shows how numerous tasks can be modeled w
sequential inputs and/or sequential outputs. In each subfigure,
correspond to inputs, red rectangles to outputs and green recta
tire hidden state of the neural network. (a) This is the conventio
case, as assumed by standard feedforward networks. (b) Text a
fication are tasks in which a sequence is mapped to one fixed le
MSE - T-MachLe - Andres Perez-Uribe 39
Image captioning presents the converse case, where the input i
Master of Science
in Engineering
Typical training session
MSE - T-MachLe - Andres Perez-Uribe 40
Master of Science
in Engineering
Speed prediction (example)
What is the race time prediction error ?
MSE - T-MachLe - Andres Perez-Uribe 41