0% found this document useful (0 votes)
27 views41 pages

Understanding Recurrent Neural Networks

This document provides an overview of Recurrent Neural Networks (RNNs) and their evolution, particularly focusing on Long Short-Term Memory (LSTM) networks. It discusses the challenges of learning long-term dependencies, the architecture of LSTMs, and their components such as gates and memory cells. Additionally, it highlights practical applications and the importance of handling sequential data in machine learning.

Uploaded by

Ss d1
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views41 pages

Understanding Recurrent Neural Networks

This document provides an overview of Recurrent Neural Networks (RNNs) and their evolution, particularly focusing on Long Short-Term Memory (LSTM) networks. It discusses the challenges of learning long-term dependencies, the architecture of LSTMs, and their components such as gates and memory cells. Additionally, it highlights practical applications and the importance of handling sequential data in machine learning.

Uploaded by

Ss d1
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Machine Learning

T-MachLe
12. Recurrent Neural Networks

Andres Perez Uribe


Jean Hennebert

1
Master of Science
in Engineering

Plan
12.1 From feed-forward to recurrent networks
12.2 Time-Delay-Neural Networks (TDNN)
12.3 Backpropagation through time (BPTT)
12.4 The vanishing gradients problem
12.5 Long short-term Memory (LSTM) networks
12.6 Beyond RNN models
12.7 Applications

Practical Work 12

MSE - T-MachLe - Andres Perez-Uribe 2


Master of Science
in Engineering

Threshold logic

XOR ?

MSE - T-MachLe - Andres Perez-Uribe


Minsky & Pappert, 1969 3
Master of Science
in Engineering

Combinational behavior of an MLP


Combinational refers to systems whose output is a function
of the present value of the inputs only

MLP Encoder
MSE - T-MachLe - Andres Perez-Uribe 4
Master of Science
in Engineering

Taking care of sequences


Lots of information that we store in our brains is not
random access, because they were learned as a
sequence. Examples:
• Try to list the alphabet backwards
• Try to list the musical notes in an octave backwards
• Try to say your phone number backwards
• Try to sing the lyrics of a song starting the in the middle of a
verse
How can we incorporate this into a Machine Learning
algorithm ?

MSE - T-MachLe - Andres Perez-Uribe 5


Master of Science
in Engineering

Taking care of sequences (2)


Lots of data is of temporal nature and generally does
not change in an abrupt manner. Examples:
• The mean temperature of a day (e.g., it depends on
seasonality, weather conditions, etc)
• Financial indicators
• Heart frequency

How can we deal with this sort of data using Neural Networks ?
MSE - T-MachLe - Andres Perez-Uribe 6
Master of Science
in Engineering

Examples of sequences

DO, RE -> MI Songs:


RE, MI -> FA DO,RE,MI -> DO
MI, FA -> SOL DO,DO,RE -> DO
FA, SOL -> LA
chatbot:
SOL, LA -> SI
Comment tu vas ? -> bien
LA, SI -> DO
Comment tu t’appelle ? -> Pepper

CHF vs € (t, t-1,.. t-n) -> CHF vs € (t+1)

MSE - T-MachLe - Andres Perez-Uribe 7


Master of Science
in Engineering

TDNN
Time series: A series of data points given in time order.
Problem: predict future values from historical ones.
Given {x(t), x(t-1), … x(t-n)} predict x(t+s), where s is
the time horizon.

if f is an MLP we talk about Time-Delay Neural Networks


MSE - T-MachLe - Andres Perez-Uribe 8
Master of Science
in Engineering

DTRNN
Discrete Time Recurrent Neural Network

Memorize the previous state of the network and use it with time-delay
inputs to compute the predicted value

MSE - T-MachLe - Andres Perez-Uribe 9


Master of Science
in Engineering

Elman and Jordan Networks


Memorize previous output (Jordan) or hidden (Elman) values and use
Jordan How – 1990, Elman
them in the following iteration. Trains as a feed-froward network using
Backpropagation
g • Simpler than Jordan network

• Hidden representation into special


units

across • Trained with backprop


nging
e steps • Fundamental to the development of
LSTMs

s Elkan. “A Critical Review of Recurrent Neural Networks for Sequence Learning.”


org/abs/1506.00019. Jordan Networks (1986) Elman Networks (1990)
Lipton, Zachary C., John Berkowitz, and Charles Elkan. “A Critical Review of Recurrent Neural Networks for Sequence Learning.”
arXiv:1506.00019 [cs], May 29, 2015. [Link]

Lipton et al., “A Critical Review of Recurrent Neural Networks for Sequence Learning.”,
MSE - T-MachLe - Andres Perez-Uribe [Link] 10
Master of Science
in Engineering

Recurrent Neural Networks

• The training is similar to that of a feed-forward network, but each


epoch must run through the observations in sequential order.

• Supposing that the network has been unfolded to a depth of k=3, each
training pattern consists of [x(t-1), x(t), x(t+1) ; O(t+1)]

• Given an error function E(Ot+1,Ôt+1), the objective is to find U, W and V


by using gradient descent to minimize E.
MSE - T-MachLe - Andres Perez-Uribe adapted from Wikipedia 11
Master of Science
in Engineering

Weight sharing in time

The final learned weights U, W and V are the same when we


unfold the computation performed by the recurrent unit in time:

MSE - T-MachLe - Andres Perez-Uribe 12


Master of Science
in Engineering

Back propagation through time (BPTT)


Task: given x(t-n) … x(t-1),x(t),x(t+1) -> predict y(t+1),y(t+2)…y(t+m)

ŷt

yt is the correct output at time t, and ŷt is


the predicted one. We want to predict a
full sequence of yt’s. That full sequence
is treated as one training example, thus
we must sum up the errors.

Just like we sum up the errors, we also sum up the


gradients at each time step for one training example:

MSE - T-MachLe - Andres Perez-Uribe from [Link] 13


Master of Science
in Engineering

BPTT (2)
Suppose a “simple” task: given x(t-n) … x(t-1),x(t),x(t+1) -> predict y(t+1)

….at a given moment: learn to predict y3 based on x3, x2, x1 and x0

E3 = E( y3, ŷ3)

W
U

MSE - T-MachLe - Andres Perez-Uribe from [Link] 14


Master of Science
in Engineering

BPTT (3)
Now, note that s3 is a function of s2, which depends on W and s2,
and so on: s3 = tanh(Ux3 + Ws2), s2 = tanh(Ux2 + Ws1), etc…

W W W
U U U

, where k is the depth

and for example:

thus, in general, the larger the temporal horizon, the more longer the
chain rule of derivatives (which imply more and more products)
MSE - T-MachLe - Andres Perez-Uribe from [Link] 15
Master of Science
in Engineering

Vanishing/exploding gradients
Gradient contributions from “far away” steps become zero, and the state
at those steps doesn’t contribute to what you are learning: we end up
not learning long-range dependencies.

BPTT works for short-time scale dependencies between inputs


and outputs

MSE - T-MachLe - Andres Perez-Uribe from [Link] 16


Master of Science
in Engineering

Key problem
• Learning long-term dependencies is hard

DO -> RE
DO, RE -> MI
DO, RE -> ___ , ___ , ___ , _?_

MSE - T-MachLe - Andres Perez-Uribe Image Credit: Christopher Olah ([Link] 17


Master of Science
in Engineering

Towards modern RNNs

Avoiding vanishing gradients: Juergen Schmidhuber, IDSIA

• Instead of multiplying the previous hidden state by a


matrix to get the new state, add something to the old
hidden state to get the new one.
• Gate all operations to store only what is useful and to
retrieve only what can help predicting the target output.

Memory cells with a self-connected recurrent edge (Hochreiter & Schmidhuber, 1997)
Possibility of forgetting (Gers, Schmidhuber, Cummins, 1999)

MSE - T-MachLe - Andres Perez-Uribe 18


Master of Science
in Engineering

Standard Recurrent Neural Network


• All recurrent neural networks have the form of a chain of repeating modules of
neural network. Here a single layer of tanh functions.

MSE - T-MachLe - Andres Perez-Uribe From Christopher Olah ([Link] 19


Master of Science
in Engineering

Long Short Term Memory Networks


(LSTM)
• An LSTM is a RNN composed of four different layers

MSE - T-MachLe - Andres Perez-Uribe From Christopher Olah ([Link] 20


Master of Science
in Engineering

The LSTM building blocks (1)

There are three gates (i,f,o)


controlled by “Perceptrons” weighing
the inputs and the recurrent outputs
, Given an input IN, we compute a
new state c’ that can:
be ignored (i = 0)
replace the previous state (f=0 and i=1)
be used to compute a new state C + c’
(f=1 and i=1)
Given a state C, the network outputs
OUT (o=1) or not (o=0)
MSE - T-MachLe - Andres Perez-Uribe 21
Master of Science
in Engineering

LSTMs: Cell state (or Memory) and Gates

• The cell state is kind of like a conveyor belt.


• The LSTM does have the ability to remove or add information
to the cell state, carefully regulated by structures called gates.

∑wixi
Gates are composed of a sigmoid
neural net layer and a pointwise
multiplication operation.

The sigmoid layer outputs numbers between zero and one, describing how
much of each component should be let through. A value of zero means “let
nothing through,” while a value of one means “let everything through!”

MSE - T-MachLe - Andres Perez-Uribe From Christopher Olah ([Link] 22


Master of Science
in Engineering

Language example

AdaLovelace was the daughter of LordByron. _____


met CharlesBabbage who was mathematician and
inventor. ______ designed the first mechanical
computer. She ______ interested in his invention and
came up with the first computer program.

Objective: complete the sentences


What to memorize in the cell state (c) from the
sequence of words (inputs) to decide what to
output ?

MSE - T-MachLe - Andres Perez-Uribe 23


Master of Science
in Engineering

LSTMs: Forget gate

• Should we continue to remember this “bit” of information or not ?


• A ft = 1 represents “completely keep this” while a 0 represents
“completely get rid of this.”

Example: we want to predict the next word based on all the previous ones. In such a
problem, the cell state might include the gender of the present subject, so that the correct
pronouns (he or she) can be used. When we see a new subject in the input (xt), we want
to forget the gender of the old subject.
MSE - T-MachLe - Andres Perez-Uribe From Christopher Olah ([Link] 24
Master of Science
in Engineering

LSTMs: Input gate

• Should we update this “bit” of information or not?


– If so, with what?

Example: in the example of our language model, we want to add the gender of
the new subject to the cell state, to replace the old one we’re forgetting.

MSE - T-MachLe - Andres Perez-Uribe From Christopher Olah ([Link] 25


Master of Science
in Engineering

LSTMs: Memory update

• Forget that + memorize this

Example: in the case of the language model, this is where we will actually
drop the information about the old subject’s gender and add the new one.

MSE - T-MachLe - Andres Perez-Uribe From Christopher Olah ([Link] 26


Master of Science
in Engineering

LSTMs: Output gate


• What shall we output given the context ?
• The output is a function of the cell state (Ct). A tanh pushes the cell state values
towards +1 or -1
• A « Perceptron » selects what to output based on the current input xt and the
previous output ht-1, by multiplying its output ot and tanh(ct)

Example: in the case of the language model, given that we just saw a new subject,
this cell might decide how to conjugate a verb, based on its knowledge of the
subject, whether it is plural or singular.
MSE - T-MachLe - Andres Perez-Uribe From Christopher Olah ([Link] 27
Master of Science
in Engineering

The LSTM building blocks (2)

MSE - T-MachLe - Andres Perez-Uribe 28


Master of Science
in Engineering

A two-unit LSTM network


outputs
Each output is fed back to all
Perceptron gates of each LSTM
unit (here we only show the
connections of the outputs of one
LSTM unit)

Each input is connected to all


Perceptron gates of each LSTM
unit
MSE - T-MachLe - Andres Perez-Uribe
inputs 29
Master of Science
in Engineering

LSTM complexity

4 x (Ninputs + NLSTMblocks + bias) x NLSTMblocks


weights for 32 LSTM units & 2-dim inputs: 4 x (2 + 32 + 1) x 32 = 4480

MSE - T-MachLe - Andres Perez-Uribe 30


Master of Science
in Engineering

– Long-Term Dependencies
LSTM vs RNNsWhy – Long-Term Dependen
want to remember everything, just the important things for a long time

Input forgetting when looking Memorized inputs for long-term


for long-term dependencies in dependencies in LSTMs
RNNs
pervised*sequence*labelling.)Springer)Berlin)Heidelberg,)2012.
On the LSTM (right), the memory cell « remembers » the first input as
long as the forget gate isGraves,)Alex.
open (‘o’) and the input gate is closed (‘-‘).
Supervised*sequence*labelling.)Springer)Berlin)Heidelberg,)2012.
The sensitivity of the output can be switched on and off by the output
gate without affecting the rest of the cell.
MSE - T-MachLe - Andres Perez-Uribe from Graves, Alex, « Supervised sequence labelling », 2012 31
Master of Science
in Engineering

LSTM-based architectures & applications


a) Feed-forward network

b) Text and video classification: a sequence is


mapped to one fixed length vector

c) Image captioning: the input image is a single


non-sequential data point.

d) Natural language translation, a sequence-


to-sequence task (they might have varying and
different sizes)

13: Recurrent neural networks have been used successfully to model both
e) Learn a generative model for text, predicting
tial inputs and sequential outputs as well as mappings between single data
and sequences (in both directions). This figure, based on a similar at each
figure step the following character.
pathy [2015] shows how numerous tasks can be modeled with RNNs with
tial inputs and/or sequential outputs. In Lipton
each subfigure, blueReview
et al., “A Critical rectangles
of Recurrent Neural Networks for Sequence Learning.”,
pond toMSEinputs, red rectangles to outputs
- T-MachLe - Andres Perez-Uribe and green rectangles to the en-
[Link] 32
dden state of the neural network. (a) This is the conventional independent
Master of Science
in Engineering

Example applications

Image captioning

ave been used successfully to model both


as well as mappings between single data
s). This figure, based on a similar figure
s tasks can be modeled with RNNs with
3: Recurrent
puts. In eachneural networks have
subfigure, bluebeen used successfully to model both
rectangles
al inputs and
outputs and sequential
green outputs as well
rectangles to as
themappings
en- between single data
MSE - T-MachLe
nd sequences - Andres
(in both This figure, based on a similarMachine
Perez-Uribe
directions). figure translation 33
(a) This is the conventional independent
Master of Science
in Engineering

Seq2Seq (Sutskever, Ilya; Vinyals, Oriol; Le, Quoc Viet (2014)


The original encoder-decoder architecture supposes that a vector is
enough to encode the input sequence, but we ease the problem if each
state of the decoder can have access to the output of the previous
state to guide the generation of the subsequent output (e.g., use ‘I’ to
compute “saw”).

MSE - T-MachLe - Andres Perez-Uribe from [Link] 34


Master of Science
in Engineering

Attention: at different steps, let a model


"focus" on different parts of the input.

Here, each state of the decoder uses information from the input in addition
to the output of the previous state, to guide the generation of the
subsequent output.
MSE - T-MachLe - Andres Perez-Uribe from [Link] 35
Master of Science
in Engineering

Self-Attention is all you need (Vaswani et al., 2017)

Self-attention computes for each input a representation that is a function of all


the inputs of the sequence. It can be run in parallel (since, it is of
combinatorial nature), which is way more efficient than sequential processing
using LSTMs. This new module is at the core of the new Transformers.
MSE - T-MachLe - Andres Perez-Uribe 36
Master of Science
in Engineering

Runner speed prediction (case study)

How to guide the beginner runner on uphill slopes ?

Figure 13: Recurrent neural networks have been used


sequential inputs and sequential outputs as well as map
points and sequences (in both directions). This figure
in Karpathy [2015] shows how numerous tasks can be
sequential inputs and/or sequential outputs. In each
20km de Lausanne
correspond to inputs, red rectangles to outputs and g
tire hidden state of the neural network. (a) This is the
case, as assumed by standard feedforward networks.
MSE - T-MachLe - Andres Perez-Uribe 37
fication are tasks in which a sequence is mapped to on
Master of Science
in Engineering

Runner speed prediction (2)

preceding km 𝑣→
slope
𝑣
→ accumulated
distance/height
𝑣
predictive
→ model

𝑣

Runner training data

MSE - T-MachLe - Andres Perez-Uribe 38


Master of Science
in Engineering

Runner speed prediction (3)

speed
v(t-n) …. v(t-2) v(t-1) v(t) ?

t-n …. t-2 t-1 t next time

Figure 13: Recurrent neural networks have been used successfull


sequential inputs and sequential outputs as well as mappings bet
slope(t-n) …. slope(t-2) slope(t-1) slope(t) next slope
points and sequences (in both directions). This figure, based on
in Karpathy [2015] shows how numerous tasks can be modeled w
sequential inputs and/or sequential outputs. In each subfigure,
correspond to inputs, red rectangles to outputs and green recta
tire hidden state of the neural network. (a) This is the conventio
case, as assumed by standard feedforward networks. (b) Text a
fication are tasks in which a sequence is mapped to one fixed le
MSE - T-MachLe - Andres Perez-Uribe 39
Image captioning presents the converse case, where the input i
Master of Science
in Engineering

Typical training session

MSE - T-MachLe - Andres Perez-Uribe 40


Master of Science
in Engineering

Speed prediction (example)

What is the race time prediction error ?

MSE - T-MachLe - Andres Perez-Uribe 41

You might also like