0% found this document useful (0 votes)
26 views63 pages

Understanding Recurrent Neural Networks

This document provides an overview of Recurrent Neural Networks (RNNs) and their variants, highlighting their ability to handle sequential data and retain memory of previous inputs. It discusses the architecture of RNNs, including the encoder-decoder framework and the introduction of Long Short-Term Memory (LSTM) networks to address long-term dependencies. Additionally, it contrasts RNNs with Recursive Neural Networks (RvNNs), emphasizing their different applications and structures.

Uploaded by

nandanasivadas86
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views63 pages

Understanding Recurrent Neural Networks

This document provides an overview of Recurrent Neural Networks (RNNs) and their variants, highlighting their ability to handle sequential data and retain memory of previous inputs. It discusses the architecture of RNNs, including the encoder-decoder framework and the introduction of Long Short-Term Memory (LSTM) networks to address long-term dependencies. Additionally, it contrasts RNNs with Recursive Neural Networks (RvNNs), emphasizing their different applications and structures.

Uploaded by

nandanasivadas86
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

MODULE 4

RNN-RECURRENT NEURAL NETWORKS


Feed Forward Neural networks

Features
Decisions are based on current input
No memory about past
No future scope
Issues in FF neural networks
 Cant handle sequential data
 Considers only current input
 Cant memorize previous inputs
Recurrent Neural Networks
 RNN works on the principle of saving the output of a
layer and feeding this back to the input inorder to
predict the output of the layer.
 U,W,V are parameters
y V

h
W

x U
Advantages

 Can handle sequential data.


 Considers current input and previously received
input.
 Can memorize previous inputs due to its
internal memory
Applications of RNN

 Image captioning(Eg- Dog catching a ball)


 Time series prediction( Eg- Prices of stock)
 NLP (Text mining and sentiment analysis)
 Machine translation.
How does a RNN look like?
Working
 We can process a sequence of vectors by applying a recurrence
formula every time step:
RNN- Variants/Types

1. One to one
2. One to many
3. Many to one
4. Many to many
One to one

 Single input,single output


 Its known as Vanilla neural networks
 Used for regular ML problems
One to Many

 Single input and multiple outputs


 Generates sequence of outputs
 Eg:- Image captioning
Many to One

 Takes in a sequence of
inputs and generates one
output
 Eg ;- Sentiment analysis
Many to many

 Takes in a sequence of inputs


and generates sequence of
outputs.
 Eg:- machine translation
Computational Graphs
A computational graph is a way to formalize the structure of a
set of computations, such as those involved in mapping
inputs and parameters to outputs and loss.
 Unfolding a recursive or recurrent computation leads into a
computational graph that has a repetitive structure, typically
corresponding to a chain of events.
 Forexample, consider the classical form of a dynamical
system:
s(t) = f(s(t−1); θ), where s(t) is called the
state of the system.
 ThisEquation is recurrent because the definition of s at time t
refers back to the same definition at time t − 1.
Unfolding computational
graph
 For a finite number of time steps τ , the graph can
be unfolded by applying the definition τ− 1 times.
 For example, if we unfold the equation s(t) =
f(s(t−1); θ) for τ = 3 time steps, we obtain
s(3) =f(s(2) ; θ)
=f(f(s(1); θ); θ)
Unfolding the equation by repeatedly applying the
definition in this way has yielded an expression that
does not involve recurrence.
Unfolding computational
graph….
 Such an expression can now be
represented by a traditional directed
acyclic computational graph.
 The unfolded computational graph of
equation s(t) = f(s(t−1); θ) is illustrated
in figure.
 Each node represents the state at
some time t and the function f maps
the state at t to the state at t + 1. The
same parameters (the same value of θ
used to parametrize f) are used for all
time steps.
Example 2- Unfolding a recurrent network
with no outputs
Unfolding
The unfolding process thus introduces two major advantages:
1. Regardless of the sequence length, the learned model always has the same input
size, because it is specified in terms of transition from one state to another state, rather
than specified in terms of a variable-length history of states.
2. It is possible to use the same transition function f with the same parameters at every
time step.
These two factors make it possible to learn a single model f that operates on all time steps and
all sequence lengths, rather than needing to learn a separate model g(t) for all possible time
steps.
Learning a single, shared model allows generalization to sequence lengths that did not appear in the
training set, and allows the model to be estimated with far fewer training examples than would be
required without parameter sharing.

The unfolded graph also helps to illustrate the idea of information flow
forward in time (computing outputs and losses) and backward in time
(computing gradients) by explicitly showing the path along which this
information flows.
 The computational
graph to compute the
training loss of a
recurrent network
that maps an input
sequence of x values
to a corresponding
sequence of output o
values.

L is the loss, y is the


target output
Time-unfolded recurrent neural network with a
single output at the end of the sequence.

Such a network can be


used to summarize a
sequence and produce a
fixed-size representation
used as input for further
processing.
We have seen that
 An RNN can map an input sequence to a fixed-
size vector.
 AnRNN can map a fixed-size vector to a
sequence.
 An RNN can map an input sequence to an
output sequence of the same length
Encoder-Decoder Sequence-to
Sequence Architectures
 An RNN can be trained to map an input sequence to an output
sequence which is not necessarily of the same length.
 This comes up in many applications, such as speech recognition,
machine translation or question answering, where the input and
output sequences in the training set are generally not of the same
length.
 We often call the input to the RNN the “context.”
 We want to produce a representation of this context, C .
 The context C might be a vector or sequence of vectors that
summarize the input sequence X = (x(1), x (2), . . . , x(nx )).
Encoder-Decoder Sequence-to
Sequence Architectures
 The simplest RNN architecture for mapping a variable-length
sequence to another variable-length sequence was named as the
encoder-decoder or sequence-to-sequence architecture.
The idea is very simple:
(1) An encoder or reader or input RNN processes the input
sequence.
The encoder emits the context C , usually as a simple function of its
final hidden state.
(2) A decoder or writer or output RNN is conditioned on that fixed-
length vector to generate the output sequence Y = (y(1) , . . . , y(ny )).
Example of an encoder-decoder or sequence-to-sequence
RNN architecture, for learning to generate an output
sequence (y(1), . . . , y(n y)) given an input sequence (x(1), x
(2)
, . . . , x(nx )).
 The last state hnx of the encoder RNN is typically used as a
representation C of the input sequence that is provided as input to
the decoder RNN.
 If the context C is a vector, then the decoder RNN is simply a vector-
tosequence RNN.
 There are at least two ways for a vector-to-sequence RNN to receive
input.
1. The input can be provided as the initial state of the RNN,
2. The input can be connected to the hidden units at each
time step.
 These two ways can also be combined.
 There is no constraint that the encoder must have the same size of
hidden layer as the decoder.
 One clear limitation of this architecture is when the
context C output by the encoder RNN has a dimension
that is too small to properly summarize a long
sequence.
 One proposal was to make C a variable-length
sequence rather than a fixed-size vector.
 Additionally, they introduced an attention
mechanism that learns to associate elements of the
sequence C to elements of the output sequence.
Deep Recurrent Networks
The computation in most RNNs can be decomposed into
three blocks of parameters and associated
transformations:
1. from the input to the hidden state,
2. from the previous hidden state to the next
hidden state, and
3. from the hidden state to the output.
 It is advantageous to introduce depth in each of these operations.
 The experimental evidence is in agreement with the idea that we
need enough depth in order to perform the required mappings.
A recurrent neural network can
be made deep in many ways
can be broken down into groups organized
(a)
hierarchically.
(b) Deeper computation (e.g., an MLP) can be
introduced in the input-to hidden, hidden-to-hidden
and hidden-to-output parts. This may lengthen the
shortest path linking different time steps.
The path-lengthening effect can be mitigated by
introducing skip connections.
Recursive Neural Networks
(RvNNs)
 Recursive Neural Networks are a class of deep neural
networks that can learn detailed and structured information.
With RvNN, we get a structured prediction by recursively
applying the same set of weights on structured inputs.
 Recursive neural networks represent yet another
generalization of recurrent networks, with a different kind of
computational graph, which is structured as a deep tree,
rather than the chain-like structure of RNNs.
 Due to their deep tree-like structure, Recursive Neural
Networks can handle hierarchical data. The tree structure
means combining child nodes and producing parent nodes.
Each child-parent bond has a weight matrix, and similar
children have the same weights. The number of children for
Typical computational graph for a recursive network is
illustrated in figure

A variable-size sequence x(1),x(2), . . . , x(t) can be


mapped to a fixed-size representation (the output o), with
a fixed set of parameters (the weight matrices U, V , W ).
Recursive Neural Networks
 Recursive networks have been successfully applied to processing
data structures as input to neural nets in natural language
processing as well as in computer vision .
 RvNNs are used when there's a need to parse an entire sentence.
 One clear advantage of recursive nets over recurrent nets is that
for a sequence of the same length τ, the depth (measured as the
number of compositions of nonlinear operations) can be drastically
reduced from τ to O(log τ ), which might help deal with long-term
dependencies.
 One option to best structure the tree is to have a tree structure
such as a balanced binary tree.
 Ideally, the learner itself discover and infer the tree structure that
is appropriate for any given input
Recurrent Neural Network vs
Recursive Neural Networks
 Recurrent Neural Networks (RNNs) are another well-known class of
neural networks used for processing sequential data. They are closely
related to the Recursive Neural Network.
 Recurrent Neural Networks represent temporal sequences, which they
find application in Natural language Processing (NLP) since language-
related data like sentences and paragraphs are sequential in nature.
 Recurrent networks are usually chain structures.
 The weights are shared across the chain length, keeping the
dimensionality constant.
Recurrent Neural Network vs
Recursive Neural Networks
 On the other hand, Recursive Neural Networks operate on
hierarchical data models due to their tree structure. There
are a fixed number of children for each node in the tree so
that it can execute recursive operations and use the same
weights for each step. Child representations are combined
into parent representations.
 The efficiency of a recursive network is higher than a feed-
forward network.
 Recurrent Networks are recurrent over time, meaning
recursive networks are just a generalization of the
recurrent network.
LSTM - Long Short Term Memory

 Long Short Term Memory is a kind of recurrent neural network.


 In RNN output from the last step is fed as input in the current step.
 It tackled the problem of long-term dependencies of RNN in which the
RNN cannot predict the word stored in the long-term memory but can
give more accurate predictions from the recent information.
 As the gap length increases RNN does not give an efficient
performance.
 LSTM can by default retain the information for a long period of time.
LSTM

 Long Short-Term Memory (LSTM) is a type of Recurrent


Neural Network (RNN) that is specifically designed to
handle sequential data, such as time series, speech, and
text.
 LSTM networks are capable of learning long-term
dependencies in sequential data, which makes them well
suited for tasks such as language translation, speech
recognition, and time series forecasting.
How it works?
 A traditional RNN has a single hidden state that is passed through time,
which can make it difficult for the network to learn long-term
dependencies.
 LSTMs address this problem by introducing a memory cell, which is a
container that can hold information for an extended period of time.
 The memory cell is controlled by three gates:

1. Input gate,
2. Forget gate,
3. Output gate.
 These gates decide what information to add to, remove from, and
output from the memory cell.
 The input gate controls what information is added to the memory cell.
 The forget gate controls what information is removed from the
memory cell.
 The output gate controls what information is output from the memory
cell.
 This allows LSTM networks to selectively retain or discard information
as it flows through the network, which allows them to learn long-term
dependencies.
 LSTMs can be stacked to create deep LSTM networks, which can learn
even more complex patterns in sequential data.
 LSTMs can also be used in combination with other neural network
architectures, such as Convolutional Neural Networks (CNNs) for image
and video analysis.
Structure of LSTM-The repeating
module in an LSTM contains four
interacting layers.
Structure of LSTM

 LSTM has a chain structure that


contains four neural networks
and different memory blocks
called cells. Ct- Ct
1
 Information is retained by the
cells and the memory
manipulations are done by
the gates.
Gates
 Gates are a way to optionally let information through.
 They are composed out of a sigmoid neural net layer
and a pointwise multiplication operation.
 The sigmoid layer outputs numbers between zero and
one, describing how much of each component should be
let through. A value of zero means “let nothing
through,” while a value of one means “let everything
through!”
 An LSTM has three of these gates, to protect and control
the cell state.
Forget Gate
The information that is no longer useful in the
cell state is removed with the forget gate.
Two inputs xt (input at the particular time)
and ht-1 (previous cell output) are fed to the
gate and multiplied with weight matrices
followed by the addition of bias.
The resultant is passed through an activation
function which gives a binary output.
If for a particular cell state the output is 0, the
piece of information is forgotten and for output
1, the information is retained for future use.
Input gate  The addition of useful information to the
cell state is done by the input gate.
 First, the information is regulated using the
sigmoid function and filter the values to be
remembered similar to the forget gate using
inputs ht-1 and xt.
 Then, a vector is created
using tanh function that gives an output
from -1 to +1, which contains all the
possible values from ht-1 and xt.
 At last, the values of the vector and the
regulated values are multiplied to obtain
the useful information
Output gate
The task of extracting useful information from the
current cell state to be presented as output is done by
the output gate.

First, a vector is generated by applying tanh function on


the cell.

Then, the information is regulated using the sigmoid


function and filter by the values to be remembered
using inputs ht-1 and xt .
At last, the values of the vector and the regulated values
are multiplied to be sent as an output and input to the
next cell.
Some of the applications of LSTM

1. Language Modeling: LSTMs have been used for natural language processing
tasks such as language modeling, machine translation, and text summarization.
They can be trained to generate coherent and grammatically correct sentences
by learning the dependencies between words in a sentence.
2. Speech Recognition: LSTMs have been used for speech recognition tasks such
as transcribing speech to text and recognizing spoken commands. They can be
trained to recognize patterns in speech and match them to the corresponding
text.
3. Time Series Forecasting: LSTMs have been used for time series forecasting
tasks such as predicting stock prices, weather, and energy consumption. They
can learn patterns in time series data and use them to make predictions about
future events.
4. Anomaly Detection: LSTMs have been used for anomaly detection tasks
such as detecting fraud and network intrusion. They can be trained to
identify patterns in data that deviate from the norm and flag them as
potential anomalies.

5. Recommender Systems: LSTMs have been used for recommendation


tasks such as recommending movies, music, and books. They can learn
patterns in user behaviour and use them to make personalized
recommendations.

6. Video Analysis: LSTMs have been used for video analysis tasks such as
object detection, activity recognition, and action classification. They can
be used in combination with other neural network architectures, such as
Convolutional Neural Networks (CNNs), to analyze video data and extract
useful information.
GRU- Gated Recurrent Unit
 GRU or Gated recurrent unit is an advancement of the standard RNN. It
was introduced by Kyunghyun Cho et al in the year 2014.
 In sequence modeling techniques, the Gated Recurrent Unit is the
newest entrant after RNN and LSTM, hence it offers an improvement
over the other two .
 GRUs are very similar to Long Short Term Memory(LSTM).
 Just like LSTM, GRU uses gates to control the flow of information. They
are relatively new as compared to LSTM. This is the reason they offer
some improvement over LSTM and have simpler architecture.
Structure of GRU compared to
LSTM

Due to the simpler architecture, GRUs are faster to train


The architecture of Gated Recurrent Unit

 At each timestamp t, it takes an input


Xt and the hidden state Ht-1 from the
previous timestamp t-1. Later it
outputs a new hidden state Ht which
again passed to the next timestamp.
 There are primarily two gates in a GRU
as opposed to three gates in an LSTM
cell.
 The first gate is the Reset gate and
the other one is the update gate.
Reset Gate (Short term memory)
 The Reset Gate is responsible for the short-term memory of the
network i.e the hidden state (Ht).
 Equation of the Reset gate.

 The value of rt will range from 0 to 1 because of the sigmoid


function.
 Here Ur and Wr are weight matrices for the reset gate.
Update Gate (Long Term memory)

 There is an Update gate for long-term memory and the


equation of the gate is shown below.

 Uu and Wu are weight matrices


Working of GRU
 To find the Hidden state Ht in GRU, it follows a two-step process.
 The first step is to generate what is known as the candidate hidden state .
Candidate Hidden State

 It takes in the input and the hidden state from the previous timestamp t-
1 which is multiplied by the reset gate output rt.
 Later passed this entire information to the tanh function, the resultant
value is the candidate’s hidden state.
Working of GRU …

 The most important part of this equation is how it is


using the value of the reset gate to control how
much influence the previous hidden state can have
on the candidate state.
 Ifthe value of rt is equal to 1 then it means the
entire information from the previous hidden state
Ht-1 is being considered.
 Likewise, if the value of rt is 0 then that means the
information from the previous hidden state is
completely ignored.
Working of GRU …
Hidden state
 Once we have the candidate state, it is used to generate the
current hidden state Ht.
 It is where the Update gate comes into the picture.
 Instead of using a separate gate like in LSTM, GRU uses a single
update gate to control both the historical information which is
Ht-1 as well as the new information which comes from the
candidate state.
 Now assume the value of ut is around 0 then the first term in the
equation will vanish which means the new hidden state will not
have much information from the previous hidden state. On the
other hand, the second part becomes almost one that essentially
means the hidden state at the current timestamp will consist of
the information from the candidate state only.
 Similarly, if the value of ut is on the second term will become
entirely 0 and the current hidden state will entirely depend on the
first term i.e the information from the hidden state at the previous
timestamp t-1.

 Hence the value of ut is very critical in this equation and it can


range from 0 to 1.
How GRU is different from LSTM?

 LSTMhas three gates on the other hand GRU has only


two gates.
 In LSTM they are the Input gate, Forget gate, and Output
gate. Whereas in GRU we have a Reset gate and Update
gate.
 In
LSTM we have two states Cell state or Long term
memory and Hidden state also known as Short term
memory.
 In the case of GRU, there is only one state i.e Hidden state
(Ht).
References
 Goodfellow I, Bengio Y, and Courville A, Deep Learning, MIT Press, 2016
 [Link]
 [Link]
lstm-tutorial-deep-learning-tutorial-simplilearn
 [Link]
memory/

You might also like