Module # 5
Recurrent Neural Network
CSDC8011.5
Analyze and compare different types of Recurrent Neural
Networks (RNNs) to select appropriate models for sequential
data applications.
What is Sequence Learning Problem?
► In all of the networks that we have covered so far(Fully Connected
Neural Network(FCNN), Convolutional Neural Network(CNN)):
►
► the output at any time step is independent of the previous layer
input/output
► the input was always of the fixed-length/size
Sequence Learning Problem
► In “Sequence Learning Problems”, the “two properties of FCNN and
CNNs do not hold”
► The output at any timestep depends on previous input/output
► The length of the input is not fixed.
► Let’s consider the case of Auto-completion. Say the user types in the
alphabet ‘d’, and the model tries to predict the next character
► CNN models wont work….!!!
● Data comes in sequence form
○ Text (sentences)
○ Time-series (stock prices)
○ Speech signals
● Problem:
○ Traditional NN cannot remember past inputs
● Need:
Model that uses previous information
Unfolding Computational Graphs
A computational graph is a directed graph where:
● Nodes represent variables or intermediate values
● Edges represent operations that transform these values
The beauty of computational graphs is that they allow us to break down complex functions
into simple operations, making it straightforward to compute derivatives using the chain
rule.
► We can unfold a recursive or recurrent computation into a computational graph that
has a repetitive structure
• Corresponding to a chain of events
► Unfolding this graph results in sharing of parameters across a deep network structure
Unfolding Computational Graph
● RNN can be unrolled over time
● Each time step = one layer
👉 Example:
x₁ → h₁ → y₁
x₂ → h₂ → y₂
x₃ → h₃ → y₃
● Helps in training and visualization
When training neural networks, we need to compute
gradients of a loss function with respect to model
parameters.
For a network with thousands or millions of
parameters, computing these derivatives manually
would be impractical.
Deep learning frameworks solve this problem using
computational graphs combined with automatic
differentiation.
The real power of computational graphs becomes apparent when we scale to deep
networks with many layers:
x → Block 1 → x₁ → Block 2 → x₂ → … → Block n → xₙ → Final Block → z
Forward Pass
During the forward pass, we compute the output by sequentially applying each
computational block:
● Start with input x
● Apply Block 1 to get x₁
● Apply Block 2 to get x₂
● Continue through all blocks
● Get final output z
Backward Pass
For the backward pass, we compute gradients iteratively using the chain rule
Dynamic vs Static Graphs
PyTorch uses dynamic computational graphs (built on-the-fly during
execution), while older versions of TensorFlow used static graphs
(defined before execution).
Dynamic graphs are more flexible and intuitive, which is one reason
for PyTorch’s popularity.
What is RNN?
● RNN = Neural Network with memory
● Uses previous output as input
● Handles sequential data
Example:
● Predict next word in sentence
● “I am going to ___”
Recurrent Neural Network
► Recurrent Neural Network(RNN) is a type of Neural Network where
the output from the previous step is fed as input to the current step.
► In traditional neural networks, all the inputs and outputs are
independent of each other, but in cases when it is required to predict
the next word of a sentence, the previous words are required and
hence there is a need to remember the previous words.
► Thus RNN came into existence, which solved this issue with the help
of a Hidden Layer.
Recurrent Neural Network
► The main and most important feature of RNN is its Hidden state,
which remembers some information about a sequence.
► The state is also referred to as Memory State since it remembers the
previous input to the network.
► It uses the same parameters for each input as it performs the same
task on all the inputs or hidden layers to produce the output.
► This reduces the complexity of parameters, unlike other neural
networks.
RNN Working
● Input → Hidden State → Output
● Hidden state stores past information
● Same weights used at every step
Key idea:
● Loop (feedback connection)
Architecture of Recurrent Neural Network
► RNNs have the same input and output architecture as any other deep
neural architecture.
► However, differences arise in the way information flows from input to
output.
► Unlike Deep neural networks where we have different weight
matrices for each Dense network, in RNN, the weight across the
network remains the same.
► It calculates state hidden state Hi for every input Xi .
How RNN works
► The Recurrent Neural Network consists of multiple fixed activation
function units, one for each time step.
► Each unit has an internal state which is called the hidden state of the
unit.
► This hidden state signifies the past knowledge that the network
currently holds at a given time step.
► This hidden state is updated at every time step to signify the change
in the knowledge of the network about the past.
► The hidden state is updated using the following recurrence relation:-
How RNN works
► The formula for calculating the current state:
► where:
► ht -> current state
► ht-1 -> previous state
► xt -> input state
How RNN works
► Formula for applying Activation function(tanh):
► where:
► whh -> weight at recurrent neuron
► wxh -> weight at input neuron
How RNN works
► The formula for calculating output:
► Yt -> output
► Why -> weight at output layer
► These parameters are updated using Backpropagation.
► However, since RNN works on sequential data here we use an updated
backpropagation which is known as Backpropagation through time.
Back Propagation in Time
► In RNN the neural network is in an ordered fashion and since in the
ordered network each variable is computed one at a time in a
specified order like first h1 then h2 then h3 so on.
► Hence we will apply backpropagation throughout all these hidden
time states sequentially.
Back Propagation in Time
► L(θ)(loss function) depends on h3
► h3 in turn depends on h2 and W
► h2 in turn depends on h1 and W
► h1 in turn depends on h0 and W
► where h0 is a constant starting state.
Backpropagation Through Time (BPTT)
● Training method for RNN
● Errors are propagated back through time steps
Steps:
1. Forward pass
2. Calculate error
3. Backward pass through all time steps
Need for bidirectionality
► In speech recognition, the correct interpretation of the
current sound may depend on the next few phonemes
because of coarticulation and the next few words
because of linguistic dependencies
► Also true of handwriting recognition
A birectional RNN
Combine an RNN that moves forward through time from the start of
the sequence
Another RNN that moves backward through time beginning from
the end of the sequence
A bidirectional RNN consists of two RNNs which are stacked on the
top of each other.
The one that processes the input in its original order and the one
that processes the reversed input sequence.
The output is then computed based on the hidden state of both
RNNs.
► A typical bidirectional RNN Maps
input sequences x to target
sequences y with loss L(t) at each
step t h recurrence propagates to
the right g recurrence propagates to
the left.
► This allows output units o(t) to
compute a representation that
depends both the past and the
future
Bidirectional RNN
● Uses:
○ Forward sequence
○ Backward sequence
Advantage:
● Uses past + future context
Example:
● Understanding sentence meaning
► Exploding and vanishing gradient problems during
backpropagation.
► Gradients are those values which to update neural networks
weights. In other words, we can say that Gradient carries
information.
►
Vanishing Gradient Problem
● Gradients become very small
● Model stops learning long-term dependencies
Problem:
Cannot remember old information
● Vanishing gradient is a big problem in deep neural networks.
it vanishes or explodes quickly in earlier layers and this
makes RNN unable to hold information of longer sequence.
and thus RNN becomes short-term memory.
● If we apply RNN for a paragraph RNN may leave out necessary
information due to gradient problems and not be able to carry
information from the initial time step to later time steps.
Exploding Gradient Problem
● Gradients become very large
● Model becomes unstable
Solution:
● Gradient clipping
► The reason for exploding gradient was the capturing of
relevant and irrelevant information. a model which can
decide what information from a paragraph and relevant
and remember only relevant information and throw all the
irrelevant information
► This is achieved by using gates. the LSTM ( Long -short-term
memory ) and GRU ( Gated Recurrent Unit ) have gates as
an internal mechanism, which control what information to
keep and what information to throw out. By doing this
LSTM, GRU networks solve the exploding and vanishing
gradient problem.
► Almost each and every SOTA ( state of the art) model based
on RNN follows LSTM or GRU networks for prediction
LSTM
► Long Short-Term Memory Networks or LSTM in deep learning, is a
sequential neural network that allows information to persist.
► It is a special type of Recurrent Neural Network which is capable of
handling the vanishing gradient problem faced by RNN.
► The shortcoming of RNN is they cannot remember long-term
dependencies due to vanishing gradient. LSTMs are explicitly
designed to avoid long-term dependency problems.
What is LSTM?
LSTM (Long Short-Term Memory) is a recurrent neural network (RNN) architecture
widely used in Deep Learning. It excels at capturing long-term dependencies,
making it ideal for sequence prediction tasks.
LSTM has become a powerful tool in artificial intelligence and deep learning,
enabling breakthroughs in various fields by uncovering valuable insights from
sequential data
Every LSTM network basically contains three gates to
control the flow of information and cells to hold
information. The Cell States carries the information from
initial to later time steps without getting vanished.
Forget Gate:
–This gate decides what information should be carried out
forward or what information should be ignored.
–Information from previous hidden states and the current
state information passes through the sigmoid function.
Values that come out from sigmoid are always between 0
and 1.
–if the value is closer to 1 means information should
proceed forward and if value closer to 0 means information
should be ignored.
Input Gate:
–After deciding the relevant information, the information
goes to the input gate, Input gate passes the relevant
information, and this leads to updating the cell states.
simply saving updating the weight.
–Input gate adds the new relevant information to the
existing information by updating cell states.
Output Gate:
–After the information is passed through the input gate,
now the output gate comes into play.
–Output gate generates the next hidden states. and cell
states are carried over the next time step.
Long Short-Term Memory (LSTM)
● Special type of RNN
● Solves vanishing gradient problem
Uses gates:
● Control information flow
LSTM Gates (Simple)
1. Forget Gate → What to remove
2. Input Gate (Write) → What to store
3. Output Gate (Read) → What to output
Acts like:
Memory box with control switches
Selective Operations in LSTM
● Selective Read → Output important info
● Selective Write → Store useful info
● Selective Forget → Remove useless info
GRU
GRU ( Gated Recurrent Units ) are similar to the LSTM
networks. GRU is a kind of newer version of RNN. However,
there are some differences between GRU and LSTM.
–GRU doesn’t contain a cell state
–GRU uses its hidden states to transport information
–It Contains only 2 gates(Reset and Update Gate)
–GRU is faster than LSTM
–GRU has lesser tensor’s operation that makes it faster
►
1. Update Gate
–
Update Gate is a combination of Forget Gate and Input
Gate. Forget gate decides what information to ignore and
what information to add in memory.
2. Reset Gate
–
This Gate Resets the past information in order to get rid
of gradient explosion. Reset Gate determines how much
past information should be forgotten.
Gated Recurrent Unit (GRU)
● Simpler than LSTM
● Uses:
○ Update Gate
○ Reset Gate
Advantages:
● Faster
● Less complex
● Good performance
LSTM vs GRU
Recent Trends & Applications
● NLP (Chatbots, Translation)
● Speech Recognition
● Time-series Forecasting
● Video Analysis
Used in:
● Google Translate
● Voice Assistants
RNN handles sequence data
BPTT used for training
Problems:
● Vanishing gradient
Solutions:
● LSTM, GRU