0% found this document useful (0 votes)
18 views21 pages

Understanding Recurrent Neural Networks

This document provides an overview of Recurrent Neural Networks (RNNs), including their structure, functionality, and advantages in handling sequential data. It explains concepts such as parameter sharing, unfolding computational graphs, and the differences between RNNs and Bidirectional RNNs, which utilize future context for improved accuracy. Additionally, it discusses the potential for deepening RNN architectures to enhance learning capabilities and introduces Recursive Neural Networks as a distinct type of neural network.

Uploaded by

vlpriya742
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views21 pages

Understanding Recurrent Neural Networks

This document provides an overview of Recurrent Neural Networks (RNNs), including their structure, functionality, and advantages in handling sequential data. It explains concepts such as parameter sharing, unfolding computational graphs, and the differences between RNNs and Bidirectional RNNs, which utilize future context for improved accuracy. Additionally, it discusses the potential for deepening RNN architectures to enhance learning capabilities and introduces Recursive Neural Networks as a distinct type of neural network.

Uploaded by

vlpriya742
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Module 4

Recurrent and Recursive Neural Networks Unfolding Computational Graphs,


Recurrent Neural Network, Bidirectional RNNs, Deep Recurrent Networks,
Recursive Neural Networks, The Long Short-Term [Link] RNNs.
Text Book – 2: 10.1-10.3, 10.5, 10.6, 10.10
10. Recurrent neural networks
What is a Recurrent Neural Network (RNN)?
A Recurrent Neural Network (RNN) is a special type of neural network
designed to work with sequential data—data where the order matters.
Examples of sequential data include:
• Sentences (where word order matters)
• Time series data (like stock prices)
• Audio signals (where each moment depends on the previous ones)
Just like Convolutional Neural Networks (CNNs) are made for images (grids of
pixels), RNNs are made for sequences of values.
Why Are RNNs Special?
In a regular neural network (like a feedforward neural network):
• Each input is processed independently.
• It’s difficult to use it for data where previous inputs affect the current
prediction.
But in RNNs:
• Previous information is passed forward through something called a
"hidden state".
• This means the network can "remember" what happened before.
So, RNNs are good at handling things where context is important:
For example, in the sentence:
• “I went to Nepal in 2009”
• “In 2009, I went to Nepal.”
Even though "2009" appears at different positions, an RNN can still
learn that it refers to the year of the event.
How Does RNN Work?
At each time step t, the RNN:
1. Takes an input vector x(t) (like a word in a sentence).

1
2. Combines it with the previous hidden state (memory of the past).
3. Produces:
o An output (like a prediction)
o A new hidden state, passed on to the next time step.
This repeating structure is like a loop:
x(1) → h(1)

x(2) → h(2)

x(3) → h(3)

... ...
The same function (with the same weights) is used at each time step. This is
called parameter sharing.
Parameter Sharing – Why It Matters?
Imagine you have 3 sentences of different lengths:
• "He runs"
• "She is running"
• "They have been running fast"
If your model had different parameters for each word position, it would:
• Need to learn language rules separately for each position.
• Struggle with new sentence lengths.
Instead, RNNs share parameters at every step, so:
• The same rules apply to each time step.
• The model can generalize to longer or shorter sequences.
This is like teaching one rule for "verb tense" and applying it wherever needed,
no matter where the verb is in the sentence.
RNN vs 1D Convolution
• A 1D Convolutional Neural Network also uses parameter sharing, but
it looks at only a small set of nearby inputs (a local window).

2
• An RNN, in contrast, builds on the entire sequence history—it
remembers everything up to the current time step.
Does RNN Always Mean Real-Time?
Not necessarily. The time steps t = 1 to τ just refer to sequence positions, not
real-world time. So RNNs can be used even when:
• The sequence is already complete (like in a paragraph).
• The sequence is spatial (like processing a row in an image).
Some advanced RNNs even process sequences backwards and forwards
(called Bidirectional RNNs).
Cycles in Computation
In most neural networks, data flows straight from input to output (a directed
acyclic graph, or DAG).
But in RNNs, there are cycles because:
• The output at one step affects the input at the next.
• This creates a loop in the computational graph.
This loop allows RNNs to carry information across time—a form of memory.

✅ Summary for Students

Feedforward Neural
Feature RNN
Net

Fixed, non- Sequence (like text or time


Input type
sequential series)

Memory of previous
❌ No ✅ Yes
inputs

Handles variable-
❌ No ✅ Yes
length data

Parameter Sharing ❌ No ✅ Yes (across time steps)

Language modeling, time


Example use cases Image classification
series, speech

10.1 Unfolding Computational Graphs


Understanding Recurrent Neural Networks (RNNs) and Computational
Graphs
What is a Computational Graph?

3
A computational graph is a way to clearly show the flow of operations in a
neural network. It represents how inputs, weights (parameters), and functions
are used step-by-step to calculate outputs and losses (errors). It helps us
understand and organize the calculations.
What is Recurrence and Unfolding?
Some systems (like time-based systems or sequences) have repetitive or
looped structures. These are called recurrent systems. In neural networks,
this concept is used in Recurrent Neural Networks (RNNs).
For example, let’s say we have a formula:
s(t) = f(s(t−1); θ)
Here:
• s(t) is the state of the system at time t
• s(t−1) is the state at the previous time step
• θ is a set of shared parameters (weights)
• f is a function that determines how the state changes
This is recursive because the current state depends on the previous one.
To train or analyze this structure, we unfold it.
What Does “Unfolding” Mean?
Unfolding means writing out the loop or repetition step-by-step for a certain
number of time steps.
Example: If we do it for 3 steps:
s(3) = f(s(2); θ)
= f(f(s(1); θ); θ)
Now the repetition is written as a regular sequence of functions. This forms a
computational graph without loops, also called a directed acyclic graph (DAG).
Every step has its own node in the graph.
Why Is This Useful?
Unfolding the recurrent function gives two big advantages:
1. Same input size at every time step:
We don’t need a separate model for sequences of different lengths.
Each step just takes the current input and the previous state.
2. Parameter sharing:
The same function f with the same weights θ is used at every time
step. This helps the model:

4
o Learn more efficiently
o Generalize to longer or shorter sequences than those seen during
training

Unfolding with External Inputs


Sometimes the system also takes an input at each time step, like:
s(t) = f(s(t−1), x(t); θ)
Now, each step depends on:
• The previous state (s(t−1))
• The current input (x(t))
In RNNs, we often call the state h(t) instead of s(t). So the formula becomes:
h(t) = f(h(t−1), x(t); θ)
This means:
• h(t) is the hidden state (a summary of previous inputs)
• x(t) is the current input
• θ are the shared weights
• f is the function (often a neural network layer)
What Happens in an RNN?

5
• At each time step, the RNN updates its hidden state h(t)
• It uses h(t) to make predictions or pass information to the next step
• The same function f is applied again and again, with the same
parameters
This forms a chain-like structure. If we draw it step-by-step, it looks like a
long sequence of repeated blocks.
Two Ways to Draw an RNN
1. Compact (Recurrent) Diagram:
o Shows just one unit with a loop (representing repetition)
o Looks like a simple circuit
2. Unfolded Diagram:
o Shows each time step as a separate block
o Makes it easier to understand how information flows over time

Information Flow in RNNs


When training:
• Forward pass: Calculates outputs and loss by moving from past to
future
• Backward pass: Calculates gradients (errors) from future to past using
Backpropagation Through Time (BPTT)
Why Use RNNs?
RNNs are useful when:
• You want to remember previous inputs
• The input sequence length is variable
• Examples: speech recognition, text prediction, time-series forecasting
Key Takeaways
• RNNs work on sequences by remembering past data using a hidden
state.
• They use the same function repeatedly over time steps, sharing
parameters.
• Unfolding the recurrent computation helps visualize and train the
model using traditional backpropagation.

6
• Compact (looped) and unfolded (step-by-step) graphs are two ways to
represent the same model.
10.3 Bidirectional RNNs
Understanding Bidirectional Recurrent Neural Networks (Bidirectional
RNNs)
Basic RNN Limitation: Only Looks Backward
Up to this point, the recurrent neural networks (RNNs) we've discussed only
look at the past and present information to make a decision.
For example, at time step t, the RNN can access:
• Past inputs: x(1), x(2), ..., x(t-1)
• Current input: x(t)
But it cannot look into the future, such as x(t+1), x(t+2), ....
This is called a “causal” structure, because the model only considers causes
from the past and not future context.
Why Is Future Information Important?
In many real-life problems, the current output depends not only on the past
and present but also on the future inputs. Here are some examples:
Speech Recognition:
To identify the correct sound (phoneme) you're hearing now, it might help to
know:
• What sounds come next
• Even what words follow next
Why? Because sounds can blend together (co-articulation), and meaning can
depend on nearby words. Example:
• The word “write” and “right” sound the same. Only the next few words
may clarify the meaning.
Handwriting Recognition:
The interpretation of a letter may depend on what comes next — some letters
look similar and are disambiguated by their neighbors.
So, using only past data is not always enough!
Bidirectional RNN to the Rescue!
To solve this, researchers created the Bidirectional RNN.
What is it?

7
A Bidirectional RNN has two RNNs:
• One processes the sequence forward (from the start to the end)
• The other processes the sequence backward (from the end to the start)
These two RNNs work in parallel.
Each time step t now has:
• h(t): State from the forward RNN
• g(t): State from the backward RNN
The output o(t) at each time step depends on both directions:
o(t) = function of [ h(t), g(t) ]
This means the model has complete context — past, present, and future.
Benefits of Bidirectional RNNs
• Output is more accurate because it uses more context.
• No need to define a fixed “look-ahead” window.
• Especially helpful when decisions at the current point depend on what’s
coming next.
Applications of Bidirectional RNNs
Bidirectional RNNs have been successfully used in:
• Handwriting recognition
• Speech recognition
• Bioinformatics (like DNA sequence analysis)
They help because these tasks often require understanding of the whole
sequence, not just part of it.
Extending the Idea to Images (2D Input)
What if the input is an image instead of a sequence?
You can extend bidirectional RNNs to two dimensions (2D), where data comes
not just left to right, but also top to bottom.
So, you can have four RNNs moving in four directions:
• Left → Right
• Right → Left
• Top → Bottom
• Bottom → Top

8
Each point (i, j) in the image grid can combine information from all directions
to compute its output O(i, j).
This allows:
• Local detail understanding (like edges, textures)
• Long-distance interactions (far-away pixels can influence the output)
RNN vs CNN in Images
• CNNs (Convolutional Neural Networks) are more efficient and faster.
• But RNNs for images can:
o Capture long-distance relationships better
o Allow more flexible context usage
• This comes at a cost: RNNs are computationally heavier and slower than
CNNs.
Summary

Concept Explanation

Causal RNNs Use only past and present data

Bidirectional RNNs Use both past and future data

Why needed? Some tasks need full context (e.g., speech, handwriting)

How? Two RNNs: one forward, one backward

Extension to 2D Use 4 RNNs: up, down, left, right for image data

Pros Better context and accuracy

Cons More computation and memory usage

10.5 Deep Recurrent Networks


Computation in Recurrent Neural Networks (RNNs): Deepening the
Architecture
In most Recurrent Neural Networks (RNNs), the processing can be split into
three main parts or transformations:
1. Input to Hidden State: This step takes the input at the current time
step and transforms it into a hidden representation.
2. Hidden to Hidden State: This part takes the hidden state from the
previous time step and uses it to update the hidden state for the current
time.

9
3. Hidden State to Output: Finally, this step transforms the current
hidden state into an output.
Each of these transformations is normally done using a simple operation: a
weight matrix (which is learned during training), followed by a non-linear
activation function like tanh or ReLU. These are called shallow
transformations, similar to one layer in a deep neural network (MLP).
Can We Make These Transformations Deeper?
Researchers asked: Can we improve performance by making each of these
three parts deeper?
• Instead of just one simple layer between input and hidden, or hidden-
to-hidden, or hidden-to-output, we can use a deep network (like a
multi-layer perceptron - MLP).
• This was found to work well in practice by researchers like Graves
(2013) and Pascanu (2014a).
• They found that deeper RNNs can learn better because they can capture
more complex patterns in data.
How Can We Add Depth?

10
Figure 10.13 (explained in simple terms):
1. Figure 10.13a – Deep Hidden States:
Instead of one hidden layer, we stack multiple hidden layers (like a
hierarchy). The lower layers focus on raw input, and higher layers
learn more abstract features.
2. Figure 10.13b – Deep Transforms in All 3 Parts:
We can use deep networks (like MLPs) in each of the three parts:
input-to-hidden, hidden-to-hidden, and hidden-to-output. This adds
more depth and learning capacity.
3. Figure 10.13c – Skip Connections to Help Training:
One problem with adding depth is that it makes training harder
because the network becomes deeper in time — information has to
travel through more layers, which can slow learning.
To solve this, we can use skip connections (shortcuts that skip layers),
which make it easier for gradients to flow during backpropagation.
These help prevent learning from getting stuck.
Key Takeaways
• Standard RNNs are shallow in their operations.
• Making them deep by adding layers inside the input, recurrent, and
output parts can help the model learn better.
• But adding depth increases the path length between time steps, which
may slow learning.
• This issue can be fixed using skip connections, which shorten the
learning path.
This idea of deep RNNs is similar to why deep feedforward networks (like
ResNet) became popular — deeper models learn better, but we have to help
them train efficiently.
10.6 Recursive Neural Networks
What Are Recursive Neural Networks?
Recursive Neural Networks (RecNNs) are a special type of neural network.
They are different from Recurrent Neural Networks (RNNs), even though their
names sound similar.
RNNs have a chain-like structure — they process one input after another in a
straight line over time.
Recursive Neural Networks, on the other hand, have a tree-like structure.
Instead of processing a sequence in a line, they combine parts of the input in
a hierarchical way, like building a pyramid from the bottom up.
Where Are Recursive Neural Networks Used?

11
Recursive networks are useful when the input data is naturally tree-
structured, such as:
• Natural Language Processing (NLP): Words in a sentence can be
arranged into a parse tree (like a grammar structure).
• Computer Vision: Objects in an image can be combined based on their
parts (e.g., eyes, nose → face).
• Any structured data: XML, code trees, molecule structures, etc.
Some researchers who helped develop and apply this idea include:
• Pollack (1990) – introduced the concept,
• Socher et al. (2011–2013) – applied it to NLP and vision,
• Frasconi et al. (1998) – used it for structured data,
• Bottou (2011) – discussed how recursive networks could learn structure
from data.
Why Use Recursive Networks?
Recursive networks offer one major advantage over RNNs:
• For a sequence of length τ, an RNN has a depth (number of processing
steps) of τ.
• But a Recursive Network can reduce this depth to around log(τ)
(logarithmic), which means it processes faster and more efficiently.
This is especially useful for long sequences, where RNNs may struggle to
remember information from earlier steps.
How Do You Build the Tree?
This is still a challenging question in practice. There are a few options:
1. Use a fixed structure like a balanced binary tree.
2. Use a structure from outside tools, such as a parser that tells how a
sentence should be broken down into phrases and words.
3. Learn the structure automatically: Ideally, the model should figure out
the best tree structure by itself while learning. This is an ongoing
research area.
Variations in Recursive Networks
Recursive networks can come in different forms:
• Some models associate each node of the tree with both an input and a
target output.

12
• The computation at each node doesn't have to be a simple linear
operation. It can be more advanced, like:
o Tensor operations
o Bilinear forms (used to model relationships between different
concepts)
These methods help capture complex relationships between elements (like
how two words in a sentence interact).
Also, each input (like a word or object) is often represented using embeddings
— continuous-valued vectors that capture meaning or features.

Summary

Recursive Neural Network


Feature
(RecNN)

Structure Tree (not a chain like RNN)

NLP (parse trees), vision,


Useful for
structured data

13
Recursive Neural Network
Feature
(RecNN)

Shorter depth for long sequences


Advantage
→ better memory

Fixed, parser-based, or learned


Tree building
during training

Can use advanced operations


Computation at each node
(tensor, bilinear, etc.)

Representation Works with vector embeddings

10.10 The Long Short-Term Memory


and Other Gated RNNs

What Are Gated RNNs?


Gated RNNs are a special type of Recurrent Neural Networks (RNNs) used to
process sequences — like text, time series, speech, etc.

14
These are smarter versions of regular RNNs, built to solve two big problems
that traditional RNNs often face:
1. Vanishing gradients (the network forgets past information quickly),
2. Exploding gradients (the network becomes unstable).
Gated RNNs solve these problems by controlling what information is
remembered, updated, or forgotten at each step using special gates.
What Does "Gated" Mean?
Imagine a gate in real life — it either lets something pass or blocks it.
Similarly, gated RNNs use "gates" (mathematical functions) that control the
flow of information in the network. These gates help the network:
• Store important information for a long time,
• Forget irrelevant information when it’s no longer needed,
• Decide what to output at each time step.
Two Famous Types of Gated RNNs:
1. LSTM (Long Short-Term Memory)
2. GRU (Gated Recurrent Unit)
How Does an LSTM Cell Work?
An LSTM cell is like a small computer inside the neural network. It has:
1. Input gate
Controls how much of the new information from the current input should be
added to the memory.
2. Forget gate
Decides how much of the old memory (from the previous step) should be
erased or kept.
3. Output gate
Decides what information from the memory should be output at this step.
4. Cell state
This is like a memory lane that carries information from one time step to the
next, with very little change unless a gate says otherwise.
Diagram Overview (from Figure 10.16)
Here's what happens inside an LSTM cell at each time step:
• The input goes in and is processed.

15
• The input gate decides if this info is worth saving.
• The forget gate checks whether to remove any old memory.
• The cell state gets updated (keeps important info across time steps).
• The output gate controls what gets passed to the next layer or next time
step.
All gates use a sigmoid activation (values between 0 and 1, like "yes/no"
decisions).
The input signal can use other activation functions (like tanh or ReLU).
Why Is This Useful?
In many real-world tasks, we need to:
• Remember information for a long time (e.g., remembering a subject from
the start of a sentence),
• Forget old details (e.g., when switching topics in a sentence),
• Update knowledge (e.g., when learning new facts).
Leaky units (used in older models) kept adding to the memory but had no
good way to clear it when needed. Gated RNNs (like LSTM) fix that by learning
when to forget.
Real-Life Example: Understanding a Paragraph
Let’s say the network reads this paragraph:
“John went to the shop. He bought a pen. Then he went home.”
To answer the question “Who bought a pen?”, the model must remember
“John” from the first sentence and connect it with “he” in the second sentence.
An LSTM network can keep “John” in memory, thanks to its cell state and
gates, while reading the rest of the sentences.
Summary Table

Feature Gated RNNs (LSTM/GRU)

Solves Vanishing/exploding gradient problems

Memory Uses a long-term memory (cell state)

Gates used Input gate, Forget gate, Output gate

Learns to Remember, update, or forget information

Works well for Long sentences, sequences, time series, etc.

16
Feature Gated RNNs (LSTM/GRU)

Advantage Better control of what is stored and erased

10.10.1 LSTM
Understanding the Core Idea Behind LSTM
In normal Recurrent Neural Networks (RNNs), when we try to pass
information across many time steps, the gradients (which are needed for
learning) either become too small (vanish) or too large (explode). This makes
it very hard for the network to learn long-term patterns.
To solve this, researchers introduced a clever idea in LSTM networks — they
added self-loops (paths that allow information to flow from one time step to
the next without changing too much). This helps the gradient pass smoothly
through many time steps during training, without vanishing or exploding.
Self-Loops with Gating – Smarter Memory Control
Instead of using a fixed weight (strength) for this self-loop, LSTM networks
use a gate to control the weight dynamically, based on the current context or
input.
• This means: Even if the LSTM has fixed parameters, it can adapt how
long it remembers something, depending on the input sequence.
What is an LSTM Cell?
An LSTM cell is a special type of unit in a neural network. It replaces the
regular hidden unit of a simple RNN. Unlike a basic neuron, an LSTM cell
has:
1. An internal memory (state) that can keep information over time.
2. Three gates to control the flow of information:
o Forget Gate: Decides what to forget from the past.
o Input Gate: Decides what new information to add.
o Output Gate: Decides what to show as output from the memory.
Forward Pass in LSTM: Step-by-Step
Let’s see what happens at every time step t:

17
18
Why LSTM is Powerful?
• LSTM can learn when to remember and when to forget.
• It can adjust the duration for which it keeps information — useful for
tasks like:
o Speech recognition
o Handwriting generation
o Machine translation
o Image captioning
o Parsing text
Optional Feature: Using Memory as Input to Gates
Sometimes, we use the memory state s^t_i as an extra input to the gates. This
helps make even more precise decisions. However, it adds three more weights
(one for each gate).
Performance
LSTM has shown excellent performance in:
• Synthetic datasets where long-term memory is tested.
• Real-world applications where previous RNNs failed.
Many researchers have built variations and improvements over the original
LSTM to make it even better.
10.10.2 Other Gated RNNs
What Are the Important Parts of the LSTM Architecture?
Researchers asked a very practical question:
"Which parts of the LSTM are really important? Can we design other simpler
architectures that still allow the network to learn what to remember and
what to forget?"
To answer this, scientists developed another type of neural network called
Gated Recurrent Units (GRUs).
GRU – A Simpler Version of LSTM
GRUs were introduced in 2014 by Cho and others. GRUs are similar to LSTM,
but simpler.
The key idea is that instead of using separate gates like LSTM (input gate,
forget gate, output gate), GRU uses only two gates:
• Update Gate (u) – Controls how much of the previous memory should
be kept.

19
• Reset Gate (r) – Controls how much of the past should be forgotten when
calculating the current state.
These gates help the network decide when to keep or forget information, based
on the input sequence.
GRU Update Equation
The GRU updates the hidden state using this formula:
h_t = u_t * h_(t-1) + (1 - u_t) * new_info
Here:
• h_t is the current hidden state.
• u_t is the update gate value.
• h_(t-1) is the previous hidden state.
• new_info is the candidate state computed using the input and the reset
gate.
This formula mixes the old memory and the new memory, controlled by the
update gate.
How Do the Gates Work?
• The update gate (u) decides how much past information should be
carried forward.
If u_t = 1, it keeps the old memory. If u_t = 0, it forgets the old and
uses the new info.
• The reset gate (r) decides how much of the past should be considered
when computing the new information.
It acts like a filter to ignore irrelevant old data.
Variations and Experiments
Researchers experimented with many variations of LSTM and GRU, like:
• Sharing reset/forget gates across multiple units.
• Using global gates for entire layers and combining with individual gates
per neuron.
However, no new version was found to consistently perform better than both
LSTM and GRU across many tasks.
Key Findings
1. The forget gate in LSTM is very important for learning long-term
dependencies.

20
2. A small tweak – adding a bias of 1 to the forget gate – was found to
significantly improve LSTM performance.
This idea was suggested by Gers et al. in 2000.
Summary
• GRU is a simpler alternative to LSTM, using only two gates (update and
reset).
• These gates allow the model to dynamically learn what to remember or
forget.
• Despite many experiments, LSTM and GRU remain the most successful
for sequence learning.
• The forget gate is one of the most crucial components in these models.

21

You might also like