RNNs and Computational Graphs Explained
RNNs and Computational Graphs Explained
Sequence Modeling- Recurrent and Recursive Nets: Unfolding Computational Graphs, Recurrent
Neural Networks, Bidirectional RNNs, Encoder-Decoder Sequence-to-Sequence Architectures, Deep
Recurrent Networks, Recursive Neural Networks, Leaky Units, LSTM.
-------
Recurrent Neural Networks (RNNs) are designed to handle sequential data. Just as convolutional
networks are tailored for processing grid-like data such as images, RNNs are specialized for sequences of
values x1 ….. xT. They can efficiently scale to long sequences and handle variable-length inputs, making
them practical for tasks involving sequential data.
To move from multi-layer networks to recurrent networks, we utilize a key concept: parameter sharing.
This allows the model to handle sequences of varying lengths and generalize across them. Without
shared parameters, the model couldn't adapt to new sequence lengths or leverage statistical patterns
across different positions in time.
For instance, consider the sentences “I went to Nepal in 2009” and “In 2009, I went to Nepal.” A model
should recognize "2009" as the year regardless of its position. A traditional feedforward network would
need separate parameters for each word position, making it inefficient. In contrast, a recurrent neural
network shares weights across time steps, making it more effective for such tasks.
The convolution kernel is a small window that slides over the sequence, and the output of the
convolution operation is a new sequence where each element is a function of the elements in the input
sequence that overlap with the kernel. It can be used to share parameters across time, but it is shallow.
Recurrent neural networks (RNNs): A type of neural network that can model sequences. RNNs
have an internal state that is updated at each time step. The output of the RNN at each time step ‘t’ is a
function of the input at that time step and the internal state.
RNNs can learn long-range dependencies because the internal state of the RNN can store information
about previous time steps. They are difficult to train because they can easily overfit the training data. In
recurrent networks, the parameters are shared through a very deep computational graph. This means that
the same parameters are used to compute the output at each time step.
We can unfold a recursive or recurrent computation into a computational graph that has a repetitive
structure, typically corresponding to a chain of events. Unfolding this graph results in the sharing of
parameters across a deep network structure.
For example, consider the classical form of a dynamical system: s(t) = f(s(t−1); θ), where s(t) is called the
state of the system. The above Equation is recurrent because the definition of s at time t refers back to
the same definition at time t − 1.
For a finite number of time steps τ , the graph can be unfolded by applying the definition τ − 1 times. For
example, if we unfold above equation for τ = 3 time steps, we obtain s(3) =f(s(2) ; θ) =f (f (s (1); θ); θ)
Unfolding equation by repeatedly applying the definition in this way will yielded expression without
recurrence. s(1) is ground state and s(2) computed by applying f. Such an expression can be represented by
a traditional acyclic computational graph.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
2
Each node represents the state at some time t and the function f maps the state at t to the state at t + 1.
The same parameters (the same value of θ used to parametrize f) are used for all time steps.
As, let us consider another example , a dynamical system driven by an external signal x(t), s(t) = f(s(t−1) ,
x(t); θ), where we see that the state now contains information about the whole past sequence.
s(t) depends on s(t−1), which depends on s(t−2), and so on. This chain of dependence means that s(t)
implicitly contains information about all past states and inputs, assuming the system is observable and
the function f is invertible or retains sufficient information.
Recurrent neural networks can be built in many different ways. Much as almost any function can be
considered a feedforward neural network, essentially any function involving recurrence can be
considered a recurrent neural network.
Many recurrent neural nets use same equation to define values of hidden units. To indicate that the state
is the hidden, we now rewrite equation s(t) = f(s(t−1); θ) using the variable h to represent the state, h(t) =
f(h(t−1), x(t); θ) as show below:
The above is a recurrent network with no outputs. This recurrent network just processes information
from the input x by incorporating it into the state h that is passed forward through time.
Circuit Diagram (Left): The circuit diagram on the left shows the recurrent network in its folded
form. The input x is processed by the function f along with the previous state h to produce the new
state h. The delay element (black square in the diagram) indicates a delay of a single time step, ensures
that the state h is passed forward to the next time step.
Unfolded Computational Graph (Right): The unfolded computational graph on the right
shows the recurrent network as it processes a sequence of inputs over time. Each node in the graph is
associated with a specific time instance.
At each time step t, the input x(t) and the previous state h(t−1) are processed by the function f to produce
the new state h(t). This process is repeated for each time step, with the state h being passed forward
through time.
This recurrent network processes input information by incorporating it into the internal state h, which is
passed forward through time. The network does not produce any outputs but focuses on updating the
internal state based on the input sequence.
We can represent the unfolded recurrence after t steps with a function g(t):
h(t) =g(t) (x(t), x(t−1), x(t−2) , . . . , x(2), x(1)) = f(h(t−1) , x(t); θ)
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
3
(t) (t) (t-1) (t-2) (2) (1)
The function g takes the whole past sequence (x , x , x , … , x , x ) as input and produces the
current state, but the unfolded recurrent structure allows us to factorize g(t) into repeated application of a
function f.
These two factors enable learning a single model f that works across all time steps and sequence lengths,
eliminating the need for separate models g(t) for each time step.
This approach allows the model to generalize to new sequence lengths not seen during training and
reduces the number of training examples needed compared to models without parameter sharing.
Armed with the graph unrolling and parameter sharing we can design a wide variety of recurrent neural
networks. Three important design patterns for recurrent neural networks are:
1. Recurrent networks that produce an output at each time step and have recurrent connections between
hidden units, illustrated in the below figure 1.
The sequence flow of the diagram begins with an input x at each time step, which is processed
through a weight matrix U to update the hidden state h .
The black square in the diagram represents a delay element, which holds the hidden state for one time
step before passing it on, ensuring that the hidden state from the current time step is used in the next
time step.
This delayed hidden state is then passed through a recurrent connection, represented by the weight
matrix W , to produce the next hidden state.
Simultaneously, the current hidden state is processed through another weight matrix V to generate
the output o is the unnormalized log probabilities.
The output is compared to the target y using a loss function L , which measures the loss between
the predicted output and the actual target.
This process is repeated for each time step in the sequence, with the hidden state being updated and
passed forward at each step, allowing the network to handle sequences of variable length.
2. Recurrent networks that produce an output at each time step and have recurrent connections only
from the output at one time step to the hidden units at the next time step, as shown in below figure 2.
The below differences of above two diagrams highlight the trade-offs between simplicity and the ability
to capture complex temporal patterns in sequential data.
Feedback Mechanism: Diagram 1 has recurrent connections within the hidden layer itself, while
Diagram 2 has feedback from the output to the hidden layer.
Hidden State Update: Diagram 1 updates it based on the current input and the previous hidden state,
whereas Diagram 2 updates the hidden state based on the current input and feedback from the
previous output.
Complexity and Power: Diagram 1 is more complex and powerful in capturing long-term
dependencies, while Diagram 2 is simpler and potentially easier to train but less powerful.
3. Recurrent networks with recurrent connections between hidden units, that read an entire sequence
and then produce a single output, illustrated in figure 3 below.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
5
RNN with hidden unit connections together with a(t) = b+Wh(t-1) + Ux(t) is Universal, i.e., Any function
computable by a Turing machine can be computed by such a recurrent network of a finite size.
The output can be read from the RNN after a number of time steps that is asymptotically linear in the
number of time steps used by the Turing machine and asymptotically linear in the length of the input.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
6
The above figure does not specify the choice of activation function for the hidden units. Here we assume
the hyperbolic tangent activation function. i.e. h(t) = tanh(a(t))
a(t) is an intermediate value calculated at each time step t. It represents the linear transformation of the
input and the previous hidden state.
The figure does not specify exactly what form the output and loss function take. Here we assume that the
output is discrete, as if the RNN is used to predict words or characters. A natural way to represent
discrete variables is to regard the output o as giving the unnormalized log probabilities of each possible
value of the discrete variable.
We can then apply the softmax operation as a post-processing step to obtain a vector yˆ of normalized
probabilities over the output.
Forward propagation begins with a specification of the initial state h (0). Then, for each time step from t =
1 to t = τ (tau, (τ) indicates the specific time step) we apply the following update equations:
o(t )= c + Vh(t) where
(t) (t)
h = tanh(a ) - a(t) : Intermediate affine transformation at time step τ .
(t) (t-1) (t)
a = b + Wh + Ux - o(t) : Unnormalized log probabilities of the output at time step τ
yˆ(t) = softmax(o(t)) - b : Bias vector for the hidden layer.
- c : Bias vector for the output layer.
- h(t-1) : Hidden state from the previous time step.
- x(t) : Input at the current time step.
- U : Weight matrix for the input-to-hidden connections.
- V : Weight matrix for the hidden-to-output connections.
- W : Weight matrix for the hidden-to-hidden connections.
These equations describe how the RNN processes sequential data and produces discrete predictions at
each time step. The use of the softmax function ensures that the output probabilities sum to 1, making it
suitable for classification tasks.
Teacher Forcing and Networks with Output Recurrence The network with recurrent
connections only from the output at one time step to the hidden units at the next time step:
This network is strictly less powerful
This network cannot simulate a universal Turing machine.
This network requires that the output units capture all of the information about the past that the
network will use to predict the future.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
7
The advantage of eliminating hidden-to-hidden recurrence is that, for any loss function based on
comparing the prediction at time t to the training target at time t, all the time steps are decoupled.
Training can thus be parallelized, with the gradient for each step t computed in isolation. There is no
need to compute the output for the previous time step first, because the training set provides the ideal
value of that output.
Models that have recurrent connections from their outputs leading back into the model may be trained
with teacher forcing.
Teacher forcing is a procedure that emerges from the maximum likelihood criterion, in which during
training the model receives the ground truth output y(t) as input at time t + 1.
We can see this by examining a sequence with two time steps. The conditional maximum likelihood
criterion is log (p(y(1) , y(2)| x(1) , x(2)) = log p(y(2)| y(1) , x(1) , x(2)) + log p(y(1)| x(1) , x(2))
At time t=2, model is trained to maximize conditional probability of y (2) given both the x sequence so far
and the previous y value from the training set. We are using y(1) as teacher forcing, rather than only x(i)
Maximum likelihood specifies that during training, rather than feeding the model’s own output back into
itself, these connections should be fed with the target values specifying what the correct output should
be. This is illustrated in figure below.
Teacher Forcing is a training technique applicable to RNNs that have connections from output to hidden
states at next time step
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
8
Two Solutions are there to mitigate this problem:
1. Train with both teacher-forced inputs and free running inputs. E.g., predicting the correct target a no of
steps in the future through the unfolded recurrent output-to-input paths. Thus network can learn to take
into account input conditions not seen during training.
2. Mitigate the gap between inputs seen at training time and test time by generating values as input. This
approach exploits a curriculum learning strategy to gradually use more of the generated values as input
Computing the Gradient in an RNN: Computing the gradient through a recurrent neural
network is straightforward.
1. Applies the generalized back-propagation algorithm to the unrolled computational graph. No
specialized algorithms are necessary.
2. Gradients obtained by back-propagation may then be used with any general-purpose gradient-based
techniques to train an RNN.
1. General Backpropagation to compute gradient: The below algorithm outlines the skeleton of a
back-propagation algorithm, specifically focusing on the setup and cleanup aspects. The core
computational work, particularly the gradient calculations, is delegated to a subroutine named
build_grad. We are computing gradients T of variable z wrt variables in computational graph G
2. Build-grad function of Generalized Backprop: The inner loop subroutine build_grad(V, G, G′,
grad_table) of the back-propagation algorithm, called by the back-propagation algorithm discussed above
is show below.
To gain intuition for how the BPTT algorithm behaves, and how to compute gradients by BPTT for the
given RNN equations
o(t )= c + Vh(t)
h(t) = tanh(a(t))
a(t) = b + Wh(t-1) + Ux(t)
yˆ(t) = softmax(o(t))
The nodes of our computational graph include the parameters U , V , W , b and c as well as the sequence
of nodes indexed by t for x(t), h(t), o(t) and L(t).
For each node N we need to compute the gradient ∇NL recursively, based on the gradient computed at
nodes that follow it in the graph. We start the recursion with the nodes immediately preceding the final
loss ∂L / ∂L(t) = 1
We assume that the outputs o(t) are used as the argument to the softmax function to obtain the vector ˆy of
y(t) given the input so far. The gradient ∇o(t)L on the outputs at time step t, for all i, t, is as follows:
probabilities over the output. We also assume that the loss is the negative log-likelihood of the true target
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
10
Recurrent Networks as Directed Graphical Models: In the recurrent networks, we've used cross-
entropy losses L to compare the network's outputs o with the training targets y(t).
(t) (t)
In recurrent neural networks (RNNs), we compute losses at each time step, similar to feedforward
networks. The output of RNN is often treated as a probability distribution, and cross-entropy naturally
used to define the loss.
2. Mean Squared Error (MSE) Loss used for regression tasks/Time series forecast:
- Used when predicting continuous values (e.g., stock prices).
- Equivalent to cross-entropy loss if we assume the output follows a Gaussian distribution.
The loss function helps the RNN learn by penalizing incorrect predictions, just like in standard neural
networks.
We train the Recurrent Neural Network (RNN) to estimate the conditional distribution of the next
element in the sequence, y(t), given the past inputs (and optionally past outputs). This involves
maximizing the log-likelihood of the observed data.
If the model does not include connections from previous outputs, we maximize: log p(y(t)|x(1),…,x(t))
If the model includes connections from the output at one time step to the next, we maximize:
log p(y(t)|x(1),…,x(t),y(1),…,y(t−1))
Decomposing the joint probability over the sequence of y values as a series of one-step probabilistic
predictions is a method to capture the full joint distribution across the whole sequence. This approach
allows us to model the dependencies between sequence elements effectively.
When we do not feed past y values as inputs that condition the next step prediction, the
outputs y are conditionally independent given the sequence of x values. This means each y(t) depends
only on the input sequence x and not on previous y values.
When we do feed the actual y values (not their prediction, but the actual observed or generated values)
back into the network, the directed graphical model contains edges from all y (i) values in the past to the
current y(t) value. This setup captures dependencies not only on the input sequence x but also on the
history of y values, making the model more expressive but also more complex.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
11
Fully connected graphical model for a sequence
This also leads to Computational Explosion as Each step requires more memory and computation than
the previous one.
RNNs obtain the same full connectivity but efficient parameterization by introducing the state variable in
the Probabilistic Graphical Model (PGM) of RNN as shown in below figure.
In the above diagram, RNNs introduce a state variable h(t) at each time step, which acts as a
"memory" summarizing past information.
Even though h(t) is deterministic (computed from inputs and previous state), it simplifies the graphical
model by compressing all past inputs into a fixed-size vector.
This model is efficeint in Parameter Sharing. The same function (with the same parameters)
computes h(t) and y(t) at every time step.
o Example: h(t) =f(x(t), h(t-1);θ), where θ is reused across all t. ( x(t) is the input vector at time
step t).
Fixed Input Size: Unlike the previous diagram (where each y(t) depends on all past inputs), in this
model by introdcuing a state Variable y(t) depends only on x(t) and h(t-1).
Bidirectional RNNs: Traditional RNNs have a "causal" structure, meaning that the state at any given
time t only captures information from past inputs x (1), . . . , x(t−1) and the current input x(t). This limits
their ability to use future context when making predictions.
In many applications, such as speech recognition or handwriting recognition, the correct interpretation of
the current input may depend on future inputs. If there are are two interpretations of the current word that
are both acoustically plausible, we may have to look far into the future (and the past) to disambiguate
them. In speech recognition, the sound "tee" might be part of the word "tea" or "steam"—you need to
hear the next few sounds/words to decide.
For example, understanding a spoken word might require context from subsequent words or sounds.
Dhaval Loves Apple, it keeps him healthy
Dhaval Loves Apple, the company produces best Electronics
"Bank" could mean:
- Financial institution (if followed by "account")
- River side (if followed by "of the river")
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
12
Bidirectional RNNs address this limitation by combining two RNNs: one that processes the sequence in
the forward direction (from start to end) and another that processes it in the backward direction (from end
to start).
At each time step t , the bidirectional RNN has two
states:
- h(t) : The hidden state of the forward-moving
RNN.
- g(t) : The hidden state of the backward-moving
RNN.
1. Time Steps in Top Row Shows the sequence positions from 1 to T (total length)
Example: For a sentence "I love AI", positions would be: [1]="I", [2]="love", [3]="AI"
2. Forward RNN Processes data from left to right (past → future). Each h(t) depends on Current input
x(t) and Previous hidden state h(t-1)
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
13
Example for "I love Deep Learning":
- h(1) knows only "I"
- h(2) knows "I" + "love"
- h(3) knows "I" + "love" + "Deep Learnng”
3. Backward RNN Processes data from right to left (future → past). Each g(t) depends on Current input
x(t) and Next hidden state g(t+1)
Example for "I love Deep Learnng ":
- g(3) knows only " Deep Learnng "
- g(2) knows " Deep Learnng " + "love"
- g(1) knows " Deep Learnng " + "love" + "I"
4. Output: Each output o(t) combines both h(t) and g(t). The "(h+g)" notation means concatenation or
weighted combination.
Bidirectional RNNs enhance the capability of traditional RNNs by incorporating future context, making
them suitable for tasks where understanding the full sequence context is crucial. They are particularly
useful in applications like speech and handwriting recognition, where future information can
significantly impact the interpretation of current inputs.
Figure below shows how an RNN can map a fixed-size vector to a sequence.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
14
Figure below shows how an RNN can map an input sequence to an output sequence of the same length.
Now we will learn how an RNN can be trained to map an input sequence to an output sequence which is
not necessarily of the same length.
This comes up in many applications, such as speech recognition, machine translation or question
answering, where the input and output sequences in the training set are generally not of the same length
(although their lengths might be related).
The simplest RNN architecture for mapping a variable-length sequence to another variable-length
sequence is as shown below. This architeicture is known as the encoder-decoder or sequence-to-sequence
architecture.
The primary role of the encoder is to process
the input sequence (x(1), x (2), . . . , x(nx)) and
encode it into a fixed-size context vector,
often referred to as the "context".
The encoder is typically implemented as a
Recurrent Neural Network (RNN), such as a
Long Short-Term Memory (LSTM) network
or a Gated Recurrent Unit (GRU). These
types of RNNs are well-suited for handling
sequential data due to their ability to maintain
a hidden state that captures information over
time.
After processing the entire input sequence, the encoder's final hidden state is used to generate the context
vector C. The context vector C serves as a condensed representation of the input sequence, capturing the
most relevant information needed to generate the output sequence.
This context vector is typically a simple function of the final hidden state, such as a direct copy or a
linear transformation. It encapsulates the information from the entire input sequence, ensuring that the
decoder has access to a comprehensive summary of the input sequence.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
15
The final hidden state of the encoder is crucial because it encapsulates the information from the entire
input sequence. By using this state to generate the context vector, the model ensures that the decoder has
access to a comprehensive summary of the input sequence.
A decoder or writer or output RNN is conditioned on that fixed-length vector C to generate the output
sequence Y = (y(1), … , y(ny)). The decoder uses the context vector to produce the output sequence
element by element, leveraging the information encoded by the encoder.
Deep Recurrent Networks The computation in most RNNs can be decomposed into three
blocks of parameters and associated transformations:
1. from the input to the hidden state, (How current input affects memory)
2. from the previous hidden state to the next hidden state, (How memory evolves over time) and
3. from the hidden state to the output. (How memory generates predictions)
With the RNN architecture of figure, each of these three blocks is associated with a single weight matrix.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
16
Hierarchical Hidden State: This subfigure shows a hierarchical
organization of the hidden recurrent state.
The hidden state h(t) is split into hierarchical groups (sub-states) that
operate at different temporal scales. This means that instead of having a
single hidden state, the network has multiple layers of hidden states.
This may cause Longer path between distant time steps, but it has better
hierarchical feature learning.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
17
Skip Connections: This subfigure shows the use of skip
connections to mitigate the path-lengthening effect.
This balances depth (for complexity) and trainability (via skip paths).
Real-World Analogies
1. Bookmarking a Page:
o Without skips: Read every page to find a key detail.
o With skips: Jump to the bookmarked page instantly.
2. Highway vs. Local Roads:
o Skips are highways for information;
o standard RNN paths are local roads.
Recursive Neural Networks: Recursive neural networks represent yet another generalization of
recurrent networks, with a different kind of computational graph. RNNs Process sequences as
a chain (linear order) where as Recursive Neural Networks Generalize to tree structures, where each
node combines inputs hierarchically.
It is structured as a deep tree, rather than the chain-like structure of RNNs. The typical computational
graph for a recursive network is illustrated in figure.
A variable-size sequence x(1),x(2), . . . , x(t) can be mapped to a fixed-size representation (the output o),
with a fixed set of parameters (the weight matrices U , V , W ).
The figure illustrates a supervised learning case in which some target y is provided which is associated
with the whole sequence.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
18
One clear advantage of recursive nets over recurrent
nets is that for a sequence of the same length τ, the
depth (measured as the number of compositions of
nonlinear operations) can be drastically reduced from τ
to O(log τ ), which might help deal with long-term
dependencies.
Vanishing Gradients are most common. When gradients are backpropagated over many time steps,
they are multiplied repeatedly by weight matrices (Jacobians). If these matrices have small
eigenvalues (<1), gradients shrink exponentially (e.g., 0.9100≈00.9100≈0).
Exploding Gradients are Less Common but more Destructive: If weight matrices have large
eigenvalues (>1), gradients grow exponentially (e.g., 1.1100≈13,7801.1100≈13,780), which may cuse
Numerical instability → NaN errors or chaotic parameter updates.
Leaky Units: To deal with long-term dependencies we have to design a model that operates at
multiple time scales, so that some parts of the model operate at fine-grained time scales and can handle
small details, updating quickly, while other parts operate at coarse time scales and transfer information
slowly from the distant past to the present more efficiently.
Various strategies for building both fine and coarse time scales are possible. These include
1. the addition of skip connections across time,
2. “leaky units” that integrate signals with different time constants, and
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
19
3. the removal of some of the connections used to model fine-grained time scales.
1. Adding skip connections through time: One way to obtain coarse time scales is to add direct
connections from variables in the distant past to variables in the present.
In an ordinary recurrent network, a recurrent connection goes from a unit at time t to a unit at time t + 1.
It is possible to construct recurrent networks with longer delays.
As we know that gradients may vanish or explode exponentially with respect to the number of time steps.
We can introduce recurrent connections with a time-delay of d to mitigate this problem.
Due to adding skip connections in RNN, Gradients now diminish exponentially as a function of τ / d
rather than τ and shorten gradient paths. Since there are both delayed and single step connections,
gradients may still explode exponentially in τ. This allows the learning algorithm to capture longer
dependencies although not all long-term dependencies may be represented well in this way.
Standard RNN: h(1) → h(2) → ... → h(100) # Gradient path: 100 steps
With Skips (d=10): h(1) → h(10) → h(20) → ... → h(100) # Gradient path: 10 steps
2. Leaky units and a spectrum of time scales: Rather than an integer skip of d time steps, the effect
can be obtained smoothly by adjusting a real-valued α
When we accumulate a running average μ(t) of some value v(t) by applying the update
μ(t) ← α μ(t−1) + (1 − α)v(t) the α parameter is called linear self-correction from μ(t−1) to μ(t).
When α is near one, the running average remembers information about the past for a long time, and
when α is near zero, information about the past is rapidly discarded i.e short memory (fine scale).
Hidden units with linear self-connections can behave similarly to such running averages. Such hidden
units are called leaky units.
Can obtain product of derivatives close to 1 by having linear self-connections and a weight near 1 on
those connections.
There are two basic strategies for setting the time constants α used by leaky units. One strategy is to
manually fix them to values that remain constant, for example by sampling their values from some
distribution once at initialization time.
Another strategy is to make the time constants free parameters and learn them. Having such leaky units
at different time scales appears to help with long-term dependencies
This idea differs from the skip connections through time because it involves actively removing length-
one connections and replacing them with longer connections. Units modified in such a way are forced to
operate on a long-time scale. Skip connections through time add edges. Units receiving such new
connections may learn to operate on a long-time scale but may also choose to focus on their other short-
term connections.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
20
Long-Short Term Memory (LSTM)
RNNs have loops. A chunk of neural network A looks at some input xt and outputs
a value ht.A loop allows information to be passed from one step of the network to
the next
An unrolled RNN.
Chain-like
structure reveals
that RNNs are
intimately related
to sequences and
lists
Application to Part
of Speech (POS)
Tagging
Different Types of RNNs: RNNs mainly used for Sequence Classification, Sentiment & Video
Classification, Sequence Labelling, POS (Part-of-Speech) & NE (Named Entity) Tagging ,Sequence
Generation, MT & Transliteration
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
21
RNNs often fail to retain distant information needed for tasks like language modeling, even though they
technically have access to the entire sequence. Hidden states tend to focus on local context (recent
inputs) and lose critical long-range dependencies.
2. Vanishing Gradients:
- During backpropagation, gradients are multiplied repeatedly by weights over many time steps.
- Result: Gradients shrink to near-zero (vanish) for distant words like flights, preventing learning.
To fix these issues, Long Short-Term Memory (LSTM) networks were designed. They:
1. Explicitly manage memory via "gates" that:
- Remember critical info (e.g., flights is plural).
- Forget irrelevant info (e.g., intermediate words).
2. Avoid vanishing gradients by using additive updates (not multiplicative).
Components of LSTM:
1. Input: The input to the LSTM cell at a given time step is denoted by the arrow labeled "input."
This input is typically a feature vector representing the data at that time step.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
22
2. Gates: LSTM cells use three types of gates to regulate the flow of information:
- Forget Gate f i(t)): The forget gate applies a sigmoid function to the cell state to determine which parts
of the state should be retained si(t-1) and which should be forgotten. This is also done through pointwise
multiplication.
The self-loop weight (or the associated time constant) is controlled by a forget gate unit f(t)i (for time step
t and cell i), that sets this weight to a value between 0 and 1 via a sigmoid unit:
- Input Gate (g i(t)): The input gate uses a sigmoid unit to decide which values from the input should
be used to update the cell state. The input data is transformed by a pointwise multiplication with the
weights produced by the input gate.
External input gate unit is computed similar to forget gate with a sigmoid unit to obtain a gating value
between 0 and 1 but with its own parameters
Same structure as forget gate, but with independent parameters (bg, Ug, Wg)
- Output Gate (q i(t)): The output gate uses a sigmoid function to decide which parts of the cell state
should be output. The output is then computed by applying the output gate's weights to the cell state
through pointwise multiplication.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
23
4. Cell State Unit: The cell state, often referred to as the "state," is linear self-loop. sigmoid function
produces weights which is controlled by forget gate. It acts as the memory of the LSTM cell, carrying
information across time steps.
Cell State Update (s i(t)): The cell state is updated by combining the transformed input (controlled by
the input gate) and the filtered previous cell state (controlled by the forget gate).
5. Self-Loop: The self-loop represents the recurrent connection of the cell state, allowing it to maintain
its state over time. This loop is crucial for the LSTM's ability to capture long-term dependencies. self-
loop weight (or the associated time constant) is controlled by a forget gate unit .
6. Output: The output of the LSTM cell is computed based on the current cell state and controlled by
the output gate. This output can be used as the input to the next time step or as the final output of the
network.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].