MODULE 4
RNNs are neural networks for sequences (like sentences or time
series). Unlike regular networks, they remember information from
previous steps by reusing the same weights across the sequence.
This allows them to handle sequences of different lengths and
understand patterns anywhere in the sequence. For example, they
can find the year “2009” in a sentence whether it’s at the start or end.
Basically, RNNs are good at learning from data where order matters.
1. Unfolding Computational Graph
A computational graph is like a map showing all the steps a
network takes to turn inputs into outputs.
In a recurrent system, the current state s(t) depends on the previous
state s(t−1).
This creates a loop, which can be hard to handle directly.
By unfolding the recurrent system for a fixed number of steps, we
can turn the loop into a straight chain of computations:
•This chain has no loops and can now be represented as a
normal, acyclic computational graph as shown in figure 4.1.
Why it’s useful:
• Makes it easier to visualize and compute gradients.
• Allows parameter sharing because the same function f (with
parameters θ) is applied at each step.
Page 1 of 33
• Helps the network handle sequences of different lengths
using the same model..
Figure 4.1: directed acyclic computational graph [TB2]
As another example, let us consider a dynamical system driven by
an external signal x(t) ,
s(t) = f(s(t−1) , x(t); θ),
where we see that the state now contains information about the
whole past sequence.
To indicate that the state is the hidden units of the network, we now
rewrite this equation using the variable h to represent the state:
h(t) = f(h(t−1), x(t); θ), illustrated in figure 4.2, typical RNNs will
add extra architectural features such as output layers that read
information out of the state h to make predictions.
Figure 4.2: A recurrent neural network with no output[TB2]
The unfolding process thus introduces two major advantages:
1. Regardless of the sequence length, the learned model always has
the same input size, because it is specified in terms of transition from
one state to another state, rather than specified in terms of a variable-
length history of states.
2. It is possible to use the same transition function f with the same
parameters at every time step.
Page 2 of 33
2. Recurrent Neural Networks
RNNs are a type of neural network designed to handle sequential
data, such as text, speech, or time series. They process inputs one at
a time while maintaining a hidden state that carries information from
previous time steps.
Key Components in an RNN
• Input-to-Hidden (U): Connects the input at time t to the hidden
state.
• Hidden-to-Hidden (W): Connects the hidden state at t−1 to the
hidden state at t (recurrent connection).
• Hidden-to-Output (V): Maps the hidden state at time t to the
output.
• Bias vectors (b, c) are added for hidden and output layers.
Some examples of important design patterns for recurrent neural
networks include the following:
• Recurrent networks that produce an output at each time step and
have
recurrent connections between hidden units, illustrated in figure 4.3.
• Recurrent networks that produce an output at each time step and
have
recurrent connections only from the output at one time step to the
hidden units at the next time step, illustrated in figure 4.4
• Recurrent networks with recurrent connections between hidden
units, that read an entire sequence and then produce a single output,
illustrated in figure 4.5.
Forward Propagation in RNNs
At each time step t, the computations are:
1. Compute activation:
2. Apply non-linearity (e.g., tanh) to get hidden state:
3. Compute output:
4. Apply softmax (for classification tasks) to get probabilities:
Page 3 of 33
Loss Computation
• If the RNN produces an output for every time step, the total
loss for a sequence is the sum of losses at each step:
where pmodel (y(t) | {x(1) , . . . , x(t) }) is given by reading the entry
for y(t) from the model’s output vector yˆ(t).
Figure 4.3: Computational graph of an RNN with unfolded time
steps and training loss computation. [TB2]
Page 4 of 33
Figure 4.4: An RNN whose only recurrence is the feedback
connection from the output to the hidden layer. [TB2]
Figure 4.5: Time-unfolded recurrent neural network with a single
output at the end of the sequence. [TB2]
2.1 Teacher Forcing and Networks with Output Recurrence
Teacher forcing is a training technique used in RNNs where, instead
of letting the model use its own output from the previous time step,
we feed the correct target output (from the training data) as input to
the next step as shown in figure 4.6.
• Normally, in RNNs, the hidden state at time t depends on:
• The input x(t)
• The hidden state (or output) from the previous step.
• If we always feed the model’s own predicted output, errors can
accumulate during training and make learning harder.
• Teacher forcing avoids this by giving the model the right
answer at each step during training, so it learns faster and more
reliably.
Cons of Teacher Forcing:
At test time, the model won’t have the true target — it must use its
own predictions.
This can cause a train-test mismatch i.e during training, it always
saw perfect inputs, but at test time, it has to work with noisy, self-
generated ones.
This mismatch is called exposure bias.
Page 5 of 33
Solutions:
1. Mix teacher forcing with free-running mode: Sometimes feed
the true outputs, sometimes let the model use its own
predictions.
2. Scheduled Sampling (Bengio et al., 2015): Start with full
teacher forcing, then gradually replace true targets with the
model’s own outputs during training.
Figure 4.6: llustration of teacher forcing. [TB2]
2.2 Computing the Gradient in a Recurrent Neural Network
Computing the gradient through a recurrent neural network is
straightforward.
One simply applies the generalized back-propagation algorithm
to the unrolled computational graph. No specialized algorithms are
necessary.
(note : BPTT means back propagation through time)
Page 6 of 33
’
In this derivation we assume that the outputs o(t) are used as the
argument to the softmax function to obtain the vector ˆy of
probabilities over the output. We also assume that the loss is the
negative log-likelihood of the true target y(t) given the input so far.
The gradient ∇o(t)L on the outputs at time step t, for all i, t, is as
follows
The gradient on the remaining parameters is given by
Page 7 of 33
2.3 Recurrent Networks as Directed Graphical Models
Similar to feedforward networks, many loss functions can be used
(cross-entropy, MSE, etc.).
• Usually, cross-entropy or negative log-likelihood is used
because outputs are interpreted as probability distributions.
Predictive Training:
• RNNs are trained to estimate the conditional distribution of
the next element in a sequence:
o Without output feedback:
p(y(t)∣x(1),…,x(t)
o With output feedback:
p(y(t)∣x(1),…,x(t),y(1),…,y(t−1))
Graphical Model View:
• RNNs define a probabilistic model over sequences.
• Each output y(t) may depend on:
o Just the inputs x (conditionally independent given xxx).
o Both inputs and past outputs y(1),…,y(t−1)y(1)
Chain Rule Decomposition:
• The joint probability of a sequence is factorized as:
Loss Function:
Page 8 of 33
• The training loss is the negative log-likelihood of the
sequence:
•
•
• Meaning: at each step, the RNN maximizes the probability of
the correct next element.
RNN can also be interpreted as a graphical model with a complete
graph structure over the output variables y as shown in figure 4.7.
In this view, every output y(t) can have a direct dependency on all
other outputs in the sequence.
This interpretation comes from marginalizing out the hidden states
h(t) (i.e., ignoring them and focusing only on outputs).
The result is a fully connected dependency graph among outputs,
meaning the RNN implicitly captures complex dependencies across
the entire sequence.
Figure 4.7 : graph where every past output y(i) is directly connected
to every future output y(t).[TB2]
Instead of explicitly connecting all past y(i) to future outputs, an
RNN uses hidden states h(t) as shown in figure 4.8.
Figure 4.8: RNN version[TB2]
Page 9 of 33
2.4 Modeling Sequences Conditioned on Context with RNNs
Earlier: RNN modeled only a sequence of outputs y(1),y(2),…y(1),
y(2), …y(1),y(2),… → joint distribution P(y).
Extension: RNN can model a conditional distribution P(y∣x).
This is done by letting parameters depend on an external input x.
Three common ways to provide input x:
1. As an extra input at each time step (most common).
2. As the initial hidden state h(0).
3. As both (1 + 2).
In the first method as seen in figure 4.9:
• Introduce a new weight matrix R.
• Each hidden unit gets additional input term (x^T)* R.
• This acts like a bias depending on x.
Interpretation:
• x provides context or conditions for generating the sequence.
• Makes the RNN capable of modeling P(y∣x) efficiently.
In a basic RNN, the outputs y(t) are assumed to be conditionally
independent given the inputs and hidden states — meaning, y(t)
does not directly depend on earlier outputs.
To fix this, Figure 4.10 adds a connection from the output at time t
→ to the hidden state at time t+1.
This way, the hidden state at the next step doesn’t just depend on the
input and past hidden states, but also on the previous output.
Because of this extra connection, the model can now represent
arbitrary probability distributions over the sequence of outputs
y(1),y(2),…
Page 10 of 33
Figure 4.9: RNN with Input Conditioning. [TB2]
Figure 4.10: RNN with Output-to-Hidden Connections. [TB2]
[Link] RNNs
combine two RNNs: one moves forward through the sequence, and
the other moves backward as seen in figure 4.11.
The output at each time step depends on both past and future inputs,
allowing better context understanding.
This approach avoids the need for a fixed-size look-ahead window,
unlike feedforward or regular RNNs.
Multidimensional extension: For 2D data (like images), four
RNNs can move in up, down, left, and right directions.
Each output in the 2D grid captures local information while also
considering long-range dependencies.
RNNs for images are more computationally expensive than CNNs
but can model long-range lateral interactions between features.
Forward propagation in multidimensional RNNs combines bottom-
up convolutional input with recurrent propagation to capture
complex spatial dependencies.
Page 11 of 33
Figure 4.11 : Bidirectional RNN. [TB2]
4. Encoder Decoder sequence to sequence architectures
• Traditional RNNs can map an input sequence to a fixed-size
vector, a fixed-size vector to an output sequence, or an input
sequence to an output sequence of the same length.
• Many applications, such as speech recognition and machine
translation, require mapping input sequences to output
sequences of different lengths.
• In the encoder-decoder or sequence-to-sequence architecture,
the encoder RNN processes the input sequence and produces
a context vector or sequence that summarizes the input.
• The decoder RNN generates the output sequence conditioned
on the context produced by the encoder, allowing the input and
output sequence lengths to differ.
• Decoder inputs can be provided either as the initial hidden
state, as input to the hidden units at each time step, or by
combining both methods.
Page 12 of 33
• The encoder and decoder can have different hidden layer sizes
depending on the task requirements.
• A limitation of this architecture is that a fixed-size context
vector may be too small to summarize long input sequences.
• This limitation is addressed by using a variable-length context
sequence and an attention mechanism, which aligns elements
of the input sequence with the output sequence.
• Figure 4.12 shows an Example of an encoder-decoder or
sequence-to-sequence RNN architecture, for learning to
generate an output sequence (y(1), . . . , y(ny)) given an input
sequence (x(1), x(2), . . . , x(nx)).
Figure 4.12: Encoder-Decoder Sequence-to-Sequence Architecture.
[TB2]
5. Deep Recurrent Networks
• Most RNN computations can be decomposed into three
blocks:
• (i) input to hidden state,
• (ii) previous hidden state to next hidden state, and
Page 13 of 33
• (iii) hidden state to output.
• In simple RNN architectures, each block is represented by a
single weight matrix, which corresponds to a shallow
transformation (a single layer with an affine transformation
followed by a nonlinearity).
• Introducing depth in each of these blocks can improve the
network’s ability to perform complex mappings, as suggested
by experimental evidence.
• One approach is to decompose the hidden state hierarchically,
where lower layers transform raw input into higher-level
representations for upper layers.
• Another approach is to replace each block with a separate
MLP (possibly deep), which increases representational
capacity but also lengthens the shortest path between time
steps, making optimization harder.
• This path-lengthening problem can be mitigated by
introducing skip connections in the hidden-to-hidden
transitions.
• Deep RNN architectures have been shown to improve
performance, but care must be taken to balance depth with
ease of training.
Figure 4.13 illustrates how an RNN can be made “deep” in different
ways to increase its representational power:
(a): Hidden state divided hierarchically to add depth.
(b): Deeper computations (MLPs) in input-hidden, hidden-hidden,
hidden-output paths.
(c): Skip connections in hidden-hidden path to ease training.
Page 14 of 33
Figure 4.13: Ways to introduce depth in recurrent neural networks.
[TB2]
6. Recursive Neural Networks
Recursive neural networks2 represent yet another
generalization of recurrent networks, with a different kind of
computational graph, which is structured as a deep tree, rather
than the chain-like structure of RNNs. The typical
computational graph for a recursive network is illustrated in
figure 4.14 .
Recursive neural networks were introduced by Pollack.
Recursive networks have been successfully applied to
processing data structures as input to neural nets, in natural
language processing as well as in computer vision.
One clear advantage of recursive nets over recurrent nets is
that for a sequence of the same length τ, the depth (measured
as the number of compositions of nonlinear operations) can be
drastically reduced from τ to O(logτ), which might help deal
with long-term dependencies.
Page 15 of 33
An open question is how to best structure the tree. One option
is to have a tree structure which does not depend on the data,
such as a balanced binary tree.
In some application domains, external methods can suggest
the appropriate tree structure.
For example, when processing natural language sentences, the
tree structure for the recursive network can be fixed to the
structure of the parse tree of the sentence provided by a natural
language parser.
Ideally, one would like the learner itself to discover and infer
the tree structure that is appropriate for any given input. Many
variants of the recursive net idea are possible.
For example, Frasconi et al. ( Frasconi et al. ( 1998 ) associate
the data with a tree structure, and associate the inputs and
targets with individual nodes of the tree.
The computation performed by each node does not have to be
the traditional artificial neuron computation (affine
transformation of all inputs followed by a monotone
nonlinearity).
For example, Socher et al. 2013a ( ) propose using tensor
operations and bilinear forms, which have previously been
found useful to model relationships between concepts when
the concepts are represented by continuous vectors
(embeddings).
Page 16 of 33
Figure 4.14 : Recursive neural network [TB2]
7. The Challenge of Long Term Dependencies
The basic problem is that gradients propagated over many stages
tend to either vanish (most of the time) or explode (rarely, but with
much damage to the optimization). Even if we assume that the
parameters are such that the recurrent network is stable (can store
memories, with gradients not exploding), the difficulty with long-
term dependencies arises from the exponentially smaller weights
given to long-term interactions (involving the multiplication of
many Jacobians) compared to short-term ones. Many other sources
provide a deeper treatment. Recurrent networks involve the
composition of the same function multiple times, once per timestep.
These compositions can result in extremely nonlinear behavior, as
illustrated in figure 4.15.
Page 17 of 33
Figure 4.15: Repeated function composition. [TB2]
The eigenvalues are raised to the power of t causing eigenvalues
with magnitude less than one to decay to zero and eigenvalues with
magnitude greater than one to explode.
The vanishing and exploding gradient problem for RNNs was
independently discovered by separate researchers ( , ; Hochreiter
1991 Bengio et al. 1993 1994 , , ).
One may hope that the problem can be avoided simply by staying in
a region of parameter space where the gradients do not vanish or
explode.
Page 18 of 33
Unfortunately, in order to store memories in a way that is robust to
small perturbations, the RNN must enter a region of parameter space
where gradients vanish ( Bengio et al. 1993 , , 1994).
Specifically, whenever the model is able to represent long term
dependencies, the gradient of a long term interaction has
exponentially smaller magnitude than the gradient of a short term
interaction.
It does not mean that it is impossible to learn, but that it might take
a very long time to learn long-term dependencies, because the signal
about these dependencies will tend to be hidden by the smallest
Bengio et al. 1994 ( fluctuations arising from short-term
dependencies. In practice, the experiments in ) show that as we
increase the span of the dependencies that need to be captured,
gradient-based optimization becomes increasingly difficult, with the
probability of successful training of a traditional RNN via SGD
rapidly reaching 0 for sequences of only length 10 or 20.
8. The Long Short-Term Memory and Other Gated RNNs
• The most effective sequence models in practice are gated
RNNs, including LSTM and GRU.
• Gated RNNs are designed to maintain derivatives through
time, preventing vanishing or exploding gradients.
• They generalize leaky units, which accumulate information
over long durations, by allowing connection weights to change
at each time step.
• Leaky units can store evidence over long sequences but cannot
automatically forget old states when needed.
• Gated RNNs introduce learnable mechanisms that allow the
network to decide when to forget or retain information,
improving sequence modeling over long dependencies.
LSTM
• As seen in figure 4.16, LSTM introduces self-loops in the cell
state to create paths where gradients can flow for long
durations, addressing vanishing/exploding gradients.
• Gated Self-Loop: The weight of the self-loop is controlled by
a forget gate, allowing the network to dynamically adjust the
time scale of integration based on input sequences.
Page 19 of 33
• Applications: Extremely successful in handwriting
recognition, speech recognition, handwriting generation,
machine translation, image captioning, and parsing.
• LSTM Cell Structure:
• Has an internal recurrence (self-loop) in addition to the outer
RNN recurrence.
Forget gate is given by
Where x(t) is the current input vector and h(t) is the current
hidden layer vector, containing the outputs of all the LSTM
cells, and bf, Uf, Wf are respectively biases, input weights, and
recurrent weights for the forget gates. The LSTM cell
internal state is thus updated as follows, but with a conditional
self-loop weight fi(t).
if forget gate is denoted by f, external input gate by g , hidden output
by h and output gate by q then
Page 20 of 33
Figure 4.16 : Block diagram of the LSTM recurrent network cell.
[TB2]
Other Gated RNN
Units of gated RNNs are also known as gated recurrent units, or
GRUs.
The main difference with the LSTM is that a single gating unit
simultaneously controls the forgetting factor and the decision to
update the state unit. The update equations are the following:
Purpose: Explore which components of LSTM are necessary and
design simpler gated architectures.
Gated Recurrent Units (GRUs):
• Combines forgetting and state update into a single update
gate u(t).
Page 21 of 33
• Reset gate r(t) controls how much past state contributes to
computing the new state.
Functionality:
• Update gate: Acts as a conditional leaky integrator, deciding
which dimensions of the state to keep or update.
• Reset gate: Introduces nonlinearity by controlling which parts
of the past state influence the new state.
Variants:
• Gates can be shared across multiple units or combined as
global and local gates.
• Studies show no variant consistently outperforms standard
LSTM or GRU across tasks.
• Crucial component for performance: forget gate, often
initialized with bias 1 for better learning.
9. Optimization for Long-Term Dependencies
When training RNNs for many time steps, the gradients
(signals that tell the network how to learn) can either:
• Vanishing: become too small → learning stops.
• Exploding: become too large → unstable learning.
Martens & Sutskever (2011) said: If the first derivative
(gradient) becomes tiny, the second derivative (curvature) may
also shrink.
Dividing them (which second-order methods do) can cancel
out the problem , making learning stable.
But the problem is:
Second-order methods are very slow, need lots of data at once,
and often get stuck in saddle points.
So later, Sutskever (2013) showed that simpler tricks (like
Nesterov momentum + careful starting values) work just as
well.
Today, people mostly just use LSTMs with normal SGD
instead of complex second-order math.
Page 22 of 33
9.1 Clipping Gradients
• RNNs often face exploding or vanishing gradients because
their functions are highly nonlinear over many time steps.
• The cost function looks like flat regions separated by sharp
cliffs, where gradients can suddenly become very large.
• A very large gradient can cause the parameter update to jump
too far, moving into a worse region and undoing earlier
progress.
• A learning rate that works in flat areas may be too large when
entering a steep, curved region, causing instability.
• Gradient clipping prevents updates from being too large by
limiting gradient values before updating parameters.
• Types of clipping:
• Element-wise clipping: Clip each gradient component
individually (Mikolov, 2012).
• Norm clipping: Clip the overall gradient magnitude if it
exceeds a threshold (Pascanu et al., 2013).
• Norm clipping formula:
• ensuring updates stay within a safe bound.
• Advantages of norm clipping: Keeps the update in the same
gradient direction, but with bounded step size.
• Bias introduced: Gradient clipping changes the gradient
estimate—making it biased—but this bias is empirically
useful for stabilizing training.
• Variants: Clipping can also be applied to hidden unit gradients
(Graves, 2013). Though methods differ slightly, they behave
similarly in practice.
Page 23 of 33
9.2 Regularizing to Encourage Information Flow
Gradient clipping limitation: It only helps with exploding gradients,
not with vanishing gradients.
Goal is to maintain gradient flow through time so that information
from earlier steps still influences later steps.
To achieve this create a computational paths where the product of
gradients across time steps stays close to 1.
One way is using architectures like LSTMs that include self-loops
and gates to preserve long-term information.
Another way is to add a penalty term (regularizer) that encourages
gradients to keep their strength instead of vanishing.
Mathematical objective: Ensure that
has similar magnitude to ∇h(t)L
Pascanu et al. (2013) proposed
Page 24 of 33
Approximation: Treat the backpropagated gradients ∇h(t)L as
constants, simplifying computation.
Effectiveness: With gradient norm clipping (to handle explosion),
this regularizer increases how long an RNN can remember
dependencies.
Weakness: It works but is less effective than LSTMs when lots of
data is available (e.g., in language modeling).
10. Explicit Memory
Neural networks are good at pattern recognition (implicit
knowledge) but not good at remembering clear facts (explicit
knowledge).
• Humans use working memory to hold and manipulate
information, but neural networks cannot do this well.
• Memory Networks (2014) introduced external memory cells
but needed supervision to know how to use them.
• Neural Turing Machines (2014) improved this by letting
networks read/write memory automatically without
supervision.
• They use soft attention (softmax-weighted averages) to
access memory, which makes training with gradient descent
possible.
• Each memory cell stores a vector, so richer information can
be saved and retrieved by similarity or position.
• Keeping memory contents over time helps prevent
vanishing/exploding gradients.
• With explicit memory, neural networks can solve tasks that
RNNs/LSTMs alone cannot, because information can be kept
and used for long durations.
• Hard (stochastic) memory access is possible, but it makes
training harder since it involves discrete choices.
Page 25 of 33
• Attention mechanisms (first in handwriting, later in
translation) combined with memory give networks stronger
reasoning and flexibility to handle complex real-world tasks.
Page 26 of 33