Module 4
RNN
Recurrent neural networks – Computational graphs. RNN design. Encoder – decoder sequence to
sequence architectures. Language modeling example of RNN. Deep recurrent networks. Recursive
neural networks. Challenges of training Recurrent Networks. Gated RNNs LSTM and GRU. Case
study: BERT, Social Media Sentiment Analysis.
Recurrent neural networks – Computational graphs
Recurrent neural networks or are a family of neural networks for processing sequential data. Much
as a convolutional network is a neural network that is specialized for processing a grid of values X
such as an image, a recurrent neural network is a neural network that is specialized for processing a
sequence of values x(1), . . . , x(τ) . Most recurrent networks can also process sequences of variable
length. Parameter sharing makes it possible to extend and apply the RNN model to different forms
(different lengths, here) and generalize across them. If we had separate parameters for each value of
the time index, we could not generalize to sequence lengths not seen during training, nor share
statistical strength across different sequence lengths and across different positions in time.
For the simplicity of exposition, we refer to RNNs as operating on a sequence that contains vectors
x(t) with the time step index t ranging from 1 to τ . RNNs may also be applied in two dimensions
across spatial data such as images, and even when applied to data involving time, the network may
have connections that go backwards in time, provided that the entire sequence is observed before it
is provided to the network.
Unfolding Computational Graphs
A computational graph is a way to formalize the structure of a set of computations, such as those
involved in mapping inputs and parameters to outputs and loss.
The idea of unfolding a recursive or recurrent computation into a computational graph that has a
repetitive structure, typically corresponding to a chain of events. Unfolding this graph results in the
sharing of parameters across a deep network structure.
For example, consider the classical form of a dynamical system:
where s(t) is called the state of the system. It is recurrent because the definition of s at time t refers
back to the same definition at time t − 1.
For a finite number of time steps τ , the graph can be unfolded by applying the definition τ − 1 times.
For example, if we unfold Eq. 10.1 for τ = 3 time steps, we obtain
unfolded computational graph can visualized as :
If a feedforward neural network have any function involving recurrence can be considered a
recurrent neural network. Then the hidden layer h to represent the state:
When the recurrent network is trained to perform a task that requires predicting the future from the
past, the network typically learns to use h(t) as a kind of lossy summary of the task-relevant aspects
of the past sequence of inputs up to t.
A recurrent network with no outputs. This recurrent network just processes information from the
input x by incorporating it into the state h that is passed forward through time. (Left) Circuit
diagram. The black square indicates a delay of 1 time step. (Right) The same network seen as an
unfolded computational graph, where each node is now associated with one particular time
instance.
Fig. shows a simplest recurrent neural network is shown in Figure 7.2(a). A key point here is the
presence of the self-loop in Figure 7.2(a). In practice, one only works with sequences of finite length,
and it makes sense to unfold the loop into a “time-layered” network that looks more like a feed-
forward network. This network is shown in Figure 7.2(b).
Figure 7.2 shows a case in which each time-stamp has an input, output, and hidden unit. In practice,
it is possible for either the input or the output units to be missing at any particular time-stamp.
Examples of cases with missing inputs and outputs are shown in Figure 7.3. The choice of missing
inputs and outputs would depend on the specific application at hand
RNN design
Armed with the graph unrolling and parameter sharing ideas, we can design a wide variety of
recurrent neural networks.
Recurrent networks that produce an output at each time step and have recurrent
connections between hidden units, illustrated in Fig
The computational graph to compute the training loss of a recurrent network that maps an input
sequence of x values to a corresponding sequence of output o values. A loss L measures how far
each o is from the corresponding training target y. When using softmax outputs, we assume o is the
unnormalized log probabilities. The lossL internally computes yˆ = softmax(o) and compares this to
the target y. The RNN has input to hidden connections parametrized by a weight matrix U, hidden-
to-hidden recurrent connections parametrized by a weight matrix W , and hidden-to-output
connections parametrized by a weight matrix V . Eq. 10.8 defines forward propagation in this model.
(Left) The RNN and its loss drawn with recurrent connections. (Right) The same seen as an time-
unfolded computational graph, where each node is now associated with one particular time
instance.
Recurrent networks that produce an output at each time step and have recurrent
connections only from the output at one time step to the hidden units at the next time step,
illustrated in Fig.
An RNN whose only recurrence is the feedback connection from the output to the hidden layer. At
each time step t, the input is xt, the hidden layer activations are h(t) , the outputs are o(t) , the
targets are y(t) and the loss is L(t) . (Left) Circuit diagram. (Right) Unfolded computational graph. The
RNN in this figure is trained to put a specific output value into o, and o is the only information it is
allowed to send to the future. There are no direct connections from h going forward. The previous h
is connected to the present only indirectly, via the predictions it was used to produce. Unless o is
very high-dimensional and rich, it will usually lack important information from the past. This makes
the RNN in this figure less powerful, but it may be easier to train because each time step can be
trained in isolation from the others, allowing greater parallelization during training
Recurrent networks with recurrent connections between hidden units, that read an entire
sequence and then produce a single output, illustrated in Fig
Time-unfolded recurrent neural
network with a single output at
the end of the sequence. Such a
network can be used to
summarize a sequence and
produce a fixed-size
representation used as input for
further processing. There might
be a target right at the end (as
depicted here) or the gradient on
the output o(t) can be obtained by
back-propagating from further
downstream modules.
Recurrent networks maps a fixed length vector x to a variable length sequence Y
An RNN that maps a fixed-length
vectorx into a distribution over
sequences Y. This RNN is
appropriate for tasks such as image
captioning, where a single image is
used as input to a model that then
produces a sequence of words
describing the image. Each element
y(t) of the observed output
sequence serves both as input (for
the current time step) and, during
training, as target (for the previous
time step).
Encoder – decoder sequence to sequence architectures
An RNN can be trained to map an input sequence to an output sequence which is not necessarily of
the same length. This comes up in many applications, such as speech recognition, machine
translation or question answering, where the input and output sequences in the training set are
generally not of the same length.
The idea of encoder-decoder or sequence-to-sequence architecture is very simple: (1) an encoder or
reader or input RNN processes the input sequence. The encoder emits the context C, usually as a
simple function of its final hidden state. (2) a decoder or writer or output RNN is conditioned on that
fixed-length vector (just like in Fig. 10.9) to generate the output sequence Y = (y(1), . . . , y(ny )).
Figure shows an encoder-decoder or sequence-to-sequence RNN architecture, for learning to
generate an output sequence (y(1), . . . ,y(n y)) given an input sequence (x(1) ,x(2) , . . . ,x(nx) ). It is
composed of an encoder RNN that reads the input sequence and a decoder RNN that generates the
output sequence (or computes the probability of a given output sequence). The final hidden state of
the encoder RNN is used to compute a generally fixed-size context variable C which represents a
semantic summary of the input sequence and is given as input to the decoder RNN
One clear limitation of this architecture is when the context C output by the encoder RNN has a
dimension that is too small to properly summarize a long sequence.
Language modeling example of RNN.
To illustrate the workings of the RNN, we will use an example of a single sequence defined on a
vocabulary of four words. Consider the sentence: “The cat chased the mouse.”
In this case, we have a lexicon of four words, which are {“the,”“cat,”“chased,”“mouse”}. In Figure
7.4, we have shown the probabilistic prediction of the next word at each of time stamps from 1 to 4.
Ideally, we would like the probability of the next word to be predicted correctly from the
probabilities of the previous words. Each one-hot encoded input vector xt has length four, in which
only one bit is 1 and the remaining bits are 0s.
Wxh will be a 2 × 4 matrix, so that it maps a one-hot encoded input vector into a hidden vector ht
vector of size 2. Whh and Why are of sizes 2 × 2 and 4 × 2. yt is defined by Whyht.
Deep recurrent networks
The computation in most RNNs can be decomposed into three blocks of parameters and associated
transformations: 1. from the input to the hidden state, 2. from the previous hidden state to the next
hidden state, and 3. from the hidden state to the output.
Introduce depth in RNN playing a role in transforming the raw input into a representation that is
more appropriate, at the higher levels of the hidden state. But adding depth may hurt learning by
making optimization difficult.
Figure shows a recurrent neural network can
be made deep in many ways. (a) The hidden
recurrent state can be broken down into
groups organized hierarchically. (b) Deeper
computation (e.g., an MLP) can be
introduced in the input-tohidden, hidden-to-
hidden and hidden-to-output parts. This
may lengthen the shortest path linking
different time steps. (c) The path-
lengthening effect can be mitigated by
introducing skip connections.
An example of a deep network
containing three layers is shown
in Figure 7.6. Note that nodes in
higher-level layers receive input
from those in lower-level layers.
The relationships among the
hidden states can be generalized
directly from the single-layer
network.
Recursive neural networks
Recursive neural networks2 represent yet another generalization of recurrent networks, with a
different kind of computational graph, which is structured as a deep tree, rather than the chain-like
structure of RNNs. One clear advantage of recursive nets over recurrent nets is that for a sequence
of the same length τ, the depth (measured as the number of compositions of nonlinear operations)
can be drastically reduced from τ to O(log τ ), which might help deal with long-term dependencies.
A recursive network has a computational graph
that generalizes that of the recurrent network
from a chain to a tree. A variable-size
sequencex(1),x(2) , . . . ,x(t) can be mapped to a
fixed-size representation (the outputo), with a
fixed set of parameters (the weight matrices U,
V , W ). The figure illustrates a supervised
learning case in which some target y is
provided which is associated with the whole
sequence.
An open question is how to best structure the
tree. For example, when processing natural
language sentences, the tree structure for the
recursive network can be fixed to the structure
of the parse tree of the sentence provided by a
natural language parser
Challenges of training Recurrent Networks
Recurrent neural networks are very hard to train because of the fact that the time-layered
network is a very deep network, especially if the input sequence is long.
The loss function has highly varying sensitivities of the loss function (i.e., loss gradients) to
different temporal layers, but the same parameter matrices are shared by different
temporal layers. This combination of varying sensitivity and shared parameters in different
layers can lead to some unusually unstable effects.
The primary challenge associated with a recurrent neural network is that of the vanishing
and exploding gradient problems.
Consider a set of T consecutive layers, in which the tanh activation function, Φ(·), is applied
between each pair of layers. The shared weight between a pair of hidden nodes is denoted
by w. Let h1 ...hT be the hidden values in the various layers. Let Φʹ (ht) be the derivative of the
activation function in hidden layer t. Let the copy of the shared weight w in the tth layer be
denoted by wt so that it is possible to examine the effect of the backpropagation update. Let
∂L/∂ht be the derivative of the loss function with respect to the hidden activation ht. The
neural architecture is illustrated in Figure 7.7. Then, one derives the following update
equations using backpropagation:
Since the shared weights in different temporal layers are the same, the gradient is multiplied
with the same quantity wt = w for each layer. Such a multiplication will have a consistent
bias towards vanishing when w < 1, and it will have a consistent bias towards exploding
when w > 1. However, the choice of the activation function will also play a role because the
derivative Φʹ (ht+1) is included in the product.
There are several solutions to the vanishing and exploding gradient problems, not all of which are
equally effective. For example, the simplest solution is to use strong regularization on the
parameters, which tends to reduce some of the problematic instability caused by the vanishing and
exploding gradient problems. A second solution is gradient clipping. Gradient clipping is well suited
to solving the exploding gradient problem. There are two types of clipping that are commonly used.
The first is value-based clipping, and the second is norm-based clipping
The type of instability faced by the optimization process is sensitive to the specific point on the loss
surface at which the current solution resides. Therefore, choosing good initialization points is crucial.
Using momentum methods can also help in addressing some of the instability. A discussion of the
power of initialization and momentum in addressing some of these issues.
Another useful trick that is often used to address the vanishing and exploding gradient problems is
that of batch normalization, a variant known as layer normalization is more effective in recurrent
networks. In layer normalization, the normalization is performed only over a single training instance,
although the normalization factor is obtained by using all the current activations in that layer of only
the current instance.
In order to understand how layer-wise normalization works, we repeat the hidden-to hidden
recursion:
The normalization is applied to preactivation values before applying the tanh activation function.
Therefore, the pre-activation value at the tth time-stamp is computed as follows:
Compute the mean μt and standard σt of the pre-activation values in at with as many components as
the number (p) of units in the hidden layer:
Here, ati denotes the ith component of the vector at. For the p units in the tth layer, we have a p-
dimensional vector of gain parameters γt , and a p-dimensional vector of bias parameters denoted by
βt . These parameters are analogous to the parameters γi and βi on batch normalization. The purpose
of these parameters is to re-scale the normalized values and add bias in a learnable way. The hidden
activations ht of the next layer are therefore computed as follows:
Here, the notation ⊙ indicates elementwise multiplication, and the notation μt refers to a vector
containing p copies of the scalar μt. The effect of layer normalization is to ensure that the
magnitudes of the activations do not continuously increase or decrease with time-stamp.
Gated RNNs LSTM
The most effective sequence models used in practical applications are called gated RNNs. These
include the long short-term memory and networks based on the gated recurrent unit.
LSTM (Long Short-Term Memory) is a recurrent neural network architecture widely used in Deep
Learning. It excels at capturing long-term dependencies, making it ideal for sequence prediction
tasks. LSTM recurrent networks have “LSTM cells” that have an internal recurrence (a self-loop), in
addition to the outer recurrence of the RNN. Each cell has the same inputs and outputs as an
ordinary recurrent network, but has more parameters and a system of gating units that controls the
flow of information.
The most important component is the state unit si(t) (in the figure Ct) that has a linear self-loop similar
to the leaky units described in the previous section. However, here, the self-loop weight (or the
associated time constant) is controlled by a forget gate unit fi(t) (for time step t and cell i), that sets
this weight to a value between 0 and 1 via a sigmoid unit:
where x(t) is the current input vector and h(t) is the current hidden layer vector, containing the outputs
of all the LSTM cells, and bf ,Uf , Wf are respectively biases, input weights and recurrent weights for
the forget gates. The LSTM cell internal state is thus updated as follows, but with a conditional self-
loop weight fi(t) :
where b, U and W respectively denote the biases, input weights and recurrent weights into the LSTM
cell. The external input gate unit gi(t) is computed similarly to the forget gate (with a sigmoid unit to
obtain a gating value between 0 and 1), but with its own parameters
The output hi(t) of the LSTM cell can also be shut off, via the output gate qi(t) (in the figure Ot), which
also uses a sigmoid unit for gating
which has parameters bo, Uo, Wo for its biases, input weights and recurrent weights, respectively.
LSTM networks have been shown to learn long-term dependencies more easily than the simple
recurrent architectures
GRU
The main difference with the LSTM is that a single gating unit simultaneously controls the forgetting
factor and the decision to update the state unit.
where u stands for “update” gate and r for “reset” gate. Their value is defined as usual:
The reset and updates gates can individually “ignore” parts of the state vector. The update gates act
like conditional leaky integrators that can linearly gate any dimension, thus choosing to copy it (at
one extreme of the sigmoid) or completely ignore it (at the other extreme) by replacing it by the new
“target state” value (towards which the leaky integrator wants to converge). The reset gates control
which parts of the state get used to compute the next target state, introducing an additional
nonlinear effect in the relationship between past state and future state.
[Link] : For more details read this
Case study: BERT
BERT (Bidirectional Encoder Representations from Transformers) stands as a pioneering model in
natural language processing. BERT is a deep learning model in which every output element is
connected to every input element, and the weightings between them are dynamically calculated
based upon their connection. Its unique architecture allows for a deeper understanding of context in
language by considering both preceding and succeeding words. Through pre-training on vast
amounts of text data, BERT learns contextualized word representations, enabling it to grasp nuanced
meanings and relationships within sentences. This bidirectional approach sets BERT apart,
empowering it to excel in various language understanding tasks, from sentiment analysis to question
answering, making it a cornerstone in modern NLP models.
Unlike RNNs, transformers like BERT don't rely on sequential processing of words. Instead, they
process words in parallel and consider the entire context of the sentence bidirectionally, capturing
relationships between words in a more comprehensive way.
Social Media Sentiment Analysis.
Social Media Sentiment Analysis involves mining and analyzing user-generated content on platforms
like Twitter, Facebook, and Instagram to gauge public opinion, emotions, or attitudes towards
specific topics, products, or events. By employing natural language processing and machine learning
techniques, this analysis identifies sentiments—positive, negative, or neutral—in user posts,
comments, or reviews. It helps businesses understand customer feedback, track trends, and make
informed decisions, while also offering insights into public perception and societal trends. This
analysis serves as a valuable tool for companies, marketers, and researchers in understanding and
responding to the dynamic landscape of public opinion on social media.
Recurrent Neural Networks play a pivotal role in Social Media Sentiment Analysis due to their ability
to capture sequential dependencies in text data. Unlike traditional feedforward neural networks,
RNNs excel in understanding context and relationships between words in a sentence, making them
particularly effective in analyzing the nuanced and contextual nature of social media posts. With
their memory of past information, RNNs can retain and utilize historical context, crucial for
understanding sentiment in longer texts or posts with complex structures. Their proficiency in
handling sequential data enables better comprehension of user sentiments, facilitating more
accurate sentiment classification and providing deeper insights into public opinions, emotions, and
trends across various social media platforms.