0% found this document useful (0 votes)

12 views23 pages

RNNs and Computational Graphs Explained

The document discusses Recurrent Neural Networks (RNNs) and their ability to handle sequential data through parameter sharing and unfolding computational graphs. It explains the structure and functioning of RNNs, including their internal state updates, the unfolding process, and various design patterns for RNN architectures. Additionally, it covers concepts like teacher forcing, loss functions, and the advantages of using recurrent connections in neural networks.

Uploaded by

jayalakshmikoruprolu2018

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views23 pages

RNNs and Computational Graphs Explained

Uploaded by

jayalakshmikoruprolu2018

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

1

Sequence Modeling- Recurrent and Recursive Nets: Unfolding Computational Graphs, Recurrent
Neural Networks, Bidirectional RNNs, Encoder-Decoder Sequence-to-Sequence Architectures, Deep
Recurrent Networks, Recursive Neural Networks, Leaky Units, LSTM.
-------
Recurrent Neural Networks (RNNs) are designed to handle sequential data. Just as convolutional
networks are tailored for processing grid-like data such as images, RNNs are specialized for sequences of
values x1 ….. xT. They can efficiently scale to long sequences and handle variable-length inputs, making
them practical for tasks involving sequential data.

To move from multi-layer networks to recurrent networks, we utilize a key concept: parameter sharing.
This allows the model to handle sequences of varying lengths and generalize across them. Without
shared parameters, the model couldn't adapt to new sequence lengths or leverage statistical patterns
across different positions in time.

For instance, consider the sentences “I went to Nepal in 2009” and “In 2009, I went to Nepal.” A model
should recognize "2009" as the year regardless of its position. A traditional feedforward network would
need separate parameters for each word position, making it inefficient. In contrast, a recurrent neural
network shares weights across time steps, making it more effective for such tasks.

Convolution: A mathematical operation that takes a multidimensional input and produces a

multidimensional output. It is used to extract features from temporal sequences.

The convolution kernel is a small window that slides over the sequence, and the output of the
convolution operation is a new sequence where each element is a function of the elements in the input
sequence that overlap with the kernel. It can be used to share parameters across time, but it is shallow.

Recurrent neural networks (RNNs): A type of neural network that can model sequences. RNNs
have an internal state that is updated at each time step. The output of the RNN at each time step ‘t’ is a
function of the input at that time step and the internal state.

RNNs can learn long-range dependencies because the internal state of the RNN can store information
about previous time steps. They are difficult to train because they can easily overfit the training data. In
recurrent networks, the parameters are shared through a very deep computational graph. This means that
the same parameters are used to compute the output at each time step.

Computational Graphs: A computational graph is a way to formalize the structure of a set of

computations, such as those involved in mapping inputs and parameters to outputs and loss.

We can unfold a recursive or recurrent computation into a computational graph that has a repetitive
structure, typically corresponding to a chain of events. Unfolding this graph results in the sharing of
parameters across a deep network structure.

For example, consider the classical form of a dynamical system: s(t) = f(s(t−1); θ), where s(t) is called the
state of the system. The above Equation is recurrent because the definition of s at time t refers back to
the same definition at time t − 1.

For a finite number of time steps τ , the graph can be unfolded by applying the definition τ − 1 times. For
example, if we unfold above equation for τ = 3 time steps, we obtain s(3) =f(s(2) ; θ) =f (f (s (1); θ); θ)

Unfolding equation by repeatedly applying the definition in this way will yielded expression without
recurrence. s(1) is ground state and s(2) computed by applying f. Such an expression can be represented by
a traditional acyclic computational graph.

Each node represents the state at some time t and the function f maps the state at t to the state at t + 1.
The same parameters (the same value of θ used to parametrize f) are used for all time steps.

As, let us consider another example , a dynamical system driven by an external signal x(t), s(t) = f(s(t−1) ,
x(t); θ), where we see that the state now contains information about the whole past sequence.

s(t) depends on s(t−1), which depends on s(t−2), and so on. This chain of dependence means that s(t)
implicitly contains information about all past states and inputs, assuming the system is observable and
the function f is invertible or retains sufficient information.

Recurrent neural networks can be built in many different ways. Much as almost any function can be
considered a feedforward neural network, essentially any function involving recurrence can be
considered a recurrent neural network.

Many recurrent neural nets use same equation to define values of hidden units. To indicate that the state
is the hidden, we now rewrite equation s(t) = f(s(t−1); θ) using the variable h to represent the state, h(t) =
f(h(t−1), x(t); θ) as show below:

The above is a recurrent network with no outputs. This recurrent network just processes information
from the input x by incorporating it into the state h that is passed forward through time.

Circuit Diagram (Left): The circuit diagram on the left shows the recurrent network in its folded
form. The input x is processed by the function f along with the previous state h to produce the new
state h. The delay element (black square in the diagram) indicates a delay of a single time step, ensures
that the state h is passed forward to the next time step.

Unfolded Computational Graph (Right): The unfolded computational graph on the right
shows the recurrent network as it processes a sequence of inputs over time. Each node in the graph is
associated with a specific time instance.

At each time step t, the input x(t) and the previous state h(t−1) are processed by the function f to produce
the new state h(t). This process is repeated for each time step, with the state h being passed forward
through time.

This recurrent network processes input information by incorporating it into the internal state h, which is
passed forward through time. The network does not produce any outputs but focuses on updating the
internal state based on the input sequence.

We can represent the unfolded recurrence after t steps with a function g(t):
h(t) =g(t) (x(t), x(t−1), x(t−2) , . . . , x(2), x(1)) = f(h(t−1) , x(t); θ)

These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
3
(t) (t) (t-1) (t-2) (2) (1)
The function g takes the whole past sequence (x , x , x , … , x , x ) as input and produces the
current state, but the unfolded recurrent structure allows us to factorize g(t) into repeated application of a
function f.

The unfolding process offers two key advantages:

1. The learned model maintains a consistent input size, regardless of the sequence length, as it focuses on
transitions between states rather than the entire history of states.
2. The same transition function f with identical parameters can be used at every time step.

These two factors enable learning a single model f that works across all time steps and sequence lengths,
eliminating the need for separate models g(t) for each time step.

This approach allows the model to generalize to new sequence lengths not seen during training and
reduces the number of training examples needed compared to models without parameter sharing.

Armed with the graph unrolling and parameter sharing we can design a wide variety of recurrent neural
networks. Three important design patterns for recurrent neural networks are:

1. Recurrent networks that produce an output at each time step and have recurrent connections between
hidden units, illustrated in the below figure 1.

 The sequence flow of the diagram begins with an input x at each time step, which is processed
through a weight matrix U to update the hidden state h .
 The black square in the diagram represents a delay element, which holds the hidden state for one time
step before passing it on, ensuring that the hidden state from the current time step is used in the next
time step.
 This delayed hidden state is then passed through a recurrent connection, represented by the weight
matrix W , to produce the next hidden state.
 Simultaneously, the current hidden state is processed through another weight matrix V to generate
the output o is the unnormalized log probabilities.
 The output is compared to the target y using a loss function L , which measures the loss between
the predicted output and the actual target.
 This process is repeated for each time step in the sequence, with the hidden state being updated and
passed forward at each step, allowing the network to handle sequences of variable length.

Unfolded Computational Graph (Right):

These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
4
 The unfolded computational graph on the right shows the RNN as it processes a sequence of inputs
over time.
 Each node in the graph is associated with a specific time instance.
 At each time step t, the input x(t) is processed by the weight matrix U to update the hidden state h(t).
 The hidden state h(t) is then processed by the weight matrix W to produce the next hidden state h(t+1).
 The hidden state h(t) is also processed by the weight matrix V to produce the output o(t).
 The loss L(t) is computed based on the output o(t) and the target y(t) .

2. Recurrent networks that produce an output at each time step and have recurrent connections only
from the output at one time step to the hidden units at the next time step, as shown in below figure 2.

The below differences of above two diagrams highlight the trade-offs between simplicity and the ability
to capture complex temporal patterns in sequential data.

 Feedback Mechanism: Diagram 1 has recurrent connections within the hidden layer itself, while
Diagram 2 has feedback from the output to the hidden layer.
 Hidden State Update: Diagram 1 updates it based on the current input and the previous hidden state,
whereas Diagram 2 updates the hidden state based on the current input and feedback from the
previous output.
 Complexity and Power: Diagram 1 is more complex and powerful in capturing long-term
dependencies, while Diagram 2 is simpler and potentially easier to train but less powerful.

3. Recurrent networks with recurrent connections between hidden units, that read an entire sequence
and then produce a single output, illustrated in figure 3 below.

1. The network processes a sequence of inputs x(t−1),x(t),…,x(τ) over time.

2. At each time step t, the input x(t) is processed by the weight matrix U to update the hidden state h(t).
3. The hidden state h(t) is then passed through the recurrent connection, represented by the weight
matrix W, to produce the next hidden state h(t+1).
4. At the final time step τ, the hidden state h(t) is processed by the weight matrix V to produce the
output o(τ).
5. The output o(τ) is compared to the target y(τ) to compute the loss L(τ).
 The gradient on the output o(τ) can be obtained by backpropagating the loss through the network. This
gradient is used to update the weights of the network during training.
 The backpropagation process involves computing the gradients of the loss with respect to the
weights U,W, and V, and then updating the weights using an optimization algorithm such as
stochastic gradient descent.

RNN with hidden unit connections together with a(t) = b+Wh(t-1) + Ux(t) is Universal, i.e., Any function
computable by a Turing machine can be computed by such a recurrent network of a finite size.

The output can be read from the RNN after a number of time steps that is asymptotically linear in the
number of time steps used by the Turing machine and asymptotically linear in the length of the input.

The above figure does not specify the choice of activation function for the hidden units. Here we assume
the hyperbolic tangent activation function. i.e. h(t) = tanh(a(t))

a(t) is an intermediate value calculated at each time step t. It represents the linear transformation of the
input and the previous hidden state.

The figure does not specify exactly what form the output and loss function take. Here we assume that the
output is discrete, as if the RNN is used to predict words or characters. A natural way to represent
discrete variables is to regard the output o as giving the unnormalized log probabilities of each possible
value of the discrete variable.

We can then apply the softmax operation as a post-processing step to obtain a vector yˆ of normalized
probabilities over the output.

Forward propagation begins with a specification of the initial state h (0). Then, for each time step from t =
1 to t = τ (tau, (τ) indicates the specific time step) we apply the following update equations:
o(t )= c + Vh(t) where
(t) (t)
h = tanh(a ) - a(t) : Intermediate affine transformation at time step τ .
(t) (t-1) (t)
a = b + Wh + Ux - o(t) : Unnormalized log probabilities of the output at time step τ
yˆ(t) = softmax(o(t)) - b : Bias vector for the hidden layer.
- c : Bias vector for the output layer.
- h(t-1) : Hidden state from the previous time step.
- x(t) : Input at the current time step.
- U : Weight matrix for the input-to-hidden connections.
- V : Weight matrix for the hidden-to-output connections.
- W : Weight matrix for the hidden-to-hidden connections.

These equations describe how the RNN processes sequential data and produces discrete predictions at
each time step. The use of the softmax function ensures that the output probabilities sum to 1, making it
suitable for classification tasks.

Teacher Forcing and Networks with Output Recurrence The network with recurrent
connections only from the output at one time step to the hidden units at the next time step:
 This network is strictly less powerful
 This network cannot simulate a universal Turing machine.
 This network requires that the output units capture all of the information about the past that the
network will use to predict the future.

These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
7
The advantage of eliminating hidden-to-hidden recurrence is that, for any loss function based on
comparing the prediction at time t to the training target at time t, all the time steps are decoupled.

Training can thus be parallelized, with the gradient for each step t computed in isolation. There is no
need to compute the output for the previous time step first, because the training set provides the ideal
value of that output.

Models that have recurrent connections from their outputs leading back into the model may be trained
with teacher forcing.

Teacher forcing is a procedure that emerges from the maximum likelihood criterion, in which during
training the model receives the ground truth output y(t) as input at time t + 1.

We can see this by examining a sequence with two time steps. The conditional maximum likelihood
criterion is log (p(y(1) , y(2)| x(1) , x(2)) = log p(y(2)| y(1) , x(1) , x(2)) + log p(y(1)| x(1) , x(2))

At time t=2, model is trained to maximize conditional probability of y (2) given both the x sequence so far
and the previous y value from the training set. We are using y(1) as teacher forcing, rather than only x(i)

Maximum likelihood specifies that during training, rather than feeding the model’s own output back into
itself, these connections should be fed with the target values specifying what the correct output should
be. This is illustrated in figure below.

Teacher Forcing is a training technique applicable to RNNs that have connections from output to hidden
states at next time step

True output is not know. We approximate the correct

output y(t) with the model’s output o(t) and feed the
output back to the model

We feed the correct output yt (from teacher)

drawn from the training set as input to ht+1.

Disadvantage of Teacher Forcing: If network is to be used in an open-loop mode with network

outputs (or samples from the output distribution) fed back as inputs. In this case the kind of inputs that it
will see during training time could be quite different from that it will see at test time

1. Train with both teacher-forced inputs and free running inputs. E.g., predicting the correct target a no of
steps in the future through the unfolded recurrent output-to-input paths. Thus network can learn to take
into account input conditions not seen during training.

2. Mitigate the gap between inputs seen at training time and test time by generating values as input. This
approach exploits a curriculum learning strategy to gradually use more of the generated values as input

Computing the Gradient in an RNN: Computing the gradient through a recurrent neural
network is straightforward.
1. Applies the generalized back-propagation algorithm to the unrolled computational graph. No
specialized algorithms are necessary.
2. Gradients obtained by back-propagation may then be used with any general-purpose gradient-based
techniques to train an RNN.

1. General Backpropagation to compute gradient: The below algorithm outlines the skeleton of a
back-propagation algorithm, specifically focusing on the setup and cleanup aspects. The core
computational work, particularly the gradient calculations, is delegated to a subroutine named
build_grad. We are computing gradients T of variable z wrt variables in computational graph G

Require: T, the target set of variables whose gradients must be computed.

Require: G, the computational graph
Require: z, the variable to be differentiated
Let G′ be the pruned version of G , which contains only nodes that are ancestors of z and descendants of
nodes in T. This step ensures that only the relevant parts of the computational graph are considered for
gradient computation, improving efficiency.
Initialize grad_table, a data structure associating tensors/variables to their gradient’s grad_table[z] ← 1
(The gradient of z with respect to itself is initialized to 1 (since dz/dz = 1).

for V in T do // V is variable whose gradient is being computed.

build_grad(V, G, G′ , grad_table)
end for
Return grad_table restricted to T

2. Build-grad function of Generalized Backprop: The inner loop subroutine build_grad(V, G, G′,
grad_table) of the back-propagation algorithm, called by the back-propagation algorithm discussed above
is show below.

Require: V, the variable whose gradient should be added to G and grad_table.

Require: G, the graph to modify.
Require: G′ , pruned version of G wih nodes that participate in the gradient.
Require: grad_table, a data structure mapping nodes to their gradients
if V is in grad_table then
Return grad_table[V]
end if
i←1
for C in get_consumers(V, G′) do // consumers means all nodes in V
op ← get_operation(C)
D ← build_grad( C, G,G′, grad_table) // Gradient of z w.r.t. C's output
G(i) ← [Link](get_inputs(C, G′),V,D) // Compute the local gradient contribution using the
operation's backward propagation method ([Link]).
i←i+1
end for
G ← Σi G(i)
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
9
grad_table[V] = G // Store G in grad_table for future use and insert any new operations (e.g., gradient
computations) into the graph G.
Insert G and the operations creating it into G
Return G

To gain intuition for how the BPTT algorithm behaves, and how to compute gradients by BPTT for the
given RNN equations

o(t )= c + Vh(t)
h(t) = tanh(a(t))
a(t) = b + Wh(t-1) + Ux(t)
yˆ(t) = softmax(o(t))

The nodes of our computational graph include the parameters U , V , W , b and c as well as the sequence
of nodes indexed by t for x(t), h(t), o(t) and L(t).

For each node N we need to compute the gradient ∇NL recursively, based on the gradient computed at
nodes that follow it in the graph. We start the recursion with the nodes immediately preceding the final
loss ∂L / ∂L(t) = 1

We assume that the outputs o(t) are used as the argument to the softmax function to obtain the vector ˆy of

y(t) given the input so far. The gradient ∇o(t)L on the outputs at time step t, for all i, t, is as follows:
probabilities over the output. We also assume that the loss is the negative log-likelihood of the true target

has o(τ) as a descendent, so its gradient is simple: ∇h(τ)L = VT∇o(τ)L

(τ
We work our way backwards, starting from the end of the sequence. At the final time step τ , h ) only

Recurrent Networks as Directed Graphical Models: In the recurrent networks, we've used cross-
entropy losses L to compare the network's outputs o with the training targets y(t).
(t) (t)

In recurrent neural networks (RNNs), we compute losses at each time step, similar to feedforward
networks. The output of RNN is often treated as a probability distribution, and cross-entropy naturally
used to define the loss.

The choice of loss function depends on the task:

1. Cross-Entropy Loss (most common for classification):

- Used when the output is a probability distribution (e.g., predicting the next word in a sentence).
- Measures how well the predicted probabilities match the true labels.

2. Mean Squared Error (MSE) Loss used for regression tasks/Time series forecast:
- Used when predicting continuous values (e.g., stock prices).
- Equivalent to cross-entropy loss if we assume the output follows a Gaussian distribution.

The loss function helps the RNN learn by penalizing incorrect predictions, just like in standard neural
networks.

When we use a predictive log-likelihood training objective, such as equation

We train the Recurrent Neural Network (RNN) to estimate the conditional distribution of the next
element in the sequence, y(t), given the past inputs (and optionally past outputs). This involves
maximizing the log-likelihood of the observed data.

If the model does not include connections from previous outputs, we maximize: log p(y(t)|x(1),…,x(t))

If the model includes connections from the output at one time step to the next, we maximize:

log p(y(t)|x(1),…,x(t),y(1),…,y(t−1))

Decomposing the joint probability over the sequence of y values as a series of one-step probabilistic
predictions is a method to capture the full joint distribution across the whole sequence. This approach
allows us to model the dependencies between sequence elements effectively.

When we do not feed past y values as inputs that condition the next step prediction, the
outputs y are conditionally independent given the sequence of x values. This means each y(t) depends
only on the input sequence x and not on previous y values.

When we do feed the actual y values (not their prediction, but the actual observed or generated values)
back into the network, the directed graphical model contains edges from all y (i) values in the past to the
current y(t) value. This setup captures dependencies not only on the input sequence x but also on the
history of y values, making the model more expressive but also more complex.

In a fully connected graphical model, every past

observation y(i) can influence future outputs y(t)
(where t > i ).

This leads to inefficiency where the number of

inputs/parameters grows indefinitely with sequence
length (e.g., y(t) depends on all y(1), ……., y(t-1) ).

This also leads to Computational Explosion as Each step requires more memory and computation than
the previous one.

RNNs obtain the same full connectivity but efficient parameterization by introducing the state variable in
the Probabilistic Graphical Model (PGM) of RNN as shown in below figure.

 In the above diagram, RNNs introduce a state variable h(t) at each time step, which acts as a
"memory" summarizing past information.
 Even though h(t) is deterministic (computed from inputs and previous state), it simplifies the graphical
model by compressing all past inputs into a fixed-size vector.
 This model is efficeint in Parameter Sharing. The same function (with the same parameters)
computes h(t) and y(t) at every time step.
o Example: h(t) =f(x(t), h(t-1);θ), where θ is reused across all t. ( x(t) is the input vector at time
step t).
 Fixed Input Size: Unlike the previous diagram (where each y(t) depends on all past inputs), in this
model by introdcuing a state Variable y(t) depends only on x(t) and h(t-1).

Bidirectional RNNs: Traditional RNNs have a "causal" structure, meaning that the state at any given
time t only captures information from past inputs x (1), . . . , x(t−1) and the current input x(t). This limits
their ability to use future context when making predictions.

In many applications, such as speech recognition or handwriting recognition, the correct interpretation of
the current input may depend on future inputs. If there are are two interpretations of the current word that
are both acoustically plausible, we may have to look far into the future (and the past) to disambiguate
them. In speech recognition, the sound "tee" might be part of the word "tea" or "steam"—you need to
hear the next few sounds/words to decide.

For example, understanding a spoken word might require context from subsequent words or sounds.
 Dhaval Loves Apple, it keeps him healthy
 Dhaval Loves Apple, the company produces best Electronics
 "Bank" could mean:
- Financial institution (if followed by "account")
- River side (if followed by "of the river")

Bidirectional RNNs address this limitation by combining two RNNs: one that processes the sequence in
the forward direction (from start to end) and another that processes it in the backward direction (from end
to start).
At each time step t , the bidirectional RNN has two
states:
- h(t) : The hidden state of the forward-moving
RNN.
- g(t) : The hidden state of the backward-moving
RNN.

The output units o(t) can compute a representation

that depends on both past and future inputs, making
them sensitive to the entire sequence without
needing a fixed-size window.

Advantages of Bidirectinal RNNs:

 Contextual Awareness: Bidirectional RNNs can
capture dependencies from both past and future
inputs, making them more effective for tasks
where context from the entire sequence is
important.
 Flexibility: They do not require a fixed-size look-
ahead buffer/filters, unlike feedforward or
convolutional networks, making them more
flexible for sequence-to-sequence learning tasks.
 Extension to 2D Inputs: The concept can be
extended to two-dimensional inputs like images
by using four RNNs, each processing the image
in one of the four directions: up, down, left, and
right. This allows capturing local and long-range
dependencies in the image data.
 Comparison with Convolutional Networks:
While RNNs applied to images are typically more computationally expensive than convolutional
networks, they allow for long-range pixel relationships (e.g., global shapes), where as CNNs are
more efficient for local patterns (e.g., edges).

For a BRNN, the output at time t combines both RNNs:

o(t)=Combine(h(t)),g(t))
Where:
 h(t) = ForwardRNN(x(1),...,x(t))) (At step t, h(t) summarizes info from x(1) to x(t) )
 g(t) = BackwardRNN(x(t) ,...,x(T)) (summarizes info from x(T) to x(t) where T is the sequence
length).

Time Steps: [1] [2] [3] ... [T]

Forward RNN: h(1) → h(2) → h(3) → ... → h(T)
Backward RNN: g(1) ← g(2) ← g(3) ← ... ← g(T)
Output: o(1) o(2) o(3) ... o(T)
(h+g) (h+g) (h+g) (h+g)

1. Time Steps in Top Row Shows the sequence positions from 1 to T (total length)
Example: For a sentence "I love AI", positions would be: [1]="I", [2]="love", [3]="AI"

2. Forward RNN Processes data from left to right (past → future). Each h(t) depends on Current input
x(t) and Previous hidden state h(t-1)
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
13
Example for "I love Deep Learning":
- h(1) knows only "I"
- h(2) knows "I" + "love"
- h(3) knows "I" + "love" + "Deep Learnng”

3. Backward RNN Processes data from right to left (future → past). Each g(t) depends on Current input
x(t) and Next hidden state g(t+1)
Example for "I love Deep Learnng ":
- g(3) knows only " Deep Learnng "
- g(2) knows " Deep Learnng " + "love"
- g(1) knows " Deep Learnng " + "love" + "I"

4. Output: Each output o(t) combines both h(t) and g(t). The "(h+g)" notation means concatenation or
weighted combination.

Example for word "love" (position 2):

- Uses h(2) ("I" + "love") + g(2) ("Deep Learnng " + "love")
- So o(2) understands both previous and next words

Bidirectional RNNs enhance the capability of traditional RNNs by incorporating future context, making
them suitable for tasks where understanding the full sequence context is crucial. They are particularly
useful in applications like speech and handwriting recognition, where future information can
significantly impact the interpretation of current inputs.

Encoder-Decoder Sequence-to-Sequence Architectures: Figure below shows how an RNN can

map an input sequence to a fixed-size vector.

Figure below shows how an RNN can map a fixed-size vector to a sequence.

Now we will learn how an RNN can be trained to map an input sequence to an output sequence which is
not necessarily of the same length.

This comes up in many applications, such as speech recognition, machine translation or question
answering, where the input and output sequences in the training set are generally not of the same length
(although their lengths might be related).

The simplest RNN architecture for mapping a variable-length sequence to another variable-length
sequence is as shown below. This architeicture is known as the encoder-decoder or sequence-to-sequence
architecture.
The primary role of the encoder is to process
the input sequence (x(1), x (2), . . . , x(nx)) and
encode it into a fixed-size context vector,
often referred to as the "context".
The encoder is typically implemented as a
Recurrent Neural Network (RNN), such as a
Long Short-Term Memory (LSTM) network
or a Gated Recurrent Unit (GRU). These
types of RNNs are well-suited for handling
sequential data due to their ability to maintain
a hidden state that captures information over
time.

The encoder processes the input sequence

one element at a time. At each time step, it
takes an input element (e.g., a word or a
token in a sentence) and updates its hidden
state based on this input and the previous
hidden state.

After processing the entire input sequence, the encoder's final hidden state is used to generate the context
vector C. The context vector C serves as a condensed representation of the input sequence, capturing the
most relevant information needed to generate the output sequence.

This context vector is typically a simple function of the final hidden state, such as a direct copy or a
linear transformation. It encapsulates the information from the entire input sequence, ensuring that the
decoder has access to a comprehensive summary of the input sequence.

These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
15
The final hidden state of the encoder is crucial because it encapsulates the information from the entire
input sequence. By using this state to generate the context vector, the model ensures that the decoder has
access to a comprehensive summary of the input sequence.

A decoder or writer or output RNN is conditioned on that fixed-length vector C to generate the output
sequence Y = (y(1), … , y(ny)). The decoder uses the context vector to produce the output sequence
element by element, leveraging the information encoded by the encoder.

Deep Recurrent Networks The computation in most RNNs can be decomposed into three
blocks of parameters and associated transformations:
1. from the input to the hidden state, (How current input affects memory)
2. from the previous hidden state to the next hidden state, (How memory evolves over time) and
3. from the hidden state to the output. (How memory generates predictions)

With the RNN architecture of figure, each of these three blocks is associated with a single weight matrix.

In other words, when the network is

unfolded, each of these corresponds to
a shallow transformation. Each
transformation is single-layered (no
intermediate hidden layers).

Typically this is a transformation

represented by a learned affine
transformation followed by a fixed
nonlinearity.

Experimental evidence strongly

suggests so that we need enough depth
in order to perform the required
transformations.

We have three ways of making an

RNN deep.

These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
16
Hierarchical Hidden State: This subfigure shows a hierarchical
organization of the hidden recurrent state.

The hidden state h(t) is split into hierarchical groups (sub-states) that
operate at different temporal scales. This means that instead of having a
single hidden state, the network has multiple layers of hidden states.

Group 1: Captures short-term dependencies (e.g., phonemes).

Group 2: Captures long-term dependencies (e.g., sentence structure)

By organizing the hidden states hierarchically, the network can capture

more complex dependencies and patterns in the data. This hierarchical
structure allows the network to learn representations at different levels
of abstraction.

Benefit is Explicitly models multi-scale temporal patterns.

Real world Example Imagine a newsroom:
 Reporters (Lower-Level): Gather raw facts (short-term).
 Editors (Higher-Level): Combine facts into stories (long-term),
then guide reporters on what to investigate next (top-down).
Deep Computation in Transitions: This subfigure replaces single-
matrix transformations (shallow) with multi-layer blocks.

Deeper computation, such as a Multi-Layer Perceptron (MLP), is

introduced in the input-to-hidden, hidden-to-hidden, and hidden-to-
output parts of the network. This means that each of these transitions
involves multiple layers of computation.

By introducing deeper computation, the network can learn more

complex transformations and representations. This can lengthen the
shortest path linking different time steps, allowing the network to
capture long-range dependencies more effectively.

This may cause Longer path between distant time steps, but it has better
hierarchical feature learning.

Risk is Vanishing gradients due to increased depth which is going to be

solved by skip connections.

These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
17
Skip Connections: This subfigure shows the use of skip
connections to mitigate the path-lengthening effect.

Skip connections are introduced to create direct links between non-

consecutive layers or time steps. These connections allow information to
flow more directly through the network.

Skip connections help mitigate the issue of vanishing gradients and

allow the network to learn dependencies across longer sequences more
effectively. By providing shortcuts for the gradient flow, skip
connections enable the network to capture both short-term and long-
term dependencies in the data.

This balances depth (for complexity) and trainability (via skip paths).

Real-World Analogies
1. Bookmarking a Page:
o Without skips: Read every page to find a key detail.
o With skips: Jump to the bookmarked page instantly.
2. Highway vs. Local Roads:
o Skips are highways for information;
o standard RNN paths are local roads.

Recursive Neural Networks: Recursive neural networks represent yet another generalization of
recurrent networks, with a different kind of computational graph. RNNs Process sequences as
a chain (linear order) where as Recursive Neural Networks Generalize to tree structures, where each
node combines inputs hierarchically.

It is structured as a deep tree, rather than the chain-like structure of RNNs. The typical computational
graph for a recursive network is illustrated in figure.

A variable-size sequence x(1),x(2), . . . , x(t) can be mapped to a fixed-size representation (the output o),
with a fixed set of parameters (the weight matrices U , V , W ).

The figure illustrates a supervised learning case in which some target y is provided which is associated
with the whole sequence.

These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
18
One clear advantage of recursive nets over recurrent
nets is that for a sequence of the same length τ, the
depth (measured as the number of compositions of
nonlinear operations) can be drastically reduced from τ
to O(log τ ), which might help deal with long-term
dependencies.

Leaf Nodes: Process raw inputs (e.g., word

embeddings).
Parent Nodes: Combine children via a shared function:
Root Node: Produces the final output o (e.g., sentence
embedding)

Example: Sentence Sentiment

 Input: "The movie was not good."
 Tree:

[not good] (negative)

/ \
[not] [good]

[The movie was...]

Output ( o ): Correctly classifies as negative by

composing "not" + "good."

[o] (Output: "positive")

/ \
[The movie] [was great]
/ \ / \
"The" "movie" "was" "great" (Leaf nodes)

The Challenge of Long-Term Dependencies: Recurrent Neural Networks (RNNs) struggle

to learn relationships between events separated by long time gaps. This arises due to vanishing or
exploding gradients during backpropagation through time (BPTT).

Vanishing Gradients are most common. When gradients are backpropagated over many time steps,
they are multiplied repeatedly by weight matrices (Jacobians). If these matrices have small
eigenvalues (<1), gradients shrink exponentially (e.g., 0.9100≈00.9100≈0).

Exploding Gradients are Less Common but more Destructive: If weight matrices have large
eigenvalues (>1), gradients grow exponentially (e.g., 1.1100≈13,7801.1100≈13,780), which may cuse
Numerical instability → NaN errors or chaotic parameter updates.

Leaky Units: To deal with long-term dependencies we have to design a model that operates at
multiple time scales, so that some parts of the model operate at fine-grained time scales and can handle
small details, updating quickly, while other parts operate at coarse time scales and transfer information
slowly from the distant past to the present more efficiently.

Various strategies for building both fine and coarse time scales are possible. These include
1. the addition of skip connections across time,
2. “leaky units” that integrate signals with different time constants, and
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
19
3. the removal of some of the connections used to model fine-grained time scales.
1. Adding skip connections through time: One way to obtain coarse time scales is to add direct
connections from variables in the distant past to variables in the present.

In an ordinary recurrent network, a recurrent connection goes from a unit at time t to a unit at time t + 1.
It is possible to construct recurrent networks with longer delays.

As we know that gradients may vanish or explode exponentially with respect to the number of time steps.
We can introduce recurrent connections with a time-delay of d to mitigate this problem.

Due to adding skip connections in RNN, Gradients now diminish exponentially as a function of τ / d
rather than τ and shorten gradient paths. Since there are both delayed and single step connections,
gradients may still explode exponentially in τ. This allows the learning algorithm to capture longer
dependencies although not all long-term dependencies may be represented well in this way.

Standard RNN: h(1) → h(2) → ... → h(100) # Gradient path: 100 steps
With Skips (d=10): h(1) → h(10) → h(20) → ... → h(100) # Gradient path: 10 steps

2. Leaky units and a spectrum of time scales: Rather than an integer skip of d time steps, the effect
can be obtained smoothly by adjusting a real-valued α

When we accumulate a running average μ(t) of some value v(t) by applying the update
μ(t) ← α μ(t−1) + (1 − α)v(t) the α parameter is called linear self-correction from μ(t−1) to μ(t).

When α is near one, the running average remembers information about the past for a long time, and
when α is near zero, information about the past is rapidly discarded i.e short memory (fine scale).

Hidden units with linear self-connections can behave similarly to such running averages. Such hidden
units are called leaky units.

Can obtain product of derivatives close to 1 by having linear self-connections and a weight near 1 on
those connections.

There are two basic strategies for setting the time constants α used by leaky units. One strategy is to
manually fix them to values that remain constant, for example by sampling their values from some
distribution once at initialization time.

Another strategy is to make the time constants free parameters and learn them. Having such leaky units
at different time scales appears to help with long-term dependencies

3. Removing Connections: Another approach to handle long-term dependencies is the idea of

organizing the state of the RNN at multiple time-scales, with information flowing more easily through
long distances at the slower time scales.

This idea differs from the skip connections through time because it involves actively removing length-
one connections and replacing them with longer connections. Units modified in such a way are forced to
operate on a long-time scale. Skip connections through time add edges. Units receiving such new
connections may learn to operate on a long-time scale but may also choose to focus on their other short-
term connections.

Summary of what we have learned till now

RNNs have loops. A chunk of neural network A looks at some input xt and outputs
a value ht.A loop allows information to be passed from one step of the network to
the next

An unrolled RNN.
Chain-like
structure reveals
that RNNs are
intimately related
to sequences and
lists

Application to Part
of Speech (POS)
Tagging

Different Types of RNNs: RNNs mainly used for Sequence Classification, Sentiment & Video
Classification, Sequence Labelling, POS (Part-of-Speech) & NE (Named Entity) Tagging ,Sequence
Generation, MT & Transliteration

Vanilla mode of Image Sentiment Analysis Machine Translation Synced sequence

processing without captioning sentence classified seq of words -> seq input and output
RNN, from fixed- takes image as positive or of words (e.g. video
sized input to fixed and outputs a negative classification label
sized output (e.g. sentence each frame of video)
image
classification)
Each rectangle is a vector and arrows represent functions (eg. Matrix Multiplication

These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
21
RNNs often fail to retain distant information needed for tasks like language modeling, even though they
technically have access to the entire sequence. Hidden states tend to focus on local context (recent
inputs) and lose critical long-range dependencies.

Example: Language Modeling

Sentence: The flights the airline was canceling were full.
- Easy: Predicting was after airline (singular agreement).
- Hard: Predicting were (plural) because:
- The plural flights is distant.
- The closer airline (singular) dominates the hidden state.

There are two Key Reasons for Failure

1. Dual Task Conflict:
- Hidden states must both:
- Make current predictions (e.g., was after airline).
- Preserve future-relevant info (e.g., plural flights).
- This overload causes distant info to be overwritten.

2. Vanishing Gradients:
- During backpropagation, gradients are multiplied repeatedly by weights over many time steps.
- Result: Gradients shrink to near-zero (vanish) for distant words like flights, preventing learning.

To fix these issues, Long Short-Term Memory (LSTM) networks were designed. They:
1. Explicitly manage memory via "gates" that:
- Remember critical info (e.g., flights is plural).
- Forget irrelevant info (e.g., intermediate words).
2. Avoid vanishing gradients by using additive updates (not multiplicative).

The image depicts a block diagram of a Long

Short-Term Memory (LSTM) cell, which is a
type of recurrent neural network (RNN)
architecture designed to overcome the
limitations of traditional RNNs, particularly in
learning long-term dependencies.

Here's a detailed explanation of the

components and flow within an LSTM cell:

Components of LSTM:

1. Core Components of an LSTM Cell:

Each LSTM cell consists of:
- State Unit (si(t)): The "memory" of the cell,
updated over time.
- Gates (controlled by sigmoid units σ
(sigma):
1. Forget Gate f i(t)) → Decides what
to discard from the state.
2. Input Gate (g i(t)) → Decides what new information to store.
3. Output Gate (q i(t)) → Decides what to output to the next layer.
- Self-loop (Internal Recurrence): Unlike traditional RNNs, LSTMs have a linear self-loop
(t
(controlled by (fi ) that allows gradients to flow unchanged over time, mitigating vanishing gradients.

1. Input: The input to the LSTM cell at a given time step is denoted by the arrow labeled "input."
This input is typically a feature vector representing the data at that time step.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
22

2. Gates: LSTM cells use three types of gates to regulate the flow of information:

- Forget Gate f i(t)): The forget gate applies a sigmoid function to the cell state to determine which parts
of the state should be retained si(t-1) and which should be forgotten. This is also done through pointwise
multiplication.
The self-loop weight (or the associated time constant) is controlled by a forget gate unit f(t)i (for time step
t and cell i), that sets this weight to a value between 0 and 1 via a sigmoid unit:

 Inputs: Current input (x( t )) and previous hidden state h( t- 1 ).

 Sigmoid (σ )): Outputs a value between 0 (forget) and 1 (retain)
 σ: Sigmoid function, squashing output to [0,1][0,1].
 b i f : Bias term for the forget gate.
 Ui,jf: Weights for the current input (x(t)).
 Wi,j f: Weights for the previous hidden state (h(t−1)).
 x(t),h(t − 1): External input and hidden state vectors.

- Input Gate (g i(t)): The input gate uses a sigmoid unit to decide which values from the input should
be used to update the cell state. The input data is transformed by a pointwise multiplication with the
weights produced by the input gate.

External input gate unit is computed similar to forget gate with a sigmoid unit to obtain a gating value
between 0 and 1 but with its own parameters

Same structure as forget gate, but with independent parameters (bg, Ug, Wg)

- Output Gate (q i(t)): The output gate uses a sigmoid function to decide which parts of the cell state
should be output. The output is then computed by applying the output gate's weights to the cell state
through pointwise multiplication.

- Output (h i (t) ): A filtered version of the cell state (s i (t) ).

- tanh: Ensures the output is between -1 and 1

3. Sigmoid and Pointwise Operations:

- Sigmoid Units (σ): These are used in the gates to produce values between 0 and 1, which act as
weights to control the flow of information. The sigmoid function introduces a nonlinearity that helps in
learning complex patterns.
- Pointwise Multiplication (×): This operation is used to apply the weights produced by the sigmoid
units to the input data or the cell state.

These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [Link]@[Link].
23
4. Cell State Unit: The cell state, often referred to as the "state," is linear self-loop. sigmoid function
produces weights which is controlled by forget gate. It acts as the memory of the LSTM cell, carrying
information across time steps.

Cell State Update (s i(t)): The cell state is updated by combining the transformed input (controlled by
the input gate) and the filtered previous cell state (controlled by the forget gate).

 First term (f i (t) s i (t - 1)): Retains a fraction of the past state.

 Second term (g i (t) . σ (...)): Adds a fraction of the new candidate state.

5. Self-Loop: The self-loop represents the recurrent connection of the cell state, allowing it to maintain
its state over time. This loop is crucial for the LSTM's ability to capture long-term dependencies. self-
loop weight (or the associated time constant) is controlled by a forget gate unit .

6. Output: The output of the LSTM cell is computed based on the current cell state and controlled by
the output gate. This output can be used as the input to the next time step or as the final output of the
network.

RNNs: Sequence Modeling Techniques
No ratings yet
RNNs: Sequence Modeling Techniques
22 pages
RNN Design Patterns and Unfolding Graphs
No ratings yet
RNN Design Patterns and Unfolding Graphs
37 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
191 pages
BCS714A Module 4 PDF
No ratings yet
BCS714A Module 4 PDF
34 pages
RNN Design Patterns and Applications
No ratings yet
RNN Design Patterns and Applications
33 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
17 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
12 pages
RNNs and Recursive Neural Networks Explained
No ratings yet
RNNs and Recursive Neural Networks Explained
23 pages
RNNs and Sequence Modeling Explained
No ratings yet
RNNs and Sequence Modeling Explained
34 pages
Unfolding Computational Graphs in RNNs
No ratings yet
Unfolding Computational Graphs in RNNs
37 pages
RNNs: Unfolding Computational Graphs
No ratings yet
RNNs: Unfolding Computational Graphs
29 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
26 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
63 pages
Overview of Recurrent Neural Networks
No ratings yet
Overview of Recurrent Neural Networks
32 pages
DL Mod5
No ratings yet
DL Mod5
20 pages
Recurrent and Recursive Neural Networks
No ratings yet
Recurrent and Recursive Neural Networks
35 pages
Module4 VTU Answers FromPDF
No ratings yet
Module4 VTU Answers FromPDF
14 pages
Unfolding Computational Graphs in RNNs
No ratings yet
Unfolding Computational Graphs in RNNs
17 pages
Unfolding Computational Graphs in RNNs
No ratings yet
Unfolding Computational Graphs in RNNs
4 pages
DL 5
No ratings yet
DL 5
10 pages
RNNs and RvNNs: Structures and Applications
No ratings yet
RNNs and RvNNs: Structures and Applications
25 pages
Module 4: Recurrent Neural Networks
No ratings yet
Module 4: Recurrent Neural Networks
34 pages
Recurrent Neural Networks Explained
No ratings yet
Recurrent Neural Networks Explained
17 pages
Unfolding RNNs in Deep Learning
No ratings yet
Unfolding RNNs in Deep Learning
31 pages
RNNs: Understanding Sequential Models
No ratings yet
RNNs: Understanding Sequential Models
6 pages
Introduction to Recurrent Neural Networks
No ratings yet
Introduction to Recurrent Neural Networks
16 pages
Sequence Modeling with RNNs and LSTMs
No ratings yet
Sequence Modeling with RNNs and LSTMs
125 pages
DL Unit 4
No ratings yet
DL Unit 4
28 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
77 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
44 pages
Technical DL U4-6
No ratings yet
Technical DL U4-6
98 pages
Technical DL U4-6
No ratings yet
Technical DL U4-6
98 pages
Introduction to Recurrent Neural Networks
No ratings yet
Introduction to Recurrent Neural Networks
13 pages
RNNs: Unfolding Graphs & Applications
No ratings yet
RNNs: Unfolding Graphs & Applications
18 pages
Recurrent Neural Networks Overview
No ratings yet
Recurrent Neural Networks Overview
20 pages
Recurrent and Recursive Neural Networks
No ratings yet
Recurrent and Recursive Neural Networks
11 pages
Data Types and CNN Mechanisms Explained
No ratings yet
Data Types and CNN Mechanisms Explained
4 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
21 pages
Recurrent Neural Networks Explained
No ratings yet
Recurrent Neural Networks Explained
58 pages
Deep Learning: Recurrent Neural Networks
No ratings yet
Deep Learning: Recurrent Neural Networks
68 pages
Sequence Modeling with RNNs and LSTMs
No ratings yet
Sequence Modeling with RNNs and LSTMs
8 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
105 pages
Unfolding RNN Computational Graphs
No ratings yet
Unfolding RNN Computational Graphs
44 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
51 pages
RNN Design Patterns and Architectures
100% (1)
RNN Design Patterns and Architectures
50 pages
Unit 5 DL
No ratings yet
Unit 5 DL
25 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
16 pages
RNN and LSTM Overview for CS-601
0% (1)
RNN and LSTM Overview for CS-601
16 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
48 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
20 pages
RNN Design Patterns and Computation Graphs
No ratings yet
RNN Design Patterns and Computation Graphs
41 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
10 pages
Module 4 (RNN)
No ratings yet
Module 4 (RNN)
28 pages
Unfolding RNNs for Sequence Learning
No ratings yet
Unfolding RNNs for Sequence Learning
42 pages
Unfolding Computational Graphs in RNNs
No ratings yet
Unfolding Computational Graphs in RNNs
36 pages
IMP - Fundamentals of Deep Learning - Introduction To Recurrent Neural Networks
No ratings yet
IMP - Fundamentals of Deep Learning - Introduction To Recurrent Neural Networks
33 pages
Recurrent Neural Networks Overview
No ratings yet
Recurrent Neural Networks Overview
34 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
47 pages
RNN Equations in Deep Learning
No ratings yet
RNN Equations in Deep Learning
19 pages
Unit2 Solar Energy Detailed
No ratings yet
Unit2 Solar Energy Detailed
3 pages
Understanding Autoencoders and Their Types
No ratings yet
Understanding Autoencoders and Their Types
7 pages
Software Project Management Spectrum
No ratings yet
Software Project Management Spectrum
15 pages
Software Project Estimation Techniques
No ratings yet
Software Project Estimation Techniques
10 pages
Online DGA Prediction for Transformers
No ratings yet
Online DGA Prediction for Transformers
6 pages
Deep Learning for Emoji Sentiment Analysis
No ratings yet
Deep Learning for Emoji Sentiment Analysis
18 pages
Deep Transformer Models For Time Series Forecasting
No ratings yet
Deep Transformer Models For Time Series Forecasting
10 pages
MIMO-Speech End-to-End Multi-Channel Multi-Speaker Speech Recognition
No ratings yet
MIMO-Speech End-to-End Multi-Channel Multi-Speaker Speech Recognition
8 pages
Summarizing YouTube Comments with NLP
No ratings yet
Summarizing YouTube Comments with NLP
25 pages
AI Chatbot Project Report for B.Tech
No ratings yet
AI Chatbot Project Report for B.Tech
53 pages
CS 224N NLP Deep Learning Quiz Pack
No ratings yet
CS 224N NLP Deep Learning Quiz Pack
4 pages
An Assessment of English-Arabic Translation Using ChatGPT
No ratings yet
An Assessment of English-Arabic Translation Using ChatGPT
20 pages
Intelligent Chatbot Using RNN-LSTM Model
No ratings yet
Intelligent Chatbot Using RNN-LSTM Model
12 pages
Legal Named Entity Recognition with Pointer Generator
No ratings yet
Legal Named Entity Recognition with Pointer Generator
9 pages
Advanced NLP for Identity Resolution
No ratings yet
Advanced NLP for Identity Resolution
22 pages
Generative AI for Adaptive Music Creation
No ratings yet
Generative AI for Adaptive Music Creation
46 pages
Data Science Projects with Python
No ratings yet
Data Science Projects with Python
308 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
27 pages
Image Processing and Deep Learning Tasks
No ratings yet
Image Processing and Deep Learning Tasks
4 pages
Module 1 NLP
No ratings yet
Module 1 NLP
13 pages
Adversarial Attacks in NLP: A Survey
No ratings yet
Adversarial Attacks in NLP: A Survey
40 pages
Seq2seq Model for Machine Translation
No ratings yet
Seq2seq Model for Machine Translation
59 pages
Abstractive Text Summarization with Transformers
No ratings yet
Abstractive Text Summarization with Transformers
9 pages
Introduction To Large Language Models (LLMS) - Unit 7 - Week 5
No ratings yet
Introduction To Large Language Models (LLMS) - Unit 7 - Week 5
4 pages
UNILMv2: Pseudo-Masked Language Models
No ratings yet
UNILMv2: Pseudo-Masked Language Models
11 pages
Revitalizing Bahnar Language with NMT
No ratings yet
Revitalizing Bahnar Language with NMT
9 pages
Survey of Pre-trained Models in NLP
No ratings yet
Survey of Pre-trained Models in NLP
28 pages
Generative AI Internship Report
No ratings yet
Generative AI Internship Report
25 pages
AD3511 Deep Learning Lab Manual
No ratings yet
AD3511 Deep Learning Lab Manual
38 pages
Overview of Natural Language Processing
No ratings yet
Overview of Natural Language Processing
6 pages
Deep Learning Chatbot Project Report
No ratings yet
Deep Learning Chatbot Project Report
21 pages
Building a Large Language Model Guide
No ratings yet
Building a Large Language Model Guide
9 pages
RNN LSTM in NLP - PPTX - Compressed
No ratings yet
RNN LSTM in NLP - PPTX - Compressed
103 pages
Arabic TTS Diacritization Errors Analysis
No ratings yet
Arabic TTS Diacritization Errors Analysis
24 pages

RNNs and Computational Graphs Explained

Uploaded by

RNNs and Computational Graphs Explained

Uploaded by

1

Convolution: A mathematical operation that takes a multidimensional input and produces a

Computational Graphs: A computational graph is a way to formalize the structure of a set of

The unfolding process offers two key advantages:

Unfolded Computational Graph (Right):

1. The network processes a sequence of inputs x(t−1),x(t),…,x(τ) over time.

True output is not know. We approximate the correct

We feed the correct output yt (from teacher)

Disadvantage of Teacher Forcing: If network is to be used in an open-loop mode with network

Require: T, the target set of variables whose gradients must be computed.

for V in T do // V is variable whose gradient is being computed.

Require: V, the variable whose gradient should be added to G and grad_table.

has o(τ) as a descendent, so its gradient is simple: ∇h(τ)L = VT∇o(τ)L

The choice of loss function depends on the task:

1. Cross-Entropy Loss (most common for classification):

When we use a predictive log-likelihood training objective, such as equation

In a fully connected graphical model, every past

This leads to inefficiency where the number of

The output units o(t) can compute a representation

Advantages of Bidirectinal RNNs:

For a BRNN, the output at time t combines both RNNs:

Time Steps: [1] [2] [3] ... [T]

Example for word "love" (position 2):

Encoder-Decoder Sequence-to-Sequence Architectures: Figure below shows how an RNN can

The encoder processes the input sequence

In other words, when the network is

Typically this is a transformation

Experimental evidence strongly

We have three ways of making an

Group 1: Captures short-term dependencies (e.g., phonemes).

By organizing the hidden states hierarchically, the network can capture

Benefit is Explicitly models multi-scale temporal patterns.

Deeper computation, such as a Multi-Layer Perceptron (MLP), is

By introducing deeper computation, the network can learn more

Risk is Vanishing gradients due to increased depth which is going to be

Skip connections are introduced to create direct links between non-

Skip connections help mitigate the issue of vanishing gradients and

Leaf Nodes: Process raw inputs (e.g., word

Example: Sentence Sentiment

[not good] (negative)

[The movie was...]

Output ( o ): Correctly classifies as *negative* by

[o] (Output: "positive")

The Challenge of Long-Term Dependencies: Recurrent Neural Networks (RNNs) struggle

3. Removing Connections: Another approach to handle long-term dependencies is the idea of

Summary of what we have learned till now

Vanilla mode of Image Sentiment Analysis Machine Translation Synced sequence

Example: Language Modeling

There are two Key Reasons for Failure

The image depicts a block diagram of a Long

Here's a detailed explanation of the

1. Core Components of an LSTM Cell:

 Inputs: Current input (x( t )) and previous hidden state h( t- 1 ).

- Output (h i (t) ): A filtered version of the cell state (s i (t) ).

3. Sigmoid and Pointwise Operations:

 First term (f i (t) s i (t - 1)): Retains a fraction of the past state.

You might also like

Output ( o ): Correctly classifies as negative by