1.
Explain how the Recurrent Neural Network (RNN) processes data sequences
A Recurrent Neural Network (RNN) is a type of neural network specifically designed to
process sequential data such as time series, text, speech, and video. Unlike feedforward
neural networks, RNNs have memory, which allows them to retain information from
previous inputs while processing the current input.
Basic Idea of Sequence Processing in RNN
RNN processes data one element at a time in a sequence. At each time step, the network:
Takes the current input
Uses information from the previous time step
Produces an output and updates its internal state
This enables the network to capture temporal dependencies in data.
Working of RNN on a Data Sequence
Consider an input sequence:
x1,x2,x3,…,xTx_1, x_2, x_3, \dots, x_Tx1,x2,x3,…,xT
Step-by-step processing:
1 Input at Time Step ttt
At time step ttt, RNN receives input xt.
2. Hidden State Update
RNN maintains a hidden state ht, which acts as memory.
The hidden state is updated using:
ht=f(Wxxt+Whht−1+b)
xt = current input
ht−1 = previous hidden state
Wx,Wh = weight matrices
b = bias
f(⋅)= activation function (tanh or ReLU)
3. Output Generation
The output at time step ttt is computed as:
yt=g(Wyht)
where:
Wy = output weight matrix
g(.) = activation function (softmax / sigmoid)
4. Information Flow Through Time
The hidden state ht carries information from:
Current input xt
All previous inputs (x1,x2,…,xt−1
This process repeats for every element in the sequence.
Key Characteristics of Sequence Processing in RNN
Weight sharing across all time steps
Memory of past inputs through hidden states
Suitable for variable-length sequences
Captures temporal and sequential dependencies
[Link] Bidirectional Recurrent Neural Networks (BRNNs) with SuitableArchitecture
A Bidirectional Recurrent Neural Network (Bidirectional RNN or BRNN) is an extension of
the standard RNN that processes a data sequence in both forward and backward directions.
Unlike a unidirectional RNN, which considers only past context, a BRNN uses past as well as
future information to make predictions at each time step.
Motivation for Bidirectional RNN
In many sequence processing tasks, the output at a given time depends not only on previous
inputs but also on future inputs.
Examples:
Speech recognition
Natural language processing
Handwriting recognition
To capture this complete context, Bidirectional RNNs are used.
Working Principle of Bidirectional RNN
A Bidirectional RNN consists of two separate RNN layers:
[Link] RNN
Processes the input sequence from time step 1 to T
Generates forward hidden states
[Link] RNN
Processes the same sequence from time step T to 1
Generates backward hidden states
The outputs from both directions are combined to produce the final output.
Mathematical Representation
Given an input sequence: x1,x2,x3,…,xT
Forward hidden state: ht→=f(Wxxt+Whht−1→+b)
Backward hidden state: ht←=f(Wxxt+Whht+1←+b)
Output at time step t: yt=g(Wy[ht→,ht←])
Architecture of Bidirectional RNN
Input sequence is fed simultaneously to:
Forward RNN layer
Backward RNN layer
Each layer maintains its own hidden states
Hidden states from both directions are:
Concatenated or summed
3 Long Short-Term Memory (LSTM) is a special type of recurrent neural network that
introduces gated self-loops to allow gradients to flow over long time durations, thereby
effectively learning long-term dependencies.
LSTM was introduced by Hochreiter and Schmidhuber (1997).
It solves the vanishing gradient problem in traditional RNNs.
Uses a memory cell (state) with a linear self-loop.
The self-loop weight is not fixed; it is controlled by a forget gate.
The time scale of memory integration can change dynamically based on input.
LSTM contains three main gates: Forget gate, Input gate, Output gate
Gates use sigmoid activation to control information flow.
The cell output is regulated using a tanh activation.
LSTM is widely used in:
Speech recognition
Handwriting recognition and generation
Machine translation
Image captioning
Parsing tasks
LSTM Components and Forward Propagation Equations
1. Forget Gate
Controls how much of the previous cell state is retained.
fi(t)=σ(bif+∑jUi,jfxj(t)+∑jWi,jfhj(t−1))
2. Cell State Update
Updates internal memory using forget gate and input gate.
si(t)=fi(t)si(t−1)+gi(t)tanh(bi+∑jUi,jxj(t)+∑jWi,jhj(t−1))
3. Input Gate
Controls how much new information is added to the cell state.
gi(t)=σ(big+∑jUi,jgxj(t)+∑jWi,jghj(t−1))
4. Output Gate and Hidden State
Controls the output of the LSTM cell.
qi(t)=σ(bio+∑jUi,joxj(t)+∑jWi,johj(t−1))q_i^{(t)}
All gates use sigmoid activation (values between 0 and 1).
The cell state may also be used as an additional input to gates.
LSTM cells replace standard hidden units in RNNs.
Same parameters are reused at each time step.
[Link]–Decoder Sequence-to-Sequence (Seq2Seq) architecture is a neural network
framework used to transform an input sequence into an output sequence, where the
lengths of input and output sequences may differ. This architecture is widely used in
applications such as machine translation, text summarization, and speech recognition.
Basic Idea of Sequence-to-Sequence Learning
In sequence-to-sequence problems:
Input is a sequence: x1,x2,…,xT
Output is another sequence: y1,y2,…,yT
The encoder–decoder architecture solves this by using:
Encoder → to encode the input sequence
Decoder → to generate the output sequence
Encoder–Decoder Architecture Overview
The architecture consists of two main components:
Encoder Network
Decoder Network
Both are usually implemented using RNN, LSTM, or GRU units.
Encoder
The encoder processes the input sequence one time step at a time.
It converts the input sequence into a fixed-length context vector.
For each time step t:
ht=f(ht−1,xt)
xt = input at time ttt
ht = hidden state
After the final input:
The last hidden state hTh_ThT represents the context vector C. [C=hT]
Decoder
The decoder generates the output sequence using the context vector from the encoder.
It predicts one output symbol at a time.
At time step ttt:
st=f(st−1,yt−1,C)
where:
st= decoder hidden state
yt−1 = previous output
g(⋅) = output activation function (softmax)
Working Principle of Encoder–Decoder Architecture
Encoder reads the entire input sequence
Information is compressed into a context vector
Decoder uses this context to generate output sequence
Output is produced step-by-step
[Link] Forcing and Networks with Output Recurrence
Teacher forcing is a training technique for recurrent neural networks where, during
training, the true output from the training set is fed back into the network at the next time
step instead of the model’s own prediction.
Networks with Output Recurrence
These networks have recurrent connections only from output to hidden units.
They lack hidden-to-hidden recurrence, making them less powerful.
Such networks cannot simulate a universal Turing machine.
Output units must store all past information needed for future predictions.
Training becomes simpler and parallelizable because:
Each time step is decoupled.
Gradients can be computed independently.
No need to wait for previous outputs during training.
Teacher forcing is derived from the maximum likelihood criterion, where the model is trained
by feeding the ground-truth output y(t)as input for predicting the next time step.
Maximum Likelihood Formulation:
Log p(y(1),y(2)∣x(1),x(2)) (10.15)
=log p(y(2)∣y(1),x(1),x(2))+log p(y(1)∣x(1),x(2))(10.16)
At time t=2, the model is trained using the true previous output y(1)
This shows why ground-truth outputs should be used during training.
Key Points:
During training:
True output y(t) is fed into the model at time t+1
During testing:
True output is unavailable.
Model’s own output o(t) is fed back.
Teacher forcing:
Avoids back-propagation through time (BPTT) when no hidden-to-hidden recurrence
exists.
Can still be used in models with hidden recurrence, but BPTT becomes necessary.
Some models use both teacher forcing and BPTT.
Solutions to Teacher Forcing Problem
Train using a mix of:
Teacher-forced inputs
Free-running (self-generated) inputs
Predict targets multiple steps ahead.
Scheduled sampling (Bengio et al., 2015):
Randomly choose between true output and generated output.
Gradually increase use of generated outputs (curriculum learning).
[Link] Recurrent Neural Networks (Deep RNNs) in Detail
A Deep Recurrent Neural Network (Deep RNN) is an extension of the standard Recurrent
Neural Network in which multiple recurrent layers are stacked on top of each other. This
depth enables the network to learn hierarchical and complex temporal representations
from sequential data such as speech, text, and time series.
Standard (shallow) RNNs have limited representational power because they contain only one
recurrent layer. For complex sequence learning tasks, shallow RNNs may fail to capture:
High-level temporal patterns
Long-range dependencies
Deep RNNs overcome this limitation by introducing depth in the temporal model.
Architecture of Deep Recurrent Neural Network
A Deep RNN consists of:
Multiple recurrent hidden layers
Each layer processes the output sequence of the previous layer
For a Deep RNN with LLL layers:
Layer 1 processes the input sequence
Higher layers process increasingly abstract temporal features.
Working Principle of Deep RNN
Consider an input sequence: x1,x2,…,xT
Let ht(l) be the hidden state at time t and layer l.
Hidden state update equation: ht(l)=f(W(l)ht(l−1)+U(l)ht−1(l)+b(l))
Output Generation
The output at time step ttt is computed from the topmost recurrent layer: yt=g(ht(L))
Key Characteristics of Deep RNN
Depth in time and space
Learns hierarchical temporal features
Each layer captures different levels of abstraction
Uses Backpropagation Through Time (BPTT) for training
Explain Recursive Neural Networks (RvNNs)
A Recursive Neural Network (Recursive NN or RvNN) is a type of neural network
designed to process structured and hierarchical data rather than simple sequences. Unlike
Recurrent Neural Networks, which operate over time sequences, Recursive Neural Networks
operate over tree-like or graph structures, making them suitable for data with a recursive
structure such as parse trees in natural language processing
.Same neural network is applied repeatedly
Structure of the network follows the structure of the input data
Working Principle of Recursive Neural Network
Consider a tree-structured input (e.g., sentence parse tree).
Step-by-step working:
Leaf Nodes
Leaf nodes represent basic inputs (words or tokens).
Each leaf node is converted into a vector representation.
Recursive Composition
Parent node representation is computed by combining its child nodes.
The same function is used at every node.
hp=f(W[hc1,hc2]+b)
Key Characteristics of Recursive Neural Networks
Operate on hierarchical structures
Use weight sharing across tree nodes
Process data in a bottom-up manner
Depth varies depending on input structure
Describe the Computation of Gradient in a Recurrent Neural Network (RNN)
Training a Recurrent Neural Network (RNN) requires computing gradients of the loss
function with respect to network parameters. Since RNNs have recurrent connections
across time steps, gradient computation is more complex than in feedforward networks. This
is performed using a technique called Backpropagation Through Time (BPTT).
Why Gradient Computation is Different in RNN
RNN parameters are shared across all time steps
The hidden state at a given time depends on previous hidden states
Errors must be propagated backward through time
Thus, gradients must account for temporal dependencies.
Forward Computation in RNN
For an input sequence x1,x2,…,xTx_1, x_2, \dots, x_Tx1,x2,…,xT:
Hidden state: ht=f(Wxxt+Whht−1+b)
Output: yt=g(Wyht)
Total loss: L=∑t=1TLt(yt,y^t)
Backpropagation Through Time (BPTT)
BPTT unfolds the RNN across time steps, converting it into a deep feedforward network.
Gradients are then computed using the chain rule.
Gradient Computation Steps
[Link] w.r.t. Output Weights Wy
[Link] w.r.t. Hidden State
The error at time step ttt depends on:
Error from the output at time t
Error propagated from future time steps
3 Gradient w.r.t. Recurrent Weights Wh
3️⃣
This shows that gradients accumulate across time steps.
4️⃣Gradient w.r.t. Input Weights WxW_xWx
Vanishing and Exploding Gradients
During BPTT:
Repeated multiplication of gradients can cause:
Vanishing gradients (values → 0)
Exploding gradients (values → ∞)
This makes learning long-term dependencies difficult.
Techniques to Handle Gradient Problems
Gradient clipping
Proper weight initialization
Using LSTM or GRU instead of simple RNN