0% found this document useful (0 votes)
21 views10 pages

Understanding RNNs and Their Variants

The document provides an overview of Recurrent Neural Networks (RNNs) and their variants, including Bidirectional RNNs and Long Short-Term Memory (LSTM) networks, highlighting their ability to process sequential data and capture temporal dependencies. It also discusses the Encoder-Decoder architecture for sequence-to-sequence tasks, the concept of teacher forcing in training, and the challenges of gradient computation in RNNs, particularly the vanishing and exploding gradient problems. Additionally, it introduces Recursive Neural Networks (RvNNs) for structured data and Deep RNNs for learning complex temporal representations.

Uploaded by

udemy6061
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views10 pages

Understanding RNNs and Their Variants

The document provides an overview of Recurrent Neural Networks (RNNs) and their variants, including Bidirectional RNNs and Long Short-Term Memory (LSTM) networks, highlighting their ability to process sequential data and capture temporal dependencies. It also discusses the Encoder-Decoder architecture for sequence-to-sequence tasks, the concept of teacher forcing in training, and the challenges of gradient computation in RNNs, particularly the vanishing and exploding gradient problems. Additionally, it introduces Recursive Neural Networks (RvNNs) for structured data and Deep RNNs for learning complex temporal representations.

Uploaded by

udemy6061
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

1.

Explain how the Recurrent Neural Network (RNN) processes data sequences

A Recurrent Neural Network (RNN) is a type of neural network specifically designed to


process sequential data such as time series, text, speech, and video. Unlike feedforward
neural networks, RNNs have memory, which allows them to retain information from
previous inputs while processing the current input.

Basic Idea of Sequence Processing in RNN

RNN processes data one element at a time in a sequence. At each time step, the network:

 Takes the current input


 Uses information from the previous time step
 Produces an output and updates its internal state

This enables the network to capture temporal dependencies in data.

Working of RNN on a Data Sequence

Consider an input sequence:

x1,x2,x3,…,xTx_1, x_2, x_3, \dots, x_Tx1,x2,x3,…,xT

Step-by-step processing:

1 Input at Time Step ttt

At time step ttt, RNN receives input xt.

2. Hidden State Update

 RNN maintains a hidden state ht, which acts as memory.

 The hidden state is updated using:

ht=f(Wxxt+Whht−1+b)

 xt = current input
 ht−1 = previous hidden state
 Wx,Wh = weight matrices
 b = bias
 f(⋅)= activation function (tanh or ReLU)

3. Output Generation

The output at time step ttt is computed as:

yt=g(Wyht)
where:
Wy = output weight matrix
g(.) = activation function (softmax / sigmoid)

4. Information Flow Through Time

The hidden state ht carries information from:

 Current input xt
 All previous inputs (x1,x2,…,xt−1

This process repeats for every element in the sequence.

Key Characteristics of Sequence Processing in RNN

 Weight sharing across all time steps


 Memory of past inputs through hidden states
 Suitable for variable-length sequences
 Captures temporal and sequential dependencies

[Link] Bidirectional Recurrent Neural Networks (BRNNs) with SuitableArchitecture


A Bidirectional Recurrent Neural Network (Bidirectional RNN or BRNN) is an extension of
the standard RNN that processes a data sequence in both forward and backward directions.
Unlike a unidirectional RNN, which considers only past context, a BRNN uses past as well as
future information to make predictions at each time step.

Motivation for Bidirectional RNN

In many sequence processing tasks, the output at a given time depends not only on previous
inputs but also on future inputs.

Examples:

 Speech recognition
 Natural language processing
 Handwriting recognition

To capture this complete context, Bidirectional RNNs are used.

Working Principle of Bidirectional RNN

A Bidirectional RNN consists of two separate RNN layers:

[Link] RNN

 Processes the input sequence from time step 1 to T


 Generates forward hidden states

[Link] RNN

 Processes the same sequence from time step T to 1


 Generates backward hidden states
The outputs from both directions are combined to produce the final output.

Mathematical Representation

Given an input sequence: x1,x2,x3,…,xT

Forward hidden state: ht→=f(Wxxt+Whht−1→+b)

Backward hidden state: ht←=f(Wxxt+Whht+1←+b)

Output at time step t: yt=g(Wy[ht→,ht←])

Architecture of Bidirectional RNN

 Input sequence is fed simultaneously to:

 Forward RNN layer


 Backward RNN layer

 Each layer maintains its own hidden states


 Hidden states from both directions are:
 Concatenated or summed

3 Long Short-Term Memory (LSTM) is a special type of recurrent neural network that
introduces gated self-loops to allow gradients to flow over long time durations, thereby
effectively learning long-term dependencies.

 LSTM was introduced by Hochreiter and Schmidhuber (1997).


 It solves the vanishing gradient problem in traditional RNNs.
 Uses a memory cell (state) with a linear self-loop.
 The self-loop weight is not fixed; it is controlled by a forget gate.
 The time scale of memory integration can change dynamically based on input.
 LSTM contains three main gates: Forget gate, Input gate, Output gate
 Gates use sigmoid activation to control information flow.
 The cell output is regulated using a tanh activation.

LSTM is widely used in:

 Speech recognition
 Handwriting recognition and generation
 Machine translation
 Image captioning
 Parsing tasks
LSTM Components and Forward Propagation Equations

1. Forget Gate

Controls how much of the previous cell state is retained.

fi(t)=σ(bif+∑jUi,jfxj(t)+∑jWi,jfhj(t−1))

2. Cell State Update

Updates internal memory using forget gate and input gate.

si(t)=fi(t)si(t−1)+gi(t)tanh⁡(bi+∑jUi,jxj(t)+∑jWi,jhj(t−1))

3. Input Gate

Controls how much new information is added to the cell state.

gi(t)=σ(big+∑jUi,jgxj(t)+∑jWi,jghj(t−1))

4. Output Gate and Hidden State

Controls the output of the LSTM cell.

qi(t)=σ(bio+∑jUi,joxj(t)+∑jWi,johj(t−1))q_i^{(t)}

 All gates use sigmoid activation (values between 0 and 1).


 The cell state may also be used as an additional input to gates.
 LSTM cells replace standard hidden units in RNNs.
 Same parameters are reused at each time step.

[Link]–Decoder Sequence-to-Sequence (Seq2Seq) architecture is a neural network


framework used to transform an input sequence into an output sequence, where the
lengths of input and output sequences may differ. This architecture is widely used in
applications such as machine translation, text summarization, and speech recognition.

Basic Idea of Sequence-to-Sequence Learning

 In sequence-to-sequence problems:
 Input is a sequence: x1,x2,…,xT
 Output is another sequence: y1,y2,…,yT
 The encoder–decoder architecture solves this by using:
 Encoder → to encode the input sequence
 Decoder → to generate the output sequence
 Encoder–Decoder Architecture Overview

 The architecture consists of two main components:


 Encoder Network
 Decoder Network
 Both are usually implemented using RNN, LSTM, or GRU units.

Encoder

 The encoder processes the input sequence one time step at a time.
 It converts the input sequence into a fixed-length context vector.
 For each time step t:
 ht=f(ht−1,xt)
 xt = input at time ttt
 ht = hidden state
 After the final input:
 The last hidden state hTh_ThT represents the context vector C. [C=hT]

Decoder

 The decoder generates the output sequence using the context vector from the encoder.
 It predicts one output symbol at a time.
 At time step ttt:
 st=f(st−1,yt−1,C)

 where:
 st= decoder hidden state
 yt−1 = previous output
 g(⋅) = output activation function (softmax)

Working Principle of Encoder–Decoder Architecture

 Encoder reads the entire input sequence


 Information is compressed into a context vector
 Decoder uses this context to generate output sequence
 Output is produced step-by-step
[Link] Forcing and Networks with Output Recurrence
Teacher forcing is a training technique for recurrent neural networks where, during
training, the true output from the training set is fed back into the network at the next time
step instead of the model’s own prediction.

Networks with Output Recurrence

 These networks have recurrent connections only from output to hidden units.
 They lack hidden-to-hidden recurrence, making them less powerful.
 Such networks cannot simulate a universal Turing machine.
 Output units must store all past information needed for future predictions.
 Training becomes simpler and parallelizable because:

 Each time step is decoupled.


 Gradients can be computed independently.
 No need to wait for previous outputs during training.

Teacher forcing is derived from the maximum likelihood criterion, where the model is trained
by feeding the ground-truth output y(t)as input for predicting the next time step.
Maximum Likelihood Formulation:

Log p(y(1),y(2)∣x(1),x(2)) (10.15)


=log p(y(2)∣y(1),x(1),x(2))+log p(y(1)∣x(1),x(2))(10.16)

 At time t=2, the model is trained using the true previous output y(1)
 This shows why ground-truth outputs should be used during training.

Key Points:

 During training:
 True output y(t) is fed into the model at time t+1
 During testing:
 True output is unavailable.
 Model’s own output o(t) is fed back.
 Teacher forcing:
 Avoids back-propagation through time (BPTT) when no hidden-to-hidden recurrence
exists.
 Can still be used in models with hidden recurrence, but BPTT becomes necessary.
 Some models use both teacher forcing and BPTT.

Solutions to Teacher Forcing Problem

Train using a mix of:

 Teacher-forced inputs
 Free-running (self-generated) inputs

Predict targets multiple steps ahead.


Scheduled sampling (Bengio et al., 2015):

 Randomly choose between true output and generated output.

Gradually increase use of generated outputs (curriculum learning).

[Link] Recurrent Neural Networks (Deep RNNs) in Detail


A Deep Recurrent Neural Network (Deep RNN) is an extension of the standard Recurrent
Neural Network in which multiple recurrent layers are stacked on top of each other. This
depth enables the network to learn hierarchical and complex temporal representations
from sequential data such as speech, text, and time series.
Standard (shallow) RNNs have limited representational power because they contain only one
recurrent layer. For complex sequence learning tasks, shallow RNNs may fail to capture:

 High-level temporal patterns


 Long-range dependencies

Deep RNNs overcome this limitation by introducing depth in the temporal model.

Architecture of Deep Recurrent Neural Network

A Deep RNN consists of:

 Multiple recurrent hidden layers


 Each layer processes the output sequence of the previous layer

For a Deep RNN with LLL layers:

 Layer 1 processes the input sequence


 Higher layers process increasingly abstract temporal features.

Working Principle of Deep RNN

 Consider an input sequence: x1,x2,…,xT


 Let ht(l) be the hidden state at time t and layer l.

 Hidden state update equation: ht(l)=f(W(l)ht(l−1)+U(l)ht−1(l)+b(l))

 Output Generation

 The output at time step ttt is computed from the topmost recurrent layer: yt=g(ht(L))

Key Characteristics of Deep RNN


 Depth in time and space
 Learns hierarchical temporal features
 Each layer captures different levels of abstraction
 Uses Backpropagation Through Time (BPTT) for training
Explain Recursive Neural Networks (RvNNs)
A Recursive Neural Network (Recursive NN or RvNN) is a type of neural network
designed to process structured and hierarchical data rather than simple sequences. Unlike
Recurrent Neural Networks, which operate over time sequences, Recursive Neural Networks
operate over tree-like or graph structures, making them suitable for data with a recursive
structure such as parse trees in natural language processing

 .Same neural network is applied repeatedly


 Structure of the network follows the structure of the input data

Working Principle of Recursive Neural Network

Consider a tree-structured input (e.g., sentence parse tree).

Step-by-step working:

Leaf Nodes

 Leaf nodes represent basic inputs (words or tokens).


 Each leaf node is converted into a vector representation.

Recursive Composition

 Parent node representation is computed by combining its child nodes.


 The same function is used at every node.

hp=f(W[hc1,hc2]+b)

Key Characteristics of Recursive Neural Networks

 Operate on hierarchical structures


 Use weight sharing across tree nodes
 Process data in a bottom-up manner
 Depth varies depending on input structure
Describe the Computation of Gradient in a Recurrent Neural Network (RNN)
Training a Recurrent Neural Network (RNN) requires computing gradients of the loss
function with respect to network parameters. Since RNNs have recurrent connections
across time steps, gradient computation is more complex than in feedforward networks. This
is performed using a technique called Backpropagation Through Time (BPTT).

Why Gradient Computation is Different in RNN

 RNN parameters are shared across all time steps


 The hidden state at a given time depends on previous hidden states
 Errors must be propagated backward through time
 Thus, gradients must account for temporal dependencies.

Forward Computation in RNN

For an input sequence x1,x2,…,xTx_1, x_2, \dots, x_Tx1,x2,…,xT:

 Hidden state: ht=f(Wxxt+Whht−1+b)


 Output: yt=g(Wyht)
 Total loss: L=∑t=1TLt(yt,y^t)

Backpropagation Through Time (BPTT)

BPTT unfolds the RNN across time steps, converting it into a deep feedforward network.
Gradients are then computed using the chain rule.

Gradient Computation Steps


[Link] w.r.t. Output Weights Wy

[Link] w.r.t. Hidden State


The error at time step ttt depends on:

Error from the output at time t

Error propagated from future time steps

3 Gradient w.r.t. Recurrent Weights Wh


3️⃣

This shows that gradients accumulate across time steps.

4️⃣Gradient w.r.t. Input Weights WxW_xWx

Vanishing and Exploding Gradients

 During BPTT:
 Repeated multiplication of gradients can cause:
 Vanishing gradients (values → 0)
 Exploding gradients (values → ∞)
 This makes learning long-term dependencies difficult.

Techniques to Handle Gradient Problems

 Gradient clipping
 Proper weight initialization
 Using LSTM or GRU instead of simple RNN

You might also like