0% found this document useful (0 votes)
3 views20 pages

Unit III Notes DL

This document provides an overview of Recurrent Neural Networks (RNNs), explaining their architecture, key components, and how they process sequential data by maintaining memory of past inputs. It discusses challenges such as vanishing and exploding gradients, along with techniques to mitigate these issues, and introduces advanced architectures like Bidirectional RNNs and Stacked RNNs. The document emphasizes the importance of RNNs in natural language processing and other sequential data tasks.

Uploaded by

Aditya Morale
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views20 pages

Unit III Notes DL

This document provides an overview of Recurrent Neural Networks (RNNs), explaining their architecture, key components, and how they process sequential data by maintaining memory of past inputs. It discusses challenges such as vanishing and exploding gradients, along with techniques to mitigate these issues, and introduces advanced architectures like Bidirectional RNNs and Stacked RNNs. The document emphasizes the importance of RNNs in natural language processing and other sequential data tasks.

Uploaded by

Aditya Morale
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT III NOTES

Deep Learning

Introduction to Recurrent Neural Networks


Recurrent Neural Networks (RNNs) are a class of neural networks designed to process
sequential data by retaining information from previous steps. They are especially effective for
tasks where context and order matter.
• Designed for sequential and temporal data
• Maintains memory of past inputs
• Widely used in NLP, forecasting and speech tasks

Imagine reading a sentence and you try to predict the next word, you don’t rely only on the
current word but also remember the words that came before. RNNs work similarly by
“remembering” past information i.e it considers all the earlier words to choose the most likely
next word.
This memory of previous steps helps the network understand context and make better
predictions.
Key Components of RNNs
There are mainly two components of RNNs that we will discuss.
1. Recurrent Neurons
The fundamental processing unit in RNN is a Recurrent Unit. They hold a hidden state that
maintains information about previous inputs in a sequence. Recurrent units can "remember"
information from prior steps by feeding back their hidden state, allowing them to capture
dependencies across time.
2. RNN Unfolding
RNN unfolding or unrolling is the process of expanding the recurrent structure over time steps.
During unfolding each step of the sequence is represented as a separate layer in a series
illustrating how information flows across each time step.
This unrolling enables backpropagation through time (BPTT) a learning process where errors
are propagated across time steps to adjust the network’s weights enhancing the RNN’s ability
to learn dependencies within sequential data.

Recurrent Neural Network Architecture


RNNs share similarities in input and output structures with other deep learning architectures
but differ significantly in how information flows from input to output. Unlike traditional deep
neural networks where each dense layer has distinct weight matrices. RNNs use shared weights
across time steps, allowing them to remember information over sequences.
In RNNs the hidden state 𝐻𝑖 is calculated for every input 𝑋𝑖 to retain sequential dependencies.
The computations follow these core formulas:
1. Hidden State Calculation:
ℎ = 𝜎(𝑈 ⋅ 𝑋 + 𝑊 ⋅ ℎ𝑡−1 + 𝐵)
Here:
• ℎ represents the current hidden state.
• 𝑈 and 𝑊 are weight matrices.
• 𝐵 is the bias.
2. Output Calculation:
𝑌 = 𝑂(𝑉 ⋅ ℎ + 𝐶)
The output 𝑌 is calculated by applying 𝑂 an activation function to the weighted hidden state
where 𝑉 and 𝐶 represent weights and bias.
3. Overall Function:
𝑌 = 𝑓(𝑋, ℎ, 𝑊, 𝑈, 𝑉, 𝐵, 𝐶)
This function defines the entire RNN operation where the state matrix 𝑆 holds each
element 𝑠𝑖 representing the network's state at each time step 𝑖.
How does RNN work?
At each time step RNNs process units with a fixed activation function. These units have an
internal hidden state that acts as memory that retains information from previous time steps.
This memory allows the network to store past knowledge and adapt based on new inputs.
Updating the Hidden State in RNNs
The current hidden state ℎ𝑡 depends on the previous state ℎ𝑡−1 and the current input 𝑥𝑡 and is
calculated using the following relations:
1. State Update:
ℎ𝑡 = 𝑓(ℎ𝑡−1 , 𝑥𝑡 )
where:
• ℎ𝑡 is the current state
• ℎ𝑡−1 is the previous state
• 𝑥𝑡 is the input at the current time step
2. Activation Function Application:
ℎ𝑡 = tanh⁡(𝑊ℎℎ ⋅ ℎ𝑡−1 + 𝑊𝑥ℎ ⋅ 𝑥𝑡 )
Here, 𝑊ℎℎ is the weight matrix for the recurrent neuron and 𝑊𝑥ℎ is the weight matrix for the
input neuron.
3. Output Calculation:
𝑦𝑡 = 𝑊ℎ𝑦 ⋅ ℎ𝑡
where 𝑦𝑡 is the output and 𝑊ℎ𝑦 is the weight at the output layer.
These parameters are updated using backpropagation. However, since RNN works on
sequential data here we use an updated backpropagation which is known as backpropagation
through time.
Backpropagation Through Time (BPTT) in RNNs
Since RNNs process sequential data, Backpropagation Through Time (BPTT) is used to update
the network's parameters. The loss function L(θ) depends on the final hidden state ℎ3 and each
hidden state relies on preceding ones forming a sequential dependency chain:
ℎ3 depends on depends on ℎ2 , ℎ2 depends on ℎ1 , … , ℎ1 depends on ℎ0
In BPTT, gradients are backpropagated through each time step. This is essential for updating
network parameters based on temporal dependencies.
1. Simplified Gradient Calculation:
∂𝐿(𝜃) ∂𝐿(𝜃) ∂ℎ3
= ⋅
∂𝑊 ∂ℎ3 ∂𝑊
2. Handling Dependencies in Layers: Each hidden state is updated based on its dependencies:
ℎ3 = 𝜎(𝑊 ⋅ ℎ2 + 𝑏)
The gradient is then calculated for each state, considering dependencies from previous hidden
states.
3. Gradient Calculation with Explicit and Implicit Parts: The gradient is broken down into
explicit and implicit parts summing up the indirect paths from each hidden state to the weights.
∂ℎ3 ∂ℎ3+ ∂ℎ3 ∂ℎ2+
= + ⋅
∂𝑊 ∂𝑊 ∂ℎ2 ∂𝑊
4. Final Gradient Expression: The final derivative of the loss function with respect to the
weight matrix W is computed:
3
∂𝐿(𝜃) ∂𝐿(𝜃) ∂ℎ3 ∂ℎ𝑘
= ⋅∑ ⋅
∂𝑊 ∂ℎ3 ∂ℎ𝑘 ∂𝑊
𝑘=1
This iterative process is the essence of backpropagation through time.
Vanishing and Exploding Gradients Problems in Deep Learning
To train deep neural networks effectively, managing the Vanishing and Exploding Gradients
Problems is important. These issues occur during backpropagation when gradients become too
small or too large, making it difficult for the model to learn properly. Both problems directly
affect the model’s convergence and overall performance.
Vanishing Gradient Problem
Vanishing gradients occur when gradients become extremely small during backpropagation,
causing early layers to learn very slowly or stop learning. During backpropagation the gradient
of the loss 𝐿 with respect to a weight 𝑤𝑖 in layer 𝑖 is calculated using the chain rule:
∂𝐿 ∂𝐿 ∂𝑎𝑛 ∂𝑎𝑛−1 ∂𝑎1
= ⋅ ⋅ ⋯
∂𝑤𝑖 ∂𝑎𝑛 ∂𝑎𝑛−1 ∂𝑎𝑛−2 ∂𝑤𝑖
where
• : Loss function.
• 𝑤𝑖 : Weight parameter in the layer.
• 𝑎𝑛 : Activation output of layer.
∂𝐿
• : Gradient of loss with respect to weight.
∂𝑤𝑖
When activation functions like Sigmoid or Tanh are used, their derivatives are less than 1.
Repeated multiplication through layers causes the gradient to vanish as it moves backwards,
making the lower layers unable to learn.
Exploding Gradient Problem
Exploding gradients occur when gradients grow too large during backpropagation, leading to
unstable weight updates and divergence in loss. When derivatives or weights are greater than
1, their repeated multiplication across layers leads to exponential growth.
𝑛
∂𝑎𝑖
∏ ⟶∞
∂𝑎𝑖−1
𝑖=1
The gradient update rule in gradient descent is:
∂𝐿
𝑤𝑡+1 = 𝑤𝑡 − 𝜂 ⋅
∂𝑤𝑡
where
• 𝑤𝑖 : Current weight value at time step 𝑡.
• 𝜂 : Learning rate.
∂𝐿
• Gradient of loss with respect to weight.
∂𝑤 𝑡
• 𝑤𝑡+1 : Updated weight after applying gradient descent.
∂𝐿
If ∂𝑤 is too large weight updates become massive causing the model loss to oscillate or diverge.
𝑡
Why do the Gradients Vanish or Explode
• Activation Functions: Sigmoid or Tanh have small derivatives that shrink gradients.
• Weight Initialization: Too small or too large weights cause vanishing or exploding
gradients.
• Deep Networks: Many layers multiply gradients repeatedly leading to instability.
• Learning Rate: High learning rate or unscaled inputs can make gradients explode.
Techniques to Fix Vanishing and Exploding Gradients
Vanishing and exploding gradients make training deep neural networks difficult. The following
methods help stabilize gradient flow and improve learning
1. Proper Weight Initialization
Choosing the right weight initialization keeps gradients balanced during backpropagation.
• Xavier Initialization: Keeps activation variance consistent across layers to stabilize
gradients.
• Kaiming Initialization: Scales weights for ReLU to preserve signal strength and
prevent gradient decay.
2. Use Non Saturating Activation Functions
Sigmoid and Tanh can shrink gradients. Using ReLU or its variants prevents vanishing
gradients:
• ReLU: Basic rectified linear unit.
• Leaky ReLU: Allows small gradients for negative inputs.
• ELU / SELU: Helps maintain self normalizing properties.
3. Apply Batch Normalization
Normalizes layer inputs to have zero mean and unit variance, stabilizing gradients and
accelerating convergence.
4. Gradient Clipping
Limits gradients to a maximum threshold to prevent them from exploding and destabilizing
training.
Bidirectional Recurrent Neural Network
RNN are designed to handle sequential data such as speech, text and time series. Unlike
traditional feedforward neural networks which process inputs as fixed-length vectors, RNNs
can manage variable-length sequences by maintaining a hidden state that stores information
from previous steps in the sequence.
This memory mechanism enables RNNs to capture key features within the sequence. However
traditional RNNs face challenges such as the vanishing gradient problem where gradients
become too small during backpropagation making training difficult. To address this issue
advanced RNN architectures like the Bidirectional Recurrent Neural Network (BRNN) have
been developed. In this article, we will explore BRNNs in more detail.
Overview of Bidirectional Recurrent Neural Networks (BRNNs)
A Bidirectional Recurrent Neural Network (BRNN) is an extension of the traditional RNN that
processes sequential data in both forward and backward directions. This allows the network to
utilize both past and future context when making predictions providing a more comprehensive
understanding of the sequence.
Like a traditional RNN, a BRNN moves forward through the sequence, updating the hidden
state based on the current input and the prior hidden state at each time step. The key difference
is that a BRNN also has a backward hidden layer which processes the sequence in reverse,
updating the hidden state based on the current input and the hidden state of the next time step.
Compared to unidirectional RNNs BRNNs improve accuracy by considering both the past and
future context. This is because the two hidden layers i.e forward and backward complement
each other and predictions are made using the combined outputs of both layers.
Example:
Consider the sentence: "I like apple. It is very healthy."
In a traditional unidirectional RNN the network might struggle to understand whether "apple"
refers to the fruit or the company based on the first sentence. However a BRNN would have
no such issue. By processing the sentence in both directions, it can easily understand that
"apple" refers to the fruit, thanks to the future context provided by the second sentence ("It is
very healthy.").

Bi-directional Recurrent Neural Network

Working of Bidirectional Recurrent Neural Networks (BRNNs)


1. Inputting a Sequence: A sequence of data points each represented as a vector with the same
dimensionality is fed into the BRNN. The sequence may have varying lengths.
2. Dual Processing: BRNNs process data in two directions:
• Forward direction: The hidden state at each time step is determined by the current
input and the previous hidden state.
• Backward direction: The hidden state at each time step is influenced by the current
input and the next hidden state.
3. Computing the Hidden State: A non-linear activation function is applied to the weighted
sum of the input and the previous hidden state creating a memory mechanism that allows the
network to retain information from earlier steps.
4. Determining the Output: A non-linear activation function is applied to the weighted sum
of the hidden state and output weights to compute the output at each step. This output can either
be:
• The final output of the network.
• An input to another layer for further processing.

Stacked RNNs in NLP


Stacked RNNs refer to a special kind of RNNs that have multiple recurrent layers on top of one
layer. Stacked RNNs are also called Deep RNNs for that reason. In this article, we will load
the IMDB dataset and make multiple layers of SimpleRNN (stacked SimpleRNN) as an
example of Stacked RNN.
What is RNN?
RNN or Recurrent Neural Network, belongs to the Neural Network family which is commonly
used for Natural Language Processing (NLP) tasks. They specialize in handling any sequential
data (be it video, text or time series as well). This is because of the presence of a Hidden state
inside an RNN layer. The hidden state is responsible for memorizing the information from the
previous timestep and using that for further adjustment of weights in Training a model.

What are Stacked RNNs


A single-layered RNN model has only a hidden layer which is liable to process sequential data.
But Stacked RNN is a special kind of model that has multiple RNN layers one on each layer.
This creates a 'Stack'. Each layer of this stack processes the input sequence.
• When an Input is passed to Layer 1:
o The input ( 𝑥𝑡 ) passes through the RNN layer 1. There, the hidden state gets
updated as:
ℎ𝑡 = 𝜎(𝑤𝑥 𝑥𝑡 + 𝑤ℎ ℎ𝑡−1 + 𝑏ℎ )
where
o ℎ𝑡 = present Hidden state
o ℎ𝑡−1 = previous hidden state
o 𝑥𝑡 = input to the RNN layer
o 𝑤𝑥 = weights associated with the input
o 𝑤ℎ = weights associated with the hidden layer
o 𝑏ℎ = bias associated with RNN layer
o 𝜎 = activation function
• What happens in the hidden state is that, using the information or knowledge it retained
in the previous time step, the hidden state updates itself.
• The present hidden state is used in getting the output of the hidden layer, using an
appropriate activation function.
o 𝑦 = 𝑊ℎ𝑡 + 𝑏𝑦 where,
o W = Weights assigned to the layer
o ℎ𝑡 = hidden state
o 𝑏𝑦 = bias associated with output layer
• For the second layer, the output of first RNN layer is fed into it, which goes through
the same process again.

Stacked RNN architecture

This feature of Stacked RNNs enables to capture of both short-term and long-term patterns.
For that reason, Stacked RNNs can learn and remember information patterns in longer
sequences and at the same it can analyze their current state's information with just learned
previous state's information. The more layers you add to your model, the stacked network will
able to capture more complex patterns present in the sequential data. If your data is nested and
has different types of complex patterns then Stacked RNNs will be a better model as its each
layer can learn different abstractions present in your data.
What is LSTM - Long Short Term Memory?
Long Short-Term Memory (LSTM) is an enhanced version of the Recurrent Neural Network
(RNN) designed by Hochreiter and Schmidhuber. LSTMs can capture long-term dependencies
in sequential data making them ideal for tasks like language translation, speech recognition and
time series forecasting. Unlike traditional RNNs which use a single hidden state passed through
time LSTMs introduce a memory cell that holds information over extended periods addressing
the challenge of learning long-term dependencies.
Problem with Long-Term Dependencies in RNN
Recurrent Neural Networks (RNNs) are designed to handle sequential data by maintaining a
hidden state that captures information from previous time steps. However they often face
challenges in learning long-term dependencies where information from distant time steps
becomes crucial for making accurate predictions for current state. This problem is known as
the vanishing gradient or exploding gradient problem.
• Vanishing Gradient: When training a model over time, the gradients which help the
model learn can shrink as they pass through many steps. This makes it hard for the
model to learn long-term patterns since earlier information becomes almost irrelevant.
• Exploding Gradient: Sometimes gradients can grow too large causing instability. This
makes it difficult for the model to learn properly as the updates to the model become
erratic and unpredictable.
Both of these issues make it challenging for standard RNNs to effectively capture long-term
dependencies in sequential data.
LSTM Architecture
LSTM architectures involves the memory cell which is controlled by three gates:
1. Input gate: Controls what information is added to the memory cell.
2. Forget gate: Determines what information is removed from the memory cell.
3. Output gate: Controls what information is output from the memory cell.
This allows LSTM networks to selectively retain or discard information as it flows through the
network which allows them to learn long-term dependencies. The network has a hidden state
which is like its short-term memory. This memory is updated using the current input, the
previous hidden state and the current state of the memory cell.
Working of LSTM
LSTM architecture has a chain structure that contains four neural networks and different
memory blocks called cells.

Information is retained by the cells and the memory manipulations are done by the gates. There
are three gates -
1. Forget Gate
The information that is no longer useful in the cell state is removed with the forget gate. Two
inputs 𝑥𝑡 (input at the particular time) and ℎ𝑡−1 (previous cell output) are fed to the gate and
multiplied with weight matrices followed by the addition of bias. The resultant is passed
through sigmoid activation function which gives output in range of [0,1]. If for a particular cell
state the output is 0 or near to 0, the piece of information is forgotten and for output of 1 or
near to 1, the information is retained for future use.
The equation for the forget gate is:
𝑓𝑡 = 𝜎(𝑊𝑓 ⋅ [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑓 )
Where:
• 𝑊𝑓 represents the weight matrix associated with the forget gate.
• [ℎ𝑡 − 1, 𝑥𝑡 ] denotes the concatenation of the current input and the previous hidden
state.
• 𝑏𝑓 is the bias with the forget gate.
• 𝜎 is the sigmoid activation function.

2. Input gate
The addition of useful information to the cell state is done by the input gate. First the
information is regulated using the sigmoid function and filter the values to be remembered
similar to the forget gate using inputs ℎ𝑡−1and 𝑥𝑡 . Then, a vector is created using tanh function
that gives an output from -1 to +1 which contains all the possible values from ℎ𝑡−1 and 𝑥𝑡 . At
last the values of the vector and the regulated values are multiplied to obtain the useful
information. The equation for the input gate is:
𝑖𝑡 = 𝜎(𝑊𝑖 ⋅ [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑖 )
̂
𝐶𝑡 = tanh⁡(𝑊𝑐 ⋅ [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑐 )
We multiply the previous state by 𝑓𝑡 effectively filtering out the information we had decided
to ignore earlier. Then we add 𝑖𝑡 ⊙ 𝐶𝑡 which represents the new candidate values scaled by
how much we decided to update each state value.
𝐶𝑡 = 𝑓𝑡 ⊙ 𝐶𝑡−1 + 𝑖𝑡 ⊙ 𝐶̂𝑡
where
• ⊙ denotes element-wise multiplication
• tanh is activation function

3. Output gate
The output gate is responsible for deciding what part of the current cell state should be sent as
the hidden state (output) for this time [Link], the gate uses a sigmoid function to determine
which information from the current cell state will be output. This is done using the previous
hidden state ℎ𝑡−1 and the current input 𝑥𝑡 :
𝑜𝑡 = 𝜎(𝑊𝑜 ⋅ [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑜 )
Next, the current cell state 𝐶𝑡 is passed through a tanh activation to scale its values
between −1 and +1. Finally, this transformed cell state is multiplied element-wise with 𝑜𝑡 to
produce the hidden state ℎ𝑡 :
ℎ𝑡 = 𝑜𝑡 ⊙ tanh⁡(𝐶𝑡 )
Here:
• 𝑜𝑡 is the output gate activation.
• 𝐶𝑡 is the current cell state.
• ⊙ represents element-wise multiplication.
• 𝜎 is the sigmoid activation function.
This hidden state ℎ𝑡 is then passed to the next time step and can also be used for generating the
output of the network.
Encoder Decoder Models
In deep learning the encoder-decoder model is a type of neural network that is mainly used for
tasks where both the input and output are sequences. This architecture is used when the input
and output sequences are not the same length for example translating a sentence from one
language to another, summarizing a paragraph, describing an image with a caption or convert
speech into text. It works in two stages:
• Encoder: The encoder takes the input data like a sentence and processes each word
one by one then creates a single, fixed-size summary of the entire input called a context
vector or latent space.
• Decoder: The decoder takes the context vector and begins to produce the output one
step at a time.
Encoder-Decoder Model Architecture
In an encoder-decoder model both the encoder and decoder are separate networks each one
has its own specific task. These networks can be different types such as Recurrent Neural
Networks (RNNs), Long Short-Term Memory networks (LSTMs), Gated Recurrent Units
(GRUs), Convolutional Neural Networks (CNNs) or even more advanced models
like Transformers.
Encoder
The encoder's job is to process the input data and convert it into a form that the model can
understand. It does this using two main steps:
1. Self-Attention Layer: This layer helps the encoder focus on different parts of the input
data that are important for understanding the context. For example in a sentence it
allows the model to consider how each word relates to the others.
2. Feed-Forward Neural Network: After the self-attention layer this network processes
the information further to capture complex patterns and relationships in the data.
Decoder
The decoder takes the processed information from the encoder and generates the output. It also
has three main components:
1. Self-Attention Layer: Similar to the encoder this layer allows the decoder to focus on
different parts of the output it has generated.
2. Encoder-Decoder Attention Layer: This unique layer enables the decoder to focus on
relevant parts of the input data help to generate more accurate outputs.
3. Feed-Forward Neural Network: Like the encoder the decoder uses this network to
process the information and generate the final output.
Working of Encoder Decoder Model
The actual working of the encoder decoder model is shown in below diagram. Now we will
understand it stepwise:

Step 1: Tokenizing the Input Sentence


• The sentence "I am learning AI" is first broken into tokens: ["I", "am", "learning",
"AI"].
• Each word (token) is converted into a vector that a machine can understand. This
process is called embedding.
Step 2: Encoding the Input
• The Encoder processes these embeddings using self-attention.
• Self-attention helps the encoder to focus on important words. For example while
encoding "learning", it understands its relation with "I" and "AI."
• After processing the encoder generates a Context Vector which captures the meaning
of the entire sentence. For example in the image The arrows show how each word
relates to the others during encoding. The final output from the encoder is the context
representation
Step 3: Passing the Context to the Decoder
• The Context Vector is passed to the Decoder as shown in image.
• It acts like a summary of the full input sentence.
Step 4: Decoder Generates Output Step-by-Step
• The Decoder uses the context and starts creating the output one word at a time.
• First it predicts the first word then uses that to predict the second word and so on
Step 5: Decoder Attention
• While generating each word the decoder attends to different parts of the input sentence
to make better predictions.
• For example when translating "learning," it might pay more attention to the word
"learning" in the input.
Step 6: Producing the Final Output
• The decoder continues generating until the full translated sentence is produced.
• Each output token depends on the previous ones and the input context. You finally see
the output tokens generated on the right side of the diagram completing the translation.

seq2seq Model
Sequence-to-Sequence (Seq2Seq) models are neural networks designed to transform one
sequence into another, even when the input and output lengths differ and are built using
encoder-decoder architecture.
• It processes an input sequence and generates a corresponding output sequence.
• Handles variable-length input and output sequences
• It is used in NLP, machine translation, speech recognition and time-series prediction.

Both the input and the output are treated as sequences of varying lengths and the model is
composed of two parts:
1. Encoder:
• Processes the input sequence token by token.
• Encodes the entire sequence into a fixed-length context vector (or a series of hidden
states) that summarizes the important information from the input.
2. Decoder:
• Takes the context vector as input.
• Generates the output sequence one token at a time, predicting each token based on the
context vector and previously generated tokens.
The model is commonly used in tasks where there is a need to map sequences of varying lengths
such as converting a sentence in one language to another or predicting a sequence of future
events based on past data i.e time-series forecasting.
Seq2Seq with RNNs
In the simplest Seq2Seq model RNNs are used in both the encoder and decoder to process
sequential data. For a given input sequence (𝑥1 , 𝑥2 , . . . , 𝑥𝑇 ), a RNN generates a sequence of
outputs (𝑦1 , 𝑦2 , . . . , 𝑦𝑇 ) through iterative computation based on the following equation:
ℎ𝑡 = 𝜎(𝑊 ℎ𝑥 𝑥𝑡 + 𝑊 ℎℎ ℎ𝑡−1 )
𝑦𝑡 = 𝑊 𝑦ℎ ℎ𝑡
Here
• ℎ𝑡 represents hidden state at time step t
• 𝑥𝑡 represents input at time step t
• 𝑊ℎ𝑥 and 𝑊𝑦ℎ represents the weight matrices
• ℎ𝑡−1 represents hidden state from the previous time step (t-1)
• 𝜎 represents the sigmoid activation function.
• 𝑦𝑡 represents output at time step t
Limitations of Vanilla RNNs:
• Vanilla RNNs struggle with long-term dependencies due to the vanishing gradient
problem.
• To overcome this, advanced RNN variants like LSTM (Long Short-Term Memory) or
GRU (Gated Recurrent Unit) are used in Seq2Seq models. These architectures are better
at capturing long-range dependencies.
How Does the Seq2Seq Model Work?
A Sequence-to-Sequence (Seq2Seq) model consists of two primary phases: encoding the input
sequence and decoding it into an output sequence.
1. Encoding the Input Sequence
• The encoder processes the input sequence token by token, updating its internal state at
each step.
• After processing the entire sequence, the encoder produces a context vector i.e a fixed-
length representation summarizing the important information from the input.
2. Decoding the Output Sequence
The decoder takes the context vector and generates the output sequence one token at a time.
For example, in machine translation:
• Input: "I am learning"
• Output: "Je suis apprenant"
Each token is predicted based on the context vector and previously generated tokens.
3. Teacher Forcing
During training, teacher forcing is commonly used. Instead of feeding the decoder’s own
previous prediction as the next input, the actual target token from the training data is provided.
Benefits:
• Accelerates training
• Reduces error propagation

Autoencoders vs. PCA


Introduction
In data science, dimensionality reduction is an essential technique that helps simplify high-
dimensional data while retaining the most critical features. Two popular methods for
dimensionality reduction are Principal Component Analysis (PCA) and Autoencoders. PCA
is a classic linear approach, while Autoencoders leverage neural networks for more complex,
non-linear data compression. In this blog post, we’ll compare PCA and Autoencoders, explore
how they work, and discuss their strengths, limitations, and use cases to help you choose the
right approach for your data.
What is Principal Component Analysis (PCA)?
Principal Component Analysis (PCA) is a statistical method used to reduce the
dimensionality of data by transforming it into a new set of orthogonal axes (called principal
components) that capture the most variance. PCA projects data onto these components in a
way that maximizes the variance in fewer dimensions, allowing for more compact
representation without losing too much information.
How PCA Works:

1. Standardize the Data: Normalize the data to ensure each feature has equal weight.
2. Covariance Matrix Calculation: Compute the covariance matrix to identify how
variables relate to each other.
3. Eigenvectors and Eigenvalues: Calculate the eigenvectors and eigenvalues of the
covariance matrix to determine the direction and magnitude of variance in the data.
4. Project Data onto Principal Components: Select the top principal components (those
with the highest eigenvalues) and project the original data onto this lower-dimensional
space.

Mathematical Representation:

Let ‘X’ be the input data matrix (standardized), and the goal is to find the projection
matrix ‘W’ that transforms ‘X’ into a lower-dimensional subspace:

Where:
• ‘Z’ is the projected data (in reduced dimensions),
• ‘W’ is the matrix of eigenvectors corresponding to the largest eigenvalues of the
covariance matrix.
The principal components maximize the variance:

Where ‘Zi’ are the principal components.


Advantages of PCA:
• Simplicity: PCA is easy to implement and computationally efficient.
• Interpretable: PCA provides an intuitive understanding of variance and how features
relate to each other.
• Linear Dimensionality Reduction: Suitable for data that lies in a linear subspace,
making it highly effective for linearly separable datasets.
Limitations of PCA:
• Linearity Assumption: PCA is a linear method, so it struggles with capturing complex,
non-linear relationships in data.
• Sensitivity to Scaling: PCA requires data to be normalized, as it is sensitive to the scale
of the variables.
• Lack of Flexibility: PCA does not handle highly non-linear datasets well, leading to
suboptimal performance when data has complex structures.
What is an Autoencoder?
• Autoencoders are a type of neural network designed for unsupervised learning. They
learn to compress (encode) data into a lower-dimensional space and then reconstruct
(decode) the original data from this compressed representation. Unlike PCA,
Autoencoders can capture non-linear relationships, making them more suitable for
complex datasets with non-linear structures.
How Autoencoders Work:
• Encoder: The first part of the Autoencoder network compresses the input data into a
lower-dimensional latent space (bottleneck). The goal is to capture the most important
features of the data in fewer dimensions.
• Latent Representation: The compressed version of the data is represented in the
bottleneck layer, which has fewer neurons than the input.
• Decoder: The second part of the network attempts to reconstruct the original input data
from the compressed latent space.
Structure of an Autoencoder:
• Input Layer: Original high-dimensional data.
• Hidden Layers: Intermediate layers used for encoding and decoding.
• Bottleneck Layer: The lower-dimensional latent space representation of the data.
• Output Layer: Reconstructed version of the input data.
Mathematical Representation: Let ‘X’ represent the input data, and the Autoencoder
consists of an encoder function ‘f’ and a decoder function ‘g’:
• Z = f(X) → (Encoding: compressed representation)
• X^ = g(Z) → (Decoding: reconstruction of the input)
• The Autoencoder aims to minimize the reconstruction loss ‘L’ between the
input ‘X’ and its reconstruction ‘X^’:


Advantages of Autoencoders:
Non-Linear Dimensionality Reduction: Autoencoders can learn non-linear
transformations, making them suitable for complex datasets that cannot be captured by
linear models like PCA.
Customizable Architectures: The flexibility of neural networks allows for deeper
architectures and more complex models tailored to specific tasks.
Unsupervised Feature Learning: Autoencoders can discover underlying features in data
without the need for labeled data.
Limitations of Autoencoders:
More Complex to Train: Autoencoders require significant computational resources and
expertise to train, especially if deep architectures are used.
Prone to Overfitting: Without careful regularization, Autoencoders can overfit the training
data, leading to poor generalization.
Interpretability: The latent space representation learned by Autoencoders is often less
interpretable than the principal components from PCA.
Key Differences Between PCA and Autoencoders

Recurrent Neural Networks (RNN)


Architecture Components
Input layer → Hidden state → Output layer
At time step 𝑡:
ℎ𝑡 = 𝑓(𝑊𝑥 𝑥𝑡 + 𝑊ℎ ℎ𝑡−1 + 𝑏)
𝑦𝑡 = 𝑔(𝑊𝑦 ℎ𝑡 )

Where:
• 𝑥𝑡 = input at time 𝑡
• ℎ𝑡 = hidden state
• ℎ𝑡−1= previous hidden state
• 𝑊𝑥 , 𝑊ℎ , 𝑊𝑦 = weight matrices
• 𝑓= activation function (tanh/ReLU)

Given:

𝑥𝑡 = 0.5
ℎ𝑡−1 = 0.3
𝑊𝑥 = 0.8
𝑊ℎ = 0.6

Compute the hidden state ℎ𝑡 using tanh activation.


Solution

ℎ𝑡 = 𝑡𝑎𝑛ℎ(𝑊𝑥 𝑥𝑡 + 𝑊ℎ ℎ𝑡−1 )
ℎ𝑡 = 𝑡𝑎𝑛ℎ((0.8)(0.5) + (0.6)(0.3))
= 𝑡𝑎𝑛ℎ(0.4 + 0.18)
= 𝑡𝑎𝑛ℎ(0.58)
ℎ𝑡 ≈ 0.52

You might also like