RNN:
In traditional neural networks, inputs and outputs are treated independently. However,
tasks like predicting the next word in a sentence require information from previous words
to make accurate predictions. To address this limitation, Recurrent Neural Networks
(RNNs) were developed.
Recurrent Neural Networks introduce a mechanism where the output from one step is fed
back as input to the next, allowing them to retain information from previous inputs. This
design makes RNNs well-suited for tasks where context from earlier steps is essential,
such as predicting the next word in a sentence.
The defining feature of RNNs is their hidden state—also called the memory state—which
preserves essential information from previous inputs in the sequence. By using the same
parameters across all steps, RNNs perform consistently across inputs, reducing parameter
complexity compared to traditional neural networks. This capability makes RNNs highly
effective for sequential tasks.
KEY COMPONENTS OF RNN:
[Link] neurons:
A recurrent neuron is a type of artificial neuron that retains a connection to itself from the
previous time step.
This enables the neuron to maintain a hidden state that acts as a memory, storing
information about prior inputs.
[Link] unfolding:
Unfolding refers to the process of representing the recurrent structure of the RNN as a
sequence of interconnected layers, one for each time step.
This transformation allows the network to be trained using standard backpropagation
(specifically, backpropagation through time, or BPTT).
Types Of Recurrent Neural Networks
There are four types of RNNs based on the number of inputs and outputs in the network:
1. One-to-One RNN
One-to-One RNN behaves as the Vanilla Neural Network, is the simplest type of neural
network architecture. In this setup, there is a single input and a single output. Commonly
used for straightforward classification tasks where input data points do not depend on
previous elements.
One to One RNN
2. One-to-Many RNN
In a One-to-Many RNN, the network processes a single input to produce multiple
outputs over time. This setup is beneficial when a single input element should generate a
sequence of predictions.
For example, for image captioning task, a single image as input, the model predicts a
sequence of words as a caption.
One to Many RNN
3. Many-to-One RNN
The Many-to-One RNN receives a sequence of inputs and generates a single output.
This type is useful when the overall context of the input sequence is needed to make one
prediction.
In sentiment analysis, the model receives a sequence of words (like a sentence) and
produces a single output, which is the sentiment of the sentence (positive, negative, or
neutral).
Many to One RNN
4. Many-to-Many RNN
The Many-to-Many RNN type processes a sequence of inputs and generates a sequence
of outputs. This configuration is ideal for tasks where the input and output sequences
need to align over time, often in a one-to-one or many-to-many mapping.
In language translation task, a sequence of words in one language is given as input, and a
corresponding sequence in another language is generated as output.
Training Recurrent Neural Networks (RNNs) poses unique challenges due to their
sequential nature and the need to propagate information over time. Among the most
significant issues are the vanishing gradient and exploding gradient problems, which
severely impact the training of RNNs, especially when dealing with long sequences.
Challenges in Training RNNs
RNNs involve processing sequential data, where the hidden state at each time step
depends on the current input and the hidden state of the previous step. This temporal
dependency causes gradients to propagate backward through time, which can lead to:
1. Vanishing Gradient Problem:
○ During backpropagation through time (BPTT), gradients of the loss
function with respect to earlier weights diminish exponentially as they are
multiplied by small derivatives (e.g., from activation functions like tanh or
sigmoid).
○ This makes it difficult for the network to learn long-term dependencies, as
the weights associated with earlier time steps receive extremely small
updates.
2. Exploding Gradient Problem:
○ Conversely, when the gradients are repeatedly multiplied by values greater
than 1 (e.g., large weights or improper initialization), they grow
exponentially.
○ This leads to extremely large weight updates, destabilizing the training
process, and causing the loss function to diverge.
METHODS TO OVERCOME CHALLENGES:
1. Gradient Clipping (for Exploding Gradients):
● Limit the gradients to a predefined maximum value during backpropagation to
prevent them from growing excessively.
● Common technique: scale gradients if their norm exceeds a threshold.
2. Use of Gated Architectures (for Vanishing Gradients):
● Long Short-Term Memory (LSTM):
○ Introduces gates (input, forget, and output) that control the flow of
information, allowing the model to retain long-term dependencies by
maintaining constant error flow.
● Gated Recurrent Units (GRU):
○ A simpler alternative to LSTMs, also designed to mitigate vanishing
gradients.
● These architectures use mechanisms like cell states and gates to preserve and
update information over long sequences.
3. Better Weight Initialization:
● Use methods like Xavier or He initialization to ensure weights start with
appropriate magnitudes, reducing the risk of vanishing or exploding gradients.
4. Use of Activation Functions:
● Replace squashing functions like sigmoid or tanh with ReLU or its variants (e.g.,
Leaky ReLU) to reduce the risk of gradients shrinking excessively.
5. Layer Normalization:
● Normalize the activations within layers to stabilize gradients and improve
convergence.
6. Shorter Sequences:
● Use techniques like truncating the sequence or segmenting longer sequences into
smaller chunks to simplify gradient propagation.