0% found this document useful (0 votes)
12 views28 pages

Supervised Deep Learning Techniques

Uploaded by

neeharika.sssvv
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views28 pages

Supervised Deep Learning Techniques

Uploaded by

neeharika.sssvv
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

SUPERVISED DEEP LEARNING

Without activation function,


we are restricted to only
linear output.
z = net input
b = bias term
f = activation function
a = output of one layer
passed as input to next
layer.

Why neural network?


A single neuron (like logistic regression) only permits a linear decision boundary. Most
real-world problems are considerably more complicated and so we need stacked neural
networks rather than having only a single neuron.
Moving in forward direction – i.e., input to output  feedforward networks.
Hidden_layer_size = (5,2)  Two hidden layers, with one having 5 units and another
hidden layer having 2 units. len((5,2)) is the number of hidden layers we have.
Every layer between input layer and output layer is hidden layer. Weights are
represented by matrices.

In practice, we do these computations for many datapoints at the same time, by


stacking the rows into a matrix. But the equations look the same.
Optimization and Gradient Descent:
Update the parameters by going through one observation at a time – Stochastic
gradient descent.
Update parameters using just a certain subset amount of our observations within our
data – Mini-batch gradient descent.

Gradient Descent: Too large alpha (learning


rate)  overshot minimum.
Too small alpha  take
more time to reach
minimum i.e., more time to
optimize the model.
Each point can be iteratively
calculated from the previous
one.
This process is repeated
and we eventually end up at
global minimum.

Stochastic Gradient Descent:


It speeds up by only using a single data point to determine the gradient and the cost
function.
Path is less direct due to noise in single data point or a single
observation: “Stochastic”.
Mini-batch gradient descent:
Let n be a number between 1 and size of the entire dataset. Perform an update for
every n training examples.

Here, we can reduce the memory relative to original or vanilla gradient descent where
we use entire dataset.
Less noisy and gets to the optimal value much smoother than stochastic gradient
descent.

 Select the method or methods that best help you find the same results as using
matrix linear algebra to solve the equation θ=(XTX)-1XTy
o Use stochastic gradient descent, use scikit-learn to build a linear
regression model, train a neural network model.

Back-propagation:
Training a Neural Network

In a nutshell this is the process to train a neural network:

 Put in Training inputs, get the output.


 Compare output to correct answers: Look at loss function J.
 Adjust and repeat.
 Backpropagation tells us how to make a single adjustment using calculus.

How have we trained before?

Ans: Using Gradient descent

1. Make prediction.
2. Calculate loss.
3. Calculate gradient of the loss function w.r.t. parameters.
4. Update the parameters by taking a step in the opposite direction.
5. Iterate.
Backpropagation:

We obtain desired changes to inputs using calculus:

o Functions are chosen to have nice derivatives.


o Numerical issues are to be considered.
 Such as exploding and vanishing gradient.
Using this partial derivative, we
will be able to update weights in
correct direction.

So, the idea of backpropagation is that we will first run our neural network with our
initialized weights. Then moving back through our layers, we are going to take the
derivative of each of our weights in our final layer with respect to our loss function. Then
use that to again get our partial derivative in respect to our layer 2 of our weights and
then our layer one weights finally. We will use these to update our initialized values and
then again feed these updated weights through our neural net and repeat the process.

If there are more layers, the


gradient gets very small at early
stages during back propagation
and this is called vanishing
gradient. It is because the 0≤f(z)≤1.
For this reason, other activation
functions such as ReLU became
more common.
Movement is in backward direction.
Sigmoid graph:

Tanh graph:

ReLU (Rectified Linear Unit) graph:


On the left side, rather than those tiny changes, there is zero change. So, these values
will actually zero out particular nodes. Now, this zeroing out will allow for us to
ignore nodes that may not be providing much extra information. Thus, it may be more
efficient than the sigmoid or hyperbolic functions that always maintain at least some
information at each node. Now, on the other hand, there will be no learning happening
at each of those nodes that are being zeroed out and perhaps you want to ensure some
type of learning at all nodes. The solution is Leaky ReLU.

Leaky ReLU graph:

Alpha is small number here. They are not necessarily better than ReLU all the time.

Regularization techniques for deep learning:


Drop out and early stopping are few regularization techniques in neural networks.
With more layers, we could learn more complex models. These models may perfectly fit
to our training data or sometimes may overfit as well and not generalize well with new
data. So, to prevent over fitting in neural networks, we use regularization techniques.
To prevent overfitting, we have several means to regularize neural networks:
 Adding some regularization penalty in cost function similar to Lasso and Ridge.
 Dropout where we randomly loose certain neurons in our network to ensure our
model is not over reliant on any particular neuron.
 Early stopping will just be the idea of stopping gradient descent short so that is
not perfectly fit to the training set.
 Stochastic and mini-batch gradient descent (to some extent) – don’t perfectly fit
to training data and therefore, may generalize better than full-batch gradient
descent.
Optimizers and Data shuffling:
Optimizers:
Different methods of optimizing weights – Optimizers.

More momentum more


smoothing.
ꭜ is replaced by beta and then
alpha = 1-beta. Alpha is learning
rate.
Momentum may cause to overshoot
optimum value.

AdaGrad (Adaptive Gradient):


The idea is to scale the update for each weight separately. ꭜ here is learning rate.
 Update frequently updated weights less.
 Keep running sum of previous updates.
 Divide new updates by the factor of previous sum.
This leads to smaller
updates each iteration. As
we get closer to optimal
value, the learning rate
shrinks, and this avoids
overshooting.
RMSProp (Root Mean Square Propagation):
Similar to AdaGrad.
Rather than using the sum of previous gradients, decay older gradients more than more
recent ones  more weights to more recent gradients.
It is more adaptive to recent gradients.
Adam (Adaptive moment estimation):
The idea is to use both first order and second-order change information and decay both
over time.
Momentum + RMSProp = Adam

Generally, beta1 = 0.9 and beta2 = 0.999.

Which optimizer to choose?


RMSProp and Adam seem to be quite popular.
It can be difficult to predict in advance which will be best for particular problem.
It is still an active area of inquiry.
 Adam speeds up the optimization process tremendously and usually does a fairly
good job at finding optimal solutions. There are going to be times when it does
have trouble conversion.

Details of training a neural network:


Classical approach: Get derivative for entire dataset, then take a step in that direction.
Pro: Each step is informed by all data.
Con: Very slow, especially as data gets big.
Stochastic gradient descent: Get derivative for just one point, then take a step in that
direction. Here, steps are less informed, but you take more of them. It should balance
out the missteps.
So, take smaller step size. Also helps in regularization as it does not perfectly fit.
Mini-batch gradient descent: Get derivative for a small set of points, then take a step
in that direction. Strikes a balance between extremes (full batch and stochastic gradient
descents).

An epoch refers to a single pass through all the training data.


 In full batch gradient descent, there would be one step taken per epoch.
 In SGD/ Online learning, there would be n steps taken per epoch (n = training set
size).
 In mini-batch, there would be (n/batch size) steps taken per epoch (n = number
of rows).

Training Neural Networks is sensitive to how to compute the derivative of each weight
and how to reach convergence. Important concepts that are involved at this step:

 Batching methods, which includes techniques like full-batch, mini-batch, and


stochastic gradient descent, get the derivative for a set of points.
 Data shuffling, which aids convergence by making sure data is presented in a
different order every epoch.
Keras library:
Some of the most common libraries:
 TensorFlow – Developed by Google. Build AI related produced.
o It has keras built in.
 Theano – Grandfather of deep learning frameworks.
o Dead in 2018
 PyTorch – Research oriented. Developed by Facebook.
Keras is very high-level library, can run on either TensorFlow or Theano.
Typical command structure:
Build the structure of network:
o Compile the model (how many layers we want), specifying loss function,
metrics, and optimizer (this includes the learning rate).
o Fit the model on to training data (specifying batch size and number of
epochs).
o Predict on new data.
o Evaluate results.

Keras provides two approaches to building the structure of model:


 Sequential model: allows a linear stack of layers – simpler and more convenient
if model has this form.
 Functional API: more detailed and complex but allows more complicated
architectures.
In machine-learning there is an approach called early stop to avoid overfitting. In that
approach you plot the error rate on training and validation data. The horizontal axis is
the number of epochs and the vertical axis is the error rate. You should stop training
when the error rate of validation data is minimum. Consequently, if you increase the
number of epochs, you will have an over-fitted model.

Convolution Neural Network:


Convolutional Layers have relatively few weights and more layers than other
architectures. In practice, data scientists add layers to CNNs to solve specific problems
using Transfer Learning.

Kernel is used to find edges, corners etc. in our image. A kernel is a grid of weights
“overlaid” on image, centered on one pixel.

o Each weight is multiplied with the kernel which is overlaid on the pixel.
o The output over the centered pixel is: . This is convolution
operation.

Kernels are local feature detectors. Kernel doe not need to be square.

Primary ideas behind CNN:

 Let the neural network learn which kernels are most useful.
 Use same set of kernels across entire image (translation variance).
 Reduces number of parameters and variance (from bias-variance point of view).

When we work with these centered values and trying to output centered values, the
edges and corners of our image are overlooked. This can be solved by padding.

Padding:

 Pixels in the edge are not used as center pixels since there not enough
surrounding pixels.
 Padding adds extra pixels around the frame, so the pixels from the original image
become center pixels as the kernel moves across the image.
 These added pixels are typically of value zero (zero-padding).

Striding:

 Striding is the step size the kernel moves across the image.
 When stride>1, it scales down the output dimension.
 Stride = 2  move 2 steps both horizontally and vertically. This can be different
for horizontal and vertical steps.

Depth:

In images, we often have multiple numbers associated with each pixel location. These
numbers are referred to as channels. Example: RGB – 3 channels, CMYK – 4 channels.

CMYK is Cyan, Magenta, Yellow and Black. The number of channels is depth. So, the
kernel itself will have a depth the same size as the number of input channels.
Example: a 5 * 5 kernel on an RGB image  5 * 5 * 3(RGB) = 75 weights.

The output from the layer will also have depth.

 The network typically train many different kernels.


 Each kernel outputs a single number at each pixel location.
 So, is there are 10 kernels  10 different patterns are detected.

Pooling:

The idea is to reduce the image size by mapping a patch of pixels to a single value.

 Shrinks the dimensions in an image.


 Does not have parameters, though there are different types of pooling operations
(like max, average).
o Max-pool – For each distinct patch, represent it by its maximum value.
o Average-pool - For each distinct patch, represent it by its average value.
 Pooling is a fixed operation and we need not learn any weights.

Transfer Learning

The main idea of Transfer Learning consists of keeping early layers of a pre-trained
network and re-train the later layers for a specific application.

Last layers in the network capture features that are more particular to the specific data
you are trying to classify.

Later layers are easier to train as adjusting their weights has a more immediate impact
on the final result.

Guiding Principles for Fine Tuning

While there are no rules of thumb, these are some guiding principles to keep in mind:

 The more similar your data and problem are to the source data of the pre-
trained network, the less intensive fine-tuning will be.
 If your data is substantially different in nature than the data the source model
was trained on, Transfer Learning may be of little value.

CNN Architectures

LeNet-5

 Created by Yann LeCun in the 1990s


 Used on the MNIST data set, black and white images.
 Novel Idea: Use convolutions to efficiently learn features on data set.
AlexNet

 Considered the “flash point” for modern deep learning.


 Created in 2012 for the ImageNet Large Scale Visual Recognition Challenge
(ILSVRC).
 Task: predict the correct label from among 1000 classes.
 Dataset: around 1.2 million images.
AlexNet developers performed data augmentation for training.

 Cropping, horizontal flipping, and other manipulations.


 This augmentation helps with overfitting.
Basic AlexNet Template:

 Convolutions with ReLUs.


 Sometimes add maxpool after convolutional layer.
 Fully connected layers at the end before a softmax classifier.
VGG

Simplify Network Structure: has same concepts and ideas from LeNet, considerably
deeper. Simpler architecture but still be able to find more complex features.

This architecture avoids Manual Choices of Convolution Size and has very Deep
Network with 3x3 Convolutions.

These structures tend to give rise to larger convolutions.

This was one of the first architectures to experiment with many layers (More is better!).
It can use multiple 3x3 convolutions to simulate larger kernels with fewer parameters
and it served as “base model” for future works.

VGG reduces working with many number of weights.

9, 49 and 27 are weights.


Can use multiple 3 *3 convolutions to simulate
larger Kernels with fewer parameters.

Inception
Ideated by Szegedy et al 2014, this architecture was built to turn each layer of the
neural network into further branches of convolutions. Each branch handles a smaller
portion of workload. It combines different layers together in a single layer.

With Inception, the idea is perhaps you don't know exactly what type of filter or what
type of layer you want at each step, so you may want to combine or try a bunch of them
together. But this can be computationally expensive. We probably want to accomplish
this with some level of computational efficiency. We are also going to want to ensure
that we can reduce the total number of activations that are needed to run through our
entire network.

The network concatenates different branches at the end. These networks use different
receptive fields and have sparse activations of groups of neurons.

Inception V3 is a relevant example of an Inception architecture.

ResNet

Researchers were building deeper and deeper networks but started finding these
issues:

In theory, the very deep (56-layer) networks should fit the training data better (even if
they overfit) but that was not happening.

Seemed that the early layers were just not getting updated and the signal got lost (due
to vanishing gradient type issues).

These are the main reasons why adding layers does not always decrease training error:

 Early layers of Deep Networks are very slow to adjust.


 Analogous to “Vanishing Gradient” issue.
 In theory, should be able to just have an “identity” transformation that makes
the deeper network behave like a shallow one.
In a nutshell, a ResNet:

 Has several layers such as convolutions.


 Enforces “best transformation” by adding “shortcut connections”.
 Adds the inputs from an earlier layer to the output of current layer.
 Keeps passing both the initial unchanged information and the transformed
information to the next layer.
 Works with much deeper networks and still gets high accuracy.

CNN Applications (supervised):


1. Image recognition/classification (animals, digits, malignant/benign)
a. Automatic feature selection/extraction
2. Object detection in images
3. Coloring black and white images
4. Creating art images
5. Natural language processing
6. Speech recognition
7. Face detection
8. Recommender system
9. Image smoothing, blurring, noise filtering, edge detection.
It is the process of detecting primitive features of image like edges, boundaries, and
curves. This is done by using a kernel to convolve the image matrix.
Convolution types (1D):
1. Full (with 0 padding)  output length: len1+len2-1
2. Same (with left sided 0 padding) output length: max(len1, len2)
3. Valid (without 0 padding) output length: max – min + 1
Validation data is used to generate model properties such as classification error, and
from this tune the model parameters like optimal number of hidden units or determining
the stopping point for back propagation.
Test data is used to evaluate the performance and accuracy of the model against “real
life situations”. No further optimization beyond this point.

Transfer learning:
It is difficult to train large datasets as it takes more time to fit and is computationally
expensive. However, the basic features (edges, shapes) learned in the early layers of
the network should generalize fairly well with other datasets having similar problems.
So, results of the training are just weights (numbers) that are easy to store.
The idea is that keep the early layers of pre-trained network, and re-train the later layers
for a specific application. This is called transfer learning.
Remove the final layer or any layer from the back and train on the pre-trained model.
The additional training of a pre-trained network on a specific new dataset is referred to
as Fine Tuning.
There are different options in "how much” and “how far back” to fine-tune.
 Should I train last layer?
 Go back few layers?
 Re-train the entire network (from the starting point of the existing network)?
Few guiding principles of fine tuning:
 The more similar the data and problem are to the source data of the pre-trained
network, the less fine tuning is necessary.
o Example: Using a network trained on ImageNet to distinguish “dogs” from
“cats” should need relatively little [Link] already
distinguished different breeds of dogs and cats, so likely has all the
features you will need.
 The more data you have about your specific problem, the more the network will
benefit from longer and deeper fine tuning.
o Example: If you have 100 dogs and 100 cats in your training data, you
probably want to do little fine tuning like may be remove final layer or two
and use lot of attributes that you learn from ImageNet.
o On the other hand, if you have 100,000 dogs and 100,000 cats, you may
get more value from longer and deeper fine tuning. Going back further or
even retraining full network using that past network to initialize weights.
 If your data is substantially different in nature than the data the source model was
trained on, Transfer Learning may be of little value.
o Example: A network that is based on recognizing typed Latin alphabet
characters would not be useful in distinguishing dogs from cats. But it
would likely be useful as a starting point for recognizing Cyrillic Alphabet
characters.
Recurrent Neural Network (RNN):
Recurrent Neural Networks are a class of neural networks that allow previous outputs to
be used as inputs while having hidden states. They are mostly used in applications of
natural language processing and speech recognition.

One of the main motivations for RNNs is to derive insights from text and do better than
“bag of words” implementations. Ideally, each word is processed or understood in the
appropriate context.

Words should be handled differently depending on “context”. Also, each word should
update the context.

Under the notion of recurrence, words are input one by one. This way, we can handle
variable lengths of text. This means that the response to a word depends on the words
that preceded it.

These are the two main outputs of an RNN:

 Prediction: What would be the prediction if the sequence ended with that
word.
 State: Summary of everything that happened in the past.
Mathematical Details

Mathematically, there are cores and subsequent dense layers.

current state = function1(old state, current input).

current output = function2(current state).

We learn function1 and function2 by training our network!

r = dimension of input vector

s = dimension of hidden state

t = dimension of output vector (after dense layer)

U is a s × r matrix (Linear transformation which is dot multiplied with our input)

W is a s × s matrix (Weight matrices within our states)


V is a t × s matrix (Output of state)

In which the weight matrices U, V, W are the same across all positions.

Kernel initializer is the weight initializer for the inputs, whereas recurrent initializer is
weight initializer for states.

Practical Details

Often, we train on just the “final” output and ignore intermediate outputs.

Slight variation called Backpropagation Through Time (BPTT) is used to train RNNs.

Sensitive to length of sequence (due to “vanishing/exploding gradient” problem).

In practice, we still set a maximum length to our sequences. If the input is shorter than
maximum, we “pad” it. If the input is longer than maximum, we truncate it.

RNN Applications

RNNs often focus on text applications, but are commonly used for other sequential data:

 Forecasting: Customer Sales, Loss Rates, Network Traffic.


 Speech Recognition: Call Center Automation, Voice Applications.
 Manufacturing Sensor Data
 Genome Sequences

Weakness of RNN:

Nature of state transition means it is hard to keep the information from distant past in
current memory without reinforcement. Example: I am from France, I speak ___. In this
___ we expect RNN to fill French. But RNN cannot remember long sequences. This is
weakness of RNN. The solutions to this are LSTM, GRU. LSTMs have more complex
mechanism for updating weights.
Structure of RNN: pad or truncate the maximum length of word  Embedding layer 
RNN  Dense layer, here embedding layer is something that similar words (fast,
quickly) have similar embedding index to be passed into the network.

Standard RNNs have poor memory.

 Transition matrix necessarily weakens signal.


o Solution: Need a structure that can leave some dimensions unchanged
over many steps.
o This problem is addressed by LSTM.

Long-Short Term Memory RNNs (LSTM)

LSTMs are a special kind of RNN (invented in 1997). LSTM has as motivation solve one
of the main weaknesses of RNNs, which is that its transitional nature, makes it hard to
keep information from distant past in current memory without reinforcement. LSTM
define a more complicated update mechanism for the changing of the internal state. By
default, LSTMs remember the information from the last step. On top of that, rather than
keeping just past information, there is more flexibility in retaining or forgetting large
portion of information from those prior steps beside just that last step (Remembering).

LSTM have a more complex mechanism for updating the state.

Standard RNNs have poor memory because the transition Matrix necessarily weakens
signal.

This is the problem addressed by Long-Short Term Memory RNNs (LSTM).

To solve it, you need a structure that can leave some dimensions unchanged over many
steps.

 By default, LSTMs remember the information from the last step.


 Items are overwritten as an active choice.
The idea for updating states that RNNs use is old, but the available computing power to
do it sequence to sequence mapping, explicit memory unit, and text generation tasks is
relatively new.

Augment RNNs with a few additional Gate Units:

 Gate Units control how long/if events will stay in memory.


 Input Gate: If its value is such, it causes items to be stored in memory.
 Forget Gate: If its value is such, it causes items to be removed from memory.
 Output Gate: If its value is such, it causes the hidden unit to feed forward
(output) in the network.
Cell state gets updated in two stages. The
cross gate is the forget gate (decide what
information from the prior cell state, as well
as the current input coming in to forget).
The + gate is add new information portion
which tells us what new information is worth
maintaining.

This is forget gate. It takes previous


hidden state and concatenates with
current input and this is multiplied to
weights (Wf) at forget gate. Then this
is passed through sigmoid function
whose output is between 0 and 1.

This is input gate. The weights get updated


(Wi). The same functionality as forget gate
except it has tanh function as well whose
output is between -1 and 1. The idea being that
the tanh is the actual information you
are deciding whether or not to add on. Then
that sigmoid between zero and one will tell
you ideally what portion of that new information
we would want to add on. If it's close to one,
we add on all information. If it's close to
zero, we don't add on very much of that new
information.

This is used to find the cell state.


Gated Recurrent Units (GRUs)

GRUs are a gating mechanism for RNNs that is an alternative to LSTM. It is based on
the principle of Removed Cell State:

 Past information is now used to transfer past information.


 Think of as a “simpler” and faster version of LSTM.
These are the gates of GRU:

Reset gate: helps decide how much past information to forget.

Update gate: helps decide what information to throw away and what new information to
keep.

LSTM vs GRU

LSTMs are a bit more complex and may therefore be able to find more complicated
patterns.

Conversely, GRUs are a bit simpler and therefore are quicker to train.

GRUs will generally perform about as well as LSTMs with shorter training time,
especially for smaller datasets.
In Keras it is easy to switch from one to the other by specifying a layer type. It is
relatively quickly to change one for the other.

Sequence-to-Sequence Models (Seq2Seq)

Thinking back to any type of RNN interprets text, the model will have a new hidden state
at each step of the sequence containing information about all past words. It is powerful
for language translation and helps us understand how words or sequences are pieced
together that may be different lengths but may be related to one another. It is simply like
language translator.

Seq2Seq improve keeping necessary information in the hidden state from one
sequence to the next.

This way, at the end of a sentence, the hidden state will have all information relating to
past words. The size of the vector from the hidden state is the same no matter the size
of the sentence. In machine translation, the encoder: corpus of sentences in the original
language.

In a nutshell, there is an encoder, a hidden state, and a decoder.

Currently the model produces


only single word at a time. It
means it translates single word
at a time and the single word
that is being produced will be
conditional on whatever that
prior word that was produced. If
it produces one wrong word, we
may end up throwing off the
sequence of words. Solution to
this is beam search.

Beam Search

A solution to the above problem is to produce multiple different hypotheses to produce


words until <EOS> and then see which full sentence is most likely.
Solution: The s(i, j) function will
weigh the different
embedding layer hidden
states to give us a better
embedding for prediction
of next word. This will
better allow you to
translate between different
Solution: languages when the
ordering of words may be
Beam search is an attempt to solve greedy inference. different.

 Greedy Inference, which means that a model producing one word at a time
implies that if it produces one wrong word, it might output a wrong entire
sequence of words.
 Beam search tries to produce multiple different hypotheses to produce words
until <EOS> and then see which full sentence is most likely.
These are examples of common enterprise applications of LSTM models:

 Forecasting: (LSTM among most common Deep Learning models used in


forecasting).
 Speech Recognition, speech to text
 Machine Translation, sentiment analysis
 Image Captioning
 Question Answering (Customer care like say “yes” is this is your request).
 Anomaly Detection
 Robotic Control
 Sentence completion, to solve the problem of modelling sequential data.
If you wanted to build some complex architectures, such as Inception or ResNet you
would have to actually use functional API instead of sequential model, in order to build
out layers, such as with Inception, where you are concatenating a bunch of different
types of layers together, or ResNet where you want to bring along portions of the layer
to further layers, you will have to use something like the functional API.

Common questions

Powered by AI

Dropout and early stopping are two distinct regularization techniques in neural networks. Dropout works by randomly deactivating a subset of neurons during training, which prevents the model from becoming overly reliant on specific paths through the network and encourages a more robust neural representation. Its implementation involves adjusting network architecture dynamically during training. Early stopping, in contrast, involves halting training early based on validation performance. It directly monitors the model's error on a validation set, stopping when subsequent training yields increasing validation error, thus reducing overfitting. While dropout affects the architecture during each training iteration, early stopping is a global training strategy .

Early stopping prevents overfitting by halting the training process once the error on the validation set starts increasing, indicating that the model begins overfitting to the training data. It is a practical and dynamic method that directly uses the model's performance as feedback. Compared to other regularization techniques like dropout, which operates by randomly deactivating certain neurons during training to prevent reliance on specific paths, and regularization penalties such as L1 or L2 that add constraints to the cost function, early stopping is responsive to the data and requires fewer hyperparameter adjustments. However, it does not specifically alter the network architecture or training process .

LSTMs address the short memory problem of standard RNNs by using a complex unit structure that allows for better preservation and management of information over long sequences. They incorporate specialized memory gates to control the flow and retention of information. The input gate determines how much input information is stored in memory, the forget gate dictates what portion of past information to discard, and the output gate decides when information is moved to the next layer of the network. This gating mechanism allows LSTMs to selectively preserve important information, avoiding the signal decay frequently seen in standard RNNs with long sequences .

Data shuffling is crucial in neural network training as it ensures that the model is exposed to diverse mini-batches of data during each epoch, enhancing the generalization capability of the model. By presenting data in a different order for each epoch, shuffling minimizes the chances of the model learning spurious patterns due to fixed data ordering. This contributes to a more stable convergence on the training set, as the parameter updates reflect average behaviors rather than systemic biases from sequential data ordering, facilitating a more generalized model performance .

Convolutional layers are fundamental to CNNs as they specialize in detecting spatial hierarchies in images through local receptive fields. Kernels or filters within these layers scan the image, learning to identify features such as edges and textures. Padding is employed to preserve the spatial dimensions of the input image, especially at edges, allowing kernels to overlay without exclusions and thus maintaining input size consistency. Zero-padding, one common approach, adds zero-value pixels around the image border. Striding, on the other hand, refers to the step size used when moving the kernel across the image, controlling the spatial size reduction of feature maps. By adjusting padding and striding, the resolution and feature complexity of output feature maps can be managed effectively .

In Keras, the choice of loss functions and metrics directly influences how models learn from data and are evaluated. Loss functions quantify the differences between predicted and actual outputs, guiding the optimization process. For instance, using Mean Squared Error (MSE) is common in regression problems, while categorical crossentropy is preferred for multi-class classification tasks. Each loss function shapes the gradient descent path differently, impacting convergence speed and model accuracy. Metrics, on the other hand, provide a performance evaluation during training (e.g., accuracy, precision), ensuring that model adjustments improve performance related to specific tasks. Effective pairing of loss function and metrics tailored to problem requirements optimizes model learning and performance .

Backpropagation addresses the issue of vanishing gradients by allowing neural networks to adjust the weights of earlier layers based on derivatives calculated alongside the loss function. However, when there are many layers, the gradients can become very small, leading to slow learning or even the vanishing gradient problem, particularly with sigmoid or hyperbolic functions where 0 ≤ f(z) ≤ 1. To mitigate this, alternative activation functions like ReLU, which has zero values for negative inputs and linear for positive ones, became popular due to their ability to maintain non-zero gradients. Leaky ReLU can be used when it is important to ensure some learning at all nodes, providing a small, non-zero gradient for negative inputs .

The decision between using LSTMs and GRUs depends on the complexity of the problem and resource constraints. LSTMs, with their three gate units—input, forget, and output—are more complex and can capture intricate patterns at the expense of longer training times, making them suitable for tasks requiring the retention of long-term dependencies. GRUs, however, offer a more streamlined architecture with fewer gating units, providing comparable performance with more efficient training, especially beneficial on smaller datasets or applications requiring faster turnaround. While GRUs might perform equally well in less complex scenarios, LSTMs generally outperform them in handling data with long-range dependencies .

AdaGrad adapts the learning rate for each parameter, scaling it inversely by the entire history of past gradients. It benefits sparse data but can lead to an aggressive decay of learning rates. RMSProp addresses this by using a moving average of squared gradients to scale learning rates, stabilizing the updates for nonstationary objectives. Adam combines the best of RMSProp and AdaGrad by incorporating momentum to modify the velocity of parameters, which can accelerate convergence. However, Adam might struggle in situations requiring precise tuning of the learning rate for optimal convergence. While Adam is generally favored due to its speed and efficiency, especially in the presence of noisy gradients, predicting which optimizer will perform best in a particular scenario remains challenging .

Seq2Seq models are advantageous in machine translation due to their capability to generate variable-length target sequences from variable-length input sequences, which is essential for translating human languages. They utilize an encoder-decoder architecture: the encoder processes input sequences into a fixed-dimensional context vector (hidden state), capturing essential information, while the decoder generates the translated output sequence. By retaining necessary information in the hidden state across sequences, Seq2Seq models effectively utilize context for accurate translations, unlike typical RNNs that might lose context over long sequences. Beam search further enhances Seq2Seq by evaluating multiple hypothesis sequences, increasing output accuracy .

You might also like