0% found this document useful (0 votes)

27 views45 pages

Supervised Deep Learning Training Guide

Module 3 covers the training of supervised deep learning networks, focusing on Convolutional Neural Networks (CNNs) and optimization techniques like gradient descent. It explains the training process, including the goal of generalization, the architecture of CNNs, and the backpropagation method used to minimize loss. Key components such as convolution layers, max-pooling, and the softmax classifier are detailed to illustrate how CNNs learn from labeled data.

Uploaded by

udhbav23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views45 pages

Supervised Deep Learning Training Guide

Uploaded by

udhbav23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Module 3

Training Supervised Deep Learning Networks Training Convolution Neural

Networks, Gradient Descent-Based Optimization Techniques, Challenges in
Training Deep Networks. Supervised Deep Learning Architectures: LetNet-
5,AlexNet
Text Book - 1 : Ch 3.2,3.4,3.5, Ch 4.2,4.3
Training Supervised Deep Learning Networks
Understanding Supervised Deep Learning and CNN Training – Simple
Explanation
What is Supervised Learning?
• In supervised learning, we teach the machine using labeled data.
• This means every piece of training data has:
o An input (like an image),
o And a correct output label (like the name of the object in the
image).
Example: If you're training a model to recognize fruits, the input could be a
picture of an apple, and the label would be “apple.”
How is a Deep Learning Model Trained?
• A deep learning network (like a CNN) learns by adjusting its internal
settings (called parameters or weights) so it can predict the correct
labels.
• The training process is like trial and error:
1. The model guesses the output.
2. It compares the guess to the correct label.
3. If it's wrong, it updates its internal weights to improve next time.
• This continues over and over with many examples.
What is the Goal?
• The goal is not just to remember the training data, but to generalize.
• Generalization means the model can make correct predictions on new,
unseen data — not just the examples it has already seen.
Like a student who learns the concept in class and can apply it to new exam
questions they've never seen before.
Why Use Convolutional Neural Networks (CNNs)?

1
• CNNs are a type of deep learning model that are especially good at:
o Recognizing patterns in images,
o Extracting important features (like edges, shapes),
o And classifying objects correctly.
• CNNs are one of the most commonly used models in supervised
learning, especially for image-related tasks (like face recognition,
medical imaging, self-driving cars, etc.).
3.2 Training Convolution Neural Networks
Training Supervised Deep Neural Networks – Explained Simply
What Is the Goal of Training a Neural Network?
Training a deep learning model means finding the best set of weights (or
parameters) that help the model make accurate predictions.
To do this, we define a loss function (also called a cost function), which
tells us how far off the model’s predictions are from the correct answers.
The goal of training is to reduce this loss as much as possible — this means
the model is learning better.
How Do We Minimize the Loss?
We use a method called gradient descent, which is a step-by-step process
to adjust the weights of the network to reduce the loss.
Think of it like walking downhill to reach the lowest point (minimum error).
Backpropagation – How Learning Happens
The most common technique used in training deep learning models is
called:
Backpropagation with Gradient Descent
Here’s how it works step by step:
1. Initialize weights: Start with random weights (or based on a probability
distribution).
2. Feedforward: Give an input to the model and let it produce an output.
3. Calculate Error: Compare the output to the correct answer using a loss
function.
4. Backpropagate: Send the error backward through the network to
update the weights.

2
5. Update Weights: Use the gradient (direction of steepest
increase/decrease) to change weights in the right direction — to reduce
error.
6. Repeat: Do this for many inputs until the model performs well.
Example CNN Architecture (Similar to LeNet)
Let’s understand this with an example of a small CNN model:
Input:
• Image of size 32 × 32 pixels
First Convolution Layer:
• Uses 6 filters of size 5 × 5
• Output: 6 feature maps of size 28 × 28
Filters slide across the image, detecting features like edges and textures.
First Max-Pooling Layer:
• Down-samples each 28 × 28 feature map to size 14 × 14
• This reduces size and helps the model focus on important features
Second Convolution Layer:
• Applies 16 filters of size 5 × 5
• Output: 16 feature maps of size 10 × 10
Second Max-Pooling Layer:
• Down-samples 10 × 10 maps to size 5 × 5
• Output: 16 feature maps, each of size 5 × 5
Third Convolution Layer:
• Uses 120 filters of size 5 × 5
• Each filter connects fully to a 5 × 5 map
• Output: 120 values (a flat vector)
Fully Connected Layer:
• Connects these 120 values to 10 output units (for 10 different classes)
• Output represents scores for each class
Softmax Classifier:
• Converts the 10 scores into probabilities

3
• The class with the highest probability is selected as the model’s
prediction
Layer 1 (C1)
Understanding the First Two Layers in a CNN – Convolution and Max-
Pooling
In Convolutional Neural Networks (CNNs), the first few layers are used to
extract features from input images. Let's break down what's happening in
these layers step by step.

🔹 Layer 1: Convolution Layer (C1)

• Input Image Size: 32 × 32 pixels
• Filters Used: 6 filters, each of size 5 × 5
• Output: 6 feature maps of size 28 × 28
Why does the output shrink to 28 × 28?
Because the 5 × 5 filters slide over the 32 × 32 input, and we don’t use
any padding here.
Mathematical Operation of Convolution (Equation 3.1)
Let’s understand what this equation means in simple words:

In simple terms:
You slide a 5 × 5 filter over the image, multiply corresponding values,
add them up, then add a bias. This gives one pixel in the output
feature map.

4
🔹 Layer 2: Max-Pooling Layer (P2)
Now the output from C1 is passed to the next layer: Max-Pooling.
• Input: 6 feature maps from C1 (each of size 28 × 28)
• Operation: Apply max-pooling with a 2 × 2 filter (usually with stride 2)
• Output: 6 feature maps of size 14 × 14 (reduced size)

🧮 Max-Pooling Operation (Equation 3.4)

In simple words:
Max-pooling shrinks the image while keeping the strongest features.
Summary for Students

Operation Input Output

Layer Purpose
Type Size Size

Extract features
6 filters
Convolution 32 × 32 28 × 28 like edges,
(5×5)
textures

5
Operation Input Output
Layer Purpose
Type Size Size

Make model learn

ReLU Activation 28 × 28 28 × 28 non-linear
patterns

Reduce size,
Max- 2 × 2 max-
28 × 28 14 × 14 highlight strong
Pooling pool
features

Understanding Layer 3 and Layer 4 in a CNN

Layer 3 (C3): Second Convolution Layer
• Input to this layer: Output from the previous max-pooling layer (P2),
which is 6 feature maps of size 14 × 14.
• Operation: This layer uses 16 filters (each of size 5 × 5) to extract deeper
features from the input.
• Output: 16 feature maps, each of size 10 × 10.
What’s happening?
Unlike the first convolution layer (which takes only one input image), this
layer takes multiple input feature maps (6 of them, from P2).
Each filter looks at all the 6 input maps to compute one output map.

In simple terms:
Each of the 16 filters "looks at" all 6 input maps, processes them, adds
the results together, adds a bias, and passes it through ReLU. This gives
one new feature map per filter.

6
Layer 4 (P4): Second Max-Pooling Layer
• Input: 16 feature maps of size 10 × 10 (from C3)
• Operation: Max-pooling with a 2 × 2 window
• Output: 16 feature maps of size 5 × 5
What’s happening here?
Just like before, we reduce the size of each feature map by picking the
maximum value from each 2 × 2 block.

In simple terms:
Max-pooling shrinks each 10 × 10 map to 5 × 5 by selecting the most
important values.
Summary for Students

Layer Type Input Output Purpose

Extract deeper
6 maps 16 maps
features using 5 × 5
C3 Convolution of 14 × of 10 ×
filters and combine
14 10
across channels

16 Make the output non-

ReLU Activation 16 maps
maps linear

16
Max- 16 maps Reduce size, highlight
P4 maps of
Pooling of 5 × 5 strongest features
10 × 10

Understanding Layers 5 and 6 + Final Output in a CNN

Layer 5 (C5): Third Convolution Layer
• Input: 16 feature maps of size 5 × 5 from the previous max-pooling layer
(P4).
• Operation: Uses 120 filters, each covering the entire 5 × 5 input area.
• Output: Produces 120 feature maps, but each one is of size 1 × 1.

7
What’s happening?
In this layer:
• Each filter has access to all 16 input maps.
• Each filter is the same size as the input (5 × 5), so it slides only once
over the input.
• That means for each of the 120 filters, we get just one number – like
compressing all information into a single feature.

In simple words:
Each of the 120 filters combines information from all 16 maps using a
weighted sum, adds a bias, and applies ReLU. This gives a single number
per filter – so we now have a vector of 120 numbers.
Layer 6 (F6): Fully Connected Layer
• Input: The 120 values from C5.
• Operation: Fully connected layer with 10 neurons – one for each class
(for example: digits 0 to 9).
• Output: 10 raw values, one from each neuron.

8
💡 In simple words:
Each of the 10 neurons takes all 120 inputs and computes a weighted
sum. These are the raw scores before final classification.

🎯 Final Layer: Softmax Activation

The softmax function is used to convert the 10 raw scores from the fully
connected layer into probabilities.

In simple words:
Softmax makes sure the 10 outputs become probabilities. So, if:
• Z0=0.01Z_0 = 0.01
• Z1=0.02Z_1 = 0.02
• …
• Z7=0.85Z_7 = 0.85
then we can say the model is 85% confident that the input belongs to
class 7.
Backward Pass
Loss Layer
What Happens During Training?

9
Goal of Training: Minimize the Loss
To improve the CNN’s accuracy, we want to minimize the error. We do this
by adjusting the weights of the network. This is done using a method called
backpropagation and an optimization method like gradient descent.
How Do We Update the Weights?
Let’s focus on the last layer (F6), which is fully connected and produces
10 outputs (for 10 classes).
For each neuron in this layer:

Breaking Down the Derivatives

10
This

11
Summary for Students

Step Description

1. Feed input image into CNN

2. Get predicted output (probabilities)

3. Compare with true output using a loss function

4. Compute error and calculate gradient using chain rule

5. Update weights using gradient descent

6. Repeat the process for all data until loss is minimized

Hidden Layer (C5)

What is the backward pass doing here?
In deep learning, during training, after the output is predicted, we
compare it with the actual label to calculate the error. Then, we go
backwards through the network and adjust the weights so that the model
can improve.

🔹Forward Pass of Layer C5 (for context)

12
🔹Goal in Backward Pass
We want to update each weight w_{k,d,m,n} using the error calculated at
the output layer.
But C5 is a hidden layer, not the final layer. So first, we need to pass the
error from the last layer (F6) back to C5.

🔸Step 1: Backpropagate the Error from F6 to C5

We calculate the error e_k at C5 using:
ek=∑l=110δF6l×wlke_k = ∑_{l=1}^{10} δF6_l × w_lk
• δF6_l: the delta from neuron l in output layer F6.
• w_lk: the weight connecting neuron k in layer C5 to neuron l in F6.
• e_k: the error for the k-th neuron/filter in C5.
• l: runs from 1 to 10 (because there are 10 output classes).
This tells us how much the final error depends on each output of C5.

13
In Simple Terms:
• C5 receives input from P4 and applies convolution.
• In the forward pass, it generates 120 feature maps.
• In the backward pass, we:
1. Pass back error from the next layer (F6) to C5.
2. Use this error to compute how much each neuron in C5 is
responsible (δC5k).
3. Update the weights by calculating how much each weight
contributed to that error.
Hidden Layer (C3)
Backpropagation from C5 to C3 (via P4)
We are now going deeper into the convolutional network during training.
After computing the error at layer C5 and updating its weights, we need to
pass the error backward to layer C3, which is two layers back. But between
C5 and C3, we have the pooling layer P4.

14
Important Note about Max Pooling (P4)
We only propagate the error through those positions that were selected
during the max pooling operation in the forward pass.
• In max pooling, we pick the highest value from a small region (e.g.,
2×2).
• When going backward, only that maximum value gets the error.
Others are ignored.

In Simple Words:
1. We start at the output layer, calculate the error, and then move
backward step by step.
2. To update C3, we must go through P4 (pooling).
3. Only the neurons selected during max pooling will carry the error
backward.
4. We calculate how much each neuron in C3 is responsible for the error
using its output and weights.

15
5. We then compute how much to adjust each weight in C3 to reduce the
overall error.
Hidden Layer (C1)
We want to update the weights in Layer C1 (the very first convolutional
layer) based on how much error the network made.
But before we can update, we need to:
1. Backpropagate the error from Layer C3, through the pooling layer P2,
to reach Layer C1.
2. Calculate delta values for Layer C1.
3. Use these deltas to calculate the gradients (i.e., how much each
weight should be changed).

16
In Simple Words:
• We are working backward to adjust the first layer's weights.
• We pass the error from the deeper layer (C3) back to the first layer
(C1).
• We use convolution (with a rotated filter) to figure out how much error
each neuron in C1 is responsible for.
• Then, we calculate how much each weight in that filter should be
changed by multiplying the delta and the input image values.
Real-World Analogy:
Imagine you're baking a cake (output). You made a mistake in taste
(error), and now you want to trace back and figure out how much of that
mistake came from sugar, salt, or flour (weights in C1).
• You're tracing back the flavor to the exact ingredient and amount you
used.
• You flip the recipe steps (like flipping the filter) and work backward
from the final result to the starting step.
3.4 Gradient Descent-Based Optimization Techniques
Gradient Descent is a popular method used in machine learning and
deep learning to train models by improving their accuracy step by step.
Let’s break it down:
• Every machine learning model has something called a cost function
(also known as a loss function).
This function measures how wrong the model’s predictions are.
The goal of training is to reduce this error as much as possible.
17
• To reduce this error, we need to adjust the model’s parameters (like
weights in a neural network). But how do we know which way to adjust
them?
• That’s where gradients come in. A gradient shows the direction and rate
of change of the cost function with respect to each parameter. Think of
it like a slope on a hill — it tells you which direction to move to reach
the bottom (minimum error).
• Gradient Descent is the technique that:
1. Calculates the gradients of the cost function,
2. Uses these gradients to update the parameters (move in the
direction where the error decreases),
3. Repeats this process until the error is as small as possible (or
until we reach a stopping condition).
• There are different variants (types) of Gradient Descent. These
variants change how and how fast the parameters are updated. Some
common ones include:
o Stochastic Gradient Descent (SGD)
o Mini-Batch Gradient Descent
o Momentum
o Adam Optimizer
Each variant has its own way of improving the learning process, either by
speeding it up, avoiding unnecessary oscillations, or making the learning
more stable.
3.4.1 Gradient Descent Variants
There are three main types of Gradient Descent, and the difference between
them is based on how much data they use at a time to calculate the
gradient (i.e., the direction and size of the update to the model).
[Link] Batch Gradient Descent (GD)
Understanding Traditional Gradient Descent (Batch Gradient Descent)
In traditional Gradient Descent (also called Batch Gradient Descent), we
use the entire training dataset to calculate the error and update the
model’s parameters (like weights).
Let’s break this down step by step:
How the Weight is Updated

18
Suppose we are adjusting a weight w in our model to reduce the error. The
formula for updating the weight is:
w = w - μ · ∇E(w)
Here’s what each part means:
• w – the weight we are updating
• ∇E(w) – this is called the gradient (or slope) of the error with respect to
the weight w. It tells us the direction in which the weight should change
to reduce the error.
• μ (mu) – this is the learning rate. It controls how big the step we take
in the direction of the gradient.
What is Learning Rate?
• It is a hyperparameter (a value we set manually before training).
• If the learning rate is too high, the updates might be too large and we
may miss the best solution.
• If it is too small, the updates are tiny, and the training becomes very
slow.
Choosing a good learning rate is important for effective learning.
Why is it Slow in Practice?
• When we use the entire training dataset to compute the gradient, it can
take a lot of time and memory, especially if the dataset has thousands
or millions of examples.
• Imagine trying to load and process all that data at once — it’s like trying
to carry all your groceries in one trip. It's heavy and slow.
• This is why Batch Gradient Descent is not practical for large datasets
— it takes too long to compute and needs a lot of memory.
[Link] Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) – Explained in Simple Words
To solve the problems of traditional Gradient Descent (which is slow and
memory-heavy), we can use a faster technique called Stochastic Gradient
Descent, or SGD.
It is also known as Incremental Gradient Descent.
What Makes SGD Different?
• In SGD, instead of using all the training examples at once, we use just
one example at a time to:

19
1. Calculate the error (gradient),
2. Update the model parameters (like weights).
So the update happens immediately after seeing each training example.
This makes the learning process much faster, especially for very large
datasets.
Formula for Weight Update in SGD
w = w - μ · ∇E(w; x(i), y(i))
Here’s what this means:
• w – the weight (parameter) we are updating.
• μ – the learning rate (step size).
• ∇E(w; x(i), y(i)) – this is the gradient of the error (or loss function) with
respect to the weight w, but calculated only for one training example,
i.e., the pair {x(i), y(i)}.
This way, we make small updates after each individual data point.
What Happens to the Error (Loss Function)?
• Because we are updating the weights using only one sample at a time,
the value of the loss function tends to go up and down a lot during
training.
• This behavior is called fluctuation, and it makes the learning process a
bit noisy.
• However, these small fluctuations can also help the model avoid getting
stuck in bad solutions (local minima).
Summary (in Simple Words):
• SGD updates the model after seeing each training example, instead of
waiting for the whole dataset.
• This makes it faster and uses less memory than traditional Gradient
Descent.
• The training process is less smooth because the error jumps around,
but it can help reach a better solution in the end.
[Link] Mini-batch Gradient Descent
Mini-Batch Gradient Descent – Explained in Simple Words
Mini-Batch Gradient Descent, also called Mini-Batch SGD, is a method
that combines the advantages of both Standard (Batch) Gradient Descent
and Stochastic Gradient Descent (SGD).

20
How Does It Work?
• Instead of using all the training data at once (like Batch GD), or using
just one example at a time (like SGD), Mini-Batch Gradient Descent
breaks the data into small groups, called mini-batches.
• Each mini-batch contains n training examples (for example: 32, 64,
128, etc.).
• The model updates its parameters after processing each mini-batch, not
after each example or the whole dataset.
So, if your dataset has 10,000 examples and your batch size is 100, you’ll
have 100 updates in one full pass (called one epoch) over the data.
Weight Update Formula
w = w - μ · ∇E(w; x(i:i+n), y(i:i+n))
Here:
• w is the weight to be updated.
• μ is the learning rate.
• ∇E(w; x(i:i+n), y(i:i+n)) is the gradient (error slope) calculated for a mini-
batch of inputs (from sample i to i+n).
What Batch Size Should You Choose?
Choosing the right mini-batch size is important. Here’s how it affects
learning:
Large Mini-Batches (e.g., 256, 512, etc.):
• Give more accurate estimates of the gradient.
• But require more memory and may slow down training if hardware (like
GPU) is not powerful enough.
Small Mini-Batches (e.g., 32, 64):
• Need less memory and can add a bit of randomness (which can help
prevent overfitting).
• But because of that randomness, they can be unstable if the learning
rate is too high.
• You’ll need to use a smaller learning rate, which can make the training
slower.
Summary (In Simple Words):
• Mini-Batch Gradient Descent processes a small group of training
examples at a time.

21
• It’s a good balance between speed (SGD) and accuracy (Batch GD).
• It's the most commonly used technique in deep learning.
• Choosing the right batch size is important: larger sizes are accurate but
heavy on memory; smaller sizes are faster but need careful tuning of
learning rate.
3.4.2 Improving Gradient Descent for Faster Convergence
The main objective of optimization is to minimize the cost/loss or objective
function. There are many methods available that help an optimization
algorithm to converge faster. Some of the commonly used methods are
discussed below
[Link] AdaGrad
Understanding Learning Rate Problems in SGD and the AdaGrad
Solution
In Stochastic Gradient Descent (SGD), the learning rate (μ) is usually fixed
– we choose one value at the beginning and use it throughout the training.
But this can cause problems. Let’s understand why:
Problems with Fixed Learning Rate
1. If the gradient is large:
o A large learning rate will cause the model to take big steps.
o It may overshoot the best solution, and the model may keep
jumping around the minimum without settling.
2. If the gradient is small:
o A small learning rate means the model takes tiny steps.
o It will take a very long time to reach the minimum (slow learning).
So, using the same learning rate for all parameters and all situations
doesn’t always work well.
Solution: Use Adaptive Learning Rate – AdaGrad
To fix this, we can use an adaptive method that automatically adjusts the
learning rate during training.
AdaGrad (Adaptive Gradient Algorithm) is one such method.
How Does AdaGrad Work?
Instead of keeping the learning rate fixed, AdaGrad changes it based on
how big the gradients have been in the past.
Here's the idea:

22
• For each parameter (like each weight), it keeps track of the sum of
squares of the past gradients.
• It then divides the learning rate by the square root of that sum.
Weight Update Formula:
w(t+1),i = w(t),i - (μ / √Gi) · ∇t,i
Where:
• w(t),i = weight of the parameter i at time t
• ∇t,i = gradient of the loss with respect to that parameter at time t
• Gi = the sum of the squares of gradients up to time t
What Does This Achieve?
• If a parameter has large gradients, its learning rate becomes smaller,
so it doesn't jump too far.
• If a parameter has small gradients, its learning rate becomes larger, so
it doesn’t get stuck.
This helps the model learn more efficiently without the need to manually
set the perfect learning rate.
Drawback of AdaGrad
While AdaGrad adjusts the learning rate automatically, there’s one issue:
• The sum in the denominator keeps increasing over time.
• As a result, the learning rate keeps getting smaller and smaller.
• Eventually, it can become too small, causing the model to stop learning
completely (learning stalls).
Summary (in simple words):
• In regular SGD, a fixed learning rate may not work well for all
parameters.
• AdaGrad automatically adjusts the learning rate for each parameter:
o Big gradients → smaller steps
o Small gradients → bigger steps
• It helps speed up training and reduce the need for manual tuning.
• But over time, the learning rate may become too small, which can slow
down or stop learning.
[Link] AdaDelta

23
AdaDelta – An Improved Version of AdaGrad
AdaDelta is an advanced optimization technique that was created to fix a
major problem with AdaGrad.
Let’s first recall the issue with AdaGrad:
• In AdaGrad, the learning rate keeps getting smaller over time because
it keeps adding up all the past squared gradients.
• Eventually, the learning rate becomes so small that the model stops
learning.
What Does AdaDelta Do Differently?
AdaDelta solves this issue by making two key changes:
1. Instead of adding up all past gradients (which causes the learning rate
to decay), it only keeps a limited memory — a kind of moving average of
recent past gradients.
2. It uses this average to calculate a stable and adaptive learning rate.
How Does It Work?
• AdaDelta calculates an average of recent squared gradients, not all of
them.
• This average is updated at every time step and is called Avg∇²(t).
So, the updated weight formula becomes:
w(t+1) = w(t) - (μ / Avg∇²(t)) · ∇t
Or more commonly written as:
w(t+1) = w(t) - μ · ∇t / RMS(∇t)
Where:
• ∇t is the current gradient at time step t.
• Avg∇²(t) is the running average of past squared gradients.
• RMS(∇t) means the Root Mean Square of the gradients (just a way to
measure the average size of recent gradients).
Why Is This Useful?
• By using a moving average instead of a total sum, AdaDelta prevents
the learning rate from shrinking to zero.
• It allows the learning rate to adapt automatically based on how the
gradient is changing — like AdaGrad, but without the decay problem.

24
• And since it only stores recent values, it’s also more efficient in terms
of memory.
Summary (in Simple Words):
• AdaDelta improves on AdaGrad by using a moving average of recent
gradients instead of accumulating all past gradients.
• This prevents the learning rate from becoming too small over time.
• It uses the Root Mean Square (RMS) of gradients to control the learning
rate.
• It is an adaptive method that automatically adjusts learning — no need
to manually set or reduce the learning rate.
[Link] RMSProp
RMSProp – Solving the Learning Rate Problem in AdaGrad
We’ve already seen that AdaGrad tends to reduce the learning rate too
much over time, which can slow down or even stop learning.
RMSProp is an improved version of AdaGrad that fixes this problem.
What Does RMSProp Do Differently?
The main idea in RMSProp is to avoid the vanishing learning rate problem
in AdaGrad by using a moving average of recent gradients, but giving more
importance to the latest gradients.
This is done using an exponentially weighted moving average – meaning
recent gradients are weighted more than older ones. It helps the model
forget the very old history, so the learning rate stays stable and effective.
How RMSProp Works (Step by Step):
Let’s understand the steps in simple terms:
(a) Keep Weight Updates Balanced
• RMSProp tries to make all weight updates have similar magnitude (size).
• We set maximum and minimum limits for how big or small the weight
updates can be.
(b) Adjust Learning Rate Based on Gradient Behavior
• At each training step (iteration), we compare the current gradient and
the previous gradient.

✔️ If both have the same sign (direction):

• This means the model is moving in the right direction.
• So, we increase the learning rate slightly (by multiplying it with 1.2):

25
• η = η + 1.2
• The weight update becomes:
• Update = min(η+, max)

❌ If the signs are different:

• This means the model might be bouncing or overshooting.
• So, we decrease the learning rate (by multiplying with 0.5):
• η = η - 0.5
• The weight update becomes:
• Update = max(η−, min)
Here, η (eta) is the learning rate, and max and min are the preset limits on
how large or small the updates can be.
Key Features of RMSProp:
• Learns fast without letting the learning rate shrink to zero.
• Adapts the learning rate depending on how the gradient is changing.
• Ignores outdated gradients, focusing more on recent changes.
• Very useful for training deep neural networks and works well with non-
stationary data (data that changes over time).
Summary (in Simple Words):
• RMSProp is an improved version of AdaGrad that keeps learning fast
and stable.
• It uses a moving average of recent gradients, not all past gradients.
• It checks if the gradient direction is consistent:
o If it is, increase the learning rate slightly.
o If it changes direction, reduce the learning rate.
• It helps avoid too small learning rates and makes the training process
smoother and faster.
[Link] Adam
Adam Optimizer – An Adaptive Optimization Technique
Adam (short for Adaptive Moment Estimation) is one of the most popular
and effective optimization techniques used in deep learning today.
It is an adaptive method, which means it automatically adjusts the
learning rate for each parameter as training progresses.

26
Adam combines the best features of two other optimization methods:
• AdaGrad – which handles sparse data well.
• RMSProp – which works well for non-stationary (changing) data.
What Makes Adam Special?
Adam keeps track of two types of information during training:
1. The average (mean) of gradients – this is called the first moment (mt).
It tells us the direction in which we should move.
2. The average of squared gradients – this is called the second moment
(vt).
It gives us an idea of the magnitude (size) of the updates.
These averages are exponentially weighted moving averages, meaning more
recent gradients have more influence than older ones.
How mt and vt Are Calculated
At each training step t, Adam calculates:
mt = β1 · mt−1 + (1 − β1) · gt
vt = β2 · vt−1 + (1 − β2) · gt²
Where:
• gt is the gradient at time step t
• mt is the moving average of the gradient (like momentum)
• vt is the moving average of the squared gradient
• β1 and β2 are hyperparameters (typically close to 1, like 0.9 and 0.999)
that control how much past information is remembered
Bias Correction (Why It’s Needed)
At the beginning of training (when t is small), both mt and vt can be biased
toward zero. To fix this, Adam uses bias-corrected versions:
m̂t = mt / (1 − β1ᵗ)
v̂t = vt / (1 − β2ᵗ)
These correct the early values to make them more accurate.
Final Weight Update Formula
Adam then updates the weights using the following rule:
wt+1 = wt − μ · (m̂t / (√v̂t + ε))
Where:

27
• μ is the learning rate (usually 0.001)
• ε (epsilon) is a small constant (like 1e-8) added to the denominator to
avoid division by zero
• m̂t is the bias-corrected mean (first moment)
• v̂t is the bias-corrected variance (second moment)
Why Adam Is So Powerful
Fast Convergence: Adam usually finds the minimum error much faster
than other methods.
Stable Learning: It balances the direction and size of updates
automatically.
No Manual Tuning Needed: Adam adjusts learning rates automatically, so
we don't have to fine-tune them for each parameter.
Handles Complex Problems Well: Adam works well even when the data is
noisy or changing.
Summary (in Simple Words):
• Adam = RMSProp + Momentum
• It keeps track of the average gradient (direction) and the average
squared gradient (magnitude).
• Uses exponentially weighted averages to give more weight to recent
updates.
• Automatically adjusts learning rate for each parameter.
• Learns quickly, efficiently, and reliably in most deep learning tasks.
3.5 Challenges in Training Deep Networks
Training a deep neural network is a challenging task, and some of the
prominent challenges in training deep models are discussed below.
3.5.1 Vanishing Gradient
Vanishing Gradient Problem in Deep Neural Networks – Explained Simply
When we train deep neural networks (networks with many layers), especially
those using activation functions like sigmoid or tanh, we face a serious issue
called the vanishing gradient problem.
Let’s understand what that means and how it affects learning.
What Is Backpropagation?
• Backpropagation is the method used to train a neural network.

28
• It works by calculating the error (difference between predicted and
actual output) and sending it backward through the layers.
• Based on this error, the weights of the network are updated using a
method called gradient descent.
What Is the Vanishing Gradient Problem?
• In deep networks (with many layers), as the error is backpropagated
from the output to the input, the gradients (used to update weights)
become smaller and smaller.
• Eventually, in the earlier layers (close to the input), the gradient
becomes almost zero.
• This means the initial layers stop learning — they don't get updated
properly.
This issue is called the vanishing gradient problem.
Why Does This Happen with Sigmoid?
Let’s look at the sigmoid activation function:
f(x) = 1 / (1 + e^(-x))
Its derivative is:
f'(x) = f(x) * (1 - f(x))
From this formula, you can see:
• The maximum value of the derivative is 0.25.
• That means every time the gradient passes through a sigmoid layer, it
gets multiplied by a value less than or equal to 0.25.
• So, if you have many layers, the gradient gets multiplied by small
numbers again and again, and becomes almost zero.
• As a result, the early layers in the network do not learn — they are left
almost unchanged.

29
Why Is This a Big Problem?
• In deep learning, we want all layers to learn, not just the last few.
• If the earlier layers don’t learn properly, the network performs poorly,
especially on complex tasks like image recognition or language
understanding.
How ReLU Fixes This Problem
To solve this, we use a different activation function called ReLU (Rectified
Linear Unit).
ReLU Function:
f(x) = x if x > 0, otherwise f(x) = 0
Derivative:
f′(x) = 1 when x > 0, and 0 when x ≤ 0
Why ReLU Helps:
• For positive inputs, the derivative is 1, so no shrinking happens.
• The gradients do not vanish as they go through the layers.
• This helps the network learn much faster and deeper.
That’s why ReLU has become the default activation function in modern deep
learning.
Summary (in Simple Words):
• In deep neural networks, gradients can vanish during training when
using sigmoid or tanh activation functions.
• This means earlier layers stop learning, which harms the performance.
• The sigmoid’s derivative is always less than 1, causing the gradient to
shrink at each layer.
• ReLU fixes this because it has a derivative of 1 for positive values, so
gradients stay strong and training becomes effective.
3.5.2 Training Data Size
Why Deep Learning Needs a Lot of Data – Explained Simply
Deep learning models are powerful tools that can learn complex patterns and
relationships between input data (like images, text, etc.) and output labels
(like categories or predictions). But to work well, these models need a lot of
learning, and that learning comes from training data.
Why Do Deep Models Need So Much Data?

30
Deep networks:
• Have many layers.
• Learn nonlinear (complex) patterns.
• Contain millions of parameters (weights) that must be adjusted during
training.
Because of this complexity, they need:
• More examples to learn from.
• Larger datasets to avoid mistakes.
• Better quality data to learn useful patterns.
More data → better learning → better accuracy.
Example: ImageNet and Popular Models
Popular models like AlexNet, VGG, GoogleNet, and ResNet were trained using
a very large image dataset called ImageNet.
• ImageNet has:
o Around 1.2 million images
o Spread across 1000 different categories (like dog, airplane, car,
etc.)
This huge dataset helped these models learn very well how to recognize and
classify images — even when objects are shown in different poses, sizes,
colors, and backgrounds.
But Do All Problems Need Big Data?
Not always.
Some tasks (like medical image classification):
• Are less complex.
• Have fewer variations in images.
• Can often be solved using smaller and simpler models.
• May not need millions of examples to train.
But still, data is important — not just quantity, but also quality.
Data Quality Matters
Imagine training a model on:
• Blurry images,
• Wrong labels,

31
• Or messy, irrelevant data.
That’s called noisy data.
In such cases, the model:
• Learns slower.
• Makes more mistakes.
• Needs more data to balance out the noise.
This is called a low Signal-to-Noise Ratio (SNR) — which means the model has
to work harder to find useful patterns.
So, even for small problems, bad-quality data can ruin the training.
So, How Much Data Is Enough?
That’s a tricky question.
• There's no fixed rule for how much data is required.
• It depends on:
o The complexity of the task.
o The size and depth of the model.
o The quality of the training data.
However, one general rule holds true:
“In most cases, more high-quality data leads to better performance.”
In Summary:
• Deep models learn complex patterns but need lots of good data.
• Large datasets like ImageNet help models learn to recognize real-world
objects better.
• Simpler problems may need less data, but quality is still important.
• There’s no universal rule to decide how much data is enough.
• But overall, more clean and diverse data leads to higher accuracy.
3.5.3 Overfitting and Underfitting
What Is Generalization in Deep Learning?
After we train a model using training data, we expect it to work well on new
data — data it has never seen before.
This ability of the model to perform well on unseen data is called
generalization.

32
A good deep learning model:
• Should not just memorize the training data.
• Should be able to apply what it has learned to make correct predictions
on new data.
To check this, we test the model on a separate dataset (called a test set or
validation set) that wasn’t used during training.
Two Major Problems in Deep Learning
1. Overfitting
2. Underfitting
Let’s understand both in simple terms:
Overfitting – When the Model Learns Too Much
• Overfitting happens when the model performs very well on training data
but poorly on new data.
• It means the model is not really “learning,” but memorizing the training
examples.
• It fails to generalize.
• Imagine a student who memorizes answers without understanding —
they may score well in practice tests but fail in the real exam.
In graphs:
• Training error is very low.
• Validation (or test) error is high after some point — this is the sign of
overfitting.
Underfitting – When the Model Fails to Learn
• Underfitting happens when the model doesn’t learn well even on
training data.
• It means the model is too simple or the training is not enough.
• The model performs poorly on both training and test data.
Why Overfitting Happens in Deep Learning?
Deep models like Convolutional Neural Networks (CNNs) have:
• Many layers and millions of weights (parameters) to learn.
• If there’s not enough data, the model will memorize instead of learning
patterns.
So:

33
• More parameters = more risk of overfitting
• Less data = even more risk of overfitting
How to Reduce Overfitting?
Here are some common techniques:
1. Increase Training Data
o The more data, the better the model learns real patterns.
2. Reduce the Size of the Network
o Smaller models have fewer parameters, so they’re less likely to
overfit.
3. Data Augmentation
o Create more training data by modifying the existing data.
o For example:
▪ Rotate the images
▪ Zoom in/out
▪ Shift or flip the images
o This makes the model see more variety.
4. Regularization (L1, L2)
o These are techniques that add a penalty to large weights.
o They help keep the model simpler and prevent memorization.
5. Dropout (Very Popular)
o During training, random neurons are turned off temporarily.
o This means:
▪ Those neurons don’t participate in that training round.
▪ It forces the network to not rely on a few specific neurons.
o As a result, the model becomes more flexible and generalizes
better.
Example:
o Think of a group project where sometimes a team member is
absent.
o Everyone else has to step up.

34
o So, everyone learns to contribute — just like in Dropout, where
all parts of the network learn to perform better.
Summary for Students:
• Generalization = how well a model performs on new data.
• Overfitting = model memorizes training data and fails on new data.
• Underfitting = model fails to learn even from training data.
• Overfitting is common in deep networks but can be reduced using:
o More data,
o Smaller models,
o Data augmentation,
o Regularization (L1/L2),
o Dropout (removing random neurons during training).

3.5.4 High-Performance Hardware

Why Training Deep Learning Models Needs Powerful Machines
Training deep learning models — especially on large datasets like ImageNet
— requires a lot of computing power and memory.
Why?
Because:
• Deep learning models have millions of parameters.

35
• They need to process a huge number of images or data samples many
times.
• This takes a lot of calculations and storage space.
The Role of GPUs (Graphics Processing Units)
To handle this heavy workload, we use GPUs (Graphics Processing Units)
instead of regular CPUs (like those in normal laptops).
• GPUs are very fast at handling large amounts of data in parallel.
• They help models train faster and make the process more efficient.
• Multi-core GPUs (with many processing units) are especially helpful.
Think of a GPU like a super-fast calculator that can do thousands of math
problems at the same time, while a normal CPU can do only a few at once.
But There’s a Problem
Using high-performance GPUs and machines has some downsides:
1. They are expensive
o These machines cost a lot of money to buy and maintain.
2. They consume a lot of energy
o Running powerful GPUs requires a lot of electricity.
o This makes deep learning energy-hungry and not always eco-
friendly.
Why This Matters in Real-World Use
Because of the high cost and energy use:
• Using deep learning in real-world applications becomes expensive.
• Not every company or individual can afford to train such models.
• It’s important to optimize models and use resources wisely.
In Short (Summary for Students):
• Training deep learning models needs powerful machines with lots of
memory and fast processors.
• GPUs are used because they are much faster than normal CPUs for
deep learning.
• But, GPUs are costly and consume a lot of energy.
• So, using deep learning in real-life projects can be expensive and
energy-consuming.

36
4.2 LeNet-5

What is LeNet-5?
LeNet-5 is one of the earliest and most famous Convolutional Neural Networks
(CNNs), developed by Yann LeCun to recognize handwritten digits (like in the
MNIST dataset). It takes an image and processes it through several layers to
extract features and then classifies the image.
Input to the Network
• The input is a gray-scale image of size 32×32 pixels.

37
• The pixel values are normalized so that the mean becomes 0 and
variance becomes 1.
This makes the learning faster and more stable.
Structure of LeNet-5
LeNet-5 has 7 layers, excluding the input. These include:
• Convolutional layers (for extracting features),
• Subsampling (pooling) layers (for reducing size),
• Fully connected layers (for decision making).
Let’s break it down layer by layer:
Layer 1: Convolutional Layer (C1)
• Input size: 32×32
• Uses 6 filters (kernels) of size 5×5
• Output: 6 feature maps of size 28×28
• Number of trainable parameters: 156
• Activation function: Tanh (non-linear function)
This layer detects basic features like edges and curves.
Layer 2: Subsampling Layer (S2)
• Input: 6 feature maps of size 28×28
• Applies pooling with a 2×2 window
• Output: 6 feature maps of size 14×14
• Number of trainable parameters: 12
This layer reduces size and helps make the features more robust.
Layer 3: Convolutional Layer (C3)
• Input: 6 maps of size 14×14
• Uses 16 filters of size 5×5
• Output: 16 feature maps of size 10×10
• Number of trainable parameters: Based on filter connections
This extracts more complex features by combining previous ones.
Layer 4: Subsampling Layer (S4)
• Input: 16 maps of 10×10

38
• Applies 2×2 pooling
• Output: 16 maps of size 5×5
• Parameters: 32 trainable
Again reduces size while keeping the important information.
Layer 5: Convolutional Layer (C5)
• Input: 16 maps of 5×5
• Uses 120 filters of size 5×5
• Output: 120 feature maps of size 1×1
• Number of connections: 48,120
At this point, we get compressed feature representations.
Layer 6: Fully Connected Layer (F6)
• Input: 120 values (from previous layer)
• Output: 84 neurons
• Trainable parameters: 10,164
This layer works like a traditional neural network and helps in classification.
Output Layer (Softmax Layer)
• Input: 84 values
• Output: 10 neurons (one for each digit from 0 to 9)
Each neuron gives a probability for the input belonging to one of the 10
classes.
For example, for the digit “7”, the output vector might look like:
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
How LeNet-5 Learns (Training Steps)
1. Initialize all weights and filters with random values.
2. Forward pass: Image goes through all layers, and the network predicts
probabilities.
3. Calculate error: Compare predicted result with the actual label.
4. Backpropagation: Compute how much each weight contributed to the
error and update weights using gradient descent.
5. Repeat: These steps are repeated for all images in the training set.
Important Notes

39
• Convolution + Subsampling = Feature extraction
• Fully Connected layers = Classification
• The architecture is simple yet powerful and was the base for many
modern CNNs.
• All layers except the output use Tanh activation. The output uses
Softmax to give probabilities.
Training of LeNet-5 – Step-by-Step Explanation
Training LeNet-5 means teaching the network how to recognize and correctly
classify handwritten digits (0–9) by adjusting its internal weights and filters
based on examples.
Step 1: Initialization
• All the filters and weights in the network are given random starting
values.
• These include:
o Weights in convolution layers,
o Weights in the fully connected layers.
Think of it like randomly guessing answers at first, and then learning the correct
ones step by step.
Step 2: Forward Propagation
• The input image (say, a handwritten "7") is passed through the network.
• It flows through:
o Convolutional layers (to detect features),
o Pooling layers (to reduce size and keep important features),
o Fully connected layers (to combine features and make decisions).
This process is called forward propagation.
• At the end of this step, the network gives a probability for each digit (0–
9).
o Example output: [0.01, 0.02, 0.01, 0.03, 0.01, 0.02, 0.01, 0.86,
0.01, 0.02] → Likely digit: 7
Step 3: Calculate Error
• The network’s predicted output is compared with the actual correct
answer (target label).
• The difference between them is called the error or loss.

40
o Example: The correct output for digit 7 should be [0, 0, 0, 0, 0, 0,
0, 1, 0, 0]
o The closer the predicted output is to this target, the better.
Step 4: Backpropagation and Weight Update
• In this step, the network figures out how much each weight/filter
contributed to the error.
• It uses an algorithm called gradient descent to:
o Calculate gradients (slopes) of the error,
o Use those gradients to adjust the weights in a way that the error
will become smaller next time.
Only the weights and filters are updated.
• The filter sizes, number of filters, and other design choices
(hyperparameters) are fixed and do not change during training.
This is like learning from your mistakes — changing the approach if something
goes wrong.
Step 5: Repeat for All Training Images
• Steps 2, 3, and 4 are repeated for each image in the training dataset.
• The network keeps improving with each image, gradually becoming
better at classification.
Understanding Layer-Wise Transformations with Image Size
Let’s look at how the image size and number of feature maps change step by
step:

Layer Operation Input/Output Size Number of Feature Maps

Input Image — 32×32 1

Conv Layer 1 5×5 filter 28×28 6

Pooling Layer 1 2×2 pooling 14×14 6

Conv Layer 2 5×5 filter 10×10 16

Pooling Layer 2 2×2 pooling 5×5 16

Conv Layer 3 5×5 filter 1×1 120

Fully Connected – 1×1 84

Output Layer – 1×1 10

41
Final Output
• The final layer gives 10 outputs, each representing a digit (0 to 9).
• The one with the highest probability is chosen as the prediction.
Summary in Simple Words
• LeNet-5 learns by going through many images and slowly correcting its
guesses.
• It uses forward pass to predict, compares it with the true label,
calculates error, and updates weights.
• This process is repeated many times until the network becomes good at
recognizing digits.
4.3 AlexNet

What is AlexNet?
AlexNet is a Convolutional Neural Network (CNN) model that became very
famous after it won the ImageNet competition in 2012 with a big margin in

42
accuracy. It showed that deep learning works very well for large and complex
image datasets.
It was designed to classify images into 1000 categories such as animals,
vehicles, furniture, etc.
Why was AlexNet Important?
Before AlexNet, deep neural networks were hard to train because of issues
like the vanishing gradient problem. AlexNet solved this problem using a new
activation function called ReLU (Rectified Linear Unit). This made training
faster and more stable.
Dataset Used
AlexNet was trained on a subset of the ImageNet dataset:
• Training images: 1.2 million
• Testing images: 150,000
• Categories: 1000
Architecture of AlexNet (Simplified)
AlexNet has 8 main layers that learn from images:
• 5 Convolutional Layers – extract features (edges, shapes, patterns)
• 3 Fully Connected Layers – perform classification
• 1 Softmax Layer – gives final output as probabilities for each class
Let’s break down the layers one by one
Layer-by-Layer Breakdown

Layer Type What it does Details

Takes image of size 224×224×3

Input Image ---
(RGB image)

Output:
Conv1 Convolution Uses 96 filters of size 11×11
55×55×96

Reduces size using 3×3 window, Output:

MaxPool1 Pooling
stride 2 27×27×96

Output:
Conv2 Convolution Uses 256 filters of size 5×5
27×27×256

Same method, reduces to Output:

MaxPool2 Pooling
13×13×256 13×13×256

43
Layer Type What it does Details

Output:
Conv3 Convolution Uses 384 filters of size 3×3
13×13×384

Output:
Conv4 Convolution Uses 384 filters of size 3×3
13×13×384

Output:
Conv5 Convolution Uses 256 filters of size 3×3
13×13×256

Output:
MaxPool3 Pooling Final downsampling to 6×6×256
6×6×256

Fully Connects all outputs from

FC1 4096 neurons
Connected previous layer

Fully
FC2 Another layer with 4096 neurons 4096 neurons
Connected

Fully
FC3 Final layer with 1000 neurons 1000 neurons
Connected

1000
Softmax Classifier Outputs probability of each class
probabilities

Key Features Used in AlexNet

1. ReLU Activation Function
o Faster training than sigmoid/tanh
o Helps to avoid vanishing gradients.
2. Dropout Regularization
o Helps prevent overfitting by randomly turning off some neurons
during training.
3. Overlapping Pooling
o Unlike traditional pooling, it overlaps the window which keeps
more information.
4. GPU Training
o AlexNet was one of the first models trained on GPU to speed up
learning.
What Happens During Training?
1. Image is passed through convolution + pooling layers
2. Important features are extracted (edges, textures, patterns)
44
3. Output is passed to fully connected layers
4. Final softmax layer gives probability for each class
5. Backpropagation is used to update weights using gradient descent
Why Did AlexNet Succeed?
• Used deep architecture with many layers
• ReLU made training faster and stable
• Used GPU to handle large image data
• Used smart tricks like dropout and overlapping pooling
Summary for Students
• AlexNet is a deep learning model for image classification.
• It uses convolution layers to learn patterns in images.
• Fully connected layers do the job of classifying the image.
• It introduced ReLU, dropout, and GPU-based training to deep learning.
• It proved that deep networks can perform very well on large image
datasets.

Understanding Convolutional Neural Networks
No ratings yet
Understanding Convolutional Neural Networks
16 pages
Understanding CNN Architecture and Operations
No ratings yet
Understanding CNN Architecture and Operations
97 pages
Deep Learning & CNNs Overview for AI&DS
No ratings yet
Deep Learning & CNNs Overview for AI&DS
24 pages
Multi-Layer Perceptron Overview and Training
No ratings yet
Multi-Layer Perceptron Overview and Training
33 pages
Introduction to Deep Learning Concepts
No ratings yet
Introduction to Deep Learning Concepts
78 pages
Convolutional Neural Networks Overview
No ratings yet
Convolutional Neural Networks Overview
34 pages
Human Brain Functions and Neural Networks
No ratings yet
Human Brain Functions and Neural Networks
40 pages
Introduction to Artificial Neural Networks
No ratings yet
Introduction to Artificial Neural Networks
54 pages
DNN Training and Optimization Techniques
No ratings yet
DNN Training and Optimization Techniques
114 pages
Understanding Deep Learning Concepts
No ratings yet
Understanding Deep Learning Concepts
18 pages
Perceptron and Backpropagation Explained
No ratings yet
Perceptron and Backpropagation Explained
32 pages
Perceptron and Multilayer Perceptron Guide
No ratings yet
Perceptron and Multilayer Perceptron Guide
42 pages
Understanding Multi-Layer Perceptrons
No ratings yet
Understanding Multi-Layer Perceptrons
54 pages
Deep Learning and AI Course Overview
No ratings yet
Deep Learning and AI Course Overview
79 pages
Introduction to Deep Learning Concepts
No ratings yet
Introduction to Deep Learning Concepts
22 pages
Optimization Techniques for Deep Learning
No ratings yet
Optimization Techniques for Deep Learning
18 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
8 pages
Understanding Feedforward Neural Networks
No ratings yet
Understanding Feedforward Neural Networks
64 pages
Understanding the Perceptron Model
No ratings yet
Understanding the Perceptron Model
55 pages
Understanding Artificial Neural Networks
No ratings yet
Understanding Artificial Neural Networks
59 pages
Challenges in Deep Learning Optimization
No ratings yet
Challenges in Deep Learning Optimization
46 pages
Gradient Descent and Optimization Techniques
No ratings yet
Gradient Descent and Optimization Techniques
201 pages
Adagrad in Machine Learning Optimization
No ratings yet
Adagrad in Machine Learning Optimization
7 pages
Deep Learning: Perceptron & Gradient Descent
No ratings yet
Deep Learning: Perceptron & Gradient Descent
26 pages
Understanding Deep Learning Basics
No ratings yet
Understanding Deep Learning Basics
25 pages
Overview of Multilayer Perceptron Algorithm
0% (1)
Overview of Multilayer Perceptron Algorithm
3 pages
NN & DL UNIt-5 Notes
No ratings yet
NN & DL UNIt-5 Notes
9 pages
Introduction to Neural Networks
No ratings yet
Introduction to Neural Networks
102 pages
Understanding Perceptrons and MLPs
No ratings yet
Understanding Perceptrons and MLPs
14 pages
Ai Unit 5
No ratings yet
Ai Unit 5
17 pages
NLP Techniques and Applications Overview
No ratings yet
NLP Techniques and Applications Overview
19 pages
Understanding Optimization in AI
No ratings yet
Understanding Optimization in AI
36 pages
Adam Optimizer in Neural Networks
No ratings yet
Adam Optimizer in Neural Networks
24 pages
Deep Neural Network Regularization Techniques
No ratings yet
Deep Neural Network Regularization Techniques
53 pages
Fundamental Steps in Digital Image Processing
No ratings yet
Fundamental Steps in Digital Image Processing
17 pages
Biological Neurons in Deep Learning
No ratings yet
Biological Neurons in Deep Learning
68 pages
C Language Decision Making Statements
No ratings yet
C Language Decision Making Statements
12 pages
Overview of C Programming Language
No ratings yet
Overview of C Programming Language
101 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
67 pages
Feedforward Neural Networks Overview
No ratings yet
Feedforward Neural Networks Overview
44 pages
C Programming: Decision Making Statements
No ratings yet
C Programming: Decision Making Statements
26 pages
Step-by-Step Backpropagation Guide
No ratings yet
Step-by-Step Backpropagation Guide
25 pages
Two-Way Value Transfer in Functions
No ratings yet
Two-Way Value Transfer in Functions
66 pages
Introduction to C Programming Language
No ratings yet
Introduction to C Programming Language
29 pages
C Programming: Logic and Operators Guide
No ratings yet
C Programming: Logic and Operators Guide
183 pages
Deep Learning: Huawei AI Academy Training Materials
No ratings yet
Deep Learning: Huawei AI Academy Training Materials
47 pages
Control Structures in Programming
No ratings yet
Control Structures in Programming
102 pages
C Programming: Arrays and Strings Guide
100% (1)
C Programming: Arrays and Strings Guide
14 pages
ML Unit-3
No ratings yet
ML Unit-3
20 pages
C Programming Functions and Recursion Guide
No ratings yet
C Programming Functions and Recursion Guide
85 pages
ML Unit - 1
No ratings yet
ML Unit - 1
48 pages
Perceptron Overview and Learning Algorithm
No ratings yet
Perceptron Overview and Learning Algorithm
63 pages
Conditional Statements in C Programming
No ratings yet
Conditional Statements in C Programming
46 pages
C Programming: Two-Dimensional Arrays
No ratings yet
C Programming: Two-Dimensional Arrays
22 pages
History and Functions of RBI
No ratings yet
History and Functions of RBI
24 pages
Overview of Data Structures and Types
No ratings yet
Overview of Data Structures and Types
25 pages
Understanding Optimizers in Deep Learning
No ratings yet
Understanding Optimizers in Deep Learning
37 pages
Convolutional Neural Networks Overview
No ratings yet
Convolutional Neural Networks Overview
31 pages
DL UNIT II Notes Updated
No ratings yet
DL UNIT II Notes Updated
26 pages
Understanding Convolutional Neural Networks
No ratings yet
Understanding Convolutional Neural Networks
19 pages
Stacking Ensemble for Network Intrusion Detection
No ratings yet
Stacking Ensemble for Network Intrusion Detection
19 pages
Improving 3D Object Reconstruction with Self-Attention
No ratings yet
Improving 3D Object Reconstruction with Self-Attention
8 pages
Product Management & AI Program Online
No ratings yet
Product Management & AI Program Online
36 pages
Chen, C. (2020) Predictive Maintenance Using Cox
No ratings yet
Chen, C. (2020) Predictive Maintenance Using Cox
41 pages
Deep Learning for Malware Classification
No ratings yet
Deep Learning for Malware Classification
12 pages
AI's Role in Modern Healthcare Analysis
No ratings yet
AI's Role in Modern Healthcare Analysis
4 pages
Deep Learning for Flash Flood Prediction
No ratings yet
Deep Learning for Flash Flood Prediction
14 pages
FPGA Acceleration for Real-Time Breast Cancer Detection
No ratings yet
FPGA Acceleration for Real-Time Breast Cancer Detection
20 pages
Investor Perception of AI in Finance
No ratings yet
Investor Perception of AI in Finance
101 pages
Optimal Design of Hybrid Renewable Systems
No ratings yet
Optimal Design of Hybrid Renewable Systems
14 pages
AI-ML Internship Report Overview
No ratings yet
AI-ML Internship Report Overview
28 pages
(Ebook) Artificial Intelligence in The 21St Century: A Living Introduction by Stephen Lucci, Danny Kopec Isbn 9781942270003, 1942270003
No ratings yet
(Ebook) Artificial Intelligence in The 21St Century: A Living Introduction by Stephen Lucci, Danny Kopec Isbn 9781942270003, 1942270003
75 pages
Deep Learning Question Bank
No ratings yet
Deep Learning Question Bank
4 pages
Arabic Sign Language Recognition Review
No ratings yet
Arabic Sign Language Recognition Review
19 pages
Oral History of AI Scaling 2019-2025
No ratings yet
Oral History of AI Scaling 2019-2025
36 pages
Battery State of Charge Estimation Methods
No ratings yet
Battery State of Charge Estimation Methods
18 pages
AI and ML Fundamentals Overview
No ratings yet
AI and ML Fundamentals Overview
67 pages
Types of Traffic Signs Explained
No ratings yet
Types of Traffic Signs Explained
15 pages
Introduction to Artificial Intelligence Course
No ratings yet
Introduction to Artificial Intelligence Course
20 pages
AWS AI Practitioner Exam Study Guide
100% (1)
AWS AI Practitioner Exam Study Guide
82 pages
Deep Learning Course Syllabus
No ratings yet
Deep Learning Course Syllabus
2 pages
Marco Corazza Artificial Intelligence and Beyond For Finance
No ratings yet
Marco Corazza Artificial Intelligence and Beyond For Finance
429 pages
Stanford CS229 - Complete Study Guide (Step-by-Step)
No ratings yet
Stanford CS229 - Complete Study Guide (Step-by-Step)
5 pages
Real-Time Fake News Detection System
No ratings yet
Real-Time Fake News Detection System
47 pages
Bias Detection in Federated Learning
No ratings yet
Bias Detection in Federated Learning
10 pages
AI Smart Driving System Project Report
No ratings yet
AI Smart Driving System Project Report
49 pages
Error Analysis in Machine Learning
No ratings yet
Error Analysis in Machine Learning
60 pages
Machine Learning & Generative AI Course
No ratings yet
Machine Learning & Generative AI Course
5 pages
Adtech Contextual Bandit Algorithm Comparison
No ratings yet
Adtech Contextual Bandit Algorithm Comparison
13 pages
Parallel Physics-Informed Neural Networks
No ratings yet
Parallel Physics-Informed Neural Networks
23 pages

Supervised Deep Learning Training Guide

Uploaded by

Supervised Deep Learning Training Guide

Uploaded by

Module 3

Training Supervised Deep Learning Networks Training Convolution Neural

🔹 Layer 1: Convolution Layer (C1)

🧮 Max-Pooling Operation (Equation 3.4)

Operation Input Output

Make model learn

Understanding Layer 3 and Layer 4 in a CNN

Layer Type Input Output Purpose

16 Make the output non-

Understanding Layers 5 and 6 + Final Output in a CNN

🎯 Final Layer: Softmax Activation

Breaking Down the Derivatives

1. Feed input image into CNN

2. Get predicted output (probabilities)

3. Compare with true output using a loss function

4. Compute error and calculate gradient using chain rule

5. Update weights using gradient descent

6. Repeat the process for all data until loss is minimized

Hidden Layer (C5)

🔹Forward Pass of Layer C5 (for context)

🔸Step 1: Backpropagate the Error from F6 to C5

✔️ If both have the same sign (direction):

❌ If the signs are different:

3.5.4 High-Performance Hardware

Layer Operation Input/Output Size Number of Feature Maps

Input Image — 32×32 1

Conv Layer 1 5×5 filter 28×28 6

Pooling Layer 1 2×2 pooling 14×14 6

Conv Layer 2 5×5 filter 10×10 16

Pooling Layer 2 2×2 pooling 5×5 16

Conv Layer 3 5×5 filter 1×1 120

Fully Connected – 1×1 84

Output Layer – 1×1 10

Layer Type What it does Details

Takes image of size 224×224×3

Reduces size using 3×3 window, Output:

Same method, reduces to Output:

Fully Connects all outputs from

Key Features Used in AlexNet

You might also like