Supervised Deep Learning Training Guide
Supervised Deep Learning Training Guide
1
• CNNs are a type of deep learning model that are especially good at:
o Recognizing patterns in images,
o Extracting important features (like edges, shapes),
o And classifying objects correctly.
• CNNs are one of the most commonly used models in supervised
learning, especially for image-related tasks (like face recognition,
medical imaging, self-driving cars, etc.).
3.2 Training Convolution Neural Networks
Training Supervised Deep Neural Networks – Explained Simply
What Is the Goal of Training a Neural Network?
Training a deep learning model means finding the best set of weights (or
parameters) that help the model make accurate predictions.
To do this, we define a loss function (also called a cost function), which
tells us how far off the model’s predictions are from the correct answers.
The goal of training is to reduce this loss as much as possible — this means
the model is learning better.
How Do We Minimize the Loss?
We use a method called gradient descent, which is a step-by-step process
to adjust the weights of the network to reduce the loss.
Think of it like walking downhill to reach the lowest point (minimum error).
Backpropagation – How Learning Happens
The most common technique used in training deep learning models is
called:
Backpropagation with Gradient Descent
Here’s how it works step by step:
1. Initialize weights: Start with random weights (or based on a probability
distribution).
2. Feedforward: Give an input to the model and let it produce an output.
3. Calculate Error: Compare the output to the correct answer using a loss
function.
4. Backpropagate: Send the error backward through the network to
update the weights.
2
5. Update Weights: Use the gradient (direction of steepest
increase/decrease) to change weights in the right direction — to reduce
error.
6. Repeat: Do this for many inputs until the model performs well.
Example CNN Architecture (Similar to LeNet)
Let’s understand this with an example of a small CNN model:
Input:
• Image of size 32 × 32 pixels
First Convolution Layer:
• Uses 6 filters of size 5 × 5
• Output: 6 feature maps of size 28 × 28
Filters slide across the image, detecting features like edges and textures.
First Max-Pooling Layer:
• Down-samples each 28 × 28 feature map to size 14 × 14
• This reduces size and helps the model focus on important features
Second Convolution Layer:
• Applies 16 filters of size 5 × 5
• Output: 16 feature maps of size 10 × 10
Second Max-Pooling Layer:
• Down-samples 10 × 10 maps to size 5 × 5
• Output: 16 feature maps, each of size 5 × 5
Third Convolution Layer:
• Uses 120 filters of size 5 × 5
• Each filter connects fully to a 5 × 5 map
• Output: 120 values (a flat vector)
Fully Connected Layer:
• Connects these 120 values to 10 output units (for 10 different classes)
• Output represents scores for each class
Softmax Classifier:
• Converts the 10 scores into probabilities
3
• The class with the highest probability is selected as the model’s
prediction
Layer 1 (C1)
Understanding the First Two Layers in a CNN – Convolution and Max-
Pooling
In Convolutional Neural Networks (CNNs), the first few layers are used to
extract features from input images. Let's break down what's happening in
these layers step by step.
In simple terms:
You slide a 5 × 5 filter over the image, multiply corresponding values,
add them up, then add a bias. This gives one pixel in the output
feature map.
4
🔹 Layer 2: Max-Pooling Layer (P2)
Now the output from C1 is passed to the next layer: Max-Pooling.
• Input: 6 feature maps from C1 (each of size 28 × 28)
• Operation: Apply max-pooling with a 2 × 2 filter (usually with stride 2)
• Output: 6 feature maps of size 14 × 14 (reduced size)
In simple words:
Max-pooling shrinks the image while keeping the strongest features.
Summary for Students
Extract features
6 filters
Convolution 32 × 32 28 × 28 like edges,
(5×5)
textures
5
Operation Input Output
Layer Purpose
Type Size Size
Reduce size,
Max- 2 × 2 max-
28 × 28 14 × 14 highlight strong
Pooling pool
features
In simple terms:
Each of the 16 filters "looks at" all 6 input maps, processes them, adds
the results together, adds a bias, and passes it through ReLU. This gives
one new feature map per filter.
6
Layer 4 (P4): Second Max-Pooling Layer
• Input: 16 feature maps of size 10 × 10 (from C3)
• Operation: Max-pooling with a 2 × 2 window
• Output: 16 feature maps of size 5 × 5
What’s happening here?
Just like before, we reduce the size of each feature map by picking the
maximum value from each 2 × 2 block.
In simple terms:
Max-pooling shrinks each 10 × 10 map to 5 × 5 by selecting the most
important values.
Summary for Students
Extract deeper
6 maps 16 maps
features using 5 × 5
C3 Convolution of 14 × of 10 ×
filters and combine
14 10
across channels
16
Max- 16 maps Reduce size, highlight
P4 maps of
Pooling of 5 × 5 strongest features
10 × 10
7
What’s happening?
In this layer:
• Each filter has access to all 16 input maps.
• Each filter is the same size as the input (5 × 5), so it slides only once
over the input.
• That means for each of the 120 filters, we get just one number – like
compressing all information into a single feature.
In simple words:
Each of the 120 filters combines information from all 16 maps using a
weighted sum, adds a bias, and applies ReLU. This gives a single number
per filter – so we now have a vector of 120 numbers.
Layer 6 (F6): Fully Connected Layer
• Input: The 120 values from C5.
• Operation: Fully connected layer with 10 neurons – one for each class
(for example: digits 0 to 9).
• Output: 10 raw values, one from each neuron.
8
💡 In simple words:
Each of the 10 neurons takes all 120 inputs and computes a weighted
sum. These are the raw scores before final classification.
In simple words:
Softmax makes sure the 10 outputs become probabilities. So, if:
• Z0=0.01Z_0 = 0.01
• Z1=0.02Z_1 = 0.02
• …
• Z7=0.85Z_7 = 0.85
then we can say the model is 85% confident that the input belongs to
class 7.
Backward Pass
Loss Layer
What Happens During Training?
9
Goal of Training: Minimize the Loss
To improve the CNN’s accuracy, we want to minimize the error. We do this
by adjusting the weights of the network. This is done using a method called
backpropagation and an optimization method like gradient descent.
How Do We Update the Weights?
Let’s focus on the last layer (F6), which is fully connected and produces
10 outputs (for 10 classes).
For each neuron in this layer:
10
This
11
Summary for Students
Step Description
12
🔹Goal in Backward Pass
We want to update each weight w_{k,d,m,n} using the error calculated at
the output layer.
But C5 is a hidden layer, not the final layer. So first, we need to pass the
error from the last layer (F6) back to C5.
13
In Simple Terms:
• C5 receives input from P4 and applies convolution.
• In the forward pass, it generates 120 feature maps.
• In the backward pass, we:
1. Pass back error from the next layer (F6) to C5.
2. Use this error to compute how much each neuron in C5 is
responsible (δC5k).
3. Update the weights by calculating how much each weight
contributed to that error.
Hidden Layer (C3)
Backpropagation from C5 to C3 (via P4)
We are now going deeper into the convolutional network during training.
After computing the error at layer C5 and updating its weights, we need to
pass the error backward to layer C3, which is two layers back. But between
C5 and C3, we have the pooling layer P4.
14
Important Note about Max Pooling (P4)
We only propagate the error through those positions that were selected
during the max pooling operation in the forward pass.
• In max pooling, we pick the highest value from a small region (e.g.,
2×2).
• When going backward, only that maximum value gets the error.
Others are ignored.
In Simple Words:
1. We start at the output layer, calculate the error, and then move
backward step by step.
2. To update C3, we must go through P4 (pooling).
3. Only the neurons selected during max pooling will carry the error
backward.
4. We calculate how much each neuron in C3 is responsible for the error
using its output and weights.
15
5. We then compute how much to adjust each weight in C3 to reduce the
overall error.
Hidden Layer (C1)
We want to update the weights in Layer C1 (the very first convolutional
layer) based on how much error the network made.
But before we can update, we need to:
1. Backpropagate the error from Layer C3, through the pooling layer P2,
to reach Layer C1.
2. Calculate delta values for Layer C1.
3. Use these deltas to calculate the gradients (i.e., how much each
weight should be changed).
16
In Simple Words:
• We are working backward to adjust the first layer's weights.
• We pass the error from the deeper layer (C3) back to the first layer
(C1).
• We use convolution (with a rotated filter) to figure out how much error
each neuron in C1 is responsible for.
• Then, we calculate how much each weight in that filter should be
changed by multiplying the delta and the input image values.
Real-World Analogy:
Imagine you're baking a cake (output). You made a mistake in taste
(error), and now you want to trace back and figure out how much of that
mistake came from sugar, salt, or flour (weights in C1).
• You're tracing back the flavor to the exact ingredient and amount you
used.
• You flip the recipe steps (like flipping the filter) and work backward
from the final result to the starting step.
3.4 Gradient Descent-Based Optimization Techniques
Gradient Descent is a popular method used in machine learning and
deep learning to train models by improving their accuracy step by step.
Let’s break it down:
• Every machine learning model has something called a cost function
(also known as a loss function).
This function measures how wrong the model’s predictions are.
The goal of training is to reduce this error as much as possible.
17
• To reduce this error, we need to adjust the model’s parameters (like
weights in a neural network). But how do we know which way to adjust
them?
• That’s where gradients come in. A gradient shows the direction and rate
of change of the cost function with respect to each parameter. Think of
it like a slope on a hill — it tells you which direction to move to reach
the bottom (minimum error).
• Gradient Descent is the technique that:
1. Calculates the gradients of the cost function,
2. Uses these gradients to update the parameters (move in the
direction where the error decreases),
3. Repeats this process until the error is as small as possible (or
until we reach a stopping condition).
• There are different variants (types) of Gradient Descent. These
variants change how and how fast the parameters are updated. Some
common ones include:
o Stochastic Gradient Descent (SGD)
o Mini-Batch Gradient Descent
o Momentum
o Adam Optimizer
Each variant has its own way of improving the learning process, either by
speeding it up, avoiding unnecessary oscillations, or making the learning
more stable.
3.4.1 Gradient Descent Variants
There are three main types of Gradient Descent, and the difference between
them is based on how much data they use at a time to calculate the
gradient (i.e., the direction and size of the update to the model).
[Link] Batch Gradient Descent (GD)
Understanding Traditional Gradient Descent (Batch Gradient Descent)
In traditional Gradient Descent (also called Batch Gradient Descent), we
use the entire training dataset to calculate the error and update the
model’s parameters (like weights).
Let’s break this down step by step:
How the Weight is Updated
18
Suppose we are adjusting a weight w in our model to reduce the error. The
formula for updating the weight is:
w = w - μ · ∇E(w)
Here’s what each part means:
• w – the weight we are updating
• ∇E(w) – this is called the gradient (or slope) of the error with respect to
the weight w. It tells us the direction in which the weight should change
to reduce the error.
• μ (mu) – this is the learning rate. It controls how big the step we take
in the direction of the gradient.
What is Learning Rate?
• It is a hyperparameter (a value we set manually before training).
• If the learning rate is too high, the updates might be too large and we
may miss the best solution.
• If it is too small, the updates are tiny, and the training becomes very
slow.
Choosing a good learning rate is important for effective learning.
Why is it Slow in Practice?
• When we use the entire training dataset to compute the gradient, it can
take a lot of time and memory, especially if the dataset has thousands
or millions of examples.
• Imagine trying to load and process all that data at once — it’s like trying
to carry all your groceries in one trip. It's heavy and slow.
• This is why Batch Gradient Descent is not practical for large datasets
— it takes too long to compute and needs a lot of memory.
[Link] Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) – Explained in Simple Words
To solve the problems of traditional Gradient Descent (which is slow and
memory-heavy), we can use a faster technique called Stochastic Gradient
Descent, or SGD.
It is also known as Incremental Gradient Descent.
What Makes SGD Different?
• In SGD, instead of using all the training examples at once, we use just
one example at a time to:
19
1. Calculate the error (gradient),
2. Update the model parameters (like weights).
So the update happens immediately after seeing each training example.
This makes the learning process much faster, especially for very large
datasets.
Formula for Weight Update in SGD
w = w - μ · ∇E(w; x(i), y(i))
Here’s what this means:
• w – the weight (parameter) we are updating.
• μ – the learning rate (step size).
• ∇E(w; x(i), y(i)) – this is the gradient of the error (or loss function) with
respect to the weight w, but calculated only for one training example,
i.e., the pair {x(i), y(i)}.
This way, we make small updates after each individual data point.
What Happens to the Error (Loss Function)?
• Because we are updating the weights using only one sample at a time,
the value of the loss function tends to go up and down a lot during
training.
• This behavior is called fluctuation, and it makes the learning process a
bit noisy.
• However, these small fluctuations can also help the model avoid getting
stuck in bad solutions (local minima).
Summary (in Simple Words):
• SGD updates the model after seeing each training example, instead of
waiting for the whole dataset.
• This makes it faster and uses less memory than traditional Gradient
Descent.
• The training process is less smooth because the error jumps around,
but it can help reach a better solution in the end.
[Link] Mini-batch Gradient Descent
Mini-Batch Gradient Descent – Explained in Simple Words
Mini-Batch Gradient Descent, also called Mini-Batch SGD, is a method
that combines the advantages of both Standard (Batch) Gradient Descent
and Stochastic Gradient Descent (SGD).
20
How Does It Work?
• Instead of using all the training data at once (like Batch GD), or using
just one example at a time (like SGD), Mini-Batch Gradient Descent
breaks the data into small groups, called mini-batches.
• Each mini-batch contains n training examples (for example: 32, 64,
128, etc.).
• The model updates its parameters after processing each mini-batch, not
after each example or the whole dataset.
So, if your dataset has 10,000 examples and your batch size is 100, you’ll
have 100 updates in one full pass (called one epoch) over the data.
Weight Update Formula
w = w - μ · ∇E(w; x(i:i+n), y(i:i+n))
Here:
• w is the weight to be updated.
• μ is the learning rate.
• ∇E(w; x(i:i+n), y(i:i+n)) is the gradient (error slope) calculated for a mini-
batch of inputs (from sample i to i+n).
What Batch Size Should You Choose?
Choosing the right mini-batch size is important. Here’s how it affects
learning:
Large Mini-Batches (e.g., 256, 512, etc.):
• Give more accurate estimates of the gradient.
• But require more memory and may slow down training if hardware (like
GPU) is not powerful enough.
Small Mini-Batches (e.g., 32, 64):
• Need less memory and can add a bit of randomness (which can help
prevent overfitting).
• But because of that randomness, they can be unstable if the learning
rate is too high.
• You’ll need to use a smaller learning rate, which can make the training
slower.
Summary (In Simple Words):
• Mini-Batch Gradient Descent processes a small group of training
examples at a time.
21
• It’s a good balance between speed (SGD) and accuracy (Batch GD).
• It's the most commonly used technique in deep learning.
• Choosing the right batch size is important: larger sizes are accurate but
heavy on memory; smaller sizes are faster but need careful tuning of
learning rate.
3.4.2 Improving Gradient Descent for Faster Convergence
The main objective of optimization is to minimize the cost/loss or objective
function. There are many methods available that help an optimization
algorithm to converge faster. Some of the commonly used methods are
discussed below
[Link] AdaGrad
Understanding Learning Rate Problems in SGD and the AdaGrad
Solution
In Stochastic Gradient Descent (SGD), the learning rate (μ) is usually fixed
– we choose one value at the beginning and use it throughout the training.
But this can cause problems. Let’s understand why:
Problems with Fixed Learning Rate
1. If the gradient is large:
o A large learning rate will cause the model to take big steps.
o It may overshoot the best solution, and the model may keep
jumping around the minimum without settling.
2. If the gradient is small:
o A small learning rate means the model takes tiny steps.
o It will take a very long time to reach the minimum (slow learning).
So, using the same learning rate for all parameters and all situations
doesn’t always work well.
Solution: Use Adaptive Learning Rate – AdaGrad
To fix this, we can use an adaptive method that automatically adjusts the
learning rate during training.
AdaGrad (Adaptive Gradient Algorithm) is one such method.
How Does AdaGrad Work?
Instead of keeping the learning rate fixed, AdaGrad changes it based on
how big the gradients have been in the past.
Here's the idea:
22
• For each parameter (like each weight), it keeps track of the sum of
squares of the past gradients.
• It then divides the learning rate by the square root of that sum.
Weight Update Formula:
w(t+1),i = w(t),i - (μ / √Gi) · ∇t,i
Where:
• w(t),i = weight of the parameter i at time t
• ∇t,i = gradient of the loss with respect to that parameter at time t
• Gi = the sum of the squares of gradients up to time t
What Does This Achieve?
• If a parameter has large gradients, its learning rate becomes smaller,
so it doesn't jump too far.
• If a parameter has small gradients, its learning rate becomes larger, so
it doesn’t get stuck.
This helps the model learn more efficiently without the need to manually
set the perfect learning rate.
Drawback of AdaGrad
While AdaGrad adjusts the learning rate automatically, there’s one issue:
• The sum in the denominator keeps increasing over time.
• As a result, the learning rate keeps getting smaller and smaller.
• Eventually, it can become too small, causing the model to stop learning
completely (learning stalls).
Summary (in simple words):
• In regular SGD, a fixed learning rate may not work well for all
parameters.
• AdaGrad automatically adjusts the learning rate for each parameter:
o Big gradients → smaller steps
o Small gradients → bigger steps
• It helps speed up training and reduce the need for manual tuning.
• But over time, the learning rate may become too small, which can slow
down or stop learning.
[Link] AdaDelta
23
AdaDelta – An Improved Version of AdaGrad
AdaDelta is an advanced optimization technique that was created to fix a
major problem with AdaGrad.
Let’s first recall the issue with AdaGrad:
• In AdaGrad, the learning rate keeps getting smaller over time because
it keeps adding up all the past squared gradients.
• Eventually, the learning rate becomes so small that the model stops
learning.
What Does AdaDelta Do Differently?
AdaDelta solves this issue by making two key changes:
1. Instead of adding up all past gradients (which causes the learning rate
to decay), it only keeps a limited memory — a kind of moving average of
recent past gradients.
2. It uses this average to calculate a stable and adaptive learning rate.
How Does It Work?
• AdaDelta calculates an average of recent squared gradients, not all of
them.
• This average is updated at every time step and is called Avg∇²(t).
So, the updated weight formula becomes:
w(t+1) = w(t) - (μ / Avg∇²(t)) · ∇t
Or more commonly written as:
w(t+1) = w(t) - μ · ∇t / RMS(∇t)
Where:
• ∇t is the current gradient at time step t.
• Avg∇²(t) is the running average of past squared gradients.
• RMS(∇t) means the Root Mean Square of the gradients (just a way to
measure the average size of recent gradients).
Why Is This Useful?
• By using a moving average instead of a total sum, AdaDelta prevents
the learning rate from shrinking to zero.
• It allows the learning rate to adapt automatically based on how the
gradient is changing — like AdaGrad, but without the decay problem.
24
• And since it only stores recent values, it’s also more efficient in terms
of memory.
Summary (in Simple Words):
• AdaDelta improves on AdaGrad by using a moving average of recent
gradients instead of accumulating all past gradients.
• This prevents the learning rate from becoming too small over time.
• It uses the Root Mean Square (RMS) of gradients to control the learning
rate.
• It is an adaptive method that automatically adjusts learning — no need
to manually set or reduce the learning rate.
[Link] RMSProp
RMSProp – Solving the Learning Rate Problem in AdaGrad
We’ve already seen that AdaGrad tends to reduce the learning rate too
much over time, which can slow down or even stop learning.
RMSProp is an improved version of AdaGrad that fixes this problem.
What Does RMSProp Do Differently?
The main idea in RMSProp is to avoid the vanishing learning rate problem
in AdaGrad by using a moving average of recent gradients, but giving more
importance to the latest gradients.
This is done using an exponentially weighted moving average – meaning
recent gradients are weighted more than older ones. It helps the model
forget the very old history, so the learning rate stays stable and effective.
How RMSProp Works (Step by Step):
Let’s understand the steps in simple terms:
(a) Keep Weight Updates Balanced
• RMSProp tries to make all weight updates have similar magnitude (size).
• We set maximum and minimum limits for how big or small the weight
updates can be.
(b) Adjust Learning Rate Based on Gradient Behavior
• At each training step (iteration), we compare the current gradient and
the previous gradient.
25
• η = η + 1.2
• The weight update becomes:
• Update = min(η+, max)
26
Adam combines the best features of two other optimization methods:
• AdaGrad – which handles sparse data well.
• RMSProp – which works well for non-stationary (changing) data.
What Makes Adam Special?
Adam keeps track of two types of information during training:
1. The average (mean) of gradients – this is called the first moment (mt).
It tells us the direction in which we should move.
2. The average of squared gradients – this is called the second moment
(vt).
It gives us an idea of the magnitude (size) of the updates.
These averages are exponentially weighted moving averages, meaning more
recent gradients have more influence than older ones.
How mt and vt Are Calculated
At each training step t, Adam calculates:
mt = β1 · mt−1 + (1 − β1) · gt
vt = β2 · vt−1 + (1 − β2) · gt²
Where:
• gt is the gradient at time step t
• mt is the moving average of the gradient (like momentum)
• vt is the moving average of the squared gradient
• β1 and β2 are hyperparameters (typically close to 1, like 0.9 and 0.999)
that control how much past information is remembered
Bias Correction (Why It’s Needed)
At the beginning of training (when t is small), both mt and vt can be biased
toward zero. To fix this, Adam uses bias-corrected versions:
m̂t = mt / (1 − β1ᵗ)
v̂t = vt / (1 − β2ᵗ)
These correct the early values to make them more accurate.
Final Weight Update Formula
Adam then updates the weights using the following rule:
wt+1 = wt − μ · (m̂t / (√v̂t + ε))
Where:
27
• μ is the learning rate (usually 0.001)
• ε (epsilon) is a small constant (like 1e-8) added to the denominator to
avoid division by zero
• m̂t is the bias-corrected mean (first moment)
• v̂t is the bias-corrected variance (second moment)
Why Adam Is So Powerful
Fast Convergence: Adam usually finds the minimum error much faster
than other methods.
Stable Learning: It balances the direction and size of updates
automatically.
No Manual Tuning Needed: Adam adjusts learning rates automatically, so
we don't have to fine-tune them for each parameter.
Handles Complex Problems Well: Adam works well even when the data is
noisy or changing.
Summary (in Simple Words):
• Adam = RMSProp + Momentum
• It keeps track of the average gradient (direction) and the average
squared gradient (magnitude).
• Uses exponentially weighted averages to give more weight to recent
updates.
• Automatically adjusts learning rate for each parameter.
• Learns quickly, efficiently, and reliably in most deep learning tasks.
3.5 Challenges in Training Deep Networks
Training a deep neural network is a challenging task, and some of the
prominent challenges in training deep models are discussed below.
3.5.1 Vanishing Gradient
Vanishing Gradient Problem in Deep Neural Networks – Explained Simply
When we train deep neural networks (networks with many layers), especially
those using activation functions like sigmoid or tanh, we face a serious issue
called the vanishing gradient problem.
Let’s understand what that means and how it affects learning.
What Is Backpropagation?
• Backpropagation is the method used to train a neural network.
28
• It works by calculating the error (difference between predicted and
actual output) and sending it backward through the layers.
• Based on this error, the weights of the network are updated using a
method called gradient descent.
What Is the Vanishing Gradient Problem?
• In deep networks (with many layers), as the error is backpropagated
from the output to the input, the gradients (used to update weights)
become smaller and smaller.
• Eventually, in the earlier layers (close to the input), the gradient
becomes almost zero.
• This means the initial layers stop learning — they don't get updated
properly.
This issue is called the vanishing gradient problem.
Why Does This Happen with Sigmoid?
Let’s look at the sigmoid activation function:
f(x) = 1 / (1 + e^(-x))
Its derivative is:
f'(x) = f(x) * (1 - f(x))
From this formula, you can see:
• The maximum value of the derivative is 0.25.
• That means every time the gradient passes through a sigmoid layer, it
gets multiplied by a value less than or equal to 0.25.
• So, if you have many layers, the gradient gets multiplied by small
numbers again and again, and becomes almost zero.
• As a result, the early layers in the network do not learn — they are left
almost unchanged.
29
Why Is This a Big Problem?
• In deep learning, we want all layers to learn, not just the last few.
• If the earlier layers don’t learn properly, the network performs poorly,
especially on complex tasks like image recognition or language
understanding.
How ReLU Fixes This Problem
To solve this, we use a different activation function called ReLU (Rectified
Linear Unit).
ReLU Function:
f(x) = x if x > 0, otherwise f(x) = 0
Derivative:
f′(x) = 1 when x > 0, and 0 when x ≤ 0
Why ReLU Helps:
• For positive inputs, the derivative is 1, so no shrinking happens.
• The gradients do not vanish as they go through the layers.
• This helps the network learn much faster and deeper.
That’s why ReLU has become the default activation function in modern deep
learning.
Summary (in Simple Words):
• In deep neural networks, gradients can vanish during training when
using sigmoid or tanh activation functions.
• This means earlier layers stop learning, which harms the performance.
• The sigmoid’s derivative is always less than 1, causing the gradient to
shrink at each layer.
• ReLU fixes this because it has a derivative of 1 for positive values, so
gradients stay strong and training becomes effective.
3.5.2 Training Data Size
Why Deep Learning Needs a Lot of Data – Explained Simply
Deep learning models are powerful tools that can learn complex patterns and
relationships between input data (like images, text, etc.) and output labels
(like categories or predictions). But to work well, these models need a lot of
learning, and that learning comes from training data.
Why Do Deep Models Need So Much Data?
30
Deep networks:
• Have many layers.
• Learn nonlinear (complex) patterns.
• Contain millions of parameters (weights) that must be adjusted during
training.
Because of this complexity, they need:
• More examples to learn from.
• Larger datasets to avoid mistakes.
• Better quality data to learn useful patterns.
More data → better learning → better accuracy.
Example: ImageNet and Popular Models
Popular models like AlexNet, VGG, GoogleNet, and ResNet were trained using
a very large image dataset called ImageNet.
• ImageNet has:
o Around 1.2 million images
o Spread across 1000 different categories (like dog, airplane, car,
etc.)
This huge dataset helped these models learn very well how to recognize and
classify images — even when objects are shown in different poses, sizes,
colors, and backgrounds.
But Do All Problems Need Big Data?
Not always.
Some tasks (like medical image classification):
• Are less complex.
• Have fewer variations in images.
• Can often be solved using smaller and simpler models.
• May not need millions of examples to train.
But still, data is important — not just quantity, but also quality.
Data Quality Matters
Imagine training a model on:
• Blurry images,
• Wrong labels,
31
• Or messy, irrelevant data.
That’s called noisy data.
In such cases, the model:
• Learns slower.
• Makes more mistakes.
• Needs more data to balance out the noise.
This is called a low Signal-to-Noise Ratio (SNR) — which means the model has
to work harder to find useful patterns.
So, even for small problems, bad-quality data can ruin the training.
So, How Much Data Is Enough?
That’s a tricky question.
• There's no fixed rule for how much data is required.
• It depends on:
o The complexity of the task.
o The size and depth of the model.
o The quality of the training data.
However, one general rule holds true:
“In most cases, more high-quality data leads to better performance.”
In Summary:
• Deep models learn complex patterns but need lots of good data.
• Large datasets like ImageNet help models learn to recognize real-world
objects better.
• Simpler problems may need less data, but quality is still important.
• There’s no universal rule to decide how much data is enough.
• But overall, more clean and diverse data leads to higher accuracy.
3.5.3 Overfitting and Underfitting
What Is Generalization in Deep Learning?
After we train a model using training data, we expect it to work well on new
data — data it has never seen before.
This ability of the model to perform well on unseen data is called
generalization.
32
A good deep learning model:
• Should not just memorize the training data.
• Should be able to apply what it has learned to make correct predictions
on new data.
To check this, we test the model on a separate dataset (called a test set or
validation set) that wasn’t used during training.
Two Major Problems in Deep Learning
1. Overfitting
2. Underfitting
Let’s understand both in simple terms:
Overfitting – When the Model Learns Too Much
• Overfitting happens when the model performs very well on training data
but poorly on new data.
• It means the model is not really “learning,” but memorizing the training
examples.
• It fails to generalize.
• Imagine a student who memorizes answers without understanding —
they may score well in practice tests but fail in the real exam.
In graphs:
• Training error is very low.
• Validation (or test) error is high after some point — this is the sign of
overfitting.
Underfitting – When the Model Fails to Learn
• Underfitting happens when the model doesn’t learn well even on
training data.
• It means the model is too simple or the training is not enough.
• The model performs poorly on both training and test data.
Why Overfitting Happens in Deep Learning?
Deep models like Convolutional Neural Networks (CNNs) have:
• Many layers and millions of weights (parameters) to learn.
• If there’s not enough data, the model will memorize instead of learning
patterns.
So:
33
• More parameters = more risk of overfitting
• Less data = even more risk of overfitting
How to Reduce Overfitting?
Here are some common techniques:
1. Increase Training Data
o The more data, the better the model learns real patterns.
2. Reduce the Size of the Network
o Smaller models have fewer parameters, so they’re less likely to
overfit.
3. Data Augmentation
o Create more training data by modifying the existing data.
o For example:
▪ Rotate the images
▪ Zoom in/out
▪ Shift or flip the images
o This makes the model see more variety.
4. Regularization (L1, L2)
o These are techniques that add a penalty to large weights.
o They help keep the model simpler and prevent memorization.
5. Dropout (Very Popular)
o During training, random neurons are turned off temporarily.
o This means:
▪ Those neurons don’t participate in that training round.
▪ It forces the network to not rely on a few specific neurons.
o As a result, the model becomes more flexible and generalizes
better.
Example:
o Think of a group project where sometimes a team member is
absent.
o Everyone else has to step up.
34
o So, everyone learns to contribute — just like in Dropout, where
all parts of the network learn to perform better.
Summary for Students:
• Generalization = how well a model performs on new data.
• Overfitting = model memorizes training data and fails on new data.
• Underfitting = model fails to learn even from training data.
• Overfitting is common in deep networks but can be reduced using:
o More data,
o Smaller models,
o Data augmentation,
o Regularization (L1/L2),
o Dropout (removing random neurons during training).
35
• They need to process a huge number of images or data samples many
times.
• This takes a lot of calculations and storage space.
The Role of GPUs (Graphics Processing Units)
To handle this heavy workload, we use GPUs (Graphics Processing Units)
instead of regular CPUs (like those in normal laptops).
• GPUs are very fast at handling large amounts of data in parallel.
• They help models train faster and make the process more efficient.
• Multi-core GPUs (with many processing units) are especially helpful.
Think of a GPU like a super-fast calculator that can do thousands of math
problems at the same time, while a normal CPU can do only a few at once.
But There’s a Problem
Using high-performance GPUs and machines has some downsides:
1. They are expensive
o These machines cost a lot of money to buy and maintain.
2. They consume a lot of energy
o Running powerful GPUs requires a lot of electricity.
o This makes deep learning energy-hungry and not always eco-
friendly.
Why This Matters in Real-World Use
Because of the high cost and energy use:
• Using deep learning in real-world applications becomes expensive.
• Not every company or individual can afford to train such models.
• It’s important to optimize models and use resources wisely.
In Short (Summary for Students):
• Training deep learning models needs powerful machines with lots of
memory and fast processors.
• GPUs are used because they are much faster than normal CPUs for
deep learning.
• But, GPUs are costly and consume a lot of energy.
• So, using deep learning in real-life projects can be expensive and
energy-consuming.
36
4.2 LeNet-5
What is LeNet-5?
LeNet-5 is one of the earliest and most famous Convolutional Neural Networks
(CNNs), developed by Yann LeCun to recognize handwritten digits (like in the
MNIST dataset). It takes an image and processes it through several layers to
extract features and then classifies the image.
Input to the Network
• The input is a gray-scale image of size 32×32 pixels.
37
• The pixel values are normalized so that the mean becomes 0 and
variance becomes 1.
This makes the learning faster and more stable.
Structure of LeNet-5
LeNet-5 has 7 layers, excluding the input. These include:
• Convolutional layers (for extracting features),
• Subsampling (pooling) layers (for reducing size),
• Fully connected layers (for decision making).
Let’s break it down layer by layer:
Layer 1: Convolutional Layer (C1)
• Input size: 32×32
• Uses 6 filters (kernels) of size 5×5
• Output: 6 feature maps of size 28×28
• Number of trainable parameters: 156
• Activation function: Tanh (non-linear function)
This layer detects basic features like edges and curves.
Layer 2: Subsampling Layer (S2)
• Input: 6 feature maps of size 28×28
• Applies pooling with a 2×2 window
• Output: 6 feature maps of size 14×14
• Number of trainable parameters: 12
This layer reduces size and helps make the features more robust.
Layer 3: Convolutional Layer (C3)
• Input: 6 maps of size 14×14
• Uses 16 filters of size 5×5
• Output: 16 feature maps of size 10×10
• Number of trainable parameters: Based on filter connections
This extracts more complex features by combining previous ones.
Layer 4: Subsampling Layer (S4)
• Input: 16 maps of 10×10
38
• Applies 2×2 pooling
• Output: 16 maps of size 5×5
• Parameters: 32 trainable
Again reduces size while keeping the important information.
Layer 5: Convolutional Layer (C5)
• Input: 16 maps of 5×5
• Uses 120 filters of size 5×5
• Output: 120 feature maps of size 1×1
• Number of connections: 48,120
At this point, we get compressed feature representations.
Layer 6: Fully Connected Layer (F6)
• Input: 120 values (from previous layer)
• Output: 84 neurons
• Trainable parameters: 10,164
This layer works like a traditional neural network and helps in classification.
Output Layer (Softmax Layer)
• Input: 84 values
• Output: 10 neurons (one for each digit from 0 to 9)
Each neuron gives a probability for the input belonging to one of the 10
classes.
For example, for the digit “7”, the output vector might look like:
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
How LeNet-5 Learns (Training Steps)
1. Initialize all weights and filters with random values.
2. Forward pass: Image goes through all layers, and the network predicts
probabilities.
3. Calculate error: Compare predicted result with the actual label.
4. Backpropagation: Compute how much each weight contributed to the
error and update weights using gradient descent.
5. Repeat: These steps are repeated for all images in the training set.
Important Notes
39
• Convolution + Subsampling = Feature extraction
• Fully Connected layers = Classification
• The architecture is simple yet powerful and was the base for many
modern CNNs.
• All layers except the output use Tanh activation. The output uses
Softmax to give probabilities.
Training of LeNet-5 – Step-by-Step Explanation
Training LeNet-5 means teaching the network how to recognize and correctly
classify handwritten digits (0–9) by adjusting its internal weights and filters
based on examples.
Step 1: Initialization
• All the filters and weights in the network are given random starting
values.
• These include:
o Weights in convolution layers,
o Weights in the fully connected layers.
Think of it like randomly guessing answers at first, and then learning the correct
ones step by step.
Step 2: Forward Propagation
• The input image (say, a handwritten "7") is passed through the network.
• It flows through:
o Convolutional layers (to detect features),
o Pooling layers (to reduce size and keep important features),
o Fully connected layers (to combine features and make decisions).
This process is called forward propagation.
• At the end of this step, the network gives a probability for each digit (0–
9).
o Example output: [0.01, 0.02, 0.01, 0.03, 0.01, 0.02, 0.01, 0.86,
0.01, 0.02] → Likely digit: 7
Step 3: Calculate Error
• The network’s predicted output is compared with the actual correct
answer (target label).
• The difference between them is called the error or loss.
40
o Example: The correct output for digit 7 should be [0, 0, 0, 0, 0, 0,
0, 1, 0, 0]
o The closer the predicted output is to this target, the better.
Step 4: Backpropagation and Weight Update
• In this step, the network figures out how much each weight/filter
contributed to the error.
• It uses an algorithm called gradient descent to:
o Calculate gradients (slopes) of the error,
o Use those gradients to adjust the weights in a way that the error
will become smaller next time.
Only the weights and filters are updated.
• The filter sizes, number of filters, and other design choices
(hyperparameters) are fixed and do not change during training.
This is like learning from your mistakes — changing the approach if something
goes wrong.
Step 5: Repeat for All Training Images
• Steps 2, 3, and 4 are repeated for each image in the training dataset.
• The network keeps improving with each image, gradually becoming
better at classification.
Understanding Layer-Wise Transformations with Image Size
Let’s look at how the image size and number of feature maps change step by
step:
41
Final Output
• The final layer gives 10 outputs, each representing a digit (0 to 9).
• The one with the highest probability is chosen as the prediction.
Summary in Simple Words
• LeNet-5 learns by going through many images and slowly correcting its
guesses.
• It uses forward pass to predict, compares it with the true label,
calculates error, and updates weights.
• This process is repeated many times until the network becomes good at
recognizing digits.
4.3 AlexNet
What is AlexNet?
AlexNet is a Convolutional Neural Network (CNN) model that became very
famous after it won the ImageNet competition in 2012 with a big margin in
42
accuracy. It showed that deep learning works very well for large and complex
image datasets.
It was designed to classify images into 1000 categories such as animals,
vehicles, furniture, etc.
Why was AlexNet Important?
Before AlexNet, deep neural networks were hard to train because of issues
like the vanishing gradient problem. AlexNet solved this problem using a new
activation function called ReLU (Rectified Linear Unit). This made training
faster and more stable.
Dataset Used
AlexNet was trained on a subset of the ImageNet dataset:
• Training images: 1.2 million
• Testing images: 150,000
• Categories: 1000
Architecture of AlexNet (Simplified)
AlexNet has 8 main layers that learn from images:
• 5 Convolutional Layers – extract features (edges, shapes, patterns)
• 3 Fully Connected Layers – perform classification
• 1 Softmax Layer – gives final output as probabilities for each class
Let’s break down the layers one by one
Layer-by-Layer Breakdown
Output:
Conv1 Convolution Uses 96 filters of size 11×11
55×55×96
Output:
Conv2 Convolution Uses 256 filters of size 5×5
27×27×256
43
Layer Type What it does Details
Output:
Conv3 Convolution Uses 384 filters of size 3×3
13×13×384
Output:
Conv4 Convolution Uses 384 filters of size 3×3
13×13×384
Output:
Conv5 Convolution Uses 256 filters of size 3×3
13×13×256
Output:
MaxPool3 Pooling Final downsampling to 6×6×256
6×6×256
Fully
FC2 Another layer with 4096 neurons 4096 neurons
Connected
Fully
FC3 Final layer with 1000 neurons 1000 neurons
Connected
1000
Softmax Classifier Outputs probability of each class
probabilities
45