Training Supervised
Deep Learning
Networks
Training Convolution Neural
Networks
• Training supervised deep neural network is formulated in terms of minimizing a loss
function.
• In this context, training a supervised deep neural network means searching a set of values
of parameters (or weights) of the network at which the loss function has minimum value.
• Gradient descent is an optimization technique which is used to minimize the error by
calculating gradients necessary to update the values of the parameters of the network.
• The most common and successful learning algorithm for deep learning models is gradient
descent-based backpropagation in which error is propagated backward from last layer to
the first layer.
• In this learning technique, all the weights of a neural network are either initialized
randomly or initialized by using probability distribution. An input is fed through the
network to get the output. The obtained output and the desired output are then used to
calculate the error using some cost function (error function).
• The working of backpropagation, consider a small Convolution Neural Network
(CNN) models.
• Flow: 32×32 → Conv (6×28×28) → Pool (6×14×14) → Conv (16×10×10) → Pool
(16×5×5) → Conv (120×1×1) → FC (10) → Softmax
• Hierarchy of features: Edges → Shapes → Complex structures → Classification.
• Convolution Formula (Equation 3.1)
• The mathematical operation is:
• = value at position (i,j) of the k-th feature map
• = weight at position (m,n) of the k-th filter
• = pixel value from the input image at shifted position
• = bias term for the k-th filter
• Intuition:
The filter (5×5) slides over the image → multiplies and sums pixel values →
produces a feature map that highlights patterns like edges, textures, etc.
Layer Input → Output Filter / Operation # Feature Maps Size Change Special Note
First convolution. Each
Input: (grayscale image) filter sees raw pixels.
C1 (Convolution 1) 6 filters, each 5×5×1 6
→ Output: Summation over (m,n)
only.
Reduces size by half,
P2 (Pooling 1) → 2×2 max pooling 6 keeps max values, no
learnable parameters.
Deeper convolution. Each
filter now combines info
C3 (Convolution 2) → 16 filters, each 5×5×6 16
from all 6 input maps →
summation over (d,m,n).
Again halves size, keeps
P4 (Pooling 2) → 2×2 max pooling 16
strongest features.
This is like a fully-
connected layer because
C5 (Convolution 3) → 120 filters, each 5×5×16 120
the filter covers the entire
input.
Classic fully connected
F6 (Fully Connected) → 84 neurons Dense connections 84 – layer, learns high-level
combinations.
Produces class
Output Layer 84 → 10 classes Fully connected + Softmax 10 –
probabilities.
Gradient Descent-Based Optimization
Techniques
• Gradient descent is an optimization technique used to
minimize/maximize the cost function by calculating gradients
necessary to update the values of the parameters of the network.
There are three commonly used Gradient Descent (GD) variants.
i. Batch Gradient Descent (GD)
ii. Stochastic Gradient Descent (SGD)
iii. Mini-batch Gradient Descent
Batch Gradient Descent (GD)
• In traditional Gradient Descent (GD), also known as batch gradient descent
• Error gradient with respect to weight parameter w is computed for the entire
training set followed by updating the weight parameter == means it uses all training
data to compute gradient once.
• Update rule:
• Pros: Stable, exact gradient.
• Cons: Very slow, requires huge memory if dataset is large.
•When to use:
•Only when dataset is small (fits easily into memory).
•Example: Dataset with a few thousand rows (like in simple regression problems).
•Why:
•You compute gradient on the whole dataset at once, so it’s slow for big data.
Stochastic Gradient Descent
(SGD)
• The above problem can be rectified by using Stochastic Gradient
Descent (SGD).
• It also known as incremental gradient descent.
• where gradient is computed for one training example at a time followed
by updating of parameter values. == Uses just 1 example at a time.
• It is usually much faster than standard gradient descent as it performs
one update at a time.
• Update rule:
• Pros: Very fast, can escape local minima.
• Cons: Updates fluctuate a lot (zig-zag path).
Mini-batch Gradient Descent
• Mini-batch gradient descent also known as mini-batch SGD is a
combination of both standard gradient descent and SGD techniques.
• Mini-batch SGD divides the entire training set into mini-batches of n
training examples and performs the updating of parameter values for each
mini-batch.=== Uses a small batch (say 32, 64, 128 examples).
• This type of gradient descent technique takes advantage of both standard
gradient descent and SGD techniques.
• It is commonly used optimization technique in deep learning.
• Pros: Best of both worlds → efficient, less noisy, works well with GPUs.
• Cons: Needs careful batch size selection (too big = memory issue, too small
= unstable).
Improving Gradient Descent for Faster Convergence
1. AdaGrad (Adaptive Gradient Algorithm)
In standard Stochastic Gradient Descent (SGD), the learning rate is fixed for all parameters, which can cause
issues. If the gradient is large, a large learning rate might overshoot the optimum, and if the gradient is small,
convergence becomes very slow.
AdaGrad addresses this by adapting the learning rate for each parameter individually.
It keeps track of the sum of squares of all previous gradients for each parameter, and divides the learning rate
by the square root of this accumulated value.
Formula:
w(t+1,i) = w(t,i) − μ / √G(i) * ∇(t,i)
Here, G(i) represents the sum of squared gradients for parameter i.
Effectively, parameters with large gradients get smaller learning rates, and parameters with small gradients
get larger learning rates.
Advantage: Learning rate is adjusted automatically.
Limitation: The sum in the denominator increases over time, causing the learning rate to decay too much,
which can slow or stop training.
2) AdaDelta
AdaDelta is an improved version of AdaGrad that prevents the learning rate from continuously
decaying. Instead of summing all past squared gradients, it keeps only a fixed-size window of past
gradients.
It computes an exponentially decaying average of squared gradients, which helps maintain a
balanced learning rate.
Formula:
w(t+1) = w(t) − μ / RMS(∇t) * ∇t
Here, RMS(∇t) is the Root Mean Square of recent gradients. This ensures the denominator stays
within a useful range.
Advantages:
• Prevents vanishing learning rate.
• No manual tuning of global learning rate required.
• Performs well in practice for deep networks.
3) RMSProp (Root Mean Square Propagation)
RMSProp improves AdaGrad by introducing an exponentially weighted moving average of
squared gradients. It ‘forgets’ very old gradients and focuses on recent ones.
Formula:
w(t+1) = w(t) − μ / RMS(∇t) * ∇t
Working steps:
(a) Set equal update magnitude for all weights and define max/min limits.
(b) If current and previous gradients have the same sign, increase learning rate (×1.2).
(c) If signs differ, reduce learning rate (×0.5).
This makes learning stable and prevents oscillations.
Advantages:
• Solves AdaGrad’s decaying learning rate problem.
• Performs well on non-stationary and sequential data (like RNNs).
4) Adam (Adaptive Moment Estimation)
Adam combines the benefits of AdaGrad and RMSProp. It maintains two exponential moving averages:
1. m(t): the mean of gradients (first moment)
2. v(t): the uncentered variance (second moment)
Formulas:
m(t) = β1 * m(t−1) + (1−β1) * g(t)
v(t) = β2 * v(t−1) + (1−β2) * g(t)^2
Bias-corrected estimates:
m̂ (t) = m(t) / (1−β1^t)
v̂ (t) = v(t) / (1−β2^t)
Update rule:
w(t+1) = w(t) − μ * m̂ (t) / (√v̂ (t) + ε)
Advantages:
• Combines adaptive learning rate and momentum.
• Fast convergence.
• Works well for most deep learning applications.
• Automatically adjusts learning rates for each parameter.
• • AdaGrad – Adapts learning rate per parameter but learning rate
decays over time.
• AdaDelta – Fixes AdaGrad’s decay problem by keeping a limited
history of gradients.
• RMSProp – Maintains an exponentially decaying average of squared
gradients.
• Adam – Combines RMSProp and Momentum; fast, adaptive, and
efficient.
Among these, Adam is most widely used in deep learning for its
balance of speed and stability.
Challenges in Training Deep Network
1) Vanishing Gradient:
Any deep neural network with activation function like sigmoid, tanh,
etc. and training through backpropagation suffers from vanishing
gradient problem.
Vanishing gradient makes it very hard to train and update the
parameters of the initial layers in the network.
This problem worsens as the number of layers in the network increases.
The aim of backpropagation in neural networks is to update the
parameters such that the error of the network is minimized and actual
output gets closer to the target out put.
During backpropagation, the weights are updated using gradient
descent
• Why does the gradient “vanish”?
• Let’s look at the sigmoid function:
• Its derivative is:
• This derivative (which is what’s used during backpropagation) has a maximum value of 0.25 and is always
between 0 and 0.25.
• That means:
Each time the gradient passes through a sigmoid activation, it gets multiplied by a number less than 1 (say
0.25 or smaller).
• Now imagine a deep network with 10 layers.
If each layer multiplies the gradient by 0.25, then:
• So the gradient becomes almost zero by the time it reaches the first few layers.
• This is why we say the gradient “vanishes” — it becomes too small for the earlier layers to learn anything.
• What happens because of it?
• The initial layers (the ones close to input) stop learning, because their weights barely change.
• The later layers (near output) might still learn, but the overall network won’t improve much.
• Training becomes very slow or even stuck — the loss doesn’t reduce further.
• This is why deep neural networks with sigmoid or tanh activations were historically very hard to train —
especially before ReLU was introduced.
How ReLU helps
• ReLU (Rectified Linear Unit) is defined as:
• Its derivative is:
• For positive values, the derivative is 1, not a small number like 0.25.
So when backpropagation happens, gradients don’t shrink — they stay strong enough for all layers to keep
learning.
• That’s why ReLU and its variants (like Leaky ReLU, ELU) are widely used today — they prevent the vanishing
gradient problem and make deep networks trainable.
Training Data Size
• Deep neural networks use training data for learning and can model
complex nonlinear relationships between input data and output
labels. The number of parameters in these networks is very large,
making the training data size a critical factor influencing model
success.
Importance of Large Data
• Deep networks have millions of parameters that need to be learned.
More parameters require more data to ensure effective training.
Complex models mean more powerful abstraction but also require
vast amounts of data to generalize well.
Real-World Examples of Large
Datasets
• Successful deep models such as AlexNet, GoogleNet, VGG, and
ResNet were all trained on the ImageNet dataset. ImageNet contains
around 1.2 million labeled images distributed across 1,000 classes.
Such large datasets help these models handle variations in object
pose, color, lighting, and background.
When Smaller Data Works
For less complex problems—such as medical image classification,
where variations are small—less complex models can perform well
even with smaller datasets. However, both model complexity and data
quality determine the actual data required.
Role of Data Quality
• The quality of training data is as important as its size. Noisy or low-
quality data reduces the Signal-to-Noise Ratio (SNR), making learning
harder and requiring more data for convergence. Hence, high-quality
and clean data helps deep models train efficiently.
Data Size vs. Problem Complexity
The required dataset size depends on both the complexity of the
problem and the nature of the data. Highly variable data, such as
natural images, needs larger datasets, while low-variation data can be
trained with fewer examples.
• How Much Data is Enough?
• There is no universal rule for the amount of data required to train a
deep model. Generally, more data improves accuracy and
generalization. However, factors such as model size, task complexity,
and data quality determine the exact requirement.
Overfitting and Underfitting
• Generalization in Deep Learning
• Once a deep learning model is trained on a given training dataset, its primary objective is not just to perform well on that
data, but also to generalize — that is, to perform accurately on new, unseen data. The ability of a deep learning model to
maintain good performance on unseen data is called generalization. Generalization is one of the most important qualities
of a good deep learning model.
To assess a model’s generalization ability, the dataset is generally split into training, validation, and test sets. The model is
trained using the training set and evaluated on the validation or test set. If the model performs well on the training data
but poorly on new data, it indicates poor generalization.
• Overfitting:
• Overfitting occurs when a model learns the training data too well, including its noise and minor details, instead of learning the general
patterns. This results in a model that performs exceptionally well on training data but fails to generalize to unseen data.
In overfitting, the training error becomes very low, but the validation (or test) error remains high. This behavior can be visualized where the
training error keeps decreasing while the validation error increases after a certain point. Overfitting commonly occurs in deep networks like
CNNs, which have a large number of learnable parameters. If the training dataset is too small relative to the number of parameters, the
network starts memorizing the examples instead of learning general features.
Underfitting
• Underfitting occurs when a model is not able to learn effectively from the training data. It happens when
the model is too simple to capture the underlying patterns of the data, or when it has not been trained
for enough iterations. In this case, the model shows high error on both the training and validation sets,
indicating that it has not learned the task properly.
Underfitting is often caused by using a model that is too simple, insufficient training, or an inappropriate
learning rate.
Techniques to Reduce Overfitting
• Although overfitting is a common challenge in deep networks, several strategies can be used to reduce it:
(a) Increase the training dataset: A larger dataset allows the model to see more variations and improve
generalization.
(b) Reduce network size: Simplifying the architecture by reducing layers or neurons prevents overfitting.
(c) Data augmentation: Generating new examples by transforming existing data (scaling, rotation,
flipping, etc.) increases dataset size.
(d) Regularization (L1 and L2): Adding penalty terms discourages large weights and reduces complexity.
(e) Dropout: Randomly dropping neurons during training prevents reliance on specific neurons, forcing
the model to learn robust representations.