0% found this document useful (0 votes)

12 views25 pages

Training Supervised Deep Learning Models

The document discusses the training of supervised deep learning networks, focusing on minimizing loss functions using gradient descent and its variants, including batch, stochastic, and mini-batch gradient descent. It highlights the importance of data size and quality for effective training, as well as challenges like vanishing gradients and overfitting, providing strategies to mitigate these issues. Additionally, it introduces advanced optimization techniques like AdaGrad, RMSProp, and Adam for improved convergence in deep learning models.

Uploaded by

shreekd2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views25 pages

Training Supervised Deep Learning Models

Uploaded by

shreekd2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Training Supervised

Deep Learning
Networks
Training Convolution Neural
Networks
• Training supervised deep neural network is formulated in terms of minimizing a loss
function.
• In this context, training a supervised deep neural network means searching a set of values
of parameters (or weights) of the network at which the loss function has minimum value.
• Gradient descent is an optimization technique which is used to minimize the error by
calculating gradients necessary to update the values of the parameters of the network.
• The most common and successful learning algorithm for deep learning models is gradient
descent-based backpropagation in which error is propagated backward from last layer to
the first layer.
• In this learning technique, all the weights of a neural network are either initialized
randomly or initialized by using probability distribution. An input is fed through the
network to get the output. The obtained output and the desired output are then used to
calculate the error using some cost function (error function).
• The working of backpropagation, consider a small Convolution Neural Network
(CNN) models.

• Flow: 32×32 → Conv (6×28×28) → Pool (6×14×14) → Conv (16×10×10) → Pool

(16×5×5) → Conv (120×1×1) → FC (10) → Softmax
• Hierarchy of features: Edges → Shapes → Complex structures → Classification.
• Convolution Formula (Equation 3.1)
• The mathematical operation is:

• = value at position (i,j) of the k-th feature map

• = weight at position (m,n) of the k-th filter
• = pixel value from the input image at shifted position
• = bias term for the k-th filter
• Intuition:
The filter (5×5) slides over the image → multiplies and sums pixel values →
produces a feature map that highlights patterns like edges, textures, etc.
Layer Input → Output Filter / Operation # Feature Maps Size Change Special Note

First convolution. Each

Input: (grayscale image) filter sees raw pixels.
C1 (Convolution 1) 6 filters, each 5×5×1 6
→ Output: Summation over (m,n)
only.

Reduces size by half,

P2 (Pooling 1) → 2×2 max pooling 6 keeps max values, no
learnable parameters.

Deeper convolution. Each

filter now combines info
C3 (Convolution 2) → 16 filters, each 5×5×6 16
from all 6 input maps →
summation over (d,m,n).

Again halves size, keeps

P4 (Pooling 2) → 2×2 max pooling 16
strongest features.

This is like a fully-

connected layer because
C5 (Convolution 3) → 120 filters, each 5×5×16 120
the filter covers the entire
input.

Classic fully connected

F6 (Fully Connected) → 84 neurons Dense connections 84 – layer, learns high-level
combinations.
Produces class
Output Layer 84 → 10 classes Fully connected + Softmax 10 –
probabilities.
Gradient Descent-Based Optimization
Techniques
• Gradient descent is an optimization technique used to
minimize/maximize the cost function by calculating gradients
necessary to update the values of the parameters of the network.
There are three commonly used Gradient Descent (GD) variants.
i. Batch Gradient Descent (GD)
ii. Stochastic Gradient Descent (SGD)
iii. Mini-batch Gradient Descent
Batch Gradient Descent (GD)

• In traditional Gradient Descent (GD), also known as batch gradient descent

• Error gradient with respect to weight parameter w is computed for the entire
training set followed by updating the weight parameter == means it uses all training
data to compute gradient once.
• Update rule:

• Pros: Stable, exact gradient.

• Cons: Very slow, requires huge memory if dataset is large.
•When to use:
•Only when dataset is small (fits easily into memory).
•Example: Dataset with a few thousand rows (like in simple regression problems).
•Why:
•You compute gradient on the whole dataset at once, so it’s slow for big data.
Stochastic Gradient Descent
(SGD)
• The above problem can be rectified by using Stochastic Gradient
Descent (SGD).
• It also known as incremental gradient descent.
• where gradient is computed for one training example at a time followed
by updating of parameter values. == Uses just 1 example at a time.
• It is usually much faster than standard gradient descent as it performs
one update at a time.
• Update rule:

• Pros: Very fast, can escape local minima.

• Cons: Updates fluctuate a lot (zig-zag path).
Mini-batch Gradient Descent
• Mini-batch gradient descent also known as mini-batch SGD is a
combination of both standard gradient descent and SGD techniques.
• Mini-batch SGD divides the entire training set into mini-batches of n
training examples and performs the updating of parameter values for each
mini-batch.=== Uses a small batch (say 32, 64, 128 examples).
• This type of gradient descent technique takes advantage of both standard
gradient descent and SGD techniques.
• It is commonly used optimization technique in deep learning.

• Pros: Best of both worlds → efficient, less noisy, works well with GPUs.
• Cons: Needs careful batch size selection (too big = memory issue, too small
= unstable).
Improving Gradient Descent for Faster Convergence
1. AdaGrad (Adaptive Gradient Algorithm)
In standard Stochastic Gradient Descent (SGD), the learning rate is fixed for all parameters, which can cause
issues. If the gradient is large, a large learning rate might overshoot the optimum, and if the gradient is small,
convergence becomes very slow.

AdaGrad addresses this by adapting the learning rate for each parameter individually.

It keeps track of the sum of squares of all previous gradients for each parameter, and divides the learning rate
by the square root of this accumulated value.

Formula:
w(t+1,i) = w(t,i) − μ / √G(i) * ∇(t,i)

Here, G(i) represents the sum of squared gradients for parameter i.

Effectively, parameters with large gradients get smaller learning rates, and parameters with small gradients
get larger learning rates.

Advantage: Learning rate is adjusted automatically.

Limitation: The sum in the denominator increases over time, causing the learning rate to decay too much,
which can slow or stop training.
2) AdaDelta

AdaDelta is an improved version of AdaGrad that prevents the learning rate from continuously
decaying. Instead of summing all past squared gradients, it keeps only a fixed-size window of past
gradients.

It computes an exponentially decaying average of squared gradients, which helps maintain a

balanced learning rate.

Formula:
w(t+1) = w(t) − μ / RMS(∇t) * ∇t

Here, RMS(∇t) is the Root Mean Square of recent gradients. This ensures the denominator stays
within a useful range.

Advantages:
• Prevents vanishing learning rate.
• No manual tuning of global learning rate required.
• Performs well in practice for deep networks.
3) RMSProp (Root Mean Square Propagation)

RMSProp improves AdaGrad by introducing an exponentially weighted moving average of

squared gradients. It ‘forgets’ very old gradients and focuses on recent ones.

Formula:
w(t+1) = w(t) − μ / RMS(∇t) * ∇t

Working steps:
(a) Set equal update magnitude for all weights and define max/min limits.
(b) If current and previous gradients have the same sign, increase learning rate (×1.2).
(c) If signs differ, reduce learning rate (×0.5).

This makes learning stable and prevents oscillations.

Advantages:
• Solves AdaGrad’s decaying learning rate problem.
• Performs well on non-stationary and sequential data (like RNNs).
4) Adam (Adaptive Moment Estimation)

Adam combines the benefits of AdaGrad and RMSProp. It maintains two exponential moving averages:
1. m(t): the mean of gradients (first moment)
2. v(t): the uncentered variance (second moment)

Formulas:
m(t) = β1 * m(t−1) + (1−β1) * g(t)
v(t) = β2 * v(t−1) + (1−β2) * g(t)^2

Bias-corrected estimates:
m̂ (t) = m(t) / (1−β1^t)
v̂ (t) = v(t) / (1−β2^t)

Update rule:
w(t+1) = w(t) − μ * m̂ (t) / (√v̂ (t) + ε)

Advantages:
• Combines adaptive learning rate and momentum.
• Fast convergence.
• Works well for most deep learning applications.
• Automatically adjusts learning rates for each parameter.
• • AdaGrad – Adapts learning rate per parameter but learning rate
decays over time.
• AdaDelta – Fixes AdaGrad’s decay problem by keeping a limited
history of gradients.
• RMSProp – Maintains an exponentially decaying average of squared
gradients.
• Adam – Combines RMSProp and Momentum; fast, adaptive, and
efficient.

Among these, Adam is most widely used in deep learning for its
balance of speed and stability.
Challenges in Training Deep Network
1) Vanishing Gradient:
Any deep neural network with activation function like sigmoid, tanh,
etc. and training through backpropagation suffers from vanishing
gradient problem.
Vanishing gradient makes it very hard to train and update the
parameters of the initial layers in the network.
This problem worsens as the number of layers in the network increases.
The aim of backpropagation in neural networks is to update the
parameters such that the error of the network is minimized and actual
output gets closer to the target out put.
During backpropagation, the weights are updated using gradient
descent
• Why does the gradient “vanish”?
• Let’s look at the sigmoid function:

• Its derivative is:

• This derivative (which is what’s used during backpropagation) has a maximum value of 0.25 and is always
between 0 and 0.25.
• That means:
Each time the gradient passes through a sigmoid activation, it gets multiplied by a number less than 1 (say
0.25 or smaller).
• Now imagine a deep network with 10 layers.
If each layer multiplies the gradient by 0.25, then:

• So the gradient becomes almost zero by the time it reaches the first few layers.
• This is why we say the gradient “vanishes” — it becomes too small for the earlier layers to learn anything.
• What happens because of it?
• The initial layers (the ones close to input) stop learning, because their weights barely change.
• The later layers (near output) might still learn, but the overall network won’t improve much.
• Training becomes very slow or even stuck — the loss doesn’t reduce further.
• This is why deep neural networks with sigmoid or tanh activations were historically very hard to train —
especially before ReLU was introduced.
How ReLU helps
• ReLU (Rectified Linear Unit) is defined as:

• Its derivative is:

• For positive values, the derivative is 1, not a small number like 0.25.
So when backpropagation happens, gradients don’t shrink — they stay strong enough for all layers to keep
learning.
• That’s why ReLU and its variants (like Leaky ReLU, ELU) are widely used today — they prevent the vanishing
gradient problem and make deep networks trainable.
Training Data Size
• Deep neural networks use training data for learning and can model
complex nonlinear relationships between input data and output
labels. The number of parameters in these networks is very large,
making the training data size a critical factor influencing model
success.
Importance of Large Data
• Deep networks have millions of parameters that need to be learned.
More parameters require more data to ensure effective training.
Complex models mean more powerful abstraction but also require
vast amounts of data to generalize well.
Real-World Examples of Large
Datasets
• Successful deep models such as AlexNet, GoogleNet, VGG, and
ResNet were all trained on the ImageNet dataset. ImageNet contains
around 1.2 million labeled images distributed across 1,000 classes.
Such large datasets help these models handle variations in object
pose, color, lighting, and background.
When Smaller Data Works
For less complex problems—such as medical image classification,
where variations are small—less complex models can perform well
even with smaller datasets. However, both model complexity and data
quality determine the actual data required.
Role of Data Quality
• The quality of training data is as important as its size. Noisy or low-
quality data reduces the Signal-to-Noise Ratio (SNR), making learning
harder and requiring more data for convergence. Hence, high-quality
and clean data helps deep models train efficiently.
Data Size vs. Problem Complexity
The required dataset size depends on both the complexity of the
problem and the nature of the data. Highly variable data, such as
natural images, needs larger datasets, while low-variation data can be
trained with fewer examples.
• How Much Data is Enough?
• There is no universal rule for the amount of data required to train a
deep model. Generally, more data improves accuracy and
generalization. However, factors such as model size, task complexity,
and data quality determine the exact requirement.
Overfitting and Underfitting
• Generalization in Deep Learning
• Once a deep learning model is trained on a given training dataset, its primary objective is not just to perform well on that
data, but also to generalize — that is, to perform accurately on new, unseen data. The ability of a deep learning model to
maintain good performance on unseen data is called generalization. Generalization is one of the most important qualities
of a good deep learning model.

To assess a model’s generalization ability, the dataset is generally split into training, validation, and test sets. The model is
trained using the training set and evaluated on the validation or test set. If the model performs well on the training data
but poorly on new data, it indicates poor generalization.

• Overfitting:
• Overfitting occurs when a model learns the training data too well, including its noise and minor details, instead of learning the general
patterns. This results in a model that performs exceptionally well on training data but fails to generalize to unseen data.

In overfitting, the training error becomes very low, but the validation (or test) error remains high. This behavior can be visualized where the
training error keeps decreasing while the validation error increases after a certain point. Overfitting commonly occurs in deep networks like
CNNs, which have a large number of learnable parameters. If the training dataset is too small relative to the number of parameters, the
network starts memorizing the examples instead of learning general features.
Underfitting
• Underfitting occurs when a model is not able to learn effectively from the training data. It happens when
the model is too simple to capture the underlying patterns of the data, or when it has not been trained
for enough iterations. In this case, the model shows high error on both the training and validation sets,
indicating that it has not learned the task properly.

Underfitting is often caused by using a model that is too simple, insufficient training, or an inappropriate
learning rate.

Techniques to Reduce Overfitting

• Although overfitting is a common challenge in deep networks, several strategies can be used to reduce it:

(a) Increase the training dataset: A larger dataset allows the model to see more variations and improve
generalization.
(b) Reduce network size: Simplifying the architecture by reducing layers or neurons prevents overfitting.
(c) Data augmentation: Generating new examples by transforming existing data (scaling, rotation,
flipping, etc.) increases dataset size.
(d) Regularization (L1 and L2): Adding penalty terms discourages large weights and reduces complexity.
(e) Dropout: Randomly dropping neurons during training prevents reliance on specific neurons, forcing
the model to learn robust representations.

Unit2 DeepLearning ComprehensiveNotes
No ratings yet
Unit2 DeepLearning ComprehensiveNotes
20 pages
Supervised Deep Learning Techniques
No ratings yet
Supervised Deep Learning Techniques
28 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
22 pages
Neural Network Training and Optimization
No ratings yet
Neural Network Training and Optimization
34 pages
Unit 2 - DLTM
No ratings yet
Unit 2 - DLTM
62 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
23 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
67 pages
Neural Network Training Techniques
No ratings yet
Neural Network Training Techniques
70 pages
Deep Learning Fundamentals and Techniques
No ratings yet
Deep Learning Fundamentals and Techniques
212 pages
Gradient Descent and Neural Network Techniques
No ratings yet
Gradient Descent and Neural Network Techniques
2 pages
Gradient Descent Optimization Techniques
No ratings yet
Gradient Descent Optimization Techniques
54 pages
BCSE332L-Deep Learning Module 3
No ratings yet
BCSE332L-Deep Learning Module 3
69 pages
Optimizing Neural Network Training Techniques
No ratings yet
Optimizing Neural Network Training Techniques
34 pages
Gradient Descent in Deep Learning
No ratings yet
Gradient Descent in Deep Learning
28 pages
Single Feed Forward
No ratings yet
Single Feed Forward
147 pages
Deep Learning Model Optimization Techniques
No ratings yet
Deep Learning Model Optimization Techniques
31 pages
Gradient Descent Optimization Techniques
No ratings yet
Gradient Descent Optimization Techniques
27 pages
Understanding Gradient Descent in ML
No ratings yet
Understanding Gradient Descent in ML
19 pages
Unit 2
No ratings yet
Unit 2
10 pages
Module 1
No ratings yet
Module 1
19 pages
Supervised Deep Learning Training Techniques
No ratings yet
Supervised Deep Learning Training Techniques
36 pages
Supervised Deep Learning Training Techniques
No ratings yet
Supervised Deep Learning Training Techniques
23 pages
Training Neural Networks with Gradient Descent
No ratings yet
Training Neural Networks with Gradient Descent
4 pages
Backpropagation in Deep Learning Explained
No ratings yet
Backpropagation in Deep Learning Explained
48 pages
Understanding Artificial Neural Networks
No ratings yet
Understanding Artificial Neural Networks
35 pages
Understanding Epochs and Optimizers in ML
No ratings yet
Understanding Epochs and Optimizers in ML
23 pages
Understanding Machine Learning Optimizers
No ratings yet
Understanding Machine Learning Optimizers
4 pages
Unit II
No ratings yet
Unit II
14 pages
Deep Learning Optimization Algorithms
No ratings yet
Deep Learning Optimization Algorithms
31 pages
Gradient Descent in Neural Networks
No ratings yet
Gradient Descent in Neural Networks
26 pages
Unit 2
No ratings yet
Unit 2
95 pages
Gradient-Based Optimization in Deep Learning
No ratings yet
Gradient-Based Optimization in Deep Learning
9 pages
Deep Learning: Gradient Optimization Techniques
No ratings yet
Deep Learning: Gradient Optimization Techniques
40 pages
Deep Learning My Notes
No ratings yet
Deep Learning My Notes
10 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
86 pages
Types of Optimizers in Deep Learning
No ratings yet
Types of Optimizers in Deep Learning
15 pages
Key Deep Learning Terms Explained
No ratings yet
Key Deep Learning Terms Explained
9 pages
Loss Functions and Gradient Descent Techniques
No ratings yet
Loss Functions and Gradient Descent Techniques
5 pages
Understanding Gradient Descent Methods
No ratings yet
Understanding Gradient Descent Methods
2 pages
Module 2
No ratings yet
Module 2
108 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
14 pages
Deep Learning with TensorFlow Guide
No ratings yet
Deep Learning with TensorFlow Guide
95 pages
CBOW vs Skip-Gram in Word2Vec
No ratings yet
CBOW vs Skip-Gram in Word2Vec
170 pages
Neural Network Optimization Techniques
No ratings yet
Neural Network Optimization Techniques
7 pages
Neural Network Architectures & Optimizers
No ratings yet
Neural Network Architectures & Optimizers
39 pages
Adagrad and RMSProp in Deep Learning
No ratings yet
Adagrad and RMSProp in Deep Learning
13 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
51 pages
Understanding Gradient Descent Algorithms
No ratings yet
Understanding Gradient Descent Algorithms
13 pages
Screenshot 2026-01-21 at 1.34.22 AM
No ratings yet
Screenshot 2026-01-21 at 1.34.22 AM
18 pages
5 SEng5305 Chapter 5 Optimization Techniques
No ratings yet
5 SEng5305 Chapter 5 Optimization Techniques
47 pages
Module 03 Backprop Opti
No ratings yet
Module 03 Backprop Opti
72 pages
Neural Network Optimization Algorithms
No ratings yet
Neural Network Optimization Algorithms
25 pages
Backpropagation and Gradient Descent Explained
No ratings yet
Backpropagation and Gradient Descent Explained
10 pages
Key Factors in MLP Learning
No ratings yet
Key Factors in MLP Learning
17 pages
Gradient Descent Variants for Neural Networks
No ratings yet
Gradient Descent Variants for Neural Networks
20 pages
Deep Learning: Gradient Descent Explained
No ratings yet
Deep Learning: Gradient Descent Explained
41 pages
AI & ML Curriculum Overview 2023-24
No ratings yet
AI & ML Curriculum Overview 2023-24
4 pages
Deep Learning and Reinforcement Learning Course
No ratings yet
Deep Learning and Reinforcement Learning Course
16 pages
Introduction to MongoDB Features
No ratings yet
Introduction to MongoDB Features
50 pages
Deep Learning Experiments and Projects
No ratings yet
Deep Learning Experiments and Projects
15 pages
E-Waste Management in India: Overview & Regulations
No ratings yet
E-Waste Management in India: Overview & Regulations
4 pages
MapReduce Programming Overview
No ratings yet
MapReduce Programming Overview
39 pages
Key Machine Learning Concepts and Problems
No ratings yet
Key Machine Learning Concepts and Problems
1 page
Candidate Registration Certificate Details
No ratings yet
Candidate Registration Certificate Details
1 page
Big Data Analytics Course Overview
No ratings yet
Big Data Analytics Course Overview
61 pages
Understanding Big Data and Its Types
No ratings yet
Understanding Big Data and Its Types
57 pages
Big Data Analytics Overview and Tools
No ratings yet
Big Data Analytics Overview and Tools
92 pages
Understanding Hadoop for Big Data
No ratings yet
Understanding Hadoop for Big Data
91 pages
Big Data Concepts and Technologies Guide
No ratings yet
Big Data Concepts and Technologies Guide
1 page
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
23 pages
Word Processing Techniques in Computing
No ratings yet
Word Processing Techniques in Computing
92 pages
Bridging Ethics and HCAI Practice
No ratings yet
Bridging Ethics and HCAI Practice
68 pages
Machine Learning Key Concepts and Problems
No ratings yet
Machine Learning Key Concepts and Problems
1 page
Language Modelling: Grammar vs. Statistics
No ratings yet
Language Modelling: Grammar vs. Statistics
79 pages
Overview of Indian Knowledge System
No ratings yet
Overview of Indian Knowledge System
16 pages
Introduction to Renewable Energy Concepts
No ratings yet
Introduction to Renewable Energy Concepts
90 pages
Goals of AI Research Explained
No ratings yet
Goals of AI Research Explained
44 pages
BAIL606 Machine Learning Lab Syllabus
No ratings yet
BAIL606 Machine Learning Lab Syllabus
15 pages
Understanding Human-Centered AI Principles
100% (1)
Understanding Human-Centered AI Principles
61 pages
BRMK557 Model Question Paper 2022-23
No ratings yet
BRMK557 Model Question Paper 2022-23
2 pages
Python File Handling Basics
No ratings yet
Python File Handling Basics
13 pages
DBMS Lab Manual for BCS403
100% (1)
DBMS Lab Manual for BCS403
11 pages
IWSLT 2024 Evaluation Campaign Findings
No ratings yet
IWSLT 2024 Evaluation Campaign Findings
59 pages
ANFIS-Wavelet Moving Average Indicator
No ratings yet
ANFIS-Wavelet Moving Average Indicator
14 pages
Modulation Classification with CNNs
No ratings yet
Modulation Classification with CNNs
9 pages
Business Email Compromise Dataset Creation
100% (1)
Business Email Compromise Dataset Creation
13 pages
ASL Emotion and Alphabet Recognition Using CNN
No ratings yet
ASL Emotion and Alphabet Recognition Using CNN
9 pages
Machine Learning Handwritten Notes Exam
No ratings yet
Machine Learning Handwritten Notes Exam
36 pages
Data Mining: Techniques and Applications
No ratings yet
Data Mining: Techniques and Applications
18 pages
Understanding Regularization & Normalization
No ratings yet
Understanding Regularization & Normalization
16 pages
FedPETuning: Efficient FL for PLMs
No ratings yet
FedPETuning: Efficient FL for PLMs
13 pages
Ensemble Techniques in Unsupervised Learning
No ratings yet
Ensemble Techniques in Unsupervised Learning
26 pages
Machine Learning Lab Manual for B.Tech
No ratings yet
Machine Learning Lab Manual for B.Tech
41 pages
2R 13 (Ieee)
No ratings yet
2R 13 (Ieee)
8 pages
Week 5 Quiz: Clustering Techniques
No ratings yet
Week 5 Quiz: Clustering Techniques
6 pages
Brain Tumor Classification Techniques
No ratings yet
Brain Tumor Classification Techniques
16 pages
Machine Learning Fundamentals Overview
No ratings yet
Machine Learning Fundamentals Overview
39 pages
Weka Machine Learning Tutorial
No ratings yet
Weka Machine Learning Tutorial
48 pages
Mastering Machine Learning Guide
No ratings yet
Mastering Machine Learning Guide
19 pages
Fruit Quality Assessment Features
No ratings yet
Fruit Quality Assessment Features
55 pages
CNN for Diabetic Retinopathy Detection
No ratings yet
CNN for Diabetic Retinopathy Detection
19 pages
Machine Learning in Hydraulic Fracturing
No ratings yet
Machine Learning in Hydraulic Fracturing
17 pages
Inductive Learning in AI Explained
No ratings yet
Inductive Learning in AI Explained
45 pages
Deep Learning for Critical Heat Flux Detection
No ratings yet
Deep Learning for Critical Heat Flux Detection
11 pages
Credit Card Fraud Detection with ML & Blockchain
100% (1)
Credit Card Fraud Detection with ML & Blockchain
9 pages
Meta-Heuristic Feature Selection Review
No ratings yet
Meta-Heuristic Feature Selection Review
44 pages
Machine Learning in Cybersecurity Insights
No ratings yet
Machine Learning in Cybersecurity Insights
20 pages
Real-Time Face Mask Detection System
No ratings yet
Real-Time Face Mask Detection System
7 pages
Detecting Electricity Theft with ML
No ratings yet
Detecting Electricity Theft with ML
21 pages
Nanoengineering in Science and Technology An Introduction To The World of Nanodesign Series On The Foundations of Natural Science and Technology Michael Rieth Updated 2025
100% (4)
Nanoengineering in Science and Technology An Introduction To The World of Nanodesign Series On The Foundations of Natural Science and Technology Michael Rieth Updated 2025
103 pages
Advanced Uber Fare Estimation System
No ratings yet
Advanced Uber Fare Estimation System
6 pages
Optimizing Ride Demand Forecasting
No ratings yet
Optimizing Ride Demand Forecasting
71 pages

Training Supervised Deep Learning Models

Uploaded by

Training Supervised Deep Learning Models

Uploaded by

Training Supervised

• Flow: 32×32 → Conv (6×28×28) → Pool (6×14×14) → Conv (16×10×10) → Pool

• = value at position (i,j) of the k-th feature map

First convolution. Each

Reduces size by half,

Deeper convolution. Each

Again halves size, keeps

This is like a fully-

Classic fully connected

• In traditional Gradient Descent (GD), also known as batch gradient descent

• Pros: Stable, exact gradient.

• Pros: Very fast, can escape local minima.

Here, G(i) represents the sum of squared gradients for parameter i.

Advantage: Learning rate is adjusted automatically.

It computes an exponentially decaying average of squared gradients, which helps maintain a

RMSProp improves AdaGrad by introducing an exponentially weighted moving average of

This makes learning stable and prevents oscillations.

• Its derivative is:

• Its derivative is:

Techniques to Reduce Overfitting

You might also like