0% found this document useful (0 votes)

11 views22 pages

Deep Learning Optimization Techniques

The document discusses optimization techniques in deep learning, focusing on constrained and unconstrained optimization, gradient descent methods, and various optimization algorithms like Adagrad, RMSProp, and Adam. It highlights the importance of gradient descent in finding optimal parameters for machine learning models and explains different types of gradient descent, including batch, stochastic, and mini-batch. Additionally, it covers adaptive learning rates and the effectiveness of modern optimizers in navigating complex loss landscapes.

Uploaded by

LALITH Machavarapu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views22 pages

Deep Learning Optimization Techniques

Uploaded by

LALITH Machavarapu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

22AIE304 Deep Learning

Constrained and Unconstrained

Optimization, Gradient Descent
Technique, Adagrad, RMSProp,
Adam, AdaBelief

Dr. Deepa Gupta, Professor

Human Language Technology Lab
Dept. of computer Science and Engineering
Amrita School of Computing, Bangalore
1
Optimization Techniques
Drawbacks of analytical optimization techniques
1. Difficulty in checking the second order sufficiency conditions

2. Difficulty in finding all possible solutions for the nonlinear equation.

Numerical Optimization Techniques

1. Solving optimization problems using numerical methods to obtain near to exact solutions
2. Different algorithms are used
3. Algorithms are mainly classified into two types:
-- Direct search method
Uses the objective function alone to search for optimum
--Gradient based methods
Uses the derivatives of functions to search for optimum

2
Constrained and Unconstrained Optimization

• An Unconstrained Optimization problem is one where you only have

to be concerned with the objective function you are trying to optimize.
• None of the variables in the objective function are constrained.

• Constrained Optimization is said to occur when one or more of the

variables in the objective function is constrained by some function.
• Hence a constrained optimization problem will have an objective
functions and a set of constraints.

3
Gradient Descent Algorithm
for Machine Learning

4
Gradient Descent
• Gradient descent is a simple optimization procedure that you can use with
many machine learning algorithms.

• Gradient descent is an optimization algorithm used to find the values of

parameters (coefficients) of a function (f) that minimizes a objective
function (cost/loss).

• Gradient descent is best used when the parameters cannot be calculated

analytically and must be searched for by an optimization algorithm.

5
Iterative formula
• In any numerical algorithm to find the minimum of an unconstrained
problem, the iterative formula used is:
x(k+1) = x(k) + 𝛼(k)d(k)
Where,
• x(k) is the previous iteration point
• 𝛼(k) is the step length in step k and
• d(k) is the direction of descent in step k.

Condition: f(x(k+1))<f(x(k))
6
Steepest Descent method
• The iterative formula for method of steepest descent is
x(k+1) = x(k) + 𝛼 (k)d(k)
• The descent direction d(k) is along the steepest descent direction, 𝑑 (𝑘) = −𝛻𝑓 𝑥 𝑘

• The step length 𝛼 (k) is obtained by minimizing single variable function

𝜑 𝛼 𝑘 = 𝑓 x(k) + 𝛼(k)d(k)
( 𝜑 ′ 𝛼 𝑘 = 0, 𝜑 ′′ 𝛼 𝑘 > 0 )
Algorithm:
Step 1: Chose an initial starting point 𝑥 0 and a termination parameter 𝜀.
Step 2: Compute steepest descent, 𝑑(0) = −𝛻𝑓 𝑥 0
Step 3: Compute the step length, 𝛼 0 .
Step 4: Evaluate x(1) = x(0) + 𝛼 (0)d(0)
Step 5: Compute 𝛻𝑓 𝑥 1 . If 𝛻𝑓 𝑥 1 < 𝜀, stop and mention x(1) is minimum,
Else go to step 2 with x(0) = x(1).

7
Descent Direction
A direction of descent at a point is a direction along which the function value
decreases.
Eg: f(x, y) = x+y has a descent direction along the direction(-1,0) (-1,-1), (-2,1) from the
point (0,0)

Note: 𝜵𝒇 gives the direction of maximum increase of a function, i.e. the direction of steepest
ascent, hence the direction of steepest descent is given by -𝜵𝒇
• Theorem: A direction given by the vector d is a descent direction only if 𝛻 𝑓 𝒙(𝑘) . 𝒅 < 𝟎
Proof: 𝑓 𝒙(𝑘+1) < 𝑓 𝒙(𝑘)
𝑓 𝒙(𝑘) + 𝜶(𝑘) 𝒅(𝑘) < 𝑓 𝒙(𝑘)
𝑓 𝒙(𝑘) ) + 𝜶(𝑘) 𝛻𝒇(𝒙(𝑘) )𝒅(𝑘) < 𝑓 𝒙(𝑘) (Taylor series)
𝜶(𝑘) 𝛻𝒇(𝒙(𝑘) )𝒅(𝑘) < 𝟎
As step length cannot be negative,𝛻𝑓 𝒙(𝑘) ∙ 𝒅 < 𝟎

• Example mentioned above can be verified using the theorem

8
F(x,y)=x+y

Eg: f(x, y) = x+y has a descent

direction along the direction(-1,0)
(-1,-1), (-2,1) from the point (0,0)

𝛻𝑓 𝒙(𝑘) ∙ 𝒅 < 𝟎,
𝒅 𝒊𝒔 𝒂 𝒅𝒆𝒔𝒄𝒆𝒏𝒕 𝒅𝒊𝒓𝒆𝒄𝒕𝒊𝒐𝒏

9
Remark:
The condition < 𝑑 , 𝛻𝑓 𝑥 > < 0 ⇒ [𝑑. 𝛻𝑓 𝑥 < 0]
Geometrically, means that the vector 𝑑 and 𝛻𝑓 𝑥 make an angle of more than
90 degree (on the plane that contains them)
< 𝑑. 𝛻𝑓 𝑥 > = 𝑑 𝛻 𝑓(𝑥) cos 𝜃
+ve +ve −1 ≤ cos 𝜃 ≤ 1

Maximum rate of decrease when cos 𝜃 = −1

i.e d should be −𝛻𝑓 𝑥 then only maximum rate of decrease happened

Note:
When we speak of direction magnitude of the vector does not matter; e.g.
𝛻𝑓 𝑥 𝛻𝑓 𝑥
𝛻𝑓 𝑥 , 5𝛻𝑓 𝑥 , , all are in the same direction.
20 𝛻𝑓 𝑥

10
Check whether the given function have a descent direction from x= (2,-1) along
the given directions d1 and d2 𝑓 = 2𝑥1 2 + 𝑥2 2 − 2𝑥1 𝑥2 + 2𝑥1 3 + 𝑥2 4
, d1= (-2, 3), d2= (1, 1)

11
Find a minimum for the function, f(x,y) = (x-1)2 + (y-2)2 starting from the point (1,3),
using steepest descent method. Choose the termination parameter 𝜀 = 0.0001 and step
length is α =0.001.

12
Questions
1. Check whether the given function have a descent direction from x= (2,-1) along the
given directions d1 and d2
𝑓 = 2𝑥1 2 + 𝑥2 2 − 2𝑥1 𝑥2 + 2𝑥1 3 + 𝑥2 4
d1= (-2, 3), d2= (1, 1)

2. Check whether the given function have a descent direction from x= (4, 2,-1) along the
given directions d1 and d2
𝑓 = (𝑥1 − 1)4 + 𝑥2 − 3 2 + 4(𝑥3 + 5)4
d1= (-1,10,-1), d2= (-1, 2, 1)
__________________________________________________________
1. Find a minimum for the function, f(x,y) = (x-1)2 + (y-2)2 starting from the point (10,-1),
using steepest descent method. Here step length is α=0.01. Choose the termination
parameter 𝜀 = 0.1
2. Find a minimum for the function, f(x,y) = 2x2 -2xy + y2 starting from the point (1, 2),
using steepest descent method. Here step length is α=0.001. Choose the termination
parameter 𝜀 = 0.1.
13
Types of Gradient Descent
• Batch Gradient Descent 1. Gradient Descent (GD)
•Full-batch optimization.
•Updates weights after computing gradients on the entire
dataset.
•Slow and inefficient for large datasets.

• Stochastic Gradient Descent 2. Stochastic Gradient Descent (SGD)

•Updates weights using one sample at a time.
•Faster and more memory-efficient than full GD.
•Can be noisy and may oscillate.

• Mini-Batch Gradient Descent 3. Mini-Batch Gradient Descent

•A hybrid of GD and SGD.
•Updates weights after a small batch of samples.
•Offers a good balance between performance and accuracy.
14
Batch Gradient Stochastic Gradient Mini-Batch Gradient
Descent Descent Descent

15
Types of Optimizers
Here are some well-known optimizers and
what their names stand for (some do have
expanded forms):
1. SGD – Stochastic Gradient Descent
2. Adam – Adaptive Moment Estimation
3. RMSprop – Root Mean Square
Propagation
4. Adagrad – Adaptive Gradient Algorithm
5. Adadelta – An extension of Adagrad that
seeks to reduce its aggressive,
monotonically decreasing learning rate
6. Nadam – Nesterov-accelerated Adaptive
Moment Estimation
7. AdamW stands for "Adam with Weight
Decay." It is an improved version of the
Adam optimizer, specifically designed to
handle L2 regularization (weight decay)
more effectively
Each of these optimizers has specific ways of
updating model parameters to achieve
better performance during training.
16
In deep learning, optimizers are algorithms or methods used to adjust the weights of neural networks to minimize the
loss function. Different optimizers affect the speed, stability, and final accuracy of training. Here are the most commonly
used ones:

17
Momentum: Smoother & Faster Learning
Concept:
• Momentum helps accelerate gradients in the right direction, leading to faster convergence.
• It adds a "velocity" term that carries forward past gradients to smooth the update process.
• Think of it like pushing a ball down a hill—it gains speed and doesn't stop with every small bump.

18
What Is Adaptive Learning Rate?
In standard gradient descent, you use a fixed learning rate for all weights:
But sometimes:

•Some weights learn too fast → cause instability.

•Others learn too slow → slow convergence.

Solution: Adaptive learning rate techniques adjust the learning rate individually for each parameter based
on its historical gradients.

Imagine you're learning to ride a bike on different terrains:

• On smooth road: You go fast.
• On rocky road: You slow down and adjust carefully.

Similarly, adaptive optimizers:

• Speed up learning on stable/flat areas.
• Slow down on steep, sensitive or noisy areas.

19
Popular Optimizers with Adaptive Learning Rates
[Link]
1. Adapts learning rate for each parameter based on the sum of past squared gradients
2. Great for sparse data but learning rate decreases too much over time

[Link]
1. Improves Adagrad by using an exponentially weighted moving average of squared gradients
2. Works well in non-stationary problems (like RNNs)

[Link] (Adaptive Moment Estimation)

1. Combines ideas from RMSprop and Momentum
2. Maintains both average of past gradients and squared gradients
3. Widely used and effective in many deep learning tasks

[Link]
1. Extension of Adagrad, addressing its rapidly decreasing learning rate

[Link]
1. Adam with proper weight decay, also uses adaptive learning
Why Gradient Descent Still Works (Even for Non-
Convex Loss):
[Link] Descent Only Needs Gradients:
As long as the loss is differentiable, gradient descent can be applied —
convexity is not a requirement.

[Link] Minima Are Often Good Enough:

In deep learning, we don’t need the global minimum.
Many local minima generalize well on unseen data.

[Link] Optimizers Help:

Algorithms like Adam, RMSProp, SGD with momentum help navigate
complex, non-convex landscapes better.
21
Thank you

Partial Derivatives and Gradient Descent
No ratings yet
Partial Derivatives and Gradient Descent
14 pages
Understanding Gradient Descent Basics
No ratings yet
Understanding Gradient Descent Basics
30 pages
Understanding Gradient Descent Methods
No ratings yet
Understanding Gradient Descent Methods
37 pages
Understanding Gradient Descent
No ratings yet
Understanding Gradient Descent
20 pages
Gradient Descent vs. Steepest Descent
No ratings yet
Gradient Descent vs. Steepest Descent
10 pages
Gradient Descent in Machine Learning
No ratings yet
Gradient Descent in Machine Learning
24 pages
Gradient Descent in Machine Learning
No ratings yet
Gradient Descent in Machine Learning
65 pages
Training Models: Gradient Descent & KNN
No ratings yet
Training Models: Gradient Descent & KNN
71 pages
Gradient Descent and Partial Derivatives
No ratings yet
Gradient Descent and Partial Derivatives
58 pages
Gradient Descent in Machine Learning
No ratings yet
Gradient Descent in Machine Learning
12 pages
Gradient Descent in Machine Learning
No ratings yet
Gradient Descent in Machine Learning
21 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
14 pages
06 23ECE216 GradientDescent v2
No ratings yet
06 23ECE216 GradientDescent v2
55 pages
Addressing Steepest Descent Weaknesses
No ratings yet
Addressing Steepest Descent Weaknesses
31 pages
25-Ms-Ai-08 Ass#01 ML
No ratings yet
25-Ms-Ai-08 Ass#01 ML
5 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
81 pages
Deep Learning Optimization Algorithms
No ratings yet
Deep Learning Optimization Algorithms
31 pages
Understanding Gradient Descent in ML
No ratings yet
Understanding Gradient Descent in ML
62 pages
Understanding Gradient Descent Methods
No ratings yet
Understanding Gradient Descent Methods
33 pages
Gradient Descent in Machine Learning
No ratings yet
Gradient Descent in Machine Learning
67 pages
Gradient Descent Techniques Overview
No ratings yet
Gradient Descent Techniques Overview
25 pages
Machine Learning: Gradient Descent Overview
No ratings yet
Machine Learning: Gradient Descent Overview
52 pages
Steepest vs Gradient Descent Methods
No ratings yet
Steepest vs Gradient Descent Methods
7 pages
Gradient-Based Optimization Techniques
No ratings yet
Gradient-Based Optimization Techniques
47 pages
Minimizing Gradient Problems in ML
No ratings yet
Minimizing Gradient Problems in ML
37 pages
Understanding Gradient Descent in ML
No ratings yet
Understanding Gradient Descent in ML
6 pages
Gradient Descent Algorithm Overview
No ratings yet
Gradient Descent Algorithm Overview
10 pages
Gradient Descent in Machine Learning
No ratings yet
Gradient Descent in Machine Learning
67 pages
2024 MTH058 Lecture03 GradientDescent
No ratings yet
2024 MTH058 Lecture03 GradientDescent
48 pages
Gradient Descent and Steepest Descent Methods
No ratings yet
Gradient Descent and Steepest Descent Methods
4 pages
Machine Learning Optimization Techniques
No ratings yet
Machine Learning Optimization Techniques
20 pages
Understanding Gradient Descent Basics
No ratings yet
Understanding Gradient Descent Basics
8 pages
Understanding Gradient Descent Techniques
No ratings yet
Understanding Gradient Descent Techniques
35 pages
Understanding Gradient Descent in ML
No ratings yet
Understanding Gradient Descent in ML
5 pages
Understanding Gradient Descent Variants
No ratings yet
Understanding Gradient Descent Variants
5 pages
Optimization Techniques in Machine Learning
No ratings yet
Optimization Techniques in Machine Learning
42 pages
Understanding Gradient Descent Basics
No ratings yet
Understanding Gradient Descent Basics
22 pages
Understanding Subgradients in Optimization
No ratings yet
Understanding Subgradients in Optimization
25 pages
Understanding Gradient Descent in ML
No ratings yet
Understanding Gradient Descent in ML
8 pages
Nonlinear Programming Optimization Methods
No ratings yet
Nonlinear Programming Optimization Methods
39 pages
Understanding Gradient Descent Techniques
No ratings yet
Understanding Gradient Descent Techniques
40 pages
Adam Optimization Algorithm Explained
No ratings yet
Adam Optimization Algorithm Explained
28 pages
Covariant Gradient Descent Overview
No ratings yet
Covariant Gradient Descent Overview
4 pages
Gradient Descent in Unconstrained Optimization
No ratings yet
Gradient Descent in Unconstrained Optimization
12 pages
Understanding Gradient Descent Methods
No ratings yet
Understanding Gradient Descent Methods
2 pages
Understanding Gradient Descent in ML
No ratings yet
Understanding Gradient Descent in ML
11 pages
Gradient Descent Algorithm Explained
No ratings yet
Gradient Descent Algorithm Explained
4 pages
Understanding Gradient Descent Basics
No ratings yet
Understanding Gradient Descent Basics
37 pages
Understanding Gradient Descent Methods
No ratings yet
Understanding Gradient Descent Methods
9 pages
Understanding Gradient Descent in ML
No ratings yet
Understanding Gradient Descent in ML
17 pages
Optimization Techniques Explained
No ratings yet
Optimization Techniques Explained
73 pages
04-FSSR DS620 2024 2025T2 GD
No ratings yet
04-FSSR DS620 2024 2025T2 GD
66 pages
Understanding Gradient Descent Methods
No ratings yet
Understanding Gradient Descent Methods
37 pages
Gradient Descent in Machine Learning
No ratings yet
Gradient Descent in Machine Learning
9 pages
Understanding Gradient Descent in ML
No ratings yet
Understanding Gradient Descent in ML
31 pages
Machine Learning: Cost Function Optimization
No ratings yet
Machine Learning: Cost Function Optimization
28 pages
Overview of Gradient Descent Methods
No ratings yet
Overview of Gradient Descent Methods
27 pages
Understanding Web Cookies and Their Uses
No ratings yet
Understanding Web Cookies and Their Uses
32 pages
25 Years of Continuous Sign Language Recognition
No ratings yet
25 Years of Continuous Sign Language Recognition
32 pages
Understanding Trie Data Structures
No ratings yet
Understanding Trie Data Structures
13 pages
Single-Cell Dynamics of Flavivirus Infection
No ratings yet
Single-Cell Dynamics of Flavivirus Infection
21 pages
Lab 07: ML Model Experimentation Guide
No ratings yet
Lab 07: ML Model Experimentation Guide
2 pages
SHAP and LIME in Explainable AI
No ratings yet
SHAP and LIME in Explainable AI
8 pages
Understanding the Alpha-Beta Filter
100% (1)
Understanding the Alpha-Beta Filter
8 pages
Multigrid Methods in Isogeometric Discretization
No ratings yet
Multigrid Methods in Isogeometric Discretization
13 pages
BrainIB: GNN for Psychiatric Diagnosis
No ratings yet
BrainIB: GNN for Psychiatric Diagnosis
12 pages
High-Dimensional Data in Cancer Research
No ratings yet
High-Dimensional Data in Cancer Research
15 pages
Edexcel Level 3 GCE Further Maths Exam Guide
No ratings yet
Edexcel Level 3 GCE Further Maths Exam Guide
12 pages
NLP Applications and Text Classification
No ratings yet
NLP Applications and Text Classification
64 pages
B.Tech Digital Signal Processing Exam 2023
No ratings yet
B.Tech Digital Signal Processing Exam 2023
4 pages
Engineering Statistics Overview
No ratings yet
Engineering Statistics Overview
1 page
ECO 311 Applied Statistics Overview
No ratings yet
ECO 311 Applied Statistics Overview
20 pages
Eigenvalue Problem Solutions and Methods
No ratings yet
Eigenvalue Problem Solutions and Methods
11 pages
A-Level Differentiation Practice Questions
No ratings yet
A-Level Differentiation Practice Questions
4 pages
Slack Variables in Simplex Method
No ratings yet
Slack Variables in Simplex Method
33 pages
Machine Learning for Long-Term Flood Forecasting
No ratings yet
Machine Learning for Long-Term Flood Forecasting
19 pages
Common Challenges with Diagnostic Plots
No ratings yet
Common Challenges with Diagnostic Plots
3 pages
Process Capability in Quality Control
No ratings yet
Process Capability in Quality Control
69 pages
Robust Finite-Time Tracking for UAVs
No ratings yet
Robust Finite-Time Tracking for UAVs
10 pages
Multi-Objective Genetic Algorithm for Robot Path Planning
No ratings yet
Multi-Objective Genetic Algorithm for Robot Path Planning
15 pages
Convolutional - Autoencoder - and - Transfer - Learning - For - Automatic - Virtual - Metrology (IEEE RA-L, July 2022)
No ratings yet
Convolutional - Autoencoder - and - Transfer - Learning - For - Automatic - Virtual - Metrology (IEEE RA-L, July 2022)
8 pages
Practice Questions of Page Replacement Numerical
100% (1)
Practice Questions of Page Replacement Numerical
3 pages
First and Second Order System Response
No ratings yet
First and Second Order System Response
9 pages
Text Preprocessing: Tokenization & Stopwords
No ratings yet
Text Preprocessing: Tokenization & Stopwords
7 pages
Uber AI Solutions vs. Competitors
No ratings yet
Uber AI Solutions vs. Competitors
53 pages
Data Analytics: Definition and Examples
No ratings yet
Data Analytics: Definition and Examples
3 pages
Hashing Techniques for Message Authentication
No ratings yet
Hashing Techniques for Message Authentication
52 pages
Design & Implementation of JPEG2000 Encoder Using VHDL: Kanchan H. Wagh, Pravin K. Dakhole, Vinod G. Adhau
No ratings yet
Design & Implementation of JPEG2000 Encoder Using VHDL: Kanchan H. Wagh, Pravin K. Dakhole, Vinod G. Adhau
6 pages
Advanced Process Control Techniques
No ratings yet
Advanced Process Control Techniques
24 pages
Arena Simulation Results Summary
No ratings yet
Arena Simulation Results Summary
2 pages
Speech Processing: Key Concepts and Questions
No ratings yet
Speech Processing: Key Concepts and Questions
2 pages
CNN for Singer Voice Prediction
No ratings yet
CNN for Singer Voice Prediction
17 pages
Understanding WPA Protocol for WLAN Security
No ratings yet
Understanding WPA Protocol for WLAN Security
3 pages

Deep Learning Optimization Techniques

Uploaded by

Deep Learning Optimization Techniques

Uploaded by

22AIE304 Deep Learning

Constrained and Unconstrained

Dr. Deepa Gupta, Professor

2. Difficulty in finding all possible solutions for the nonlinear equation.

Numerical Optimization Techniques

• An Unconstrained Optimization problem is one where you only have

• Constrained Optimization is said to occur when one or more of the

• Gradient descent is an optimization algorithm used to find the values of

• Gradient descent is best used when the parameters cannot be calculated

• The step length 𝛼 (k) is obtained by minimizing single variable function

• Example mentioned above can be verified using the theorem

Eg: f(x, y) = x+y has a descent

Maximum rate of decrease when cos 𝜃 = −1

• Stochastic Gradient Descent 2. Stochastic Gradient Descent (SGD)

• Mini-Batch Gradient Descent 3. Mini-Batch Gradient Descent

•Some weights learn too fast → cause instability.

•Others learn too slow → slow convergence.

Imagine you're learning to ride a bike on different terrains:

Similarly, adaptive optimizers:

[Link] (Adaptive Moment Estimation)

[Link] Minima Are Often Good Enough:

[Link] Optimizers Help:

You might also like