0% found this document useful (0 votes)
11 views22 pages

Deep Learning Optimization Techniques

The document discusses optimization techniques in deep learning, focusing on constrained and unconstrained optimization, gradient descent methods, and various optimization algorithms like Adagrad, RMSProp, and Adam. It highlights the importance of gradient descent in finding optimal parameters for machine learning models and explains different types of gradient descent, including batch, stochastic, and mini-batch. Additionally, it covers adaptive learning rates and the effectiveness of modern optimizers in navigating complex loss landscapes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views22 pages

Deep Learning Optimization Techniques

The document discusses optimization techniques in deep learning, focusing on constrained and unconstrained optimization, gradient descent methods, and various optimization algorithms like Adagrad, RMSProp, and Adam. It highlights the importance of gradient descent in finding optimal parameters for machine learning models and explains different types of gradient descent, including batch, stochastic, and mini-batch. Additionally, it covers adaptive learning rates and the effectiveness of modern optimizers in navigating complex loss landscapes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

22AIE304 Deep Learning

Constrained and Unconstrained


Optimization, Gradient Descent
Technique, Adagrad, RMSProp,
Adam, AdaBelief

Dr. Deepa Gupta, Professor


Human Language Technology Lab
Dept. of computer Science and Engineering
Amrita School of Computing, Bangalore
1
Optimization Techniques
Drawbacks of analytical optimization techniques
1. Difficulty in checking the second order sufficiency conditions

2. Difficulty in finding all possible solutions for the nonlinear equation.

Numerical Optimization Techniques


1. Solving optimization problems using numerical methods to obtain near to exact solutions
2. Different algorithms are used
3. Algorithms are mainly classified into two types:
-- Direct search method
Uses the objective function alone to search for optimum
--Gradient based methods
Uses the derivatives of functions to search for optimum

2
Constrained and Unconstrained Optimization

• An Unconstrained Optimization problem is one where you only have


to be concerned with the objective function you are trying to optimize.
• None of the variables in the objective function are constrained.

• Constrained Optimization is said to occur when one or more of the


variables in the objective function is constrained by some function.
• Hence a constrained optimization problem will have an objective
functions and a set of constraints.

3
Gradient Descent Algorithm
for Machine Learning

4
Gradient Descent
• Gradient descent is a simple optimization procedure that you can use with
many machine learning algorithms.

• Gradient descent is an optimization algorithm used to find the values of


parameters (coefficients) of a function (f) that minimizes a objective
function (cost/loss).

• Gradient descent is best used when the parameters cannot be calculated


analytically and must be searched for by an optimization algorithm.

5
Iterative formula
• In any numerical algorithm to find the minimum of an unconstrained
problem, the iterative formula used is:
x(k+1) = x(k) + 𝛼(k)d(k)
Where,
• x(k) is the previous iteration point
• 𝛼(k) is the step length in step k and
• d(k) is the direction of descent in step k.

Condition: f(x(k+1))<f(x(k))
6
Steepest Descent method
• The iterative formula for method of steepest descent is
x(k+1) = x(k) + 𝛼 (k)d(k)
• The descent direction d(k) is along the steepest descent direction, 𝑑 (𝑘) = −𝛻𝑓 𝑥 𝑘

• The step length 𝛼 (k) is obtained by minimizing single variable function


𝜑 𝛼 𝑘 = 𝑓 x(k) + 𝛼(k)d(k)
( 𝜑 ′ 𝛼 𝑘 = 0, 𝜑 ′′ 𝛼 𝑘 > 0 )
Algorithm:
Step 1: Chose an initial starting point 𝑥 0 and a termination parameter 𝜀.
Step 2: Compute steepest descent, 𝑑(0) = −𝛻𝑓 𝑥 0
Step 3: Compute the step length, 𝛼 0 .
Step 4: Evaluate x(1) = x(0) + 𝛼 (0)d(0)
Step 5: Compute 𝛻𝑓 𝑥 1 . If 𝛻𝑓 𝑥 1 < 𝜀, stop and mention x(1) is minimum,
Else go to step 2 with x(0) = x(1).

7
Descent Direction
A direction of descent at a point is a direction along which the function value
decreases.
Eg: f(x, y) = x+y has a descent direction along the direction(-1,0) (-1,-1), (-2,1) from the
point (0,0)

Note: 𝜵𝒇 gives the direction of maximum increase of a function, i.e. the direction of steepest
ascent, hence the direction of steepest descent is given by -𝜵𝒇
• Theorem: A direction given by the vector d is a descent direction only if 𝛻 𝑓 𝒙(𝑘) . 𝒅 < 𝟎
Proof: 𝑓 𝒙(𝑘+1) < 𝑓 𝒙(𝑘)
𝑓 𝒙(𝑘) + 𝜶(𝑘) 𝒅(𝑘) < 𝑓 𝒙(𝑘)
𝑓 𝒙(𝑘) ) + 𝜶(𝑘) 𝛻𝒇(𝒙(𝑘) )𝒅(𝑘) < 𝑓 𝒙(𝑘) (Taylor series)
𝜶(𝑘) 𝛻𝒇(𝒙(𝑘) )𝒅(𝑘) < 𝟎
As step length cannot be negative,𝛻𝑓 𝒙(𝑘) ∙ 𝒅 < 𝟎

• Example mentioned above can be verified using the theorem


8
F(x,y)=x+y

Eg: f(x, y) = x+y has a descent


direction along the direction(-1,0)
(-1,-1), (-2,1) from the point (0,0)

𝛻𝑓 𝒙(𝑘) ∙ 𝒅 < 𝟎,
𝒅 𝒊𝒔 𝒂 𝒅𝒆𝒔𝒄𝒆𝒏𝒕 𝒅𝒊𝒓𝒆𝒄𝒕𝒊𝒐𝒏

9
Remark:
The condition < 𝑑 , 𝛻𝑓 𝑥 > < 0 ⇒ [𝑑. 𝛻𝑓 𝑥 < 0]
Geometrically, means that the vector 𝑑 and 𝛻𝑓 𝑥 make an angle of more than
90 degree (on the plane that contains them)
< 𝑑. 𝛻𝑓 𝑥 > = 𝑑 𝛻 𝑓(𝑥) cos 𝜃
+ve +ve −1 ≤ cos 𝜃 ≤ 1

Maximum rate of decrease when cos 𝜃 = −1


i.e d should be −𝛻𝑓 𝑥 then only maximum rate of decrease happened

Note:
When we speak of direction magnitude of the vector does not matter; e.g.
𝛻𝑓 𝑥 𝛻𝑓 𝑥
𝛻𝑓 𝑥 , 5𝛻𝑓 𝑥 , , all are in the same direction.
20 𝛻𝑓 𝑥

10
Check whether the given function have a descent direction from x= (2,-1) along
the given directions d1 and d2 𝑓 = 2𝑥1 2 + 𝑥2 2 − 2𝑥1 𝑥2 + 2𝑥1 3 + 𝑥2 4
, d1= (-2, 3), d2= (1, 1)

11
Find a minimum for the function, f(x,y) = (x-1)2 + (y-2)2 starting from the point (1,3),
using steepest descent method. Choose the termination parameter 𝜀 = 0.0001 and step
length is α =0.001.

12
Questions
1. Check whether the given function have a descent direction from x= (2,-1) along the
given directions d1 and d2
𝑓 = 2𝑥1 2 + 𝑥2 2 − 2𝑥1 𝑥2 + 2𝑥1 3 + 𝑥2 4
d1= (-2, 3), d2= (1, 1)

2. Check whether the given function have a descent direction from x= (4, 2,-1) along the
given directions d1 and d2
𝑓 = (𝑥1 − 1)4 + 𝑥2 − 3 2 + 4(𝑥3 + 5)4
d1= (-1,10,-1), d2= (-1, 2, 1)
__________________________________________________________
1. Find a minimum for the function, f(x,y) = (x-1)2 + (y-2)2 starting from the point (10,-1),
using steepest descent method. Here step length is α=0.01. Choose the termination
parameter 𝜀 = 0.1
2. Find a minimum for the function, f(x,y) = 2x2 -2xy + y2 starting from the point (1, 2),
using steepest descent method. Here step length is α=0.001. Choose the termination
parameter 𝜀 = 0.1.
13
Types of Gradient Descent
• Batch Gradient Descent 1. Gradient Descent (GD)
•Full-batch optimization.
•Updates weights after computing gradients on the entire
dataset.
•Slow and inefficient for large datasets.

• Stochastic Gradient Descent 2. Stochastic Gradient Descent (SGD)


•Updates weights using one sample at a time.
•Faster and more memory-efficient than full GD.
•Can be noisy and may oscillate.

• Mini-Batch Gradient Descent 3. Mini-Batch Gradient Descent


•A hybrid of GD and SGD.
•Updates weights after a small batch of samples.
•Offers a good balance between performance and accuracy.
14
Batch Gradient Stochastic Gradient Mini-Batch Gradient
Descent Descent Descent

15
Types of Optimizers
Here are some well-known optimizers and
what their names stand for (some do have
expanded forms):
1. SGD – Stochastic Gradient Descent
2. Adam – Adaptive Moment Estimation
3. RMSprop – Root Mean Square
Propagation
4. Adagrad – Adaptive Gradient Algorithm
5. Adadelta – An extension of Adagrad that
seeks to reduce its aggressive,
monotonically decreasing learning rate
6. Nadam – Nesterov-accelerated Adaptive
Moment Estimation
7. AdamW stands for "Adam with Weight
Decay." It is an improved version of the
Adam optimizer, specifically designed to
handle L2 regularization (weight decay)
more effectively
Each of these optimizers has specific ways of
updating model parameters to achieve
better performance during training.
16
In deep learning, optimizers are algorithms or methods used to adjust the weights of neural networks to minimize the
loss function. Different optimizers affect the speed, stability, and final accuracy of training. Here are the most commonly
used ones:

17
Momentum: Smoother & Faster Learning
Concept:
• Momentum helps accelerate gradients in the right direction, leading to faster convergence.
• It adds a "velocity" term that carries forward past gradients to smooth the update process.
• Think of it like pushing a ball down a hill—it gains speed and doesn't stop with every small bump.

18
What Is Adaptive Learning Rate?
In standard gradient descent, you use a fixed learning rate for all weights:
But sometimes:

•Some weights learn too fast → cause instability.

•Others learn too slow → slow convergence.

Solution: Adaptive learning rate techniques adjust the learning rate individually for each parameter based
on its historical gradients.

Imagine you're learning to ride a bike on different terrains:


• On smooth road: You go fast.
• On rocky road: You slow down and adjust carefully.

Similarly, adaptive optimizers:


• Speed up learning on stable/flat areas.
• Slow down on steep, sensitive or noisy areas.

19
Popular Optimizers with Adaptive Learning Rates
[Link]
1. Adapts learning rate for each parameter based on the sum of past squared gradients
2. Great for sparse data but learning rate decreases too much over time

[Link]
1. Improves Adagrad by using an exponentially weighted moving average of squared gradients
2. Works well in non-stationary problems (like RNNs)

[Link] (Adaptive Moment Estimation)


1. Combines ideas from RMSprop and Momentum
2. Maintains both average of past gradients and squared gradients
3. Widely used and effective in many deep learning tasks

[Link]
1. Extension of Adagrad, addressing its rapidly decreasing learning rate

[Link]
1. Adam with proper weight decay, also uses adaptive learning
Why Gradient Descent Still Works (Even for Non-
Convex Loss):
[Link] Descent Only Needs Gradients:
As long as the loss is differentiable, gradient descent can be applied —
convexity is not a requirement.

[Link] Minima Are Often Good Enough:


In deep learning, we don’t need the global minimum.
Many local minima generalize well on unseen data.

[Link] Optimizers Help:


Algorithms like Adam, RMSProp, SGD with momentum help navigate
complex, non-convex landscapes better.
21
Thank you

22

You might also like