0% found this document useful (0 votes)

6 views33 pages

Understanding Optimization Algorithms

Uploaded by

prof.severussnape.hp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views33 pages

Understanding Optimization Algorithms

Uploaded by

prof.severussnape.hp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Introduction to

Optimization
CHAPTER 4
Learning objectives
• Understand the need for optimization algorithms.
• Know the types of optimization functions.
• Understand the first-order gradient descent algorithms.
• Know the types of gradient descent algorithms.
• Know about momentum.
• Know the types of momentum-based algorithms.
• Understand the second-order optimization algorithms.
Overview of Optimization
• The training of neural networks is all about assigning optimal
parameters to the model parameters based on the training dataset.
• The stages of Training are :
(i) Computation of the loss for every data element of the training
data.
(ii) Compute the gradients of the loss function with respect to
every model parameter.
(iii) The algorithms perform backpropagation to update the
model parameters.
Parameter of the neural Network:
Hyperparameter : Hyperparameters are user-controlled parameters that can be
tuned for the optimal performance of the neural networks.
Parameters:These are the model-controlled parameters that must be tuned by
the learning algorithms for optimal performance by learning from data.
Optimization Problem:
An optimization problem has an objective function and a set of constraints.
An objective function is either minimization or maximization of a loss function of
an optimization problem.
There are many solutions called feasible solutions.
An optimal solution satisfies the objective function without violating the
constraints.
Optimization Algorithm
• The optimizer is an algorithm or method that is used to minimize the loss
function to maximize the efficiency of neural networks
• Popular Optimizer are:
Stochastic Gradient Descent
ADAM
AdaGrad
• Role of Optimizer
Minimizing Loss Functions
Efficiency
Handling of Non-convex Functions
Adaption to Data
Convergence
Search Space
• The optimization problems can be framed as search problems.
• The search space is the universe of candidate solutions for the objective function.
• The optimization algorithms effectively search the search space and find the best
solution
• The maximum or minimum of the function is called the optima of the problem
• The function that needs to be optimized may be a convex function or a non-
convex function
Surface plot of a convex function
Non-Convex Function
Some of the issues that are associated with non-convex function are listed below:
Local Minima
Saddle Point
Plateau
Types of Optimization Problem
Hyperparameter Tuning
• Hyperparameters are parameters of the learning algorithm itself and
are not learned from the data
• Some of the naïve algorithms for hyperparameter tuning are Random
Search and Grid Search.
Steps of Random Search Algorithm
1. Sample inputs randomly from the distribution.
2. Check for all combinations of input, evaluate the model, and check
the results.
3. Pick the best combination of the inputs that generalizes the model
well.
The advantage of this is that assumption is made about the optimization
function, and less memory is required for this procedure.
The disadvantage is that the computational complexity is very high as it involves
checking all combinations of the input.
Grid Search
It consists of the following steps:
1. Takes the hyperparameter with a discrete set of values.
2. Find all the combinations of the inputs.
3. A metric like accuracy is chosen for the evaluation of the model.
4. All the possible combinations are tested for accuracy (if that criterion
is chosen for evaluation).
5. The best combinations are chosen for the hypermodel.
Gradient Descent Algorithms
• It is used to find the minimum value of model parameters and minimize the
loss function during the training process
• The meaning of gradient is “slope” or “slant” of the surface
• The objective of finding the minimum is achieved as follows:
(i) For example, consider a univariate function f(x). The derivative of the
loss function is computed
(ii) Gradient descent is an iterative algorithm that starts with an initial
point, say w0. Then, the slope of the objective function is obtained with respect
to each feature. If the function has more variables x0 x1, … then the partial
derivative of the loss function with respect to all these variables is computed
and stored as a vector called the gradient.
(iii) The gradient of the multivariate function indicates two factors:
1. Magnitude that indicates how much change.
Batch Gradient Descent
• Batch Gradient Descent is a gradient descent algorithm
The procedure for batch gradient descent is given as follows:
1. Initialize the model parameters θ with random values or with predefined
values.
2. Initialize the learning rate η. The learning rate indicates the step size or learning
rate. Initialize the number of iterations.
3. Repeat until iterations converge or the number of iterations is reached
3.1 For the entire dataset, Compute the cost function as follows:

Here, m is the number of samples in the training dataset. J is the cost

function and θ is the parameter of the neural network. Here ,xi yi is the
current data point, and i ∇J (θ; xi yi ) is the gradient of the current point.
3.2 Update the model parameters by taking the direction of negative gradient
descent scaled by η as

The symbol θt indicates the model parameters. t is the iteration.

4. Return optimal parameters and End
Advantages and Disadvantages of Batch Gradient Descent Algorithm
Stochastic Gradient Descent (SGD)
• SGD is based on randomness.
• The stochastic gradient descent algorithm takes only one random
sample from the dataset for gradient computation and update.
• This random selection only introduces randomness in the optimization
process.

• The procedure for stochastic gradient descent is given as follows:

1. Initialize the model parameters with random values or with predefined
values.
2. Initialize the learning rate η and the number of iterations The randomization is
carried out by shuffling the dataset. Randomization is crucial for the success of
the optimization process.
3. Repeat until iterations converge or the number of iterations is reached
3.1 Shuffle the dataset.
3.2 Take a training sample and compute the gradient of the cost for
the current point with respect to the model parameters.
3.3 Update the model parameters by taking the direction of negative
gradient descent scaled by η as
Θt+1 = θt - η x ∇ J (θt ; xi , yi )
Here xi , yi is the current data point and ∇ J (θt ; xi , yi ) the gradient of
the current point. The symbol θt indicates the model parameters. t is
the iteration.
4. Return optimal parameters and End.
• Advantages and Disadvantages of Stochastic Gradient Descent Algorithm

Mini Batch
• Mini-Batch Gradient Descent is another optimization algorithm
• This algorithm is the trade-off between SGD and the Batch gradient algorithm
• This algorithm achieves the compromise or balance by choosing a small set of training
samples called mini-batch (k)
• The procedure for mini-batch gradient descent is given as follows:
1. Initialize the model parameters with random values or with predefined
values. The chosen parameter is the learning rate a, which indicates the step size or
learning rate. k is the number of examples in each mini-batch.
2. Introduce the randomization by shuffling the data and creating a mini-
batch with k data samples.
3. Repeat for a fixed number of steps or until convergence.
3.1 Compute the gradient as follows:

3.2 Update the model parameters by taking the direction of negative

gradient descent scaled by η as
• Advantages and Disadvantages of Mini-Batch Gradient Descent Algorithm
Concept of Momentum
• The major difficulties in traditional gradient descent algorithms are slow convergence
• Oscillations due to step size
• The navigation of complex and poorly conditioned landscapes.
Momentum can be used to solve these problems
(i) Accelerated convergence – Momentum solves the problem of slow convergence by
speeding up the algorithm in areas where the gradient consistently points.
(ii) Dampening oscillations – Momentum helps to remove oscillations or fluctuations.
The oscillations occur in flat or shallow regions in the cost function.
(iii) Saddle Points: Momentum helps the algorithm to escape from saddle points.
(iv) Efficient navigation of the plateaus – The plateaus are the regions where the cost
function has very small gradients. This effective navigation is done by accumulating velocity
over flat regions.
(v) Improved generalization – Momentum helps in improved generalization
performance and helps in overfitting.
There are two types of momentum

Traditional Momentum
• In the traditional momentum, the algorithm takes the current
gradient and also the accumulated previous updates.
• The update is made as below
Nesterov Momentum
• Nesterov momentum “looks ahead” at the future position of the parameters
before computing the gradients.
• It estimates the direction of the next update.
• The Nesterov momentum procedure is given below:
1. Find the look ahead position.

2. Compute the Gradient at the look-ahead position as

3. Update the velocity at the look-ahead position as

4. Update the parameters using the velocity as

Comparisons of Traditional and Nesterov Momentum

Gradient Descent with Momentum

• It is another optimization algorithm that extends the traditional gradient descent
algorithm with momentum
• The convergence is faster if all the gradients point in the same direction so that
the algorithm rolls faster
• The procedure for the gradient with momentum is given below:
1. Initialize the model parameters θ with random or predefined values. Let η
be the learning rate and β the momentum coefficient. Its value ranges from 0 to 1
and controls the accumulated gradients. Its typical value is 0.9.
2. Repeat until converge
2.1 Initial steps are the same as SGD.
2.2 Accumulate the gradients into velocity as

2.3 Apply the update rule as

v, which is the average of all past gradients.

AdaGrad (Adaptive Gradient Descent)
• AdaGrad is another optimization algorithm designed by Duchi and Stinger in 2011
• The procedure for AdaGrad is given below:
1. Initialize The model parameters θ are initialized to zero or some random
values. Let r be a vector that keeps track of the squared sum of the gradients
that are initialized to 0.
2. Initialize the learning rate η and the parameter ε to 10−7 whose purpose
is to avoid division by zero error.
3. Repeat until convergence
3.1 The initial steps are similar to the Stochastic gradient algorithm.
3.2 Accumulate the square of the gradients in r as
3.3 Update the parameters

3.4 Apply the update

• Advantages and Disadvantages of AdaGrad Algorithm

• the specialties of AdaGrad are:

(i) Personalized learning rate for every parameter.
(ii) Adaption is based on gradient history. So, the learning rate decreases over
time.
(iii) Effective for sparse and higher dimensional data. Parameters with infrequent
intervals will have larger updates and parameters with frequent updates will
have smaller updates.
RMSProp
• RMSProp (root mean square propagation) is an adaptive optimization
algorithm introduced by the legendary Geoffrey Hinton in a lecture in
2012.
• The key idea of RMSProp is to accumulate the squared gradients and
also pick a fraction of the previous updates using an exponential
weighted average decay factor
• Differences between RMSProp and AdaGrad
ADAM
• ADAM stands for ADAptive Moment estimation algorithm.
ADAM is a combination of gradient descent algorithm with
momentum and RMS prop.
• The specialties of ADAM are listed below:
(i) Advantages of combining the advantages of AdaGrad
and RMSProp gives more robustness and efficiency.
(ii) Adaptive learning rate for each parameter by combining
EWMA of mean and variance. This yields faster convergence and
the ability to solve the nonstationary targets.
(iii) Bias correction to remove the bias
Second-Order Optimization Algorithms
• Second-order optimization algorithms are algorithms that take the first-
order information as well as the second-order information
• The second-order optimization algorithms are of two types
Newton’s Method
• Newton’s method is often not used because of the computational
overhead involved in the calculation of the Hessian matrix and its
inverse
• The updated formula for the Newtons method is given below:

• Quasi-Newton Method
• Quasi-newton methods is Broyden-Fletcher-Goldfarb-Shanno (BFGS) is
an iterative optimization algorithm that finds minimum without any
constraints
1. Initialize model parameters θ with random values or using a predefined
procedure.
2. Initialize the Hessian matrix

3. Compute the gradient of the cost function at iteration t

4. Update the parameters

Here, Ht is the approximate Hessian function and αt is the step size.

The Hessian matrix is updated as
5. The final update is given as

6. Repeat the steps till convergence is met. The convergence can be

predefined iterations or a small change in the objective function.
Advantages and Disadvantages of BFGS Algorithm

Unit 3
No ratings yet
Unit 3
54 pages
Types of Optimizers in Deep Learning
No ratings yet
Types of Optimizers in Deep Learning
15 pages
Deep Learning Optimization Algorithms
No ratings yet
Deep Learning Optimization Algorithms
31 pages
Deep Learning: Dr. Nehal Sakr
No ratings yet
Deep Learning: Dr. Nehal Sakr
52 pages
Minimizing Gradient Problems in ML
No ratings yet
Minimizing Gradient Problems in ML
37 pages
Unit 3 R23 DL
No ratings yet
Unit 3 R23 DL
24 pages
Deep Learning: Gradient Optimization Techniques
No ratings yet
Deep Learning: Gradient Optimization Techniques
40 pages
Key Deep Learning Terms Explained
No ratings yet
Key Deep Learning Terms Explained
9 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
15 pages
Understanding Adamax Optimizer
No ratings yet
Understanding Adamax Optimizer
18 pages
Optimization Techniques for Gradient Descent
No ratings yet
Optimization Techniques for Gradient Descent
37 pages
Deep Learning Optimizers Explained
No ratings yet
Deep Learning Optimizers Explained
17 pages
Optimizers
No ratings yet
Optimizers
15 pages
Understanding Gradient Descent Techniques
No ratings yet
Understanding Gradient Descent Techniques
35 pages
Gradient Descent Optimization Techniques
No ratings yet
Gradient Descent Optimization Techniques
54 pages
Optimization and Regularization in Deep Learning
No ratings yet
Optimization and Regularization in Deep Learning
56 pages
Gradient Descent Optimization Overview
No ratings yet
Gradient Descent Optimization Overview
34 pages
Gradient Descent in Neural Networks
No ratings yet
Gradient Descent in Neural Networks
4 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
30 pages
Gradient Descent in Machine Learning
No ratings yet
Gradient Descent in Machine Learning
9 pages
25-Ms-Ai-08 Ass#01 ML
No ratings yet
25-Ms-Ai-08 Ass#01 ML
5 pages
Optimizers for Neural Network Training
No ratings yet
Optimizers for Neural Network Training
9 pages
Neural Network Optimization Algorithms
No ratings yet
Neural Network Optimization Algorithms
25 pages
Gradient Descent in Neural Networks
No ratings yet
Gradient Descent in Neural Networks
13 pages
Optimizers Nag
No ratings yet
Optimizers Nag
29 pages
Deep Learningmod 2
No ratings yet
Deep Learningmod 2
111 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
86 pages
Understanding Gradient Descent
No ratings yet
Understanding Gradient Descent
20 pages
Machine Learning Optimization Techniques
No ratings yet
Machine Learning Optimization Techniques
10 pages
5 Optimizer
No ratings yet
5 Optimizer
28 pages
Understanding Gradient Descent Variants
No ratings yet
Understanding Gradient Descent Variants
2 pages
DLT 3
No ratings yet
DLT 3
11 pages
Gradient Descent Variants for Neural Networks
No ratings yet
Gradient Descent Variants for Neural Networks
20 pages
Adam Optimizer in Neural Networks
No ratings yet
Adam Optimizer in Neural Networks
24 pages
Understanding Gradient Descent Methods
No ratings yet
Understanding Gradient Descent Methods
42 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
29 pages
Stopping Criteria in Gradient Descent
No ratings yet
Stopping Criteria in Gradient Descent
3 pages
Deep Learning Model Optimization Techniques
No ratings yet
Deep Learning Model Optimization Techniques
31 pages
Learning XOR with Deep Networks
No ratings yet
Learning XOR with Deep Networks
25 pages
Neural Network Training and Optimization
No ratings yet
Neural Network Training and Optimization
34 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
29 pages
Understanding Optimizers in Deep Learning
No ratings yet
Understanding Optimizers in Deep Learning
37 pages
Understanding Gradient Descent in ML
No ratings yet
Understanding Gradient Descent in ML
11 pages
What Is Gradient Descent in Machine Learning
No ratings yet
What Is Gradient Descent in Machine Learning
8 pages
Backpropagation and Gradient Descent Explained
No ratings yet
Backpropagation and Gradient Descent Explained
10 pages
Optimization in Deep Learning
No ratings yet
Optimization in Deep Learning
10 pages
Deep Learning Optimizers Explained
No ratings yet
Deep Learning Optimizers Explained
12 pages
Neural Network Optimization Methods
No ratings yet
Neural Network Optimization Methods
90 pages
Understanding Gradient Descent in ML
No ratings yet
Understanding Gradient Descent in ML
8 pages
Optimize Learning with SGD & Hyperparameters
No ratings yet
Optimize Learning with SGD & Hyperparameters
15 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
14 pages
Understanding Neural Networks & Optimization
No ratings yet
Understanding Neural Networks & Optimization
37 pages
Stochastic Gradient Descent Overview
No ratings yet
Stochastic Gradient Descent Overview
14 pages
Understanding Cost Function & Gradient Descent
No ratings yet
Understanding Cost Function & Gradient Descent
142 pages
Cost Function and Gradient Descent Basics
No ratings yet
Cost Function and Gradient Descent Basics
142 pages
Gradient Descent Algorithm Features Shared
No ratings yet
Gradient Descent Algorithm Features Shared
8 pages
String Matching Algorithms Explained
No ratings yet
String Matching Algorithms Explained
52 pages
Numerical Analysis in Engineering Applications
No ratings yet
Numerical Analysis in Engineering Applications
11 pages
Complex Number Evaluations and Signals
No ratings yet
Complex Number Evaluations and Signals
193 pages
MCA Practical File: Python Programs
No ratings yet
MCA Practical File: Python Programs
28 pages
Convolutional Neural Networks Survey
No ratings yet
Convolutional Neural Networks Survey
22 pages
Balancing AVL Trees with Rotations
No ratings yet
Balancing AVL Trees with Rotations
44 pages
Hamming Code Error Correction Example
No ratings yet
Hamming Code Error Correction Example
12 pages
ECS 122A Midterm Exam Solutions
No ratings yet
ECS 122A Midterm Exam Solutions
2 pages
Find-S and Candidate-Elimination Algorithms
No ratings yet
Find-S and Candidate-Elimination Algorithms
3 pages
Dijkstra vs. Floyd-Warshall Comparison
No ratings yet
Dijkstra vs. Floyd-Warshall Comparison
4 pages
Data Structures Lab: Algorithms Overview
No ratings yet
Data Structures Lab: Algorithms Overview
9 pages
Huffman vs Arithmetic Coding Explained
No ratings yet
Huffman vs Arithmetic Coding Explained
139 pages
Knapsack Optimization Techniques Explained
No ratings yet
Knapsack Optimization Techniques Explained
33 pages
Fuzzy Logic and Neural Networks Overview
No ratings yet
Fuzzy Logic and Neural Networks Overview
8 pages
Machine Learning in Hedge Fund Trading
No ratings yet
Machine Learning in Hedge Fund Trading
2 pages
Bisection and Newton's Method Explained
No ratings yet
Bisection and Newton's Method Explained
58 pages
FIR Filter Design with Passband Ripple
No ratings yet
FIR Filter Design with Passband Ripple
3 pages
Cairo University Machine Learning Midterm Exam
No ratings yet
Cairo University Machine Learning Midterm Exam
4 pages
Numerical Methods Tutorial Sheet 1
No ratings yet
Numerical Methods Tutorial Sheet 1
4 pages
Thompson Sampling in POMDPs Explained
No ratings yet
Thompson Sampling in POMDPs Explained
2 pages
Feature Selection and PCA in ML
No ratings yet
Feature Selection and PCA in ML
18 pages
Time and Space Complexity Explained
No ratings yet
Time and Space Complexity Explained
5 pages
Flow Networks and Matrix Algorithms
No ratings yet
Flow Networks and Matrix Algorithms
17 pages
SVR for Seasonal Time Series Forecasting
No ratings yet
SVR for Seasonal Time Series Forecasting
10 pages
Dynamic Programming in Optimization Techniques
No ratings yet
Dynamic Programming in Optimization Techniques
13 pages
Understanding Bias and Variance in ML
No ratings yet
Understanding Bias and Variance in ML
4 pages
Python Variable Exchange and Patterns
No ratings yet
Python Variable Exchange and Patterns
8 pages
Week 10 DBMS Assignment Overview
No ratings yet
Week 10 DBMS Assignment Overview
4 pages
Design and Analysis of Algorithms Course
No ratings yet
Design and Analysis of Algorithms Course
2 pages
NIPS 2004 Parallel Support Vector Machines The Cascade SVM Paper
No ratings yet
NIPS 2004 Parallel Support Vector Machines The Cascade SVM Paper
8 pages

Understanding Optimization Algorithms

Uploaded by

Understanding Optimization Algorithms

Uploaded by

Introduction to

Here, m is the number of samples in the training dataset. J is the cost

The symbol θt indicates the model parameters. t is the iteration.

• The procedure for stochastic gradient descent is given as follows:

3.2 Update the model parameters by taking the direction of negative

2. Compute the Gradient at the look-ahead position as

3. Update the velocity at the look-ahead position as

4. Update the parameters using the velocity as

Gradient Descent with Momentum

2.3 Apply the update rule as

v, which is the average of all past gradients.

3.4 Apply the update

• the specialties of AdaGrad are:

3. Compute the gradient of the cost function at iteration t

4. Update the parameters

Here, Ht is the approximate Hessian function and αt is the step size.

6. Repeat the steps till convergence is met. The convergence can be

You might also like