0% found this document useful (0 votes)
6 views33 pages

Understanding Optimization Algorithms

Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views33 pages

Understanding Optimization Algorithms

Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Introduction to

Optimization
CHAPTER 4
Learning objectives
• Understand the need for optimization algorithms.
• Know the types of optimization functions.
• Understand the first-order gradient descent algorithms.
• Know the types of gradient descent algorithms.
• Know about momentum.
• Know the types of momentum-based algorithms.
• Understand the second-order optimization algorithms.
Overview of Optimization
• The training of neural networks is all about assigning optimal
parameters to the model parameters based on the training dataset.
• The stages of Training are :
(i) Computation of the loss for every data element of the training
data.
(ii) Compute the gradients of the loss function with respect to
every model parameter.
(iii) The algorithms perform backpropagation to update the
model parameters.
Parameter of the neural Network:
Hyperparameter : Hyperparameters are user-controlled parameters that can be
tuned for the optimal performance of the neural networks.
Parameters:These are the model-controlled parameters that must be tuned by
the learning algorithms for optimal performance by learning from data.
Optimization Problem:
An optimization problem has an objective function and a set of constraints.
An objective function is either minimization or maximization of a loss function of
an optimization problem.
There are many solutions called feasible solutions.
An optimal solution satisfies the objective function without violating the
constraints.
Optimization Algorithm
• The optimizer is an algorithm or method that is used to minimize the loss
function to maximize the efficiency of neural networks
• Popular Optimizer are:
Stochastic Gradient Descent
ADAM
AdaGrad
• Role of Optimizer
Minimizing Loss Functions
Efficiency
Handling of Non-convex Functions
Adaption to Data
Convergence
Search Space
• The optimization problems can be framed as search problems.
• The search space is the universe of candidate solutions for the objective function.
• The optimization algorithms effectively search the search space and find the best
solution
• The maximum or minimum of the function is called the optima of the problem
• The function that needs to be optimized may be a convex function or a non-
convex function
Surface plot of a convex function
Non-Convex Function
Some of the issues that are associated with non-convex function are listed below:
Local Minima
Saddle Point
Plateau
Types of Optimization Problem
Hyperparameter Tuning
• Hyperparameters are parameters of the learning algorithm itself and
are not learned from the data
• Some of the naïve algorithms for hyperparameter tuning are Random
Search and Grid Search.
Steps of Random Search Algorithm
1. Sample inputs randomly from the distribution.
2. Check for all combinations of input, evaluate the model, and check
the results.
3. Pick the best combination of the inputs that generalizes the model
well.
The advantage of this is that assumption is made about the optimization
function, and less memory is required for this procedure.
The disadvantage is that the computational complexity is very high as it involves
checking all combinations of the input.
Grid Search
It consists of the following steps:
1. Takes the hyperparameter with a discrete set of values.
2. Find all the combinations of the inputs.
3. A metric like accuracy is chosen for the evaluation of the model.
4. All the possible combinations are tested for accuracy (if that criterion
is chosen for evaluation).
5. The best combinations are chosen for the hypermodel.
Gradient Descent Algorithms
• It is used to find the minimum value of model parameters and minimize the
loss function during the training process
• The meaning of gradient is “slope” or “slant” of the surface
• The objective of finding the minimum is achieved as follows:
(i) For example, consider a univariate function f(x). The derivative of the
loss function is computed
(ii) Gradient descent is an iterative algorithm that starts with an initial
point, say w0. Then, the slope of the objective function is obtained with respect
to each feature. If the function has more variables x0 x1, … then the partial
derivative of the loss function with respect to all these variables is computed
and stored as a vector called the gradient.
(iii) The gradient of the multivariate function indicates two factors:
1. Magnitude that indicates how much change.
Batch Gradient Descent
• Batch Gradient Descent is a gradient descent algorithm
The procedure for batch gradient descent is given as follows:
1. Initialize the model parameters θ with random values or with predefined
values.
2. Initialize the learning rate η. The learning rate indicates the step size or learning
rate. Initialize the number of iterations.
3. Repeat until iterations converge or the number of iterations is reached
3.1 For the entire dataset, Compute the cost function as follows:

Here, m is the number of samples in the training dataset. J is the cost


function and θ is the parameter of the neural network. Here ,xi yi is the
current data point, and i ∇J (θ; xi yi ) is the gradient of the current point.
3.2 Update the model parameters by taking the direction of negative gradient
descent scaled by η as

The symbol θt indicates the model parameters. t is the iteration.


4. Return optimal parameters and End
Advantages and Disadvantages of Batch Gradient Descent Algorithm
Stochastic Gradient Descent (SGD)
• SGD is based on randomness.
• The stochastic gradient descent algorithm takes only one random
sample from the dataset for gradient computation and update.
• This random selection only introduces randomness in the optimization
process.

• The procedure for stochastic gradient descent is given as follows:


1. Initialize the model parameters with random values or with predefined
values.
2. Initialize the learning rate η and the number of iterations The randomization is
carried out by shuffling the dataset. Randomization is crucial for the success of
the optimization process.
3. Repeat until iterations converge or the number of iterations is reached
3.1 Shuffle the dataset.
3.2 Take a training sample and compute the gradient of the cost for
the current point with respect to the model parameters.
3.3 Update the model parameters by taking the direction of negative
gradient descent scaled by η as
Θt+1 = θt - η x ∇ J (θt ; xi , yi )
Here xi , yi is the current data point and ∇ J (θt ; xi , yi ) the gradient of
the current point. The symbol θt indicates the model parameters. t is
the iteration.
4. Return optimal parameters and End.
• Advantages and Disadvantages of Stochastic Gradient Descent Algorithm

Mini Batch
• Mini-Batch Gradient Descent is another optimization algorithm
• This algorithm is the trade-off between SGD and the Batch gradient algorithm
• This algorithm achieves the compromise or balance by choosing a small set of training
samples called mini-batch (k)
• The procedure for mini-batch gradient descent is given as follows:
1. Initialize the model parameters with random values or with predefined
values. The chosen parameter is the learning rate a, which indicates the step size or
learning rate. k is the number of examples in each mini-batch.
2. Introduce the randomization by shuffling the data and creating a mini-
batch with k data samples.
3. Repeat for a fixed number of steps or until convergence.
3.1 Compute the gradient as follows:

3.2 Update the model parameters by taking the direction of negative


gradient descent scaled by η as
• Advantages and Disadvantages of Mini-Batch Gradient Descent Algorithm
Concept of Momentum
• The major difficulties in traditional gradient descent algorithms are slow convergence
• Oscillations due to step size
• The navigation of complex and poorly conditioned landscapes.
Momentum can be used to solve these problems
(i) Accelerated convergence – Momentum solves the problem of slow convergence by
speeding up the algorithm in areas where the gradient consistently points.
(ii) Dampening oscillations – Momentum helps to remove oscillations or fluctuations.
The oscillations occur in flat or shallow regions in the cost function.
(iii) Saddle Points: Momentum helps the algorithm to escape from saddle points.
(iv) Efficient navigation of the plateaus – The plateaus are the regions where the cost
function has very small gradients. This effective navigation is done by accumulating velocity
over flat regions.
(v) Improved generalization – Momentum helps in improved generalization
performance and helps in overfitting.
There are two types of momentum

Traditional Momentum
• In the traditional momentum, the algorithm takes the current
gradient and also the accumulated previous updates.
• The update is made as below
Nesterov Momentum
• Nesterov momentum “looks ahead” at the future position of the parameters
before computing the gradients.
• It estimates the direction of the next update.
• The Nesterov momentum procedure is given below:
1. Find the look ahead position.

2. Compute the Gradient at the look-ahead position as

3. Update the velocity at the look-ahead position as

4. Update the parameters using the velocity as


Comparisons of Traditional and Nesterov Momentum

Gradient Descent with Momentum


• It is another optimization algorithm that extends the traditional gradient descent
algorithm with momentum
• The convergence is faster if all the gradients point in the same direction so that
the algorithm rolls faster
• The procedure for the gradient with momentum is given below:
1. Initialize the model parameters θ with random or predefined values. Let η
be the learning rate and β the momentum coefficient. Its value ranges from 0 to 1
and controls the accumulated gradients. Its typical value is 0.9.
2. Repeat until converge
2.1 Initial steps are the same as SGD.
2.2 Accumulate the gradients into velocity as

2.3 Apply the update rule as

v, which is the average of all past gradients.


AdaGrad (Adaptive Gradient Descent)
• AdaGrad is another optimization algorithm designed by Duchi and Stinger in 2011
• The procedure for AdaGrad is given below:
1. Initialize The model parameters θ are initialized to zero or some random
values. Let r be a vector that keeps track of the squared sum of the gradients
that are initialized to 0.
2. Initialize the learning rate η and the parameter ε to 10−7 whose purpose
is to avoid division by zero error.
3. Repeat until convergence
3.1 The initial steps are similar to the Stochastic gradient algorithm.
3.2 Accumulate the square of the gradients in r as
3.3 Update the parameters

3.4 Apply the update


• Advantages and Disadvantages of AdaGrad Algorithm

• the specialties of AdaGrad are:


(i) Personalized learning rate for every parameter.
(ii) Adaption is based on gradient history. So, the learning rate decreases over
time.
(iii) Effective for sparse and higher dimensional data. Parameters with infrequent
intervals will have larger updates and parameters with frequent updates will
have smaller updates.
RMSProp
• RMSProp (root mean square propagation) is an adaptive optimization
algorithm introduced by the legendary Geoffrey Hinton in a lecture in
2012.
• The key idea of RMSProp is to accumulate the squared gradients and
also pick a fraction of the previous updates using an exponential
weighted average decay factor
• Differences between RMSProp and AdaGrad
ADAM
• ADAM stands for ADAptive Moment estimation algorithm.
ADAM is a combination of gradient descent algorithm with
momentum and RMS prop.
• The specialties of ADAM are listed below:
(i) Advantages of combining the advantages of AdaGrad
and RMSProp gives more robustness and efficiency.
(ii) Adaptive learning rate for each parameter by combining
EWMA of mean and variance. This yields faster convergence and
the ability to solve the nonstationary targets.
(iii) Bias correction to remove the bias
Second-Order Optimization Algorithms
• Second-order optimization algorithms are algorithms that take the first-
order information as well as the second-order information
• The second-order optimization algorithms are of two types
Newton’s Method
• Newton’s method is often not used because of the computational
overhead involved in the calculation of the Hessian matrix and its
inverse
• The updated formula for the Newtons method is given below:

• Quasi-Newton Method
• Quasi-newton methods is Broyden-Fletcher-Goldfarb-Shanno (BFGS) is
an iterative optimization algorithm that finds minimum without any
constraints
1. Initialize model parameters θ with random values or using a predefined
procedure.
2. Initialize the Hessian matrix

3. Compute the gradient of the cost function at iteration t

4. Update the parameters

Here, Ht is the approximate Hessian function and αt is the step size.


The Hessian matrix is updated as
5. The final update is given as

6. Repeat the steps till convergence is met. The convergence can be


predefined iterations or a small change in the objective function.
Advantages and Disadvantages of BFGS Algorithm

You might also like