0% found this document useful (0 votes)

18 views13 pages

Deep Learning Unit-3

The document discusses various optimization methods for training neural networks, including Adagrad, AdaDelta, RMSProp, and Adam, along with regularization techniques like dropout and batch normalization. It highlights the importance of these methods in improving model performance and addressing challenges such as overfitting and saddle points in high-dimensional loss surfaces. Additionally, it covers second-order methods like Newton's method and BFGS for faster convergence in optimization tasks.

Uploaded by

pallijanamisai434

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views13 pages

Deep Learning Unit-3

Uploaded by

pallijanamisai434

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

UNIT-III:

Better Training of Neural Networks-Newer optimization methods for neural networks

(Adagrad, adadelta, rmsprop, Adam, NAG), second order methods for training, Saddle point
problem in neural networks, Regularization methods (dropout, drop connect, batch
normalization).

Neural network optimization techniques aim to improve a network's performance by adjusting

its parameters during training to minimize a loss function and enhance accuracy.

Key techniques include gradient descent and its variants, regularization methods, and hyper
parameter tuning.

Optimization Techniques

Gradient Based Regularization Hyper_parameter

Optimization Techniques Techniques Tunning

L1 & L2 Grid Search

Gradient Descent
Regularization
Stochastic Gradient Random Search
Descent Dropout

Bayesian
SGD with Momentum Batch
Normalization Optimization
Nesterov Accelerated
Data
Gradient
Augmentation

Adaptive Gradient
Methods

Adagrad Optimization
Adagrad is an abbreviation for Adaptive Gradient Algorithm. It is an adaptive learning rate
optimization algorithm used for training deep learning models. It is particularly effective for
sparse data or scenarios where features exhibit a large variation in magnitude. Adagrad adjusts
the learning rate for each parameter individually. Unlike standard gradient descent, where a
fixed learning rate is applied to all parameters Adagrad adapts the learning rate based on the
historical gradients for each parameter, allowing the model to focus on more important features
and learn efficiently.
Working of Adagrad Optimization
The primary concept behind Adagrad is the idea of adapting the learning rate based on the
historical sum of squared gradients for each parameter. Here's a step-by-step explanation of
how Adagrad works:
1. Initialization
Adagrad begins by initializing the parameter values randomly, just like other
optimization algorithms. Additionally, it initializes a running sum of squared gradients
for each parameter, which will track the gradients over time.
2. Gradient Calculation
For each training step, the gradient of the loss function with respect to the model's parameters is
calculated, just like in standard gradient descent.
3. Adaptive Learning Rate
The key difference comes next. Instead of using a fixed learning rate, Adagrad adjusts the
learning rate for each parameter based on the accumulated sum of squared gradients.
The updated learning rate for each parameter is calculated as follows:

Where:
η is the global learning rate (a small constant value)
Gt is the sum of squared gradients for a given parameter up to time step t
ϵ is a small value added to avoid division by zero
Here, the denominator Gt+ϵ grows as the squared gradients accumulate, causing the
learning rate to decrease over time, which helps to stabilize the training.
Parameter Update
The model's parameters are updated by subtracting the product of the adaptive learning rate and
the gradient at each step:
Where:
θt is the current parameter
∇θJ(θ)is the gradient of the loss function with respect to the parameter
Uses

➢ Problems with sparse data and features (e.g., natural language processing,
recommender systems).

➢ Tasks where features have different levels of importance and frequency.

➢ Training models that do not require a very fast convergence rate but benefit from a
more stable optimization process.
if you are dealing with problems where a more constant learning rate is preferable (e.g., in some
deep learning tasks), using variants like RMSProp or Adam might be more appropriate.

AdaDelta
AdaDelta is another modification of Adagrad that focuses on reducing the accumulation of past
gradients. It updates the learning rates based on the moving average of past gradients and
incorporates a more stable and bounded update rule. The key update for AdaDelta is:

Where:
[Δθ]t2 is the running average of past squared parameter updates

Adam (Adaptive Moment Estimation)

Adam combines the benefits of both Adagrad and momentum-based methods. It uses both the
moving average of the gradients and the squared gradients to adapt the learning rate. Adam is
widely used due to its robustness and superior performance in various machine learning tasks.
Adam has the following update rules: 1,

First moment estimate (mt):

2. Second moment estimate (vt):

3. Corrected moment estimates:

4. Parameter update:

RMSProp (Root Mean Square Propagation):

RMSProp addresses the diminishing learning rate issue by introducing an exponentially

decaying average of the squared gradients instead of accumulating the sum. This prevents the
learning rate from decreasing too quickly, making the algorithm more effective in training deep
neural networks.
The update rule for RMSProp is as follows:

Where:
Gt is the accumulated gradient
γ is the decay factor (typically set to 0.9)
∇θJ(θ) is the gradient The
parameter update rule is:

Second order methods for training:

Second-order optimization methods are a powerful class of algorithms that can help us achieve
faster convergence to the optimal solution.
We discuss the following methods in Second Order Methods for training
Newton's Method
Newton's method is an iterative optimization algorithm that uses both the gradient and the
Hessian matrix of an objective function to rapidly converge to the minimum or maximum of
that function. This approach can be visualized as using a spotlight that shines brightest at the
exit, guiding you directly towards the optimal solution.
It is based on Taylor series expansion to approximate J(θ)J(θ) near some point o incorporating
second order derivative terms and ignoring derivatives of higher order.

Solving for the critical point of this function we obtain the Newton parameter update rule.

Where:
J(θ)) = Objective function to optimize.

(θ)= Parameter vector to update.

(θ0)= The initial guess for the parameter vector.

(∇J(θ0)) =The gradient vector of the objective function evaluated at

θ0. It represents the direction of steepest ascent (or descent) in the function space.
(H(θ0))=The Hessian matrix (second-order partial derivatives) of the objective function
evaluated at θ0θ0. It provides information about the curvature of the function.
((θ−θ0)T)= The transpose of the difference between the current parameter vector θ and the
initial guess θ0.
The last term involving the Hessian matrix accounts for the curvature of the function. It
ensures that the update considers both the gradient and the curvature.

Efficient Minimization Using Newton's Method

Step 1: Define the Objective Function

We start by defining the objective function that we want to minimize. For this example, we'll
use a simple quadratic function f(x)=x2+2x+1.
Step 2: Compute the Gradient and Hessian
The gradient is the first derivative of the function, and the Hessian is the second derivative. For
our quadratic function.

The gradient is ∇f(x)=2x+2.

The Hessian is H(x)=2 (a constant in this case)
Step 3: Implement Newton's Update Rule

Newton's update rule is xnew=x−H−1∇f(x).

Step 4: Iterate Until Convergence
We iterate using the update rule until we reach a maximum number of iterations.

Step 5: Visualize the Process

We plot the function and the path taken by Newton's Method to reach the minimum.

BFGS (Broyden–Fletcher–Goldfarb–Shanno): The Experienced Pathfinder The BFGS

(Broyden–Fletcher–Goldfarb–Shanno) algorithm is an advanced optimization technique
used in the context of solving nonlinear optimization problems where the exact
computation of the Hessian is computationally burdensome. As a quasi-Newton method,
BFGS circumvents the direct calculation of the Hessian matrix's inverse by
approximating it with a matrix that is iteratively updated.
The BFGS method is a quasi-Newton method, meaning it approximates the inverse
Hessian matrix (H−1) with another matrix (Mt) that is iteratively refined using low-rank
(Rank 2) updates. This method avoids the computational burden of directly
calculating H−1.
Here's a detailed explanation of the BFGS method:
1. Newton's Method without the Computational Burden: BFGS provides a way to
perform optimizations similar to Newton's method but without the heavy
computations.
2. Primary Difficulty is Computation of H−1: Directly computing the inverse Hessian
is expensive and complex.
3. BFGS Approximates H−1 by Matrix Mt : Instead of calculating H−11directly, BFGS
uses an iterative process to refine an approximation matrix Mt

4. Direction of Descent: Once Mt is updated, the direction of descent ρt is

determined by ρt=Mtgt, where gt is the gradient.

5. Line Search: A line search is conducted in the direction ρt to determine the optimal
step size ϵ∗.
6. Parameter Update: The parameters are updated using the below equation.

θt+1= θt + ϵ ∗ ρt

where:
θt: Parameters at iteration t
θt+1: Updated parameters at iteration t+1.
ϵ∗: Optimal step size determined through line search.
ρt: Direction of descent.
The BFGS method is a potent tool for optimization in a variety of settings, because of its
thorough approach which guarantees that the method adapts and refines its course.
How BFGS works?
Memory of Past Steps: Remembers the path you have taken to improve future steps.
Curvature Update: Adjusts the map based on the terrain you have crossed.

Efficient Minimization Using BFGS Algorithm:

Step 1: Define the Objective Function

We will use a more complex function for the BFGS algorithm, such as
f(x,y)=x2+y2+xy+4.
Step 2: Compute the Gradient

The gradient is a vector of partial derivatives. For f(x,y).

Step 3: Implement the BFGS Algorithm
The BFGS algorithm updates the approximation of the inverse Hessian matrix and uses it to
perform the update step.

The update rule is xnew=x – H ∇f(x).

Step 4: Iterate Until Convergence
We iterate using the update rule and update the Hessian approximation until we reach a
maximum number of iterations.

Step 5: Visualize the Process

We plot the function and the path taken by the BFGS algorithm to reach the minimum.
Conjugate Gradient Method:
The Conjugate Gradient (CG) method is an optimization algorithm primarily used for solving
large systems of linear equations where the coefficient matrix is symmetric and positive
definite, as well as for solving large-scale unconstrained optimization problems. This method
is especially valuable when dealing with large problems where storing the full Hessian matrix
is impractical due to memory constraints.
The method efficiently avoids direct computation of the inverse Hessian matrix ( H−1) by
iteratively descending along conjugate directions. Specifically, at iteration t, the next search
direction, denoted as dt, takes the form :
Two directions dt and dt−1, are considered conjugate if their inner product satisfies:

Where:
dt: The search direction at iteration (t).
∇θ: The initial gradient vector.
βt: The step size or scaling factor for the previous search direction (dt−1).
dt and dt−1 are considered conjugate.

How Conjugate Gradient Method Works?

Direction Finding: It looks at multiple directions to find the best path.
Step Size Adjustment: Adjusts how big or small each step should be.
Efficient Minimization Using Conjugate Gradient Method:
Step 1: Define the Objective Function
We will use a simple quadratic function f(x)=xTAx − bTx where ? is a positive definite matrix.
Step 2: Compute the Gradient
The gradient is ∇f(x)=Ax−b.
Step 3: Implement the Conjugate Gradient Method
The Conjugate Gradient Method iterates to find the minimum of the quadratic function using
conjugate directions.

Step 4: Iterate Until Convergence

We iterate using the update rule until the residual ? is small, indicating convergence.
Step 5: Visualize the Process
We plot the function and the path taken by the Conjugate Gradient Method to reach the
minimum.

Saddle point problem in neural networks:

A saddle point in neural networks is a point on the loss function surface where the gradient becomes
zero, but the point is not a minimum. It behaves like a minimum in one direction and a maximum in
another direction.

Definition

A saddle point is a stationary point where:

∇L(θ)=0
but it is not an optimal solution.
Example:
f (x, y) =x2−y2

At (0,0) gradient is zero, but:

 along x → function increases (minimum behaviour)

 along y → function decreases (maximum behaviour)

So (0,0) is a saddle point.

Why it is a problem?

 Deep neural networks have high-dimensional loss surfaces

 Many saddle points occur
 Gradient descent slows down because:
o gradient becomes very small
o training gets stuck in flat regions
o convergence becomes slow

Solutions :

 Use Momentum
 Use optimizers like Adam / RMSProp
 Use Batch Normalization
Regularization methods (dropout, drop connect, batch normalization):

Regularization is a technique used in machine learning to prevent Overfitting and improve the
generalization performance of a model on unseen data. Overfitting occurs when a model learns to
perform well on the training data but fails to generalize to new, unseen data. Regularization
introduces a penalty term to the loss function, discouraging the model from fitting the training
data too closely and promoting simpler or more regular patterns in the learned parameters. This
helps prevent the model from capturing noise or irrelevant details in the training data, leading to
better performance on new, unseen data.

Dropout is a regularization technique which involves randomly ignoring or "dropping out"

some layer outputs during training, used in deep neural networks to prevent Overfitting.
Dropout is implemented per-layer in various types of layers like dense fully connected,
convolutional, and recurrent layers, excluding the output layer. The dropout probability
specifies the chance of dropping outputs, with different probabilities for input and hidden layers
that prevents any one neuron from becoming too specialized or overly dependent on the
presence of specific features in the training data.
Understanding Dropout Regularization
➢ Dropout regularization leverages the concept of dropout during training in deep learning
models to specifically address overfitting, which occurs when a model performs nicely
on schooling statistics however poorly on new, unseen facts.

➢ During training, dropout randomly deactivates a chosen proportion of neurons (and

their connections) within a layer. This essentially temporarily
removes them from the network.

➢ The deactivated neurons are chosen at random for each training iteration. This
randomness is crucial for preventing overfitting.
To account for the deactivated neurons, the outputs of the remaining active neurons are
scaled up by a factor equal to the probability of keeping a neuron active (e.g., if 50% are
dropped, the remaining ones are multiplied by 2).
Advantages of Dropout Regularization in Deep Learning
➢ Prevents Overfitting: By randomly disabling neurons, the network cannot overly
rely on the specific connections between them.

➢ Ensemble Effect: Dropout acts like training an ensemble of smaller neural networks
with varying structures during each iteration. This ensemble effect
improves the model's ability to generalize to unseen data.

➢ Enhancing Data Representation: Dropout methods are used to enhance data

representation by introducing noise, generating additional training samples, and
improving the effectiveness of the model during training.
Drawbacks of Dropout Regularization and How to Mitigate Them
Despite its benefits, dropout regularization in deep learning is not without its drawbacks. Here
are some of the challenges related to dropout and methods to mitigate them:
1. Longer Training Times: Dropout increases training duration due to random dropout
of units in hidden layers. To address this, consider powerful computing resources or
parallelize training where possible.
2. Optimization Complexity: Understanding why dropout works is unclear, making
optimization challenging. Experiment with dropout rates on a smaller scale before full
implementation to fine-tune model performance.
3. Hyperparameter Tuning: Dropout adds hyper parameters like dropout chance and
learning rate, requiring careful tuning. Use techniques such as grid search or random
search to systematically find optimal combinations.
4. Redundancy with Batch Normalization: Batch normalization can sometimes
replace dropout effects. Evaluate model performance with and without dropout when
using batch normalization to determine its necessity.

5. Model Complexity: Dropout layers add complexity. Simplify the model

architecture where possible, ensuring each dropout layer is justified by
performance gains in validation.
Batch Normalization:
Batch Normalization is used to reduce the problem of internal covariate shift in neural
networks. It works by normalizing the data within each mini-batch. This means it calculates the
mean and variance of data in a batch and then adjusts the values so that they have similar
range. After that it scales and shifts the values so that model learns effectively.

This process keeps the inputs to each layer of the network in a stable range even if the outputs
of earlier layers change during training. As a result, training becomes faster and more stable.
Need of Batch Normalization
1. Batch Normalization makes sure outputs of each layer stay steady as model learns.
This helps model train faster and learn more effectively.
2. Solves the problem of internal covariate shift.
3. Makes training faster and more stable.
4. Allows use of higher learning rates.
5. Helps avoid vanishing or exploding gradients.
Can act like a regularizer sometimes reduce the need for dropout.
Fundamentals of Batch Normalization:
In this we are going to discuss the steps taken to perform batch normalization.
Step 1: Compute the Mean and Variance of Mini-Batches
For mini-batch of activations x1, x2,...,xmx1,x2,...,xm, the mean μBμB and variance
σB2σB2 of the mini-batch are computed.
Step 2: Normalization
Each activation xixiis normalized using the computed mean and variance of the mini-batch.
The normalization process subtracts the mean μBμB from each activation and divides by the
square root of the variance σB2σB2, ensuring that the normalized activations have a zero mean
and unit variance.
Additionally, a small constant ϵϵ is added to the denominator for numerical stability,
particularly to prevent division by zero.

Step 3: Scale and Shift the Normalized Activations

The normalized activations xi are then scaled by a learnable parameter γ and shifted by another
learnable parameter β. These parameters allow the model to learn the optimal scaling and
shifting of the normalized activations giving the network additional flexibility.

Benefits of Batch Normalization

• Faster Convergence: Batch Normalization reduces internal covariate shift, allowing
for faster convergence during training.
• Higher Learning Rates: With Batch Normalization, higher learning rates can be used
without the risk of divergence.
• Regularization Effect: Batch Normalization introduces a slight regularization effect
that reduces the need for adding regularization techniques like dropout.

DL Unit - 3
No ratings yet
DL Unit - 3
12 pages
Non-linear Error Surfaces & Optimization
No ratings yet
Non-linear Error Surfaces & Optimization
4 pages
Overview of Optimization Techniques
No ratings yet
Overview of Optimization Techniques
9 pages
Second-Order Optimization in Deep Learning
No ratings yet
Second-Order Optimization in Deep Learning
6 pages
Lec3 Slides
No ratings yet
Lec3 Slides
24 pages
Understanding Optimization Techniques
No ratings yet
Understanding Optimization Techniques
28 pages
Second Order Optimization in Deep Learning
100% (1)
Second Order Optimization in Deep Learning
9 pages
Optimizers in DL
No ratings yet
Optimizers in DL
21 pages
Optimization Document
No ratings yet
Optimization Document
10 pages
Neural Networks
No ratings yet
Neural Networks
4 pages
25-Ms-Ai-08 Ass#01 ML
No ratings yet
25-Ms-Ai-08 Ass#01 ML
5 pages
Week 4
No ratings yet
Week 4
25 pages
Optimizer
No ratings yet
Optimizer
18 pages
Understanding Gradient Descent Methods
No ratings yet
Understanding Gradient Descent Methods
33 pages
Understanding Machine Learning Optimizers
No ratings yet
Understanding Machine Learning Optimizers
4 pages
ADADELTA: Adaptive Learning Rates
No ratings yet
ADADELTA: Adaptive Learning Rates
6 pages
Unit 2
No ratings yet
Unit 2
10 pages
Linear Regression Paper IEEE
No ratings yet
Linear Regression Paper IEEE
4 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
14 pages
Mle Optimization Guide
No ratings yet
Mle Optimization Guide
16 pages
Derivative Based Optimization Notes - Optimization in Machine Learning - Mumbai University Sem 8
No ratings yet
Derivative Based Optimization Notes - Optimization in Machine Learning - Mumbai University Sem 8
18 pages
Optimization Techniques in Machine Learning
No ratings yet
Optimization Techniques in Machine Learning
42 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
51 pages
Levenberg-Marquardt Algorithm Overview
No ratings yet
Levenberg-Marquardt Algorithm Overview
5 pages
Earning Gradient Flow Using Equation Discovery To Accelerate Engineering Optimization
No ratings yet
Earning Gradient Flow Using Equation Discovery To Accelerate Engineering Optimization
44 pages
DL 2 Unit 2
No ratings yet
DL 2 Unit 2
10 pages
Non-Linear Optimization Techniques
No ratings yet
Non-Linear Optimization Techniques
46 pages
Neural Network Weight Optimization
No ratings yet
Neural Network Weight Optimization
11 pages
Optimization Techniques Overview
No ratings yet
Optimization Techniques Overview
47 pages
Adam Optimization Algorithm Explained
No ratings yet
Adam Optimization Algorithm Explained
28 pages
Super Gradient Descent for Global Optimization
No ratings yet
Super Gradient Descent for Global Optimization
15 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
52 pages
Adaptive Learning Rates Part I
No ratings yet
Adaptive Learning Rates Part I
10 pages
Levenberg-Marquardt Algorithm Overview
100% (5)
Levenberg-Marquardt Algorithm Overview
5 pages
Super Gradient Descent for Global Optimization
No ratings yet
Super Gradient Descent for Global Optimization
15 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
86 pages
Understanding Gradient Descent Basics
No ratings yet
Understanding Gradient Descent Basics
30 pages
Overview of Neural Network Optimizers
No ratings yet
Overview of Neural Network Optimizers
6 pages
Machine Learning: Cost Function Optimization
No ratings yet
Machine Learning: Cost Function Optimization
28 pages
Deep Learning Via Hessian-Free Optimization: James Martens
No ratings yet
Deep Learning Via Hessian-Free Optimization: James Martens
8 pages
Deep Learning Via Hessian-Free Optimization
No ratings yet
Deep Learning Via Hessian-Free Optimization
8 pages
3 Optimization - Theory
No ratings yet
3 Optimization - Theory
12 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
22 pages
Neural Network Training Techniques
No ratings yet
Neural Network Training Techniques
70 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
21 pages
DL Unit - 3
No ratings yet
DL Unit - 3
37 pages
Gradient Descent in Machine Learning
No ratings yet
Gradient Descent in Machine Learning
9 pages
Adam: Efficient Stochastic Optimization
100% (1)
Adam: Efficient Stochastic Optimization
15 pages
Gradient-Based Optimization Techniques
No ratings yet
Gradient-Based Optimization Techniques
47 pages
Understanding Gradient Descent Techniques
No ratings yet
Understanding Gradient Descent Techniques
14 pages
Gradient Descent Optimization Techniques
No ratings yet
Gradient Descent Optimization Techniques
52 pages
Stochastic Gradient Descent Overview
No ratings yet
Stochastic Gradient Descent Overview
14 pages
Overview of Gradient Descent Methods
No ratings yet
Overview of Gradient Descent Methods
3 pages
Optimizers Nag
No ratings yet
Optimizers Nag
29 pages
Deep Neural Networks Optimization Techniques
No ratings yet
Deep Neural Networks Optimization Techniques
58 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
22 pages
Gradient Descent in Machine Learning
No ratings yet
Gradient Descent in Machine Learning
67 pages
Mathematica Applications in Economics
100% (1)
Mathematica Applications in Economics
17 pages
Axioms: Genetic Algorithm For Scheduling Optimization Considering Heterogeneous Containers: A Real-World Case Study
No ratings yet
Axioms: Genetic Algorithm For Scheduling Optimization Considering Heterogeneous Containers: A Real-World Case Study
16 pages
Optimization Models in Engineering Discussion
No ratings yet
Optimization Models in Engineering Discussion
5 pages
SPE 71393 Development and Field Applications of Roller Cone Bits With Balanced Cutting Structure
No ratings yet
SPE 71393 Development and Field Applications of Roller Cone Bits With Balanced Cutting Structure
11 pages
Euler's Method: Exam Solutions
No ratings yet
Euler's Method: Exam Solutions
8 pages
GSIC for Uplink Multi-Antenna NOMA
No ratings yet
GSIC for Uplink Multi-Antenna NOMA
6 pages
1992 - A Method For Registration of 3-D Shapes - ICP - OK
No ratings yet
1992 - A Method For Registration of 3-D Shapes - ICP - OK
18 pages
Two-Phase Commit Protocol Explained
No ratings yet
Two-Phase Commit Protocol Explained
3 pages
Excel Solver Step-by-Step Guide
No ratings yet
Excel Solver Step-by-Step Guide
2 pages
Bayesian Optimization for Hyperparameters
No ratings yet
Bayesian Optimization for Hyperparameters
24 pages
Optimal Trading in CFMMs with Deep Learning
No ratings yet
Optimal Trading in CFMMs with Deep Learning
14 pages
Objective Function and Variable Types in OR
No ratings yet
Objective Function and Variable Types in OR
16 pages
Hybrid Scheduling for LLM Inference Optimization
No ratings yet
Hybrid Scheduling for LLM Inference Optimization
13 pages
Ant Colony Swarm Intelligence
No ratings yet
Ant Colony Swarm Intelligence
445 pages
Sensitivity Analysis for Decision Support
No ratings yet
Sensitivity Analysis for Decision Support
4 pages
Practical Challenges in Linear Programming
No ratings yet
Practical Challenges in Linear Programming
6 pages
Engineering Optimization Techniques
No ratings yet
Engineering Optimization Techniques
41 pages
A Deep Learning Approach To The Classification of 3D CAD Models
No ratings yet
A Deep Learning Approach To The Classification of 3D CAD Models
16 pages
P 2
No ratings yet
P 2
21 pages
Decision Making in Management Information Systems
100% (1)
Decision Making in Management Information Systems
13 pages
Design Formulas for Tuned Mass Dampers
No ratings yet
Design Formulas for Tuned Mass Dampers
22 pages
Greedy Method and Feasible Solutions
No ratings yet
Greedy Method and Feasible Solutions
7 pages
Hook Design Loading by The Optimization Method Wit PDF
No ratings yet
Hook Design Loading by The Optimization Method Wit PDF
7 pages
Integer Linear Programming Quiz and Concepts
100% (6)
Integer Linear Programming Quiz and Concepts
39 pages
Farm Management For Asia System
No ratings yet
Farm Management For Asia System
356 pages
CUDA PSO for Solving Sudoku Puzzles
No ratings yet
CUDA PSO for Solving Sudoku Puzzles
6 pages
SwarmAgentic: Automated Agentic System Generation
No ratings yet
SwarmAgentic: Automated Agentic System Generation
41 pages
Six Sigma Process Optimization Guide
No ratings yet
Six Sigma Process Optimization Guide
13 pages
Student Instructions For The Use of Spreadsheets With Examples PDF
No ratings yet
Student Instructions For The Use of Spreadsheets With Examples PDF
67 pages
Optimize Rotor-Stator Clearance Uniformity
No ratings yet
Optimize Rotor-Stator Clearance Uniformity
32 pages

Deep Learning Unit-3

Uploaded by

Deep Learning Unit-3

Uploaded by

UNIT-III:

Better Training of Neural Networks-Newer optimization methods for neural networks

Neural network optimization techniques aim to improve a network's performance by adjusting

Gradient Based Regularization Hyper_parameter

L1 & L2 Grid Search

➢ Tasks where features have different levels of importance and frequency.

Adam (Adaptive Moment Estimation)

First moment estimate (mt):

3. Corrected moment estimates:

RMSProp (Root Mean Square Propagation):

RMSProp addresses the diminishing learning rate issue by introducing an exponentially

Second order methods for training:

(θ)= Parameter vector to update.

(∇J(θ0)) =The gradient vector of the objective function evaluated at

Efficient Minimization Using Newton's Method

Step 1: Define the Objective Function

The gradient is ∇f(x)=2x+2.

Newton's update rule is xnew=x−H−1∇f(x).

Step 5: Visualize the Process

BFGS (Broyden–Fletcher–Goldfarb–Shanno): The Experienced Pathfinder The BFGS

4. Direction of Descent: Once Mt is updated, the direction of descent ρt is

determined by ρt=Mtgt, where gt is the gradient.

Efficient Minimization Using BFGS Algorithm:

Step 1: Define the Objective Function

The gradient is a vector of partial derivatives. For f(x,y).

The update rule is xnew=x – H ∇f(x).

Step 5: Visualize the Process

How Conjugate Gradient Method Works?

Step 4: Iterate Until Convergence

Saddle point problem in neural networks:

A saddle point is a stationary point where:

At (0,0) gradient is zero, but:

 along x → function increases (minimum behaviour)

So (0,0) is a saddle point.

 Deep neural networks have high-dimensional loss surfaces

Dropout is a regularization technique which involves randomly ignoring or "dropping out"

➢ During training, dropout randomly deactivates a chosen proportion of neurons (and

➢ Enhancing Data Representation: Dropout methods are used to enhance data

5. Model Complexity: Dropout layers add complexity. Simplify the model

Step 3: Scale and Shift the Normalized Activations

Benefits of Batch Normalization

You might also like