0% found this document useful (0 votes)
18 views13 pages

Deep Learning Unit-3

The document discusses various optimization methods for training neural networks, including Adagrad, AdaDelta, RMSProp, and Adam, along with regularization techniques like dropout and batch normalization. It highlights the importance of these methods in improving model performance and addressing challenges such as overfitting and saddle points in high-dimensional loss surfaces. Additionally, it covers second-order methods like Newton's method and BFGS for faster convergence in optimization tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views13 pages

Deep Learning Unit-3

The document discusses various optimization methods for training neural networks, including Adagrad, AdaDelta, RMSProp, and Adam, along with regularization techniques like dropout and batch normalization. It highlights the importance of these methods in improving model performance and addressing challenges such as overfitting and saddle points in high-dimensional loss surfaces. Additionally, it covers second-order methods like Newton's method and BFGS for faster convergence in optimization tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT-III:

Better Training of Neural Networks-Newer optimization methods for neural networks


(Adagrad, adadelta, rmsprop, Adam, NAG), second order methods for training, Saddle point
problem in neural networks, Regularization methods (dropout, drop connect, batch
normalization).

Neural network optimization techniques aim to improve a network's performance by adjusting


its parameters during training to minimize a loss function and enhance accuracy.

Key techniques include gradient descent and its variants, regularization methods, and hyper
parameter tuning.

Optimization Techniques

Gradient Based Regularization Hyper_parameter


Optimization Techniques Techniques Tunning

L1 & L2 Grid Search


Gradient Descent
Regularization
Stochastic Gradient Random Search
Descent Dropout

Bayesian
SGD with Momentum Batch
Normalization Optimization
Nesterov Accelerated
Data
Gradient
Augmentation

Adaptive Gradient
Methods

Adagrad Optimization
Adagrad is an abbreviation for Adaptive Gradient Algorithm. It is an adaptive learning rate
optimization algorithm used for training deep learning models. It is particularly effective for
sparse data or scenarios where features exhibit a large variation in magnitude. Adagrad adjusts
the learning rate for each parameter individually. Unlike standard gradient descent, where a
fixed learning rate is applied to all parameters Adagrad adapts the learning rate based on the
historical gradients for each parameter, allowing the model to focus on more important features
and learn efficiently.
Working of Adagrad Optimization
The primary concept behind Adagrad is the idea of adapting the learning rate based on the
historical sum of squared gradients for each parameter. Here's a step-by-step explanation of
how Adagrad works:
1. Initialization
Adagrad begins by initializing the parameter values randomly, just like other
optimization algorithms. Additionally, it initializes a running sum of squared gradients
for each parameter, which will track the gradients over time.
2. Gradient Calculation
For each training step, the gradient of the loss function with respect to the model's parameters is
calculated, just like in standard gradient descent.
3. Adaptive Learning Rate
The key difference comes next. Instead of using a fixed learning rate, Adagrad adjusts the
learning rate for each parameter based on the accumulated sum of squared gradients.
The updated learning rate for each parameter is calculated as follows:

Where:
η is the global learning rate (a small constant value)
Gt is the sum of squared gradients for a given parameter up to time step t
ϵ is a small value added to avoid division by zero
Here, the denominator Gt+ϵ grows as the squared gradients accumulate, causing the
learning rate to decrease over time, which helps to stabilize the training.
Parameter Update
The model's parameters are updated by subtracting the product of the adaptive learning rate and
the gradient at each step:
Where:
θt is the current parameter
∇θJ(θ)is the gradient of the loss function with respect to the parameter
Uses

➢ Problems with sparse data and features (e.g., natural language processing,
recommender systems).

➢ Tasks where features have different levels of importance and frequency.


➢ Training models that do not require a very fast convergence rate but benefit from a
more stable optimization process.
if you are dealing with problems where a more constant learning rate is preferable (e.g., in some
deep learning tasks), using variants like RMSProp or Adam might be more appropriate.

AdaDelta
AdaDelta is another modification of Adagrad that focuses on reducing the accumulation of past
gradients. It updates the learning rates based on the moving average of past gradients and
incorporates a more stable and bounded update rule. The key update for AdaDelta is:

Where:
[Δθ]t2 is the running average of past squared parameter updates

Adam (Adaptive Moment Estimation)


Adam combines the benefits of both Adagrad and momentum-based methods. It uses both the
moving average of the gradients and the squared gradients to adapt the learning rate. Adam is
widely used due to its robustness and superior performance in various machine learning tasks.
Adam has the following update rules: 1,

First moment estimate (mt):


2. Second moment estimate (vt):

3. Corrected moment estimates:

4. Parameter update:

RMSProp (Root Mean Square Propagation):

RMSProp addresses the diminishing learning rate issue by introducing an exponentially


decaying average of the squared gradients instead of accumulating the sum. This prevents the
learning rate from decreasing too quickly, making the algorithm more effective in training deep
neural networks.
The update rule for RMSProp is as follows:

Where:
Gt is the accumulated gradient
γ is the decay factor (typically set to 0.9)
∇θJ(θ) is the gradient The
parameter update rule is:

Second order methods for training:


Second-order optimization methods are a powerful class of algorithms that can help us achieve
faster convergence to the optimal solution.
We discuss the following methods in Second Order Methods for training
Newton's Method
Newton's method is an iterative optimization algorithm that uses both the gradient and the
Hessian matrix of an objective function to rapidly converge to the minimum or maximum of
that function. This approach can be visualized as using a spotlight that shines brightest at the
exit, guiding you directly towards the optimal solution.
It is based on Taylor series expansion to approximate J(θ)J(θ) near some point o incorporating
second order derivative terms and ignoring derivatives of higher order.

Solving for the critical point of this function we obtain the Newton parameter update rule.

Where:
J(θ)) = Objective function to optimize.

(θ)= Parameter vector to update.


(θ0)= The initial guess for the parameter vector.

(∇J(θ0)) =The gradient vector of the objective function evaluated at


θ0. It represents the direction of steepest ascent (or descent) in the function space.
(H(θ0))=The Hessian matrix (second-order partial derivatives) of the objective function
evaluated at θ0θ0. It provides information about the curvature of the function.
((θ−θ0)T)= The transpose of the difference between the current parameter vector θ and the
initial guess θ0.
The last term involving the Hessian matrix accounts for the curvature of the function. It
ensures that the update considers both the gradient and the curvature.

Efficient Minimization Using Newton's Method

Step 1: Define the Objective Function


We start by defining the objective function that we want to minimize. For this example, we'll
use a simple quadratic function f(x)=x2+2x+1.
Step 2: Compute the Gradient and Hessian
The gradient is the first derivative of the function, and the Hessian is the second derivative. For
our quadratic function.

The gradient is ∇f(x)=2x+2.


The Hessian is H(x)=2 (a constant in this case)
Step 3: Implement Newton's Update Rule

Newton's update rule is xnew=x−H−1∇f(x).


Step 4: Iterate Until Convergence
We iterate using the update rule until we reach a maximum number of iterations.

Step 5: Visualize the Process


We plot the function and the path taken by Newton's Method to reach the minimum.

BFGS (Broyden–Fletcher–Goldfarb–Shanno): The Experienced Pathfinder The BFGS


(Broyden–Fletcher–Goldfarb–Shanno) algorithm is an advanced optimization technique
used in the context of solving nonlinear optimization problems where the exact
computation of the Hessian is computationally burdensome. As a quasi-Newton method,
BFGS circumvents the direct calculation of the Hessian matrix's inverse by
approximating it with a matrix that is iteratively updated.
The BFGS method is a quasi-Newton method, meaning it approximates the inverse
Hessian matrix (H−1) with another matrix (Mt) that is iteratively refined using low-rank
(Rank 2) updates. This method avoids the computational burden of directly
calculating H−1.
Here's a detailed explanation of the BFGS method:
1. Newton's Method without the Computational Burden: BFGS provides a way to
perform optimizations similar to Newton's method but without the heavy
computations.
2. Primary Difficulty is Computation of H−1: Directly computing the inverse Hessian
is expensive and complex.
3. BFGS Approximates H−1 by Matrix Mt : Instead of calculating H−11directly, BFGS
uses an iterative process to refine an approximation matrix Mt

4. Direction of Descent: Once Mt is updated, the direction of descent ρt is

determined by ρt=Mtgt, where gt is the gradient.

5. Line Search: A line search is conducted in the direction ρt to determine the optimal
step size ϵ∗.
6. Parameter Update: The parameters are updated using the below equation.

θt+1= θt + ϵ ∗ ρt

where:
θt: Parameters at iteration t
θt+1: Updated parameters at iteration t+1.
ϵ∗: Optimal step size determined through line search.
ρt: Direction of descent.
The BFGS method is a potent tool for optimization in a variety of settings, because of its
thorough approach which guarantees that the method adapts and refines its course.
How BFGS works?
Memory of Past Steps: Remembers the path you have taken to improve future steps.
Curvature Update: Adjusts the map based on the terrain you have crossed.

Efficient Minimization Using BFGS Algorithm:

Step 1: Define the Objective Function


We will use a more complex function for the BFGS algorithm, such as
f(x,y)=x2+y2+xy+4.
Step 2: Compute the Gradient

The gradient is a vector of partial derivatives. For f(x,y).


Step 3: Implement the BFGS Algorithm
The BFGS algorithm updates the approximation of the inverse Hessian matrix and uses it to
perform the update step.

The update rule is xnew=x – H ∇f(x).


Step 4: Iterate Until Convergence
We iterate using the update rule and update the Hessian approximation until we reach a
maximum number of iterations.

Step 5: Visualize the Process


We plot the function and the path taken by the BFGS algorithm to reach the minimum.
Conjugate Gradient Method:
The Conjugate Gradient (CG) method is an optimization algorithm primarily used for solving
large systems of linear equations where the coefficient matrix is symmetric and positive
definite, as well as for solving large-scale unconstrained optimization problems. This method
is especially valuable when dealing with large problems where storing the full Hessian matrix
is impractical due to memory constraints.
The method efficiently avoids direct computation of the inverse Hessian matrix ( H−1) by
iteratively descending along conjugate directions. Specifically, at iteration t, the next search
direction, denoted as dt, takes the form :
Two directions dt and dt−1, are considered conjugate if their inner product satisfies:

Where:
dt: The search direction at iteration (t).
∇θ: The initial gradient vector.
βt: The step size or scaling factor for the previous search direction (dt−1).
dt and dt−1 are considered conjugate.

How Conjugate Gradient Method Works?


Direction Finding: It looks at multiple directions to find the best path.
Step Size Adjustment: Adjusts how big or small each step should be.
Efficient Minimization Using Conjugate Gradient Method:
Step 1: Define the Objective Function
We will use a simple quadratic function f(x)=xTAx − bTx where ? is a positive definite matrix.
Step 2: Compute the Gradient
The gradient is ∇f(x)=Ax−b.
Step 3: Implement the Conjugate Gradient Method
The Conjugate Gradient Method iterates to find the minimum of the quadratic function using
conjugate directions.

Step 4: Iterate Until Convergence


We iterate using the update rule until the residual ? is small, indicating convergence.
Step 5: Visualize the Process
We plot the function and the path taken by the Conjugate Gradient Method to reach the
minimum.

Saddle point problem in neural networks:

A saddle point in neural networks is a point on the loss function surface where the gradient becomes
zero, but the point is not a minimum. It behaves like a minimum in one direction and a maximum in
another direction.

Definition

A saddle point is a stationary point where:

∇L(θ)=0
but it is not an optimal solution.
Example:
f (x, y) =x2−y2

At (0,0) gradient is zero, but:

 along x → function increases (minimum behaviour)


 along y → function decreases (maximum behaviour)

So (0,0) is a saddle point.

Why it is a problem?

 Deep neural networks have high-dimensional loss surfaces


 Many saddle points occur
 Gradient descent slows down because:
o gradient becomes very small
o training gets stuck in flat regions
o convergence becomes slow

Solutions :

 Use Momentum
 Use optimizers like Adam / RMSProp
 Use Batch Normalization
Regularization methods (dropout, drop connect, batch normalization):

Regularization is a technique used in machine learning to prevent Overfitting and improve the
generalization performance of a model on unseen data. Overfitting occurs when a model learns to
perform well on the training data but fails to generalize to new, unseen data. Regularization
introduces a penalty term to the loss function, discouraging the model from fitting the training
data too closely and promoting simpler or more regular patterns in the learned parameters. This
helps prevent the model from capturing noise or irrelevant details in the training data, leading to
better performance on new, unseen data.

Dropout is a regularization technique which involves randomly ignoring or "dropping out"


some layer outputs during training, used in deep neural networks to prevent Overfitting.
Dropout is implemented per-layer in various types of layers like dense fully connected,
convolutional, and recurrent layers, excluding the output layer. The dropout probability
specifies the chance of dropping outputs, with different probabilities for input and hidden layers
that prevents any one neuron from becoming too specialized or overly dependent on the
presence of specific features in the training data.
Understanding Dropout Regularization
➢ Dropout regularization leverages the concept of dropout during training in deep learning
models to specifically address overfitting, which occurs when a model performs nicely
on schooling statistics however poorly on new, unseen facts.

➢ During training, dropout randomly deactivates a chosen proportion of neurons (and


their connections) within a layer. This essentially temporarily
removes them from the network.

➢ The deactivated neurons are chosen at random for each training iteration. This
randomness is crucial for preventing overfitting.
To account for the deactivated neurons, the outputs of the remaining active neurons are
scaled up by a factor equal to the probability of keeping a neuron active (e.g., if 50% are
dropped, the remaining ones are multiplied by 2).
Advantages of Dropout Regularization in Deep Learning
➢ Prevents Overfitting: By randomly disabling neurons, the network cannot overly
rely on the specific connections between them.

➢ Ensemble Effect: Dropout acts like training an ensemble of smaller neural networks
with varying structures during each iteration. This ensemble effect
improves the model's ability to generalize to unseen data.

➢ Enhancing Data Representation: Dropout methods are used to enhance data


representation by introducing noise, generating additional training samples, and
improving the effectiveness of the model during training.
Drawbacks of Dropout Regularization and How to Mitigate Them
Despite its benefits, dropout regularization in deep learning is not without its drawbacks. Here
are some of the challenges related to dropout and methods to mitigate them:
1. Longer Training Times: Dropout increases training duration due to random dropout
of units in hidden layers. To address this, consider powerful computing resources or
parallelize training where possible.
2. Optimization Complexity: Understanding why dropout works is unclear, making
optimization challenging. Experiment with dropout rates on a smaller scale before full
implementation to fine-tune model performance.
3. Hyperparameter Tuning: Dropout adds hyper parameters like dropout chance and
learning rate, requiring careful tuning. Use techniques such as grid search or random
search to systematically find optimal combinations.
4. Redundancy with Batch Normalization: Batch normalization can sometimes
replace dropout effects. Evaluate model performance with and without dropout when
using batch normalization to determine its necessity.

5. Model Complexity: Dropout layers add complexity. Simplify the model


architecture where possible, ensuring each dropout layer is justified by
performance gains in validation.
Batch Normalization:
Batch Normalization is used to reduce the problem of internal covariate shift in neural
networks. It works by normalizing the data within each mini-batch. This means it calculates the
mean and variance of data in a batch and then adjusts the values so that they have similar
range. After that it scales and shifts the values so that model learns effectively.

This process keeps the inputs to each layer of the network in a stable range even if the outputs
of earlier layers change during training. As a result, training becomes faster and more stable.
Need of Batch Normalization
1. Batch Normalization makes sure outputs of each layer stay steady as model learns.
This helps model train faster and learn more effectively.
2. Solves the problem of internal covariate shift.
3. Makes training faster and more stable.
4. Allows use of higher learning rates.
5. Helps avoid vanishing or exploding gradients.
Can act like a regularizer sometimes reduce the need for dropout.
Fundamentals of Batch Normalization:
In this we are going to discuss the steps taken to perform batch normalization.
Step 1: Compute the Mean and Variance of Mini-Batches
For mini-batch of activations x1, x2,...,xmx1,x2,...,xm, the mean μBμB and variance
σB2σB2 of the mini-batch are computed.
Step 2: Normalization
Each activation xixiis normalized using the computed mean and variance of the mini-batch.
The normalization process subtracts the mean μBμB from each activation and divides by the
square root of the variance σB2σB2, ensuring that the normalized activations have a zero mean
and unit variance.
Additionally, a small constant ϵϵ is added to the denominator for numerical stability,
particularly to prevent division by zero.

Step 3: Scale and Shift the Normalized Activations


The normalized activations xi are then scaled by a learnable parameter γ and shifted by another
learnable parameter β. These parameters allow the model to learn the optimal scaling and
shifting of the normalized activations giving the network additional flexibility.

Benefits of Batch Normalization


• Faster Convergence: Batch Normalization reduces internal covariate shift, allowing
for faster convergence during training.
• Higher Learning Rates: With Batch Normalization, higher learning rates can be used
without the risk of divergence.
• Regularization Effect: Batch Normalization introduces a slight regularization effect
that reduces the need for adding regularization techniques like dropout.

You might also like