0% found this document useful (0 votes)
231 views8 pages

Gradient-Based Optimization in Deep Learning

Deep Learning

Uploaded by

Kavitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
231 views8 pages

Gradient-Based Optimization in Deep Learning

Deep Learning

Uploaded by

Kavitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

3.

GRADIENT-BASED OPTIMIZATION

 Most deep learning algorithms involve optimization of some sort.


 Optimization refers to the task of either minimizing or maximizing some function f(x) by
altering x.
 We usually phrase most optimization problems in terms of minimizing f(x).
 Maximization may be accomplished via a minimization algorithm by minimizing - f(x).
 The function we want to minimize or maximize is called the objective function or
criterion.
 When we are minimizing it, we may also call it the cost function, loss function, or error
function.

For example: We might say x ∗ = arg min f (x)

 S
uppose
we have a
function
y = f(x),
where
both x
and y are
real
numbers.

 T
he derivative of this function is denoted as f’(x) or as dy/ dx. The derivative f‘(x) gives the
slope of f (x) at the point x.

 In other words, it specifies how to scale a small change in the input in order to obtain the
corresponding change in the output: f(x +∈ ) ≈ f(x) +∈f’(x).

 The derivative is therefore useful for minimizing a function because it tells us how to
change x in order to make a small improvement in y.
For example:

 We know that f (x − ∈ sign (f ∈(x))) is less than f (x) for small enough∈.

 We can thus reduce f (x) by moving x in small steps with opposite sign of the derivative.

 When f ‘(x) = 0, the derivative provides no information about which direction to move.
Points

 Where f ‘(x) = 0 are known as critical points or stationary points.

 A local minimum is a point where f (x) is lower than at all neighboring points, so it is no
longer possible to decrease f(x) by making infinitesimal steps. A local maximum is a point
where f (x) is higher than at all neighboring points.
 The directional derivative in direction u (a unit vector) is the slope of the function f in
direction u.

 In other words, the directional derivative is the derivative of the function f(x + αu) with
respect to α, evaluated at α = 0.

 Using the chain rule, we can see that ∂ /∂α f(x + αu) evaluates to u T∇xf(x) when α = 0.

 To minimize f, we would like to find the direction in which f decreases the fastest. We can
do this using the directional derivative:

Where θ is the angle between u and the gradient.

 Substituting in ||u||2 = 1 and ignoring factors that do not depend on u, this simplifies to
minu cos θ. This is minimized when u points in the opposite direction as the gradient. In
other words, the gradient points directly uphill, and the negative gradient points directly
downhill. We can decrease f by moving in the direction of the negative gradient. This is
known as the method of steepest descent or gradient descent. Steepest descent proposes a
new p

where ∈ is the learning rate, a positive scalar determining the size of the step. We can
choose ∈ in several different ways. A popular approach is to set ∈ to a small constant.
Beyond the Gradient: Jacobian and Hessian Matrices

 Jacobian and Hessian Matrices Sometimes we need to find all of the partial derivatives of
a function whose input and output are both vectors.

 The matrix containing all such partial derivatives is known as a Jacobian matrix.

 Specifically, if we have a function f : R m → Rn , then the Jacobian matrix J ∈ Rn×m of f


is defined such that Ji,j = ∂/ ∂xj f(x)i.

 We are also sometimes interested in a derivative of a derivative. This is known as a second


derivative.
For example:- For a function f : R n → R, the derivative with respect to xi of the derivative of f
with respect to xj is denoted as ∂2 / ∂xi ∂xj f.
 In a single dimension, we can denote d2 /dx2 f by f “(x).
 The second derivative tells us how the first derivative will change as we vary the input.

Suppose we have a quadratic function


o If such a function has a second derivative of zero, then there is no curvature.

o It is a perfectly flat line, and its value can be predicted using only the gradient.

o If the gradient is 1, then we can make a step of size ∈ along the negative gradient, and the
cost function will decrease by ∈ .

o If the second derivative is negative, the function curves downward, so the cost function
will actually decrease by more than ∈.

o Finally, if the second derivative is positive, the function curves upward, so the cost
function can decrease by less than ∈.
When our function has multiple input dimensions, there are many second derivatives. These
derivatives can be collected together into a matrix called the Hessian matrix. The Hessian matrix
H(f)(x) is defined such that

 Most of the functions we encounter in the context of deep learning have a symmetric
Hessian almost everywhere. Because the Hessian matrix is real and symmetric, we can
decompose it into a set of real eigenvalues and an orthogonal basis eigenvectors.

 The second derivative in a specific direction represented by a unit vector d is given by


dTHd.

 When d is an eigenvector of H , the second derivative in that direction is given by the


corresponding eigenvalue.

 For other directions of d the directional second derivative is a weighted average of all of
the eigenvalues, with weights between 0 and 1, and eigenvectors that have smaller angle
with d receiving more weight.

 The maximum eigenvalue determines the maximum second derivative and the minimum
eigenvalue determines the minimum second derivative.

 The (directional) second derivative tells us how well we can expect a gradient descent step
to perform.
 We can make a second-order Taylor series approximation to the function f(x) around the
current point x (0) :

 When gTHg is zero or negative, the Taylor series approximation predicts that increasing ∈
forever will decrease f forever.

 When gTHg is positive, solving for the optimal step size that decreases the Taylor series
approximation of the function the most yields
Figure: A saddle point containing both positive and negative curvature.

 The function in this example is f (x) = x2 1 − x2 2 . Along the axis corresponding to x1, the
function curves upward.

Figure : Gradient descent fails to exploit the curvature information

 The simplest method for doing so is known as Newton’s method. Newton’s method is
based on using a second-order Taylor series expansion to approximate f(x) near some
point x (0) :

Optimization algorithms that use only the gradient such as gradient descent are called first-
order optimization algorithms.

 Optimization algorithms that also use the Hessian matrix, such as Newton’s method are
called second-order optimization algorithms.
 In the context of deep learning, we sometimes gain some guarantees by restricting
ourselves to functions that are either Lipschitz continuous or have Lipschitz continuous
derivatives.

 Convex optimization algorithms are able to provide many more guarantees by making
stronger restrictions.

 Convex optimization algorithms are applicable only to convex functions -- functions for
which the Hessian is positive semidefinite everywhere.

 Such functions are well-behaved because they lack saddle points and all of their local
minima are necessarily global minima.

 However, most problems in deep learning are difficult to express in terms of convex
optimization. Convex optimization is used only as a subroutine of some deep learning
algorithms.

 Ideas from the analysis of convex optimization algorithms can be useful for proving the
convergence of deep learning algorithms.

 However, in general, the importance of convex optimization is greatly diminished in the


context of deep learning.

Common questions

Powered by AI

The Jacobian matrix offers the first-order partial derivatives for multivariable functions, providing insights into how the function's output changes with respect to each input variable . The Hessian matrix expands upon this by offering the second-order partial derivatives, thus indicating the curvature around each point and helping to understand how these changes accelerate or decelerate . In deep learning, these matrices are critical for tasks such as stability analysis, understanding parameter sensitivity, and enabling optimization routines, particularly when fine-tuning convergence strategies and ensuring robustness against overfitting .

The Hessian matrix, which contains all second-order partial derivatives, provides information about the curvature of a function. Specifically, the eigenvalues of the Hessian inform us of the curvature in different directions; a positive eigenvalue implies upward curvature, while a negative eigenvalue implies downward curvature . In deep learning, the Hessian's symmetry allows for decomposition into real eigenvalues and an orthogonal basis of eigenvectors . This information can be used in second-order optimization algorithms, such as Newton's method, which can exploit the curvature information for more efficient optimization compared to first-order methods like gradient descent .

The method of steepest descent, or gradient descent, involves moving in the direction of the negative gradient to minimize a function, as it indicates the direction of steepest decrease . The learning rate is a crucial parameter that affects the size of each step taken in the descent. A small constant learning rate is commonly used to ensure stable convergence, but it can impact the speed and accuracy of reaching the minimum, requiring careful tuning for optimal performance .

The second derivative, encapsulated in the Hessian matrix, provides insight into the curvature of the cost function. In second-order optimization methods like Newton's method, the second derivative helps adjust the step size and direction based on how the tangent to the function changes, allowing these methods to better navigate regions of varied curvature and potentially converge faster than first-order methods, especially when near a local minimum where the curvature is more pronounced .

Eigenvalues and eigenvectors of the Hessian matrix provide a comprehensive picture of a function's curvature at a given point. In gradient descent optimization, understanding the spectrum of the Hessian can indicate whether the descent path will be effectively navigable or hindered by saddle points. A predominant positive eigenvalue corresponds to a potential descent funnel, while a negative indicates a potential escalation towards divergence . This helps in diagnosing optimization difficulties and selecting appropriate algorithms or preconditioning techniques to mitigate issues .

First-order optimization algorithms, such as gradient descent, use only the gradient for optimization and are generally simpler and require less computational resources . Second-order optimization algorithms, like Newton's method, also use the Hessian matrix, providing additional curvature information that can improve convergence speed and accuracy. However, they are more computationally intensive due to the need to compute and invert the Hessian . In deep learning, first-order methods are often preferred due to their scalability with large datasets and models, despite potentially slower convergence rates .

The gradient of a function provides the direction of the steepest ascent. To minimize the function, one should move in the opposite direction of the gradient, which is known as gradient descent or steepest descent . The directional derivative in any direction u provides the slope of the function in that direction. To find the direction where the function decreases the fastest, the directional derivative can be evaluated, and it is minimized when u points in the opposite direction as the gradient .

Lipschitz continuity or having Lipschitz continuous derivatives implies that changes in the function do not vary too rapidly, which provides certain stability and boundedness guarantees in optimization . In deep learning, these properties can be useful for proving convergence of algorithms, ensuring that the steps taken by the optimization algorithms do not diverge. However, not all deep learning problems naturally exhibit Lipschitz continuity, which limits its applicability .

Saddle points are critical points where the function does not achieve a local minimum or maximum, characterized by directions of both positive and negative curvature . In gradient descent, reaching a saddle point can halt progress as the gradient tends to zero, misleading the algorithm to believe a minimum has been found while it is in fact a flat region . This can significantly slow down or mislead convergence, particularly in high-dimensional spaces common in deep learning .

Convex optimization refers to optimization over convex functions, where the Hessian is positive semidefinite everywhere, ensuring all local minima are global . While convex optimization provides many guarantees, deep learning problems are typically non-convex due to complex loss functions, making direct applications challenging . Thus, its role is limited, used primarily in subroutines or to inform convergence proofs without being directly applied to full deep learning models .

You might also like