0% found this document useful (0 votes)
9 views10 pages

Gradient Descent: Learning Rate & Types

Uploaded by

charankolan9
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views10 pages

Gradient Descent: Learning Rate & Types

Uploaded by

charankolan9
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1.

Explain the role of the learning rate, cost function in Gradient Descent and
describe what happens if learning rate is set too high or too low.

Gradient Descent is an optimization algorithm used to minimize a function (usually a cost or


loss function) by iteratively moving towards the steepest descent (negative gradient) direction. It
is widely used in machine learning and deep learning for updating model parameters to reduce
errors.

Gradient descent (GD) is an iterative first-order optimization algorithm used to find a local
minimum/maximum of a given function. This method is commonly used in machine learning
(ML) and deep learning (DL) to minimize a cost/loss function.

The starting point is just an arbitrary point for us to evaluate the performance. From that starting
point, we will find the derivative (or slope), and from there, we can use a tangent line to
observe the steepness of the slope. The slope will inform the updates to the parameters—i.e. the
weights and bias.

Role of Learning Rate in Gradient Descent

The learning rate (denoted as α) controls the step size when updating model parameters during
optimization.

Function: It determines how fast or slow the algorithm converges to the minimum of the cost
function.

Too High: If the learning rate is too large, the algorithm may overshoot the minimum, causing
divergence or oscillation.

Too Low: If the learning rate is very small, convergence becomes extremely slow, requiring
many iterations and increasing computational cost.
Role of Cost Function in Gradient Descent

The cost function (also called loss function) measures the error between predicted output and
actual output.

Purpose: Gradient Descent minimizes this cost function by adjusting parameters in the opposite
direction of the gradient.

Example: For linear regression, common cost function is Mean Squared Error (MSE).
if learning rate is set too high or too low:

Here’s the diagram showing the effect of learning rate on Gradient Descent:

 Blue path (Low LR = 0.1): Moves slowly toward the minimum, requiring many steps.
 Red path (High LR = 0.8): Overshoots the minimum and oscillates, risking divergence.
 Green point: Starting position.

[Link] Batch Gradient Descent and Stochastic Gradient Descent in


terms of accuracy, speed, and suitability for large datasets

Batch Gradient Descent

An optimization algorithm that updates model parameters by computing the gradient of the
cost function using the entire training dataset in each iteration.

Advantages:
More stable and accurate updates.

Disadvantages: Very slow for large datasets.


Requires high memory.

Stochastic Gradient Descent

An optimization algorithm that updates model parameters by computing the gradient of the cost
function using only one training example at a time.
Advantages:

 Fast updates (especially for large datasets).


 Requires low memory.
 Can escape local minima due to randomness.

Disadvantages:

 High variance in updates; path to convergence is noisy.


 May not converge exactly, but oscillates near the minimum.

Stochastic Gradient
Aspect Batch Gradient Descent
Descent (SGD)

High accuracy due to precise Lower accuracy per update


Accuracy gradient calculation using the because of noisy gradient
entire dataset. estimates.

Slow for large datasets since it


Fast per iteration because
Speed processes all data before each
it updates after each sample.
update.

Low memory requirement


Memory Requires high memory to load
as it uses one sample at a
Usage and compute on full dataset.
time.

Suitability for
Less suitable; computationally Well-suited for large datasets
Large
expensive for big data. and online learning.
Datasets

Converges faster but with


Smooth and stable convergence
Convergence fluctuations around the
to the minimum.
minimum.
[Link] is Gradient Descent? Describe its purpose and explain the three main types of
Gradient Descent, highlighting their key differences.

Gradient Descent

Gradient Descent is an optimization algorithm used to minimize a cost (loss) function by iteratively
adjusting model parameters in the direction of the negative gradient (steepest descent).

 It is used to find the set of parameters (weights) that minimizes the cost function in
machine learning and deep learning models.
 Commonly used in linear regression, logistic regression, and neural networks.

1. Batch Gradient Descent

Batch Gradient Descent is a variant of the gradient descent algorithm where the entire dataset is
used to compute the gradient of the loss function with respect to the parameters. In each iteration
the algorithm calculates the average gradient of the loss function for all the training examples
and updates the model parameters accordingly.
Advantages of Batch Gradient Descent:
Stable Convergence: Since the gradient is averaged over all training examples the updates are
less noisy and more stable.
Global View: It considers the entire dataset for each update providing a global perspective of
the loss landscape.
Disadvantages of Batch Gradient Descent:
Computationally Expensive: It Processing the entire dataset in each iteration can be slow and
resource-intensive especially for large datasets.
Memory Intensive: This requires storing and processing the entire dataset in memory which can
be impractical for very large datasets.

2. Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is a variant of the gradient descent algorithm where the
model parameters are updated using the gradient of the loss function with respect to a single
training example at each iteration. Unlike batch gradient descent which uses the entire dataset
SGD updates the parameters more frequently, leading to faster convergence.
Advantages of Stochastic Gradient Descent
 Faster Convergence: Frequent updates can lead to faster convergence, especially in large
datasets.
 Less Memory Intensive: Since it processes one training example at a time, it requires less
memory compared to batch gradient descent.
 Better for Online Learning: Suitable for scenarios where data comes in a stream,
allowing the model to be updated continuously.

Disadvantages of Stochastic Gradient Descent


 Noisy Updates: Updates can be noisy, leading to a more erratic convergence path.
 Potential for Overshooting: The frequent updates can cause the algorithm to overshoot
the minimum, especially with a high learning rate.
 Hyper parameter Sensitivity: Requires careful tuning of the learning rate to ensure stable
and efficient convergence.

3. Mini-Batch Gradient Descent

Mini-Batch Gradient Descent is a compromise between Batch Gradient Descent and Stochastic
Gradient Descent. Instead of using the entire dataset or a single training example Mini-Batch
Gradient Descent updates the model parameters using a small, random subset of the training
data called a mini-batch.
Advantages of Mini-Batch Gradient Descent
 Faster Convergence: By using mini-batches, it achieves a balance between the noisy
updates of SGD and the stable updates of Batch Gradient Descent, often leading to faster
convergence.
 Reduced Memory Usage: Requires less memory than Batch Gradient Descent as it only
needs to store a mini-batch at a time.
 Efficient Computation: Allows for efficient use of hardware optimizations and parallel
processing, making it suitable for large datasets.

Disadvantages of Mini-Batch Gradient Descent


 Complexity in Tuning: Requires careful tuning of the mini-batch size and learning rate to
ensure optimal performance.
 Less Stable than Batch GD: While more stable than SGD, it can still be less stable than
Batch Gradient Descent, especially if the mini-batch size is too small.
 Potential for Suboptimal Mini-Batch Sizes: Selecting an inappropriate mini-batch size
can lead to suboptimal performance and convergence issues.
1. Momentum-Based Gradient Descent

Momentum-Based Gradient Descent is an enhancement of standard gradient descent algorithm


that aims to accelerate convergence particularly in the presence of high curvature, small but
consistent gradients or noisy gradients. It introduces a velocity term that accumulates the
gradient of the loss function over time thereby smoothing the path taken by the parameters.

Advantages of Momentum-Based Gradient Descent


 Accelerated Convergence: Helps in faster convergence especially in scenarios with small
but consistent gradients.
 Smoother Updates: Reduces the oscillations in the gradient updates, leading to a smoother
and more stable convergence path.
 Effective in Ravines: Particularly effective in dealing with ravines or regions of steep
curvature and is common in deep learning loss landscapes.

Disadvantages of Momentum-Based Gradient Descent


 Additional Hyperparameter: Introduces an additional hyperparameter (momentum term)
that needs to be tuned.
 Complex Implementation: Slightly more complex to implement compared to standard
gradient descent.
 Potential Overcorrection: If not properly tuned the momentum can lead to overcorrection
and instability in the updates.

You might also like