0% found this document useful (0 votes)

8 views7 pages

Stochastic Gradient Descent Explained

Q: How does the noise introduced in the data affect the model training, and what steps in SGD can mitigate its impact?

The noise introduced in the data can lead to fluctuations in gradient estimates during model training, potentially affecting convergence and generalization. In Stochastic Gradient Descent, this impact of noise can be mitigated by carefully tuning the learning rate, using mini-batches rather than single samples to reduce variance, and applying mechanisms like early stopping or learning rate schedules to stabilize convergence. Techniques like regularization can also help by preventing overfitting to noisy data .

Q: How does the randomness introduced in Stochastic Gradient Descent (SGD) impact its convergence path compared to traditional Gradient Descent?

The randomness introduced in Stochastic Gradient Descent (SGD) impacts its convergence path by making it noisier compared to traditional Gradient Descent. This noise results from the use of a single or small batch of training examples for each iteration, instead of the entire dataset as in traditional Gradient Descent. The noisy path may take a higher number of iterations to reach the minima, but it doesn't matter as long as the minima is reached efficiently. This makes SGD computationally less expensive and often preferred in practice for large datasets .

Q: How does the mini-batch size in SGD affect the optimization process and computational efficiency?

The mini-batch size in SGD affects the optimization process by balancing between the high variance of using a single example and the computation-intensive process of full-batch Gradient Descent. Smaller mini-batches introduce more noise into the convergence path, which may require more iterations but often results in better generalization. Larger mini-batches reduce the path's randomness but increase computational cost per iteration. Generally, small to moderate-sized mini-batches are preferred because they provide a good trade-off between computational efficiency and stable convergence .

Q: In what ways can learning rate schedules be employed to enhance the training process in SGD, and what are their potential drawbacks?

Learning rate schedules enhance the training process in SGD by dynamically adjusting the learning rate during training to improve convergence. For instance, starting with a larger learning rate that decreases over time can accelerate initial progress while stabilizing final convergence. Schedules like exponential decay or adaptive learning rates (e.g., learning rate annealing) adaptively manage the rate based on performance. Potential drawbacks include the added complexity of tuning the schedule and the risk of missing optimal rates or slowing convergence prematurely .

Q: What are the computational benefits of using Stochastic Gradient Descent (SGD) in machine learning projects with large datasets?

Stochastic Gradient Descent (SGD) provides computational benefits in machine learning projects with large datasets by reducing the computational cost per iteration. Unlike traditional Gradient Descent, which requires the entire dataset to calculate the gradient, SGD uses only a single training example or a small batch for each iteration. This approach significantly reduces the computational overhead and speeds up the optimization process, making it highly efficient for large datasets .

Q: What convergence criteria are typically used in SGD, and why are they important?

Convergence criteria in SGD typically include checking if the change in the cost function value between iterations falls below a predefined threshold or if the gradient's norm becomes sufficiently small. These criteria are important because they determine when the iterative optimization process should stop, ensuring that the model has adequately minimized the error without excessive or unnecessary iterations, thereby improving training efficiency .

Q: Describe how the concept of back-propagation is applied in conjunction with SGD for model training.

Back-propagation is applied in conjunction with SGD for model training by using the calculated gradient of the loss function to update model parameters. During back-propagation, the error from the output layer is propagated backward through the layers, and the gradient of each parameter is computed. SGD then utilizes these gradients to update the parameters with calculated steps controlled by the learning rate. This iterative adjustment reduces the loss function and aids convergence to optimal model parameters .

Q: What role does the learning rate play in the functionality of Stochastic Gradient Descent, and how can it affect the outcome?

The learning rate in Stochastic Gradient Descent (SGD) determines the size of the step taken towards minimizing the error during each update. A too-small learning rate may lead to slow convergence, while a too-large learning rate can result in divergence or overshooting the minima. Tuning the learning rate is crucial; it must be adjusted to balance convergence speed and stability. The learning rate can sometimes follow a schedule or be adapted during the training process to improve results .

Q: Why is it essential to shuffle training data before each iteration in Stochastic Gradient Descent?

Shuffling the training data before each iteration in Stochastic Gradient Descent is essential to introduce randomness into the optimization process. This randomness helps prevent the algorithm from getting stuck in cycles and ensures a more robust search for the global minima. Shuffling ensures that each iteration through the dataset provides a different sequence of examples, which helps improve convergence characteristics .

Q: What are the primary functions defined in the SGD class for implementing the algorithm, and how do they contribute to model training?

The primary functions defined in the SGD class for implementing the algorithm are Gradient, Fit, and Predict. The Gradient function computes the gradient of the cost function with respect to the model parameters, which is essential for adjusting these parameters iteratively. The Fit function fits the training dataset into the model by shuffling data, calculating the gradient for each mini-batch, and updating parameters. The Predict function estimates target values by applying the learned model to new data. Collectively, these functions enable the training process and facilitate predictions .

Stochastic gradient descent (SGD) is an optimization algorithm for machine learning models that addresses computational inefficiency in traditional gradient descent methods when dealing with large datasets. SGD uses a random sample of training data in each iteration rather than the full dataset, reducing computational cost. While noisier than traditional gradient descent, SGD reaches the minimum faster due to its computational efficiency. The document provides pseudocode for an SGD algorithm and an example of implementing it in Python to optimize parameters for a random dataset.

Uploaded by

ayten55zoweil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views7 pages

Stochastic Gradient Descent Explained

Uploaded by

ayten55zoweil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Alamein University Neural Networks

Faculty of Computer science & Course Code: AIE231

Engineering

Neural Network Lab 5

Stochastic Gradient Descent(SGD)
Stochastic Gradient Descent (SGD) is a variant of the Gradient Descent algorithm that is
used for optimizing machine learning models. It addresses the
computational inefficiency of traditional Gradient Descent methods when
dealing with large datasets in machine learning projects.
In SGD, instead of using the entire dataset for each iteration, only a single
random training example (or a small batch) is selected to calculate the
gradient and update the model parameters. This random selection
introduces randomness into the optimization process, hence the term
“stochastic” in stochastic Gradient Descent.
The advantage of using SGD is its computational efficiency, especially when
dealing with large datasets. By using a single example or a small batch, the
computational cost per iteration is significantly reduced compared to
traditional Gradient Descent methods that require processing the entire
dataset.

Stochastic Gradient Descent Algorithm

Initialization: Randomly initialize the parameters of the model.
Set Parameters: Determine the number of iterations and the learning rate
(alpha) for updating the parameters.
Stochastic Gradient Descent Loop: Repeat the following steps until the
model converges or reaches the maximum number of iterations:
a. Shuffle the training dataset to introduce randomness.
b. Iterate over each training example (or a small batch) in the shuffled
order.
c. Compute the gradient of the cost function with respect to the model
parameters using the current training example (or batch).
d. Update the model parameters by taking a step in the direction of the
negative gradient, scaled by the learning rate.
e. Evaluate the convergence criteria, such as the difference in the cost
function between iterations of the gradient.

Page 1 of 7
Alamein University Neural Networks
Faculty of Computer science & Course Code: AIE231
Engineering

Return Optimized Parameters: Once the convergence criteria are met or

the maximum number of iterations is reached, return the optimized model
parameters.
In SGD, since only one sample from the dataset is chosen at random for
each iteration, the path taken by the algorithm to reach the minima is
usually noisier than your typical Gradient Descent algorithm. But that
doesn’t matter all that much because the path taken by the algorithm does
not matter, as long as we reach the minimum and with a significantly
shorter training time.

The path taken by Batch Gradient Descent is shown below:

Page 2 of 7
Alamein University Neural Networks
Faculty of Computer science & Course Code: AIE231
Engineering

A path taken by Stochastic Gradient Descent looks as follows :

One thing to be noted is that, as SGD is generally noisier than typical

Gradient Descent, it usually took a higher number of iterations to reach the
minima, because of the randomness in its descent. Even though it requires
a higher number of iterations to reach the minima than typical Gradient
Descent, it is still computationally much less expensive than typical
Gradient Descent. Hence, in most scenarios, SGD is preferred over Batch
Gradient Descent for optimizing a learning algorithm.

Page 3 of 7
Alamein University Neural Networks
Faculty of Computer science & Course Code: AIE231
Engineering

Python Code For Stochastic Gradient Descent

We will create an SGD class with methods that we will use while updating
the parameters, fitting the training data set, and predicting the new test
data. The methods we will be using are as :
Gradient – This method will be used in updating the parameters of the
model. For every iteration, it will calculate the error between the predicted
data point and the actual data point.
Fit – This method will be used to fit the training dataset into the machine
learning model. It will shuffle the data indices and will calculate the
gradient for each data point and update the parameter theta.
Predict – This method will be used to predict new data points. As the
prediction is just the dot product of parameter and dataset elements.
import numpy as np

class SGD:
def __init__(self, lr=0.01, max_iter=1000, batch_size=32,
tol=1e-3):
# learning rate of the SGD Optimizer
self.learning_rate = lr
# maximum number of iterations for SGD Optimizer
self.max_iteration = max_iter
# mini-batch size of the data
self.batch_size = batch_size
# tolerance for convergence for the theta
self.tolerence_convergence = tol
# Initialize model parameters to None
[Link] = None

def fit(self, X, y):

# store dimension of input vector
n, d = [Link]
# Intialize random Theta for every feature
[Link] = [Link](d)
for i in range(self.max_iteration):
# Shuffle the data
indices = [Link](n)
X = X[indices]
y = y[indices]

Page 4 of 7
Alamein University Neural Networks
Faculty of Computer science & Course Code: AIE231
Engineering

# Iterate over mini-batches

for i in range(0, n, self.batch_size):
X_batch = X[i:i+self.batch_size]
y_batch = y[i:i+self.batch_size]
grad = [Link](X_batch, y_batch)
[Link] -= self.learning_rate * grad
# Check for convergence
if [Link](grad) < self.tolerence_convergence:
break
# define a gradient functon for calculating gradient
# of the data
def gradient(self, X, y):
n = len(y)
# predict target value by taking taking
# taking dot product of dependent and theta value
y_pred = [Link](X, [Link])

# calculate error between predict and actual value

error = y_pred - y
grad = [Link](X.T, error) / n
return grad

def predict(self, X):

# prdict y value using calculated theta value
y_pred = [Link](X, [Link])
return y_pred

SGD Implementation
We will create a random dataset with 100 rows and 5 columns and we fit
our Stochastic gradient descent Class on this data. Also, We will use predict
method from SGD
# Create random dataset with 100 rows and 5 columns
X = [Link](100, 5)
# create corresponding target value by adding random
# noise in the dataset
y = [Link](X, [Link]([1, 2, 3, 4, 5]))\
+ [Link](100) * 0.1
# Create an instance of the SGD class
model = SGD(lr=0.01, max_iter=1000,
batch_size=32, tol=1e-3)
[Link](X, y)

Page 5 of 7
Alamein University Neural Networks
Faculty of Computer science & Course Code: AIE231
Engineering

# Predict using predict method from model

y_pred = [Link](X)

This cycle of taking the values and adjusting them based on different
parameters in order to reduce the loss function is called back-
propagation.

Tuning your learning rate

This curve consists of three distinct parts: learning rates that don't learn fast enough,
and don't take the model anywhere; an area of steepest descent, eventually leading into
an optimal or near-optimal learning rate. Past the edge of that curve you get noise, and
eventually divergence (this is where the learning rate is too large).

Page 6 of 7
Alamein University Neural Networks
Faculty of Computer science & Course Code: AIE231
Engineering

import keras
from [Link] import Sequential
from [Link] import Dense, Dropout, Activation
from [Link] import SGD

# Generate dummy data

import numpy as np
import pandas as pd

X_train = [Link]((1000, 3))

y_train = pd.get_dummies([Link](X_train[:, :3], axis=1)).values
X_test = [Link]((100, 3))
y_test = pd.get_dummies([Link](X_test[:, :3], axis=1)).values

import numpy as np
from [Link] import LearningRateScheduler

lr_sched = LearningRateScheduler(lambda epoch: 1e-4 * (0.75 **

[Link](epoch / 2)))

# Build the model.

clf = Sequential()
[Link](Dense(9, activation='relu', input_dim=3))
[Link](Dense(9, activation='relu'))
[Link](Dense(3, activation='softmax'))
# Change the learning rate to show the loss
optimizer = [Link](lr=0.0001)
[Link](loss='categorical_crossentropy', optimizer=optimizer)

# Perform training.
[Link](X_train, y_train, epochs=10, batch_size=500)

[Link](loss='categorical_crossentropy', optimizer=SGD())

# Perform training.
[Link](X_train, y_train, epochs=10, batch_size=500, callbacks=[lr_sche
d])

Page 7 of 7

Common questions