0% found this document useful (0 votes)
21 views12 pages

Stochastic Gradient Descent in Deep Learning

Stochastic Gradient Descent (SGD) is a key algorithm in Deep Learning that optimizes objective functions by using a single sample or mini-batch at each step, making it suitable for large datasets. Unlike traditional Gradient Descent, which can be computationally intensive, SGD reduces the number of calculations significantly, allowing for efficient handling of big data and quick updates when new data is available. The document outlines the basic principles of SGD, its advantages over standard Gradient Descent, and the iterative process involved in optimizing model parameters.

Uploaded by

poojamani133
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views12 pages

Stochastic Gradient Descent in Deep Learning

Stochastic Gradient Descent (SGD) is a key algorithm in Deep Learning that optimizes objective functions by using a single sample or mini-batch at each step, making it suitable for large datasets. Unlike traditional Gradient Descent, which can be computationally intensive, SGD reduces the number of calculations significantly, allowing for efficient handling of big data and quick updates when new data is available. The document outlines the basic principles of SGD, its advantages over standard Gradient Descent, and the iterative process involved in optimizing model parameters.

Uploaded by

poojamani133
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DEEP LEARNING

Stochastic Gradient Descent


Rajesh A
Stochastic Gradient Descent

▪ Stochastic Gradient Descent (SGD) is one of the core algorithms used in Deep
Learning
▪ SGD is based on Gradient Descent ; Gradient Descent an iterative optimization
process that searches for an objective function’s optimum value (Min/Max)

▪ Gradient simply measures the change in all weights with regard to the change
in error. You can also think of a gradient as the slope of a function
▪ In mathematical terms, a gradient is a partial derivative with respect to its
inputs
Gradient Descent

▪ A dataset of 3 items of weight and height. Given


a particular weight we want to predict the height
▪ Simple Linear Regression Model
▪ There could be multiple models possible, which
the is the best fit model?
▪ This simple model can be represented as

H=I+sW

▪ What are the values of Intercept(I) and Slope (s)


for the best fit linear regression model?
Gradient Descent – Loss Function

▪ Identify the Loss Function for the proposed model

▪ There are various loss functions like Sum of


squared Error (SSE), Mean Squared Error
(MSE), Root Mean squared Error (RMSE), Mean
Absolute Error (MAE), R2 Error…
Loss function = Sum of Squared Residuals

Loss function = (y1 – yp1)2 + (y2 – yp2)2 + (y3 – yp3)2

▪ The model with least sum squared residuals (Loss Function) value, is the most suitable
model
▪ Optimization (Minimization) problem
▪ Computation requirement for this is Huge. For N data points, P (number of features [I,
s] to evaluate) number of computations is (N x P). More number of features makes the
computation more complex
Gradient Descent – Basic Idea

▪ Take long step (descent) when slope (Gradient) is high


L
▪ As slope(gradient) comes close to zero, take smaller
steps
▪ Helps to decrease the computational complexity

s I
b – Next value of attribute (I or s) L

a – Current Value of attribute (I or s)


γ – Learning Rate
- Gradient

I
Gradient Descent - example


Gradient Descent – Basic Idea…

A gradient descent of an n-dimensional


function f(x) at a given point p is

= - γ
p+1 p

▪ Gradient Descent can be very sensitive to the learning rate (γ)


▪ In Practice, reasonable learning rate can be determined automatically determined
by starting large and getting smaller with each step
Gradient Descent

Step1: Take derivative of loss function for each attribute (feature) in it

Step2: Pick up Random values to the attribute (feature)

Step3: Plugin the current attribute values derivatives (Gradient)

Step4: Calculate step sizes for each attribute : ( slope x γ)

Step5: Calculate the new value for each attribute = current value – step size

This iterative process continues till step size becomes very small ( < threshold ) or
number of iterations becomes very large ( > limit )
Stochastic Gradient Descent

▪ Scenario: 23,000 genes to predict is a person has a particular disease and there are
1,000,000 million samples.
𝑑 𝐿𝑜𝑠𝑠 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑑 𝐿𝑜𝑠𝑠 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛
▪ This would need 23,000 derivatives …
𝑑 (𝑔𝑒𝑛𝑒1) 𝑑 (𝑔𝑒𝑛𝑒23,000)

▪ In each of the derivative, there are 1,000,000 terms


▪ 23 billion terms to be calculated in each step
▪ It is common to take 1000 steps,
▪ A total of 2.3 trillion (2,300,000,000,000) terms

For Big Data, Gradient Descent is SLOW Stochastic Gradient Descent


Stochastic Gradient Descent…
▪ Stochastic Gradient Descent would randomly pick
one sample for each step
▪ In this example, SGD will reduce number of terms by
3. When we had 1,000,000 samples, SDG would
reduce the number of terms by 1,000,000

▪ Stochastic Gradient Descent would is especially


useful when there are redundancies in data
▪ In this example, we have 12 data points. But there is
redundancy that forms 3 clusters
▪ SDG reduce number of terms by 12
Stochastic Gradient Descent…

▪ Strict definition of Strict Stochastic Gradient Descent


is to only use one sample for step
▪ It is more common to select a small subset of data or
mini-batch for each step for better results while
keeping the number of terms manageable level

▪ SDG can handle new data very efficiently


▪ We can just take another step for attribute value
estimates without taking having to start from scratch
Stochastic Gradient Descent

▪ Stochastic Gradient Descent is just like Gradient Descent, except it only looks at one
sample or a mini-batch for per each step
▪ Gradient Descent may not be computationally feasible on Big Data. Stochastic
Gradient Descent can handle Big Data (Large number of features and large amount
of data points)
▪ Stochastic Gradient can update the model efficiently when new data is available

You might also like