DEEP LEARNING
Stochastic Gradient Descent
Rajesh A
Stochastic Gradient Descent
▪ Stochastic Gradient Descent (SGD) is one of the core algorithms used in Deep
Learning
▪ SGD is based on Gradient Descent ; Gradient Descent an iterative optimization
process that searches for an objective function’s optimum value (Min/Max)
▪ Gradient simply measures the change in all weights with regard to the change
in error. You can also think of a gradient as the slope of a function
▪ In mathematical terms, a gradient is a partial derivative with respect to its
inputs
Gradient Descent
▪ A dataset of 3 items of weight and height. Given
a particular weight we want to predict the height
▪ Simple Linear Regression Model
▪ There could be multiple models possible, which
the is the best fit model?
▪ This simple model can be represented as
H=I+sW
▪ What are the values of Intercept(I) and Slope (s)
for the best fit linear regression model?
Gradient Descent – Loss Function
▪ Identify the Loss Function for the proposed model
▪ There are various loss functions like Sum of
squared Error (SSE), Mean Squared Error
(MSE), Root Mean squared Error (RMSE), Mean
Absolute Error (MAE), R2 Error…
Loss function = Sum of Squared Residuals
Loss function = (y1 – yp1)2 + (y2 – yp2)2 + (y3 – yp3)2
▪ The model with least sum squared residuals (Loss Function) value, is the most suitable
model
▪ Optimization (Minimization) problem
▪ Computation requirement for this is Huge. For N data points, P (number of features [I,
s] to evaluate) number of computations is (N x P). More number of features makes the
computation more complex
Gradient Descent – Basic Idea
▪ Take long step (descent) when slope (Gradient) is high
L
▪ As slope(gradient) comes close to zero, take smaller
steps
▪ Helps to decrease the computational complexity
s I
b – Next value of attribute (I or s) L
a – Current Value of attribute (I or s)
γ – Learning Rate
- Gradient
I
Gradient Descent - example
…
Gradient Descent – Basic Idea…
A gradient descent of an n-dimensional
function f(x) at a given point p is
= - γ
p+1 p
▪ Gradient Descent can be very sensitive to the learning rate (γ)
▪ In Practice, reasonable learning rate can be determined automatically determined
by starting large and getting smaller with each step
Gradient Descent
Step1: Take derivative of loss function for each attribute (feature) in it
Step2: Pick up Random values to the attribute (feature)
Step3: Plugin the current attribute values derivatives (Gradient)
Step4: Calculate step sizes for each attribute : ( slope x γ)
Step5: Calculate the new value for each attribute = current value – step size
This iterative process continues till step size becomes very small ( < threshold ) or
number of iterations becomes very large ( > limit )
Stochastic Gradient Descent
▪ Scenario: 23,000 genes to predict is a person has a particular disease and there are
1,000,000 million samples.
𝑑 𝐿𝑜𝑠𝑠 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑑 𝐿𝑜𝑠𝑠 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛
▪ This would need 23,000 derivatives …
𝑑 (𝑔𝑒𝑛𝑒1) 𝑑 (𝑔𝑒𝑛𝑒23,000)
▪ In each of the derivative, there are 1,000,000 terms
▪ 23 billion terms to be calculated in each step
▪ It is common to take 1000 steps,
▪ A total of 2.3 trillion (2,300,000,000,000) terms
For Big Data, Gradient Descent is SLOW Stochastic Gradient Descent
Stochastic Gradient Descent…
▪ Stochastic Gradient Descent would randomly pick
one sample for each step
▪ In this example, SGD will reduce number of terms by
3. When we had 1,000,000 samples, SDG would
reduce the number of terms by 1,000,000
▪ Stochastic Gradient Descent would is especially
useful when there are redundancies in data
▪ In this example, we have 12 data points. But there is
redundancy that forms 3 clusters
▪ SDG reduce number of terms by 12
Stochastic Gradient Descent…
▪ Strict definition of Strict Stochastic Gradient Descent
is to only use one sample for step
▪ It is more common to select a small subset of data or
mini-batch for each step for better results while
keeping the number of terms manageable level
▪ SDG can handle new data very efficiently
▪ We can just take another step for attribute value
estimates without taking having to start from scratch
Stochastic Gradient Descent
▪ Stochastic Gradient Descent is just like Gradient Descent, except it only looks at one
sample or a mini-batch for per each step
▪ Gradient Descent may not be computationally feasible on Big Data. Stochastic
Gradient Descent can handle Big Data (Large number of features and large amount
of data points)
▪ Stochastic Gradient can update the model efficiently when new data is available