0% found this document useful (0 votes)
23 views6 pages

Stochastic Gradient Descent in Regression

Data mining
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views6 pages

Stochastic Gradient Descent in Regression

Data mining
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Multiple Linear Regression

Multiple linear regression model that utilizes multiple features and takes the form

Y = 0 + 1X1+ … + pXp + 

Here, the stochastic component  represents the noise and is generally assumed to have
normal distribution N (0 , σ 2).

A training dataset comprising on n examples can be expressed in matrix notation as


Y =X β +ε

Here  = (0, 1, … , p) is the vector of model parameters, also called regression coefficients.
The parameters of the model are typically learned (estimated) using least squares approach,
resulting in
^β=( X ' X )−1 X ' Y

The predicted values of the response variable for the training examples are computed as

Y^ = X ^β
Let x be the vector containing values of the features for a new case, the expected response for
this case is predicted as

Y^ =x ' ^β
Precision of Variance of ^β is given by

V ( ^β ) =( X ' X ) σ 2
−1

The variance σ 2 of the stochastic component can be unbiasedly estimated by


2 e' e
σ^ =
n− p−1
Where the vector of residuals e is given by

e=Y − Y^
 F statistic can be used to test the significance of the model.
 Different models can also be compared using F test.
Prediction accuracy is measured by

V (^
Y )=X ( X ' X )− 1 X ' σ 2=H σ 2

The estimated prediction accuracy can be obtained using the estimated value of σ 2.
Updating regression estimators
To enable the learning algorithm to keep learning with each new experience, we need to
update the learned parameters. When a new example x becomes available, we can update the
learned values of regression coefficients as
^β new =( X ' X )−1
new X ' new Y new

Where

( X ' X )−1
new = ( X ' X+ x x ' )
−1
…(1)

(
= (X' X) −
−1 (( X ' X )− 1 x )( ( X ' X )−1 x ) '
1+ x ' ( X ' X ) x
−1 ) Rank one update formula

Thus, we have

^β new = ^β + ( X ' X )−1 x


−1
( y − ^y )
1+ x ' ( X ' X ) x
…(2)
Thus, one can use last formula for updating ^β , and then update stored matrix ( X ' X )−1 . This
strategy requires storage of only ( X ' X )−1 in addition to ^β .

Instead of updating ^
β with every new example, one can also choose to update ^β after every
batch of m examples. In that case ^β can be updated by applying the ‘rank-one update formula’
m times, adding one example at a time.

Deciding importance of features


The F-test performed on the basis of ANOVA table for the learned regression model tests
whether the explanation provided by the learnt model is significant.
It must be understood that the significant F statistic does not imply suitability of the model for
prediction purpose. A value of R2 statistic close to 1, however, indicate usefulness of the model
for prediction purpose.
If a regression model is found useful, all the features used in the model may not be important.
It is therefore necessary to identify a subset of features that can provide almost the same
amount of explanation that is provided by the full set of features. Alternatively, we may have a
large set of candidate features which can possibly influence the output/ response that we wish
to model. This task is called Feature Selection.
An associated problem is when the inputs are not directly usable as features for developing the
learning models. In such situations one must first identify the features for the available inputs.
This task is called Feature Extraction.
Since the tasks of ‘feature selection’ and ‘feature extraction’ are important for every supervised
method, we will return to these tasks after covering other supervised methods.

Gradient descent Algorithm


The literal meaning of descending on a gradient means moving downwards on a slope.
Suppose f ( x ) is a function of x that we want to minimize with respect to x.

Let x 0 be the initial guess of the optimum solution.

Then we know that f ' ( x 0 ) is the rate at which value of f changes with x at x = x 0.

In particular,
f ( x 0 +h )=f ( x 0) + h f ' ( x0 )

If f ' ( x 0 ) >0 , f increases, whereas

If f ' ( x 0 ) <0 , f decreases.

Thus, f ' ( x 0 ) denotes the rate and the direction of change.

Since our objective is to minimize f, we take a small step in the direction opposite to gradient. In
other words, for positive gradient, we should decrease x, and for negative gradient, we should
increase x.

In practice we choose the step size h to be  f ' ( x 0) , where  is called the learning rate. This
approach of step size ensures that when f ' ( x 0 ) has a higher magnitude (i.e. x 0 away from the
optimal value), the step size is bigger; and when f ' ( x 0 ) has a smaller magnitude (i.e. x 0 close to
the optimal value), the step size is smaller, so that we obtain optimal value with a good
accuracy.
We keep changing the value of x in above manner until the minimum is reached (equivalently,
gradient becomes 0)
Remark: Although above explanation is for a univariate function, the rationale holds even when
f is a multivariate function.
More precisely, f ' ( x 0 ) is replaced with the vector of partial derivatives, called gradient

( ∂∂x f ( x ) , ∂∂x f ( x ) , … , ∂∂x f ( x ))


1
0
2
0
n
0
The algorithm
1. Take initial guess x 0.
2. Compute gradient at x=x 0
3. Set x 0 = x 0 – learning rate  gradient
4. If the magnitude of gradient is very small, output x 0, Else go to step 2.

Gradient descent Algorithm for multiple regression


Consider the multiple regression model:

Y^ =f ( x ; β )= β 0 + β 1 x1 +...+ β p x p

Suppose Y is the true value and Y^ is the predicted value of the outcome variable. Then the
squared error loss function is given by

L(Y , Y^ )=(Y −Y^ )


2

Since Y is a random variable, our objective is to minimize E(L(Y , Y^ )) . When we have training
data, this expectation is estimated by
n n
^L= 1 ∑ ( Y − ^ 1
i Y i) = ∑ (Y i − f ( x i ; β ) )
2 2

n i =1 n i=1

We then minimize above estimated quantity with respect to the parameters  ’s.
The gradient vector is given by
n n
∂ ^ ∂ 1
∂β
L= ∑ ( Y − f ( x i ; β ))2= 2n ∑ ( Y i − f ( x i ; β )) ∂∂β (Y i − f ( x i ; β ))
∂ β n i=1 i i=1

n
2 ∂
= ∑ (Y i − f ( xi ; β ) ) ( Y − β − β 1 xi , 1 − …− β p x i , p )
n i=1 ∂β i 0
n
2
= ∑ (Y − f ( xi ; β )) ( −1 , − x i ,1 , … ,− x i, p )'
n i=1 i

The gradient descent algorithm

1. Take initial guess β 0.


n
2
2. Compute gradient at β=β 0 given by G= ∑ ( Y − f ( x i ; β 0 ))( − 1 ,− xi , 1 , … , − x i , p ) '
n i=1 i
3. Set β 0 = β 0 − γ G ,  is learning rate
4. If the norm of gradient is very small, output β 0,
Else go to step 2.

This iterative algorithm obtains the same LS estimator of  as discussed earlier.


Q: Why to use alternative formula when a direct formula is available in closed form?

Stochastic Gradient Descent (SGD) Algorithm


Even though the iterative formula has advantage of achieving more accuracy over the direct
formula, in practice it involves a practical difficulty. Since the volume of training data is
supposed to be large for achieving a better training of the model, the computations involved in
the iterative algorithm are prohibitively large.

Instead of estimating E(L(Y , Y^ )) based on the entire training data, we can take a random
sample from the training data and estimate this expected loss using this random sample. Also,
we draw a fresh random sample in every iteration.
Suppose S denote the training data. The gradient descent algorithm can be expressed in terms
of S as
Algorithm 1

1. Set j = 0. Take initial guess β ( 0) =β0 .


2
(j)
2. Compute gradient at β=β ( j ) given by G = ∑ (Y − f ( x i ; β( j) ))( − 1 ,− xi , 1 , … , − x i , p ) '
|S| S i
3. Set β ( j+1 ) = β ( j) − γ G( j ) ,  is learning rate
4. If the norm of gradient is very small, output β ( j+1 ),
Else set j = j+1 and go to step 2.

Here ∑
S
denotes the summation taken over the entire training dataset S, and |S| denote the
cardinality of S. Since we are using the entire training data here, this algorithm is more precisely
referred to as the batch gradient descent algorithm.
The stochastic gradient algorithm given below is instead based on the random samples s j.
Algorithm 2
( 0)
1. Set j = 0. Take initial guess β =β0 .
2. (a) Draw a random sample sj from training dataset S.
2
(b) Compute gradient at β=β ( j ) given by G = ∑ ( Y i − f ( x i ; β )) ( −1 , − x i ,1 ,… , − x i , p ) '
(j) ( j)

|s j| s j

3. Set β ( j+1 ) = β ( j) − γ G( j ) ,  is learning rate


4. If the norm of gradient is very small, output β ( j+1 ),
Else set j = j+1 and go to step 2.
In practice, SGD algorithm is implemented by drawing the random samples s j of size 1, which
greatly reduce the computational effort involved in the iterative approach. This specific form of
the algorithm can be presented as

The SGD Algorithm

1. Set j = 0. Take initial guess β ( 0) =β0 .


2. (a) Draw a random observation x j from training dataset S.
(b) Compute gradient at β=β ( j ) given by G( j )=2 ( Y j − f ( x j ; β ( j) )) ( −1 ,− x i, 1 , … , − x i , p ) '
3. Set β ( j+1 ) = β ( j) − γ G( j ) ,  is learning rate
4. If the norm of gradient is very small, output β ( j+1 ),
Else set j = j+1 and go to step 2.
The most significant benefit of this algorithm is that with every new example that is received
from the field, it can further improve the parameter estimates and thereby improve the
performance. Thus the algorithm makes continuous learning feasible.

Common questions

Powered by AI

Batch gradient descent computes the gradient using the entire dataset, which can lead to slower convergence compared to stochastic gradient descent that updates parameters using individual examples . This allows SGD to take more frequent but noisier steps towards the minimum, often resulting in faster convergence especially in large datasets . Additionally, SGD is computationally more efficient as it requires less memory and immediate updates to the model parameters, whereas batch gradient descent requires storing and computing over all data points for each update, leading to greater computational overhead .

Traditional least squares methods can be computationally intensive when applied to large datasets because they require the inversion of large matrices, which is computationally expensive . Additionally, handling the complete dataset increases memory requirements and can lead to slow processing times. These challenges can be mitigated using iterative techniques like stochastic gradient descent, which estimate parameters incrementally and require less computational power and memory by processing one or a small set of examples at a time .

Updating regression estimators with new data is essential in adaptive learning contexts where models must remain relevant over time. In situations where data streams continuously, such as in online learning environments, incorporating new data allows the model to adjust to changes and improve predictions incrementally . This ability to learn from each new example ensures that the model adapts to the evolving data distribution, maintaining its accuracy and performance .

The precision of the estimated regression coefficients in multiple linear regression is assessed using the variance of the coefficients, V(^β), calculated as (X'X)^−1σ^2 . The unbiased estimator for the variance σ^2 is given by e'e/(n−p−1), where e represents the residuals vector . A smaller variance signifies higher precision of the estimated coefficients.

Stochastic gradient descent (SGD) offers several benefits over traditional batch gradient descent in large-scale machine learning scenarios. First, it reduces computational effort significantly, as it processes one training example per iteration rather than the entire dataset . This property makes SGD particularly well-suited for problems with large datasets because it updates parameters more frequently, enabling faster convergence . Additionally, the algorithm's ability to handle new data iteratively makes it beneficial for online learning where continuous model updates are required .

The F statistic is used to test the overall significance of a multiple linear regression model. It determines if the regression model provides a better fit to the data compared to a model with no predictors . By comparing the explained variance to unexplained variance, the F statistic assesses whether at least one predictor variable is statistically significant in explaining the dependent variable . This helps in verifying the statistical validity of the model.

Feature selection increases the performance of a regression model by identifying and utilizing a subset of features that provide nearly the same level of explanation as the full set of features . This process not only reduces model complexity and overfitting but also can reduce computational costs. By focusing on the most important features, the model predictions become more accurate and interpretable .

Gradient descent ensures convergence to a local minimum by iteratively moving in the direction opposite to the gradient, which is the vector of partial derivatives indicating the steepest ascent . By choosing a step size proportional to the gradient's magnitude, the algorithm scales the step size depending on the proximity to the optimal value, allowing more accurate convergence as the minimum is approached . This method systematically reduces the function's error by updating variable estimates until the gradient is negligible, indicating that a local minimum has been reached .

Feature extraction is necessary when inputs are unsuitable because it transforms raw data into a format that is usable for learning models . This process involves deriving informative features from data, which can effectively influence the output model. Without feature extraction, the model may not be capable of accurately capturing the underlying patterns and complexities of the data .

A high R² value in multiple linear regression models implies that a large proportion of the variance in the dependent variable is explained by the independent variables included in the model . This suggests that the model has a good fit and may be useful for prediction purposes. However, it does not necessarily indicate that all features are important, nor does it imply causation .

You might also like