0% found this document useful (0 votes)
14 views11 pages

3 Regression

Uploaded by

NITISH DASA
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views11 pages

3 Regression

Uploaded by

NITISH DASA
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT - III

Regression

A regression problem is when the output variable is a real or continuous value, such as “salary”
or “weight”. Many different models can be used, the simplest is the linear regression. It tries to
fit data with the best hyperplane which goes through the points.

Regression Analysis is a statistical process for estimating the relationships between the
dependent variables or criterion variables and one or more independent variables or predictors.
Regression analysis explains the changes in criterions in relation to changes in select predictors.
The conditional expectation of the criterions based on predictors where the average value of
the dependent variables is given when the independent variables are changed. Three major uses
for regression analysis are determining the strength of predictors, forecasting an effect, and
trend forecasting.

Types of Regression –

• Linear regression
• Logistic regression
• Polynomial regression
• Stepwise regression
• Ridge regression
• Lasso regression
• ElasticNet regression

Linear regression is used for predictive analysis. Linear regression is a linear approach for
modeling the relationship between the criterion or the scalar response and the multiple
predictors or explanatory variables. Linear regression focuses on the conditional probability
distribution of the response given the values of the predictors. For linear regression, there is a
danger of overfitting. The formula for linear regression is: Y’ = bX + A.

Y = estimated dependent variable score, A = constant, b = regression coefficient, and X = score


on the independent variable.

Logistic regression is used when the dependent variable is dichotomous. Logistic regression
estimates the parameters of a logistic model and is form of binomial regression. Logistic
regression is used to deal with data that has two possible criterions and the relationship between
the criterions and the predictors. The equation for logistic regression is:

z = b0 + b1X1 + b2X2 +....+ bkXk

Where b0 is constant and k is independent (X) variables. In ordinal logistic regression, the
threshold coefficient will be different for every order of dependent variables. The coefficient
will give the cumulative probability of every order of dependent variables

Polynomial regression is used for curvilinear data. Polynomial regression is fit with the
method of least squares. The goal of regression analysis to model the expected value of a
dependent variable y in regards to the independent variable x. The equation for polynomial
regression is:
where ε is an unobserved random error with mean zero conditioned on a scalar variable x. In
this model, for each unit increase in the value of x, the conditional expectation of y increases
by β1 units.

Stepwise regression is used for fitting regression models with predictive models. It is carried
out automatically. With each step, the variable is added or subtracted from the set of
explanatory variables. The approaches for stepwise regression are forward selection, backward
elimination, and bidirectional elimination. The formula for stepwise regression is

Where Sy and Sx are the standard deviations for the dependent variable and the corresponding
jth independent variable

Ridge regression is a technique for analyzing multiple regression data. When multi-
collinearity occurs, least squares estimates are unbiased. A degree of bias is added to the
regression estimates, and a result, ridge regression reduces the standard errors. The formula for
ridge regression is

β is Coefficient
X=Independent Variable = Feature = Attribute = Predictor
The λ parameter is the regularization penalty
Y = response variable

Lasso regression is a regression analysis method that performs both variable selection and
regularization. Lasso regression uses soft thresholding. Lasso regression selects only a subset
of the provided covariates for use in the final model. Lasso regression is

Objective = RSS + α * (sum of absolute value of coefficients)

Here, α (alpha) works similar to that of ridge and provides a trade-off between balancing RSS
and magnitude of coefficients. Like that of ridge, α can take various values.

Lets iterate it here briefly:

1. α = 0: Same coefficients as simple linear regression


2. α = ∞: All coefficients zero (same logic as before)
3. 0 < α < ∞: coefficients between 0 and that of simple linear regression

ElasticNet regression is a regularized regression method that linearly combines the penalties
of the lasso and ridge methods. ElasticNet regression is used for support vector machines,
metric learning, and portfolio optimization. The penalty function is given by:
Use of this penalty function has several limitations. For example, in the "large p, small
n" case (high-dimensional data with few examples), the LASSO selects at most n
variables before it saturates.

BLUE property assumptions


• B-BEST
• L-LINEAR
• U-UNBIASED
• E-ESTIMATOR
An estimator is BLUE if the following hold:
1. It is linear (Regression model)
2. It is unbiased
3. It is an efficient estimator(unbiased estimator with least variance)

LINEARITY
• An estimator is said to be a linear estimator of (β) if it is a linear function of the
sample observations

• Sample mean is a linear estimator because it is a linear function of the X values.

UNBIASEDNESS
• A desirable property of a distribution of estimates is that its mean equals the true
mean of the variables being estimated
• Formally, an estimator is an unbiased estimator if its sampling distribution has as
its expected value equal to the true value of population.
• We also write this as follows:

Similarly, if this is not the case, we say that the estimator is biased

Bias=
MINIMUM VARIANCE
• Just as we wanted the mean of the sampling distribution to be
centered around the true population, so it is desirable for the sampling distribution to
be as narrow (or precise) as possible.

– Centering around “the truth” but with high variability might be of


very little use

• One way of narrowing the sampling distribution is to increase


the sampling size
What is the Least Squares Regression Method?
The least-squares regression method is a technique commonly used in Regression Analysis. It
is a mathematical method used to find the best fit line that represents the relationship between
an independent and dependent variable.

To understand the least-squares regression method lets get familiar with the concepts involved
in formulating the line of best fit.

What is the Line of Best Fit?


Line of best fit is drawn to represent the relationship between 2 or more variables. To be more
specific, the best fit line is drawn across a scatter plot of data points in order to represent a
relationship between those data points.
Regression analysis makes use of mathematical methods such as least squares to obtain a
definite relationship between the predictor variable (s) and the target variable. The least-
squares method is one of the most effective ways used to draw the line of best fit. It is based
on the idea that the square of the errors obtained must be minimized to the most possible extent
and hence the name least squares method.

If we were to plot the best fit line that shows the depicts the sales of a company over a period
of time, it would look something like this:

Notice that the line is as close as possible to all the scattered data points. This is what an ideal
best fit line looks like.

Let’s see how to calculate the line using the Least Squares Regression.

Steps to calculate the Line of Best Fit


To start constructing the line that best depicts the relationship between variables in the data,
the equation used is:

It is a simple equation that represents a straight line along 2 Dimensional data, i.e. x-axis and
y-axis. To better understand this, let’s break down the equation:

• y: dependent variable
• m: the slope of the line
• x: independent variable
• c: y-intercept

So the aim is to calculate the values of slope, y-intercept and substitute the corresponding ‘x’
values in the equation in order to derive the value of the dependent variable.

Let’s see how this can be done.

As an assumption, let’s consider that there are ‘n’ data points.

Step 1: Calculate the slope ‘m’ by using the following formula:

Step 2: Compute the y-intercept (the value of y at the point where the line crosses the y-axis):

Step 3: Substitute the values in the final equation:

Simple, isn’t it?

Now let’s look at an example and see how you can use the least-squares regression method to
compute the line of best fit.

Least Squares Regression Example


Consider an example. Tom who is the owner of a retail shop, found the price of different T-
shirts vs the number of T-shirts sold at his shop over a period of one week.

He tabulated this like shown below:


Let us use the concept of least squares regression to find the line of best fit for the above data.
Step 1: Calculate the slope ‘m’ by using the following formula:

After you substitute the respective values, m = 1.518 approximately.


Step 2: Compute the y-intercept value

After you substitute the respective values, c = 0.305 approximately.

Step 3: Substitute the values in the final equation


y = 1.518x + 0.305

Once you substitute the values, it should look something like this:

Let’s construct a graph that represents the y=mx + c line of best fit:

Now Tom can use the above equation to estimate how many T-shirts of price $8 can he sell at
the retail shop.

y = 1.518 X 8 + 0.305 = 12.45 T-shirts

This comes down to 13 T-shirts! That’s how simple it is to make predictions using Linear
Regression.

Now let’s try to understand based on what factors can we confirm that the above line is the line
of best fit.

The least squares regression method works by minimizing the sum of the square of the errors
as small as possible, hence the name least squares. Basically the distance between the line of
best fit and the error must be minimized as much as possible. This is the basic idea behind the
least squares regression method.

A few things to keep in mind before implementing the least squares regression method is:

• The data must be free of outliers because they might lead to a biased and wrongful line
of best fit.
• The line of best fit can be drawn iteratively until you get a line with the minimum
possible squares of errors.
• This method works well even with non-linear data.
• Technically, the difference between the actual value of ‘y’ and the predicted value of
‘y’ is called the Residual (denotes the error).

Variable Rationalization

Data Rationalization is an enabler of effective Data Governance. How can you govern
information assets if you don’t know where they are or what they mean? Similarly, Data
Rationalization can aid in the development of Master Data Management solutions. By
identifying common data entities, and how these relate to other pieces of data (again, across
many systems), MDM solutions will be able to better accommodate the needs of all the systems
which require the master/reference data.

How does it work?

In order to be able to rationalize your data, meta relationships between model objects (across
model levels) must be established. Of course, we are not talking about supplanting the normal
types of relationships between model objects in the same model. Meta relationships can be
established in multiple ways:

1. Use automated modeling tool functionality (e.g. ERStudio Where Used,


PowerDesigner Link and Sync)
2. Use manual modeling tool functionality (e.g. ERStudio User Defined Mapping)
3. Use modeling tool meta data fields (e.g. ERwin User Defined Properties (UDP))
4. Use Meta Data Repository tool (e.g. Rochade, Adaptive, Advantage Repository, etc) to
manually establish links using a GUI or other interface.
5. Use a spreadsheet(Excel Sheet)

You might also like