0% found this document useful (0 votes)
11 views55 pages

Simple Linear Regression Overview

Uploaded by

rxn255
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views55 pages

Simple Linear Regression Overview

Uploaded by

rxn255
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

Chapter 11

Simple Linear Regression and Correlation


Introduction
We look for a linear relationship between two
quantitative variables X and Y.

The linear relation is

y = 0 + 1x,

where 0 is the y-intercept and 1 is the slope.


Simple Linear Regression Model
We have (x1, y1), (x2, y2), …, (xn, yn) as the paired
data.

X - independent variable with values


x1, x2,... ,xn.
Y - dependent variable with values
y1, y2,... ,yn.
Scatterplot
We can make a scatterplot to see if there
appears to be a linear relationship between X
and Y.
Scatterplot
EXA For 9 steers taken to market, X = live
weight, Y = dressed weight (both in hundreds
of pounds).

x | 13.7 12.4 15.6 11.1 14.7 14.9 14.1 12.1 12.7


y | 9.1 8.2 10.1 6.8 9.5 9.1 8.5 7.8 8.2
Scatterplot
x | 13.7 12.4 15.6 11.1 14.7 14.9 14.1 12.1 12.7
y | 9.1 8.2 10.1 6.8 9.5 9.1 8.5 7.8 8.2

R Code:
x=c(13.7,12.4,15.6,11.1,14.7,14.9,14.1,12.1,12.7)
y=c(9.1,8.2,10.1,6.8,9.5,9.1,8.5,7.8,8.2)
plot(x,y,xlab="Live Weight",ylab="Dressed Weight")

Does the data look fairly linear? Positive or


negative association?
Simple Linear Regression Model
Let Y = 0 + 1X + , where X is independent,
regressor, explanatory, or predictor
variable, Y is dependent or response
variable, and  is a random error.

 is N(0, 2 )

Y|X = x is N(0 + 1x, 2 )


Simple Linear Regression Model
We have Y|X = x is N(0 + 1x, 2 ),

E(Y|X = x) = 0 + 1x,

and Var(Y|X = x ) = Var() = 2 .

Also, y = 0 + 1x is the true regression line.


Proper Interpretation of Line
EXA Let Y denote the flow rate in a device
used for air quality measurement and X
denote the pressure drop across the
device’s filter.

For x in between 5 and 20, the estimated


regression line is

ŷ = -0.12 + 0.095x.
Proper Interpretation of Line
EXA Let Y denote the reaction time in a
certain chemical process and X denote the
temperature in the chamber in which the
reaction takes place.

The estimated regression line is

ŷ = 5 - 0.01x.
Estimating 0 and 1
Principle of Least Squares: We want to
minimize the sum of the squared distances in the
graph… Hence, we want to minimize

n
()
 [y - (β
i1
i 0 β 1xi)] 2

with respect to 0 and 1 .


Estimating 0 and 1
It turns out that the values which minimizes ()
are: n n
β̂1 [  (xi - x)(yi - y)] /  (xi - x)2
i 1 i 1
n n n
 n xiyi - (  xi)(  yi)
i 1 i 1 i 1
n n
n x i2 - (  xi)2
i 1 i 1

β̂ 0 y - β̂ 1 x
Alternative Formulas for 0 and 1

β̂1 = sxy / sxx

β̂ 0 y - β̂ 1 x
Estimated Regression Line
Hence, yˆ βˆ 0  βˆis
1xthe estimated

regression line or least-squares line. Sometimes


this line is called the best fitting line.

EXA Compute the least-squares line for the


steer data.
Comments
Note: For a fixed x, yˆ β̂ 0 β̂ 1x gives either

(1) a point estimate of E(Y|X = x)

or

(2) a prediction of the Y value that will result from


a value x.
Estimating 2
Note: 2 is the amount of variability in the
regression model.

For a linear pattern,

(1) large scatter in scatterplot  large 2 .


(2) small scatter in scatterplot  small 2 .
Residuals

Defn: yˆ 1 β̂ 0 β̂ 1x1 ,…,yˆ n β̂ 0 β̂ 1xn

are the n fitted (or predicted) values. The


residuals are the vertical deviations

y1  yˆ 1 ,….,yn  yˆ n .
Fitted Values or Residuals

EXA Steer data. Find fitted values and


residuals.
Residual Plot
We can also make a residual plot. In a residual
plot, a pattern suggests a line is a bad fit to
model the relationship between X and Y.
n
Note: (1)  (y  yˆ ) = sum of the residuals = 0.
i1
i i

(in our example, we had a little


round off error)
(2) For a line to be a good fit, the residual
plot is a random scatter of points in a
narrow band about the 0 line.
Residual Plot
For a residual plot, the residual is on the y-axis,
and either the fitted value or the explanatory
variable is on the x-axis.

EXA Find the residual plot(s) for the steer


example using R.
yhat=.1213+.6283*x
res=y-yhat
plot(x,res,xlab="Live Weight",ylab="Residual")
plot(yhat,res,xlab="Predicted Dressed Weight",ylab="Residual")
abline(0,0)
Residual Plot
For a residual plot, the residual is on the y-axis,
and either the fitted value or the explanatory
variable is on the x-axis.

EXA Find the residual plot(s) for the steer


example using R.
yhat=.1213+.6283*x
res=y-yhat
plot(x,res,xlab="Live Weight",ylab="Residual")
plot(yhat,res,xlab="Predicted Dressed Weight",ylab="Residual")
abline(0,0)
Estimate of 2

Error Sum of Squares = SSE


n n
=  (yi  yi)  (yi  (β̂ 0 β̂ 1xi))
i1
ˆ 2

i1
2

2
Estimate of 2 = σ̂ = SSE/(n - 2) = s2 .
Estimate of 2
Note: (1) d.f. = (n -2) since two parameters
(0 and 1) had to be estimated
from the data.
2 2
(2) E(σ̂ ) = 2 . Soσ̂ is an
unbiased estimator of 2 .
2
EXA Back to steer example. Find σ̂.
Using R to do Linear Regression
x=c(13.7,12.4,15.6,11.1,14.7,14.9,14.1,12.1,1
2.7)
y=c(9.1,8.2,10.1,6.8,9.5,9.1,8.5,7.8,8.2)
fit=lm(y~x)
coefficients(fit)
summary(fit)

What pieces do we recognize?


Alternative Formula for SSE
n n n
SSE =  i  β̂ 0 yi  β̂ 1 xiyi
y
i1
2

i1 i1

EXA Steer example. Find SSE.

Note: The alternative formula to SSE is very


sensitive to round off error of ̂ 0 and ̂ 1 .
Alternative Formula for SSE

SSE = Syy  βˆ 1Sxy

EXA Steer example. Find SSE.

Note: The alternative formula to SSE is a little


sensitive to round off error of ̂ 1 .
Coefficient of Determination
How much of the y variation can be attributed
to the proposed linear regression model?

Total Sum of Squares = SST


n n n

=  i  /n
2 2
(yi  yi)  y 2
 ( yi)
i1 i1 i1
Coefficient of Determination
Note that SSE  SST. This must be from the
definition of least squares.

Regression Sum of Squares (SSR)


= SST - SSE.

Defn: The coefficient of determination is

r2 = 1 - SSE/SST = SSR/SST.
Interpretation of r2
SST - total amount of variation in observed y
values
SSE - amount of unexplained variation
SSR - amount of total variation explained by
the linear model

r2 - proportion of observed y variation that can


be explained by the linear regression
model
Comments about r2
(1) SSE  SST  0  SSE/SST  1
 0  r2  1

(2) r2 close to 1  line is a good fit

(3) r2 close to 0  line is a bad fit so look


for another model
Is a Line a Good Fit?
Exploratory tools we have:

Make scatterplot – does it look linear?

Calculate r - is it close to 1 or -1?

Calculate r2 – is it close to 1?

Make residual plot (residuals against x
values)– random scatter ….

Calculate s2 – is it small?

Look at all these together – not just one of


them by itself.
Is a Line a Good Fit?
Two other things:

Make a normal probability plot of residuals.


(qqnorm(res)) Is it linear?

Plot residuals against fitted values. Is it a


random scatter of points about the 0 line in
a narrow band?
Coefficient of Determination
EXA Steer example. Find the coefficient of
determination.
Inferences about 1

β̂ 1 is a point estimator of 1 where

n n
β̂ 1 [ (xi - x)(Yi - Y)] /  (xi - x) 2

i1 i1

= sxy / sxx .

.
Distribution of β̂ 1

(1) E( β̂ 1 ) = 1.

(2) V( β̂ 1) =
n n n
σ /  (xi  xi) σ /[  x i  (  xi) /n]
2 2 2 2 2

i1 i1 i1

= 2 / sxx .
Distribution of β̂ 1

Estimated variance of β̂ 1 =
n n
SSE/{(n - 2)[  x i2  (  xi)2 /n]} .
i1 i1

(3) β̂ 1 is N(1, V( β̂ 1 )).


Inferences about 1
Theorem:

β̂ 1 - β 1
T=
n n
SSE/{(n - 2)[  x i  (  xi) /n]}
2 2

i1 i1

has a t(n - 2) distribution.


100(1 - )% c.i. for 1

A 100(1 - )% c.i. for 1 is

n n
β̂ 1 (tα /2, n  2) SSE/{(n - 2)[  x i2  (  xi)2 /n]}
i1 i1

EXA Steer example. Find a 95% c.i. for 1.


Confidence Interval for 1
Interpretation of confidence interval found in
steer example:

We are 95% confident that for every pound of


live weight increase, the associated expected
change in dressed weight is between 0.4444
and 0.8122 pounds.
Hypothesis Testing
Consider testing

I. H0 : 1  10 II. H0 : 1  10 III. H0 : 1 = 10


H1 : 1 > 10 H1 : 1 < 10 H1 : 1  10

Test statistic is
β̂ 1 - β 10
n n
T=
SSE/{(n - 2)[  x i  (  xi) /n]}
2 2

i1 i1
Hypothesis Testing
Reject H0 when
I. T > t,n-2 II. T < -t,n-2 III. |T| > t/2,n-2 .

Typically, we want to test: H 0 : 1 = 0


H1 : 1  0.

This is called the model utility test. ( H0 true


implies that Y does not depend on X in a linear
fashion.)
Hypothesis Testing
EXA Steer example. Test H0 : 1 = 0 at  = .05.
H1 : 1  0

Note: Exploratory tools discussed early are


subjective in nature (is it close enough??) but the
model utility test is an objective way to see if
there is evidence in support of a linear
relationship between X and Y.

Book gives respective tests and ci’s for 0 .


Analysis of Variance Approach to Testing

We can test H0 : 1 = 0 versus H1 : 1  0 at


 = .05 by constructing an ANOVA table.
ANOVA Table

Source of Sum of
Variation d.f Squares Mean Square F
Regression 1 SSR SSR
SSR/s2
Error n - 2 SSE s 2 = SSE/(n-2)_ _
Total n - 1 SST

Reject H0 when F > F,1,n-2 .


ANOVA Approach to Model Utility Test

EXA Steer example. Use ANOVA table to test


H0 : 1 = 0 versus H1 : 1  0 at  = .05.
Using R

EXA Steer example. Test H0 : 1 = 0 versus


H1 : 1  0 at  = .05.
x=c(13.7,12.4,15.6,11.1,14.7,14.9,14.1,12.1,12.7)
y=c(9.1,8.2,10.1,6.8,9.5,9.1,8.5,7.8,8.2)
fit=lm(y~x)
coefficients(fit)
summary(fit)

What pieces do we recognize?


Correlation
The sample correlation coefficient, r, measures
the strength of the linear relationship between X
and Y.
Defn: r = Sxy/ SxxSyy
n n n
n xiyi - (  xi)(  yi)
i1 i1 i1
=
n n n n
[n x i2  (  xi)2 ][n y i2  (  yi)2 ]
i1 i1 i1 i1
r and r2
Note: r is the square root of r 2 discussed earlier

(the coefficient of determination) where r is


the same sign as slope of line.

EXA Find r for steer data.


Properties of r
1. r does not depend on which are the
independent and dependent variables.

2. r is independent of the units in which x and y


are measured (feet to inches, pounds to
ounces).

3. -1 r  1.
Properties of r (continued)
4. r = 1 iff all points lie on a line with positive
slope.
r = -1 iff all points lie on a line with negative
slope.

5. The square of the sample correlation


coefficient equals the coefficient of
determination.
From Moore (2000). Basic Practice of Statistics
Rule of Thumb

Rule of Thumb for Interpretation of r

0  |r|  .5  weak correlation


.5 < |r| < .8  moderate correlation
.8  |r|  1  strong correlation
Inferences about 

 is the population correlation coefficient. R is a


point estimate of . What is R?

Consider testing:

I. H0 :   0 II. H0 :   0 III. H0 :  = 0
H1 :  > 0 H1 :  < 0 H1 :   0
Inferences about 
R n 2
Test statistic is T = 2
.
1 R

Under H0 and (X, Y) having a bivariate normal


distribution (page 115), T has a t(n - 2)
distribution.
Inferences about 

Reject H0 when

I. T > t,n-2 II. T < -t,n-2 III. |T| > t/2,n-2 .

EXA Test H0 :   0 in steer example at


=.05.
H1 :  > 0

You might also like