UNIT IV
Simple linear regression
Principle of Least squares: Principle of least squares says that for a curve of
best fit to the given data points the sum of the squares of the errors is a
minimum.
The simple linear regression model is a model with a single regressor x
that has a relationship with a response y that is a straight line. This model is
given by
ŷ = a + bx ------------ (1)
where a is the intercept, b is the slope.
The parameters a and b are called regression coefficients.
Least Squares Estimation of the parameters:
We will use the method of least squares to estimate the parameters a and
b in (1). That is we will estimate a and b so that the sum of squares of the
differences between the observations yi and the straight line ŷ = a + bx is a
minimum.
From (1), we have
yi = a + bxi ---------------- (2) , i = 1,2,……. n
Equation (1) may be viewed as a population regression model whereas (4) is a
sample regression model, written in terms of the n pairs of data (xi,yi), i =
1,2,……. n.
According to least squares criterion,
n
f(a, b) = (y
i =1
i – a – bxi)2 ------------------ (3) is a minimum.
So we obtain the following equations:
f n
= 0 - 2
a
( y − a − bx )
i =1
i i =0
f n
b
= 0 −2
i =1
( yi − a − bxi ) xi = 0
Simplifying the above equations, we get
n n
na + b xi = yi
→ ( 4)
i =1 i =1
n n n
a xi + b xi = xi yi
2
i =1 i =1 i =1
Solving (5), we get a = aˆ and b = bˆ . Note that
â and b̂ are called the least squares estimators of a and b. Here the equations (5)
are called the least squares normal equations.
ˆ
Since ( x , y ) lies on the least squares line, we have y = aˆ + bx
ˆ
aˆ = y − bx → (5)
n n
xi yi
xi yi − i =1 i =1
n
n
ˆ
And b = − − − − − − − (6)
i =1
2
n
xi
xi2 − i =1
n
i =1 n
Equation (5) is obtained from the first equation of (4), after dividing with n,
1 n 1 n
where y = i
n i =1
y , x = xi .
n i =1
The fitted simple linear regression model is given by
ˆ → (7)
yˆ = aˆ + bx
Equation (7) gives the point estimate of the mean of y for a particular x.
2
n
xi
i =1
n
Let us denote S xx = xi −
2
i =1 n
2
n
yi
S yy = yi − i =1
n
2
i =1 n
n n
xi yi
S xy = xi yi − i =1 i =1
n
And n
i =1
S xy
Now equation (6) can be written as ˆ1 = → (7)
S xx
The difference between the observed value yi and the corresponding fitted
value yˆi is called the residual.
That is, ei = yi - yˆi
( )
= yi − ˆ0 + ˆ1 x , i = 1,2,………n → (8)
Problem 1:The following are measurements of the air velocity and evaporation
coefficient of burning fuel droplets in an impulse engine:
Air velocity (cm/sec) : 20 60 100 140 180 220 260 300 340 380
Evo. Coeff(mm2/sec): 0.18 0.37 0.35 0.78 0.56 0.75 1.18 1.36 1.17 1.65
Fit a simple linear regression line to the above data.
ˆ is the simple linear regression line.
Solution: Let yˆ = aˆ + bx
Here n = 10
10 10 10 10 8
xi = 2, 000 ,
i =1
xi2 = 5,32, 000 ,
i =1
yi = 8.35 ,
i =1
xi yi = 2,175.40 ,
i =1
y
i =1
2
i = 9.1097
2
10
xi
S xx = xi − i =1
10
2
i =1 n
( 2000 )
2
= 532000 - = 1,32, 000
10
10 10
10 x i yi
S xy = xi yi − i =1 i =1
i =1 n
= 2,175.40 −
( 2, 000 )(8.35) = 505.40
10
2
10
yi
S yy = yi − i =1
10
2
i =1 n
(8.35)
2
= 9.1097 − = 2.13745
10
S xy 505.40
Now, b̂ = = = 0.00383
S xx 1,32, 000
ˆ
â = y − bx
8.35 2000
= − 0.00383
10 10
= 0.069
y = 0.069 + 0.00383x is the least square regression line.
Or you can solve the normal equations
10 a + 2000b = 8.35
2000a + 532000b = 2175.40 to find the values of a and b.
Problem: The following data pertain to the number of computer jobs per day
and the central processing unit time required,
[Link] jobs (x) 1 2 3 4 5
CPU Time(y) 2 5 4 9 10
(i) ˆ
Fit a least squares line yˆ = aˆ + bx
(ii) Predict the mean CPU time when x = 3.5
Solution: (i) xi = 15, yi = 30, xi 2 = 55, yi 2 = 226, xi yi = 110, x = 3 and y = 6
i i i i i
S xx = 10, S yy = 46, S xy = 20
S 20
bˆ = xy = =2
S xx 10
ˆ
aˆ = y − bx
= 6 − (2)(3)
=0
The least squares line is yˆ = 2 x
(ii) When x = 3.5, ŷ = 2 * 3.5 = 7
Problem: A chemical company, wishing to study the effect of extraction time
on the efficiency of an extraction operation, obtained the data shown in the
following table:
Extraction time, x Extraction efficiency, y
(minutes) (%)
27 57
45 64
41 80
19 46
35 62
39 72
19 52
49 77
15 57
31 68
Fit a straight line to the given data by the method of least squares and use it to
predict the extraction efficiency one can expect when the extraction time is 35
minutes.
Solution:
ˆ be the least squares straight line.
Let yˆ = aˆ + bx
x
i
i = 320, yi = 635, xi 2 = 11490, yi 2 = 41395, xi yi = 21275, x = 32 and y = 63.5
i i i i
S xx = 1250, S yy = 1072.5, S xy = 955
S 955
bˆ = xy = = 0.764
S xx 1250
ˆ
aˆ = y − bx
= 63.5 − (0.764)(32)
= 39.052
The least squares line is yˆ = 39.052 + 0.764 x
When x = 35, ŷ = 39.052+0.764*35 = 65.792
Regression :
The process of estimating the best possible values of one variable given
the values of another variable through a least squares curve is called regression.
Regression is of two types. One is linear regression and the other one is
curvilinear regression.
Curvilinear regression: The process of estimating the best possible values of
one variable from the known values of the other variable through a least squares
curve is called curvilinear regression.
Suppose we have to fit a curve y = a + bx + cx2 to the given data points (xi,yi);
i = 1,2,..,n
Normal equations to fit the above curve are
n n n
na + b xi + c xi2 = yi
i =1 i =1 i =1
n n n n
a xi + b x + c x = xi yi
2
i
3
i
i =1 i =1 i =1 i =1
n n n n
a xi + b x + c x = xi2 yi
2 3
i
4
i
i =1 i =1 i =1 i =1
Solving the above equations we will get the values of a, b and c. Let us denote
respectively them by aˆ , bˆ and cˆ . So, The least squares best fit to the given data
ˆ + cx
points is yˆ = aˆ + bx ˆ 2
Problem: The following are data on the drying time of a certain varnish and the
amount of an additive that is intended to reduce the drying time:
Amount of
Drying Time
varnish additive
(Hours)
(grams)
y
x
0 12.0
1 10.5
2 10.0
3 8.0
4 7.0
5 8.0
6 7.5
7 8.5
8 9.0
Fit a second degree polynomial y = a + bx + cx2 by the method of least squares.
Also predict the drying time of the varnish when 6.5 grams of additive is used.
Solution: Normal equations to fit y = a + bx + cx2 are
n n n
na + b xi + c x = yi 2
i
i =1 i =1 i =1
n n n n
a xi + b x + c x = xi yi
2
i
3
i
i =1 i =1 i =1 i =1
n n n n
a xi2 + b xi3 + c xi4 = xi2 yi
i =1 i =1 i =1 i =1
Here n = 9
Table:
x y x2 x3 x4 xy x2y
0 12 0 0 0 0 0
1 10.5 1 1 1 10.5 10.5
2 10 4 8 16 20 40
3 8 9 27 81 24 72
4 7 16 64 256 28 112
5 8 25 125 625 40 200
6 7.5 36 216 1296 45 270
7 8.5 49 343 2401 59.5 416.5
8 9 64 512 4096 72 576
36 80.5 204 1296 8772 299 1697
Now we have the equations 9a + 36b + 204c = 80.5
36a + 204b + 1296c = 299
204a + 1296b + 8772c = 1697
Solving the above equations by the known methods we get
A = 12.2, b= -1.85 and c = 0.183
Hence the least squares second degree polynomial is ŷ = 12.2 – 1.85x +
0.183x2.
When x = 6.5, ŷ = 12.2 – 1.85*6.5 + 0.183*6.52 = 7.9.
Problem: Fit a second degree polynomial y = a + bx + cx 2 by the method of
least squares to the following data
x y
1.5 1.3
2 1.6
2.5 2
3 2.7
3.5 3.4
4 4.1
Solution: Normal equations to fit y = a + bx + cx2 are
n n n
na + b xi + c x = yi 2
i
i =1 i =1 i =1
n n n n
a xi + b x + c x = xi yi
2
i
3
i
i =1 i =1 i =1 i =1
n n n n
a xi2 + b xi3 + c xi4 = xi2 yi
i =1 i =1 i =1 i =1
Here n = 7
Table:
x y x2 x3 x4 xy x2y
1 1.1 1 1 1 1.1 1.1
1.5 1.3 2.25 3.375 5.0625 1.95 2.925
2 1.6 4 8 16 3.2 6.4
2.5 2 6.25 15.625 39.063 5 12.5
3 2.7 9 27 81 8.1 24.3
3.5 3.4 12.25 42.875 150.06 11.9 41.65
4 4.1 16 64 256 16.4 65.6
17.5 16.2 50.75 161.88 548.19 47.65 154.48
Now we have the equations 7a + 17.5b + 50.75c = 16.2
17.5a + 50.75b + 161.88c= 47.65
50.75a + 161.88b + 548.19c = 154.48
Solving the above equations by the known methods we get
a = 1.0477, b= - 0.204 and c = 0.2451
Hence the least squares second degree polynomial is ŷ = 1.0477 – 0.204x +
0.2451x2.
The Regression analysis for studying more than two variables at a time is
known as Multiple Regression.
Fitting of other curves
(1) To fit y = abx ------------------ (1)
Taking natural logarithms on both sides we get
ln y = ln a + x ln b
Y = A + Bx--------------------- (2)
Where Y = ln y, A = ln a and B = ln b
Normal equations to fit (2) are
n n
nA + B xi = Yi
→ ( 3)
i =1 i =1
n n n
A xi + B xi2 = xiYi
i =1 i =1 i =1
Solve (3) for A and B and hence find the values of A and B. Where a = eA
and b = eB.
(2) To fit y = axb -------------------- (1)
Taking natural logarithms on both sides we get
ln y = ln a + b ln x
i.e., Y = A + bX ---------------- (2)
Where Y = ln y, A = ln a and X = ln x
Normal equations to fit (2) are
n n
nA + b X i = Yi
→ ( 3)
i =1 i =1
n n n
A xi + b X i2 = X iYi
i =1 i =1 i =1
Solve (3) for A and b and hence find the values of a and b. Where a = eA.
(3) To fit y = aebx ----------------------- (1)
Taking natural logarithms on both sides we get
ln y = ln a + bx
i.e., Y = A + bx --------------------------- (2)
Normal equations to fit (2) are
n n
nA + b xi = Yi
→ ( 3)
i =1 i =1
n n n
A xi + b xi2 = xiYi
i =1 i =1 i =1
Solve (3) for A and b and hence find the values of a and b. Where a = eA.
Problem: An experiment gave the following values:
x: 61 26 7 26
y: 350 400 500 600
Use the principle of least squares, fit a curve y = axb.
Solution: Normal equations to fit y = axb are
n n
nA + b X i = Yi
→ (1)
i =1 i =1
n n n
A xi + b X i2 = X iYi
i =1 i =1 i =1
Table
x y X = ln x Y = ln y X2 XY
61 350 4.11 5.86 16.90 24.08
26 400 3.26 5.99 10.62 19.52
7 500 1.95 6.21 3.79 12.09
26 600 3.26 6.40 10.62 20.84
120 1850 12.57 24.46 41.92 76.54
Then (1) becomes 4A + 12.57b = 24.46 ------------------ (2)
12.57A + 41.92b = 76.54 ------------------- (3)
Solving (2) and (3) we get A = 6.538 and b = - 0.1346
Now a = e6.538 = 690.90 and b = -0.1346. So, the curve is y = 690.90x-0.1346.
Home work:
(1) Fit a curve y = abx to the data given below:
x: 2 3 4 5 6
y: 144 172 207 248 298
(2) Fit a curve y = aebx to the data given below:
x: 2 4 6 8
y: 25 38 56 84
Multiple Linear regression
Let y denote the dependent variable that is linearly related to k independent
variables x1, x2, .., xk through the parameters 1 , 2 ,..., k . The multiple linear
regression model with k parameters is given by
y = 0 + 1 x1 + ... + k xk − − − − − (1)
Data for multiple linear regression
Response Regressors
Observations x1 x2 xk
y
1 y1 x11 x12 x1k
2 y2 x21 x22 x2k
n yn xn1 xn 2 xnk
The sample regression model corresponding to (1) is given by
k
yi = 0 + j xij + i , i =1, 2,..., n − − − −(2)
j =1
The least squares function is given by
n
S ( 0 , 1 ,..., k ) = i2
i =1
n k
= ( yi − 0 − j xij ) 2 − − − −(3)
i =1 j =1
S S S
The function S to have minimum = 0, = 0,..., = 0 − − − −(4)
0 1 k
Solving (4) for 0 , 1 ,..., k we get 0 , 1 ,..., k respectively.
The least squares normal equations are given by
n n n n
nˆ0 + ˆ1 xi1 + ˆ2 xi 2 + ... + ˆk xik = yi
i =1 i =1 i =1 i =1
n n n n n
ˆ0 xi1 + ˆ1 xi1 + ˆ2 xi1 xi 2 + ... + ˆk xi1 xik = xi1 yi
2
i =1 i =1 i =1 i =1 i =1
. ...........(5)
.
.
n n n n n
0 xik + 1 xik xi1 + 2 xik xi 2 + ... + k xik = xik yi
ˆ ˆ ˆ ˆ 2
i =1 i =1 i =1 i =1 i =1
Problem: Fit the linear regression model y = β0 + β1x1 + β2x2 for the following
data:
y 41 49 69 65 40 50 58 57 31 36 44 57 19 31 33 43
x1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
x2 5 5 5 5 10 10 10 10 15 15 15 15 20 20 20 20
Find y when x1 = 2.5 and x2 = 12
Solution:
Table
y x1 x2 x12 x1x2 x22 x1y x2y
41 1 5 1 5 25 41 205
49 2 5 4 10 25 98 245
69 3 5 9 15 25 207 345
65 4 5 16 20 25 260 325
40 1 10 1 10 100 40 400
50 2 10 4 20 100 100 500
58 3 10 9 30 100 174 580
57 4 10 16 40 100 228 570
31 1 15 1 15 225 31 465
36 2 15 4 30 225 72 540
44 3 15 9 45 225 132 660
57 4 15 16 60 225 228 855
19 1 20 1 20 400 19 380
31 2 20 4 40 400 62 620
33 3 20 9 60 400 99 660
43 4 20 16 80 400 172 860
723 40 200 120 500 3000 1963 8210
x = 40, x x x x = 500
2
From the above table n = 16, 1 2 = 200, 1 = 120, 1 2
x y = 723, x y = 1963 and x y = 8210
2
2 = 3000, 1 2
Normal equations to fit the model are
nˆ0 + ˆ1 x1 + ˆ2 x2 = y
ˆ0 x1 + ˆ1 x12 + ˆ2 x1 x2 = x1 y
ˆ0 x2 + ˆ1 x1 x2 + ˆ2 x2 2 = x2 y
Substituting the above values in the normal equations we get
16ˆ0 + 40ˆ1 + 200ˆ2 = 723
40ˆ + 120ˆ + 500ˆ =1963
0 1 2
200ˆ0 + 500ˆ1 + 3000ˆ2 = 8210
Solving the above system of equations by known method we get
743 ˆ 311 ˆ 331
ˆ0 = , 1 = , 2 = −
16 40 200
Therefore yˆ = 46.43 + 7.775x1 −1.655x2 . So when x1 = 2.5 and x2 = 12 the value of y
is given by yˆ = 46 .
Bivariate Data:
In a Bivariate data, two variables are observed. One variable is
independent and the other is dependent. These variables are usually denoted by
X and Y. So, here we analyze the changes occurred between the two variables.
So, Bivariate analysis is the analysis of exactly two variables. Multivariate
analysis is the analysis of more than two variables.
Ex: Income – Expenditure
In this example, Income is the independent variable and Expenditure is
the dependent variable. The Expenditure is determined by Income. Having more
Income increases the Expenditure, but increasing Expenditure will not increase
the Income. Such type of variables are called mutually dependent variables.
Correlation :
The Correlation is a statistical technique which studies the relationship
between the two variables. In other words, Correlation is a statistical technique
that is used to measure and describe the strength and direction of the
relationship between two variables. It is defined as when the changes in the
values of one variable are associated with the changes in the values of the other
variable is called correlation.
Types of Correlation:
(a) Positive Correlation: If the values of two variables deviate in the same
direction i.e., if the increase(decrease) in the values of one variable results
in a corresponding increase(decrease) in the values of the other variable,
such type of Correlation is said to be positive correlation.
Ex: Price and Supply of a commodity, Rainfall and Yield of crop, Heights
and weights etc.
(b) Negative Correlation: Correlation is said to be Negative if the values
deviate in the opposite direction i.e., if the increase(decrease) in the
values of one variable results in a corresponding decrease(increase) in the
values of the other variable.
Ex: Price and Demand of a commodity, Volume and Pressure, Sale of
woolen garments and the Day temperature etc.
(c) Linear and Non-linear Correlation: The Correlation between two
variables is said to be linear if corresponding to a unit change in one
variable, there is a constant change in the other variable.
Ex: X : 1 2 3 4 5
Y : 5 7 9 11 13
The relationship between two variables in said to be Non-linear if
corresponding to a unit change in one variable, the other variable does not
change at a constant rate but at fluctuating rate.
(d) If the change in one variable does not affect the other
variable, and the two variables are said to be “Uncorrelated”.
Ex: Rainfall and Intelligence.
Scatter diagram: It is a diagram which is obtained by plotting the paired
observations of mutually dependent variables. With the help of the Scatter
diagram we are able to identify the type of correlation that exists between two
mutually dependent variables.
Correlation Coefficient :
The quantitative measure for Correlation is called “Correlation
Coefficient”. Otherwise, it is a measure of intensity or degree of linear
relationship between two variables. It was proposed by Karl Pearson.
Correlation Coefficient between two variables X and Y, usually denoted by r XY
and is defined as
Cov( X , Y )
r =
XY
X Y
n xy − x y
=
n x 2 − ( x ) n y 2 − ( y )
2 2
Problem: Calculate the correlation coefficient between the Age and B.P. from
the following data.
Age(X) 56 42 72 36 63 47 55 49 38 42 68 60
B.P(Y) 147 125 160 118 149 128 150 145 115 140 152 155
Solution:
Table:
x y xy x2 y2
56 147 8232 3136 21609
42 125 5250 1764 15625
72 160 11520 5184 25600
36 118 4248 1296 13924
63 149 9387 3969 22201
47 128 6016 2209 16384
55 150 8250 3025 22500
49 145 7105 2401 21025
38 115 4370 1444 13225
42 140 5880 1764 19600
68 152 10336 4624 23104
60 155 9300 3600 24025
628 1684 89894 34416 238822
n xy − x y
r =
n x 2 − ( x ) n y 2 − ( y )
XY 2 2
12 89894 − 628 1684
=
12 34416 − (628)2 12 238822 − (1684)2
r = 0.8961
XY
Problem : The following are the numbers of minutes it took 10 mechanics to
assemble a piece of machinery in the morning, x, and late afternoon, y:
x 11.1 10.3 12.0 15.1 13.7 18.5 17.3 14.2 14.8 15.3
y 10.9 14.2 13.8 21.5 13.2 21.1 16.4 19.3 17.4 19.0
Calculate the correlation coefficient between x and y.
Solution:
Table:
x y xy x2 y2
11.1 10.9 120.99 123.21 118.81
10.3 14.2 146.26 106.09 201.64
12 13.8 165.6 144 190.44
15.1 21.5 324.65 228.01 462.25
13.7 13.2 180.84 187.69 174.24
18.5 21.1 390.35 342.25 445.21
17.3 16.4 283.72 299.29 268.96
14.2 19.3 274.06 201.64 372.49
14.8 17.4 257.52 219.04 302.76
15.3 19 290.7 234.09 361
142.3 166.8 2434.69 2085.31 2897.8
n xy − x y
r =
n x 2 − ( x ) n y 2 − ( y )
XY 2 2
10 2434.69 − 142.3 166.8
=
10 2085.31 − (142.3) 2 10 2897.8 − (166.8) 2
611.26
=
603.81 1155.76
= 0.7317
Problem: From the following table calculate the coefficient of correlation by
Karl Pearson’s method.
X : 23 27 28 29 30 31 33 35 36 39
Y : 18 22 23 24 25 26 28 29 30 32