Chapter 8
CORRELATION AND REGRESSION ANALYSIS
8.1 Correlation Analysis
So far, we have confined our discussion to the distributions involving only one variable.
Sometimes, in practical applications, we might come across certain set of data, where each item
of the set may comprise of the values of two or more variables.
Suppose we have a set of 30 students in a class and we want to measure the heights and weights
of all the students. We observe that each individual (unit) of the set assumes two values – one
relating to the height and the other to the weight. Such a distribution in which each individual or
unit of the set is made up of two values is called a bivariate distribution. The following examples
will illustrate clearly the meaning of bivariate distribution.
i. In a class of 60 students the series of marks obtained in two subjects by all of them.
ii. The series of sales revenue and advertising expenditure of two companies in a particular year.
iii. The series of ages of husbands and wives in a sample of selected married couples.
Thus in a bivariate distribution, we are given a set of pairs of observations, wherein each pair
represents the values of two variables. In a bivariate distribution, we are interested in finding a
relationship (if it exists) between the two variables under study.
The concept of ‘correlation’ is a statistical tool which studies the relationship between two
variables and Correlation Analysis involves various methods and techniques used for studying
and measuring the extent of the relationship between the two variables. “Two variables are said
to be in correlation if the change in one of the variables results in a change in the other
variable”.
5.1.1: Types of Correlation
There are two important types of correlation. They are (1) Positive and Negative correlation and,
(2) Linear and Non – Linear correlation.
Positive and Negative Correlation
If the values of the two variables deviate in the same direction i.e. if an increase (or decrease) in
the values of one variable results, on an average, in a corresponding increase (or decrease) in the
values of the other variablethe correlation is said to be positive.
Some examples of series of positive correlation are:
a. Heights and weights; c. Price and supply of commodities;
b. Household income and expenditure; d. Amount of rainfall and yield of crops.
Correlation between two variables is said to be negative or inverse if the variables deviate in
opposite direction. That is, if the increase in the variables deviate in opposite direction. That is, if
increase (or decrease) in the values of one variable results on an average, in corresponding
decrease (or increase) in the values of other variable.
Some examples of series of negative correlation are:
a. Volume and pressure of perfect gas; c. Price and demand of goods and
b. Net income and operating expense, d. Temperature and altitude
Graphs of Positive and Negative correlation:
Suppose we are given sets of data relating to heights and weights of students in a class. They can
be plotted on the coordinate plane using x –axis to represent heights and y – axis to represent
weights. The different graphs shown below illustrate the different types of correlations.
1
Perfect positive correlation (r = 1) Strong positive correlation (r = 0.80) Zero correlation (r = 0)
Perfect negative correlation ( r = ─1) Moderate negative correlation Strong correlation & outlier
(r = -0.43) (r = 0.71)
Note:
i. If the points are very close to each other, a fairly good amount of correlation can be
expected between the two variables. On the other hand if they are widely scattered a
poor correlation can be expected between them.
ii. If the points are scattered and they reveal no upward or downward trend as in the case of
then we say the variables are uncorrelated.
iii. If there is an upward trend rising from the lower left hand corner and going upward to
the upper right hand corner, the correlation obtained from the graph is said to be positive.
Also, if there is a downward trend from the upper left hand corner the correlation
obtained is said to be negative.
iv. The graphs shown above are generally termed as scatter diagrams.
Linear and Non – Linear Correlation
The correlation between two variables is said to be linear if the change of one unit in one
variable result in the corresponding change in the other variable over the entire range of values.
For example, consider the following data.
X 2 4 6 8 10
Y 7 13 19 25 31
Thus, for a unit change in the value of x, there is a constant change in the corresponding values
of y and the above data can be expressed by the relation y = 3x +1
In general two variables x and y are said to be linearly related, if there exists a relationship of
the form y = a + bx.
Where y = dependent variable, x = independent variable, a = y-intercept and b = slope of the line
(defined as rise or drop) are real numbers. This is nothing but a straight line when plotted on a
graph sheet with different values of x and y and for constant values of a and b. Such relations
generally occur in physical sciences but are rarely encountered in economic and social sciences.
Dependent variable - the variable that is being predicted or estimated (explanatory variable).
Independent variable - A variable that provides the basis for estimation (predictor variable).
2
The relationship between two variables is said to be non – linear if corresponding to a unit
change in one variable, the other variable does not change at a constant rate but changes at a
fluctuating rate. In such cases, if the data is plotted on a graph sheet we will not get a straight line
curve. For example, one may have a relation of the form y = a + bx + cx 2 or more general
polynomial.
Coefficient of Correlation (r)
One of the most widely used statistics is the coefficient of correlation ‘r’, which measures the
degree of association between the two values of related variables given in the data set. In other
words, the coefficient of correlation describes the strength of the relationship between two sets of
interval-scaled or ratio-scaled variables. Designated r; it is often referred to as Pearson's r and as
the Pearson product moment correlation coefficient. It takes values from + 1 to – 1. If two sets or
data have r = +1, they are said to be perfectly correlated positively if r = -1 they are said to be
perfectly correlated negatively; and if r = 0 they are uncorrelated.
The coefficient of correlation ‘r’ is given by the formula
nΣxy−ΣxΣy
r=
√¿ ¿ ¿
Example: A study was conducted determine whether there is a relationship between the number
of sales calls made in a month and the number of copiers sold that month. The sale manager
selects a random sample of 10 representatives and determines the number of sales calls each
representative made last month and the number of copiers sold. The sample information is shown
in Table below. Compute the coefficient of correlation.
Sales representative Number of sales call (x) Number of copies sold (y)
1 20 30
2 40 60
3 20 40
4 30 60
5 10 30
6 10 40
7 20 40
8 20 50
9 20 30
10 30 70
Solution:
x y x2 y2 xy
1 20 30 400 900 600 nΣxy−ΣxΣy
2 40 60 1,600 3,600 2,400 r = (n Σ x 2−(Σx)2¿) (n Σ y 2−(Σy)2 ¿)¿ ¿
3 20 40 400 1,600 800
√ √
4 30 60 900 3,600 1,800 =
5 10 30 100 900 300 (10)(10,800)−( 220 ) (450)
6 10 40 100 1,600 400 √ ( 10 ) ( 22,100 )−(450)2 ¿
7 20 40 400 1,600 800
√( 10 ) (5,600)−(220) ¿
2
8 20 50 400 2,500 1,000 9,000
9 20 30 400 900 600 = = 0.759
( 87.17798 ) (136.0147)
10 30 70 900 4,900 2,100 How do we interpret a correlation of 0.759? First, it is
Σ 220 450 5,600 22,100 10,800 positive, so we see there is a direct relationship between
the number of sales calls and the number of copiers sold.
3
The value of 0.759 is fairly close to 1, so we conclude
that the association is strong. To put it another way, an
increase in calls will likely lead to more sales.
4
Rank Correlation
The product-moment correlation coefficient is used to measure the strength of the linear
association between two variables, i.e. how close the points on a scatter graph lie to a straight
line. It is most appropriate when the points on a scatter graph have an elliptical pattern. The
product-moment correlation coefficient is less appropriate when the points on a scatter graph
seem to follow a curve or when there are outliers (or anomalous values) on the graph.
Data which are arranged in numerical order, usually from largest to smallest and numbered 1,2,3
--- are said to be in ranks or ranked data.. These ranks prove useful at certain times when two
or more values of one variable are the same. The coefficient of correlation for such type of data
is given by Spearman rank difference correlation coefficient and is denoted by R. In order to
calculate R, we arrange data in ranks computing the difference in rank ‘d’ for each pair. The
following example will explain the usefulness of R. R is given by the formula
( Σd¿¿ 2)
R=1─6 ¿
n(n2−1)
Where, d = difference between ranks and n = total number of observations.
Example: The data given below are obtained from student records. Calculate the rank correlation
coefficient ‘R’ for the data.
Example: The following data shows the annual income per head of population, x, (in birr) and
the infant mortality, y, (per thousand live births) for a sample of 11 countries.
Countr A B C D E F G H I K L
y
X 130 5950 560 2010 1870 170 390 580 820 6620 3800
Y 150 43 121 53 41 169 143 59 75 20 39
The relationship between the two variables does not however appear to be linear – it is more
curved (see through scatter diagram). Calculating the product moment correlation coefficient for
these data is therefore not really appropriate (as this examines how well the data fit to a straight
line).
Country A B C D E F G H I K L Total
x 130 5950 560 2010 1870 170 390 580 820 6620 3800
y 150 43 121 53 41 169 143 59 75 20 39
Rank x 1 10 4 8 7 2 3 5 6 11 9
Rank y 10 4 8 5 3 11 9 6 7 1 2
d -9 6 -4 3 4 -9 -6 -1 -1 10 7
2
d 81 36 16 9 16 81 36 1 1 100 49 426
Note: In ranking the numbers above, we have used rank 1 to denote the smallest number in each
row and rank 11 to represent the largest number. Some people would use rank 1 to denote the
largest number and rank 11 to represent the smallest. It does not matter which way you do the
ranking as long as you rank in the same way for both rows.
2
(Σd¿¿ 2) 6 (426)
R=1─6 ¿ = = -0.936
n(n2−1) 11(112−1)
Interpretation: This (-0.936) represents strong negative rank correlation between income and
infant mortality, i.e. infant mortality tends to fall as income per head of population increases.
5
Simple Linear Regression
Managerial decisions often are based on the relationship between two or more variables. For
example, after considering the relationship between advertising expenditures and sales, a
marketing manager might attempt to predict sales for a given level of advertising expenditures.
In another case, a public utility might use the relationship between the daily high temperature
and the demand for electricity to predict electricity usage on the basis of next month’s
anticipated daily high temperatures. Sometimes a manager will rely on intuition to judge how
two variables are related. However, if data can be obtained, a statistical procedure called
regression analysis can be used to develop an equation showing how the variables are related.
From this scenario, we can understand that regression analysis is concerned with how the values
of one variable depend on the corresponding values of second variable. This can be summarized
by an equation that enables us to predict or estimate the values of one variable given values of
the other variable. In contrast to correlation problems that involve measuring only the strength of
a relationship, regression problems are concerned with the form or nature of a relationship and
able to estimate the value one variable using the values of other variable through equation.
For instance, answer the following example whether it is correlation or regression problem.
i. How do the sales of a product depend on the price charged? – Regression
ii. How does the strength of a material depend on temperature? - Regression
iii. To what extent is metal pitting related to pollution?- Correlation
iv. How strong is the link between inflation and employment rates?- Correlation
v. How can we use the amount of fertilizer used to predict crop yields?- Regression
In regression terminology, the variable being predicted is called the dependent variable. The
variable or variables being used to predict the value of the dependent variable are called the
independent (explanatory) variables. For example, in analyzing the effect of advertising
expenditures on sales, a marketing manager’s desire to predict sales would suggest making sales
the dependent variable. Advertising expenditure would be the independent variable used to help
predict sales. In statistical notation, y denotes the dependent variable and x denotes the
independent variable.
In this section, we consider the simplest type of regression analysis involving one independent
variable and one dependent variable in which the relationship between the variables is
approximated by a straight line. It is called simple linear regression. Regression analysis
involving two or more independent variables is called multiple regression analysis.
Simple Linear Regression Model
Habesha Shiro is a chain of Ethiopian-food restaurants located in a five-state area. Habesha most
successful locations are near college campuses. The managers believe that sales for these
restaurants (denoted by y) are related positively to the size of the student population (denoted by
x); that is, restaurants near campuses with a large student population tend to generate more sales
than those located near campuses with a small student population. Using regression analysis, we
can develop an equation showing how the dependent variable y is related to the independent
variable x.
Regression Model and Regression Equation
In the Habesha Shiro restaurant example, the population consists of all the Habesha’s restaurants.
For every restaurant in the population, there is a value of x (student population) and a
6
corresponding value of y (sales). The equation that describes how y is related to x and an error
term is called the regression model as given below.
y = β 0 + β 1 x +ε
β 0∧β 1 are referred to as the parameters of the model, and ε (the Greek letter epsilon) is a
random variable referred to as the error term. The error term accounts for the variability in y that
cannot be explained by the linear relationship between x and y.
The population of all Habesha’s restaurants can also be viewed as a collection of subpopulations,
one for each distinct value of x. For example, one subpopulation consists of all Habesha’s
restaurants located near college campuses with 8000 students; another subpopulation with 9000
students; and so on. Each subpopulation has a corresponding distribution of y values. Each
distribution of y values has its own mean or expected value. The equation that describes how the
expected value of y, denoted E(y), is related to x is called the regression equation as shown
below.
E(y) = β 0 + β 1 x
The graph of the simple linear regression equation is a straight line; β 0 is the y-intercept of the regression
line, β 1is the slope, and E(y) is the mean or expected value of y for a given value of x.
Estimated Regression Equation
If the values of the population parameters β 0∧β 1 were known, we could use the above equation
to compute the mean value of y for a given value of x. In practice, the parameter values are not
known and must be estimated using sample data. Sample statistics (denoted b0 and b1) are
computed as estimates of the population parameters β 0∧β 1. Substituting the values of the sample
statistics b0 and b1 for β 0∧β 1in the regression equation, we obtain the estimated regression
equation. The estimated regression equation for simple linear regression follows.
ŷ = b0 + b1x
The graph of the estimated simple linear regression equation is called the estimated regression
line; b0 is the y intercept and b1 is the slope. In the next section, we show how the least squares
method can be used to compute the values of b0 and b1 in the estimated regression equation. In
general, ŷ is the point estimator of E( y), the mean value of y for a given value of x.
The least squares method is a procedure for using sample data to find the estimated regression
equation. To illustrate the least squares method, suppose data were collected from a sample of 10
Habesha Shiro restaurants located near college campuses. For the ith observation or restaurant in
the sample, xi is the size of the student population (in thousands) and yi is the sales (in thousands
of dollars). The values of xi and yi for the 10 restaurants in the sample are summarized in Table
below.
Restaurant 1 2 3 4 5 6 7 8 9 10
Student (xi) 2 6 8 8 12 16 20 20 22 26
Sales (yi) 58 108 88 118 117 137 157 169 149 202
We therefore choose the simple linear regression model to represent the relationship between
sales and student population. Given that choice, our next task is to use the sample data in above
Table to determine the values of b0 and b1 in the estimated simple linear regression equation. For
the ith restaurant, the estimated regression equation provides ŷi = b0 + b1xi.
Where,
ŷi = estimated value of sales ($1000s) for the ith restaurant
7
b0 = the y intercept of the estimated regression line
b1 = the slope of the estimated regression line
xi = size of the student population (1000s) for the ith restaurant
The least squares method uses the sample data to provide the values of b0 and b1 that minimize
the sum of the squares of the deviations between the observed values of the dependent variable yi
and the estimated values of the dependent variable ŷ i. Differential calculus can be used to show
that the values of b0 and b1 that minimize the above expression can be found by using the
following equations.
b1 = Σ ( xi −Ẍ ) ¿ ¿ and b0 = ӯ ─ b1Ẍ
where, xi = value of the independent variable for the ith observation
yi = value of the dependent variable for the ith observation
ẍ = mean value for the independent variable
ӯ = mean value for the dependent variable
n = total number of observations
Some of the calculations necessary to develop the least squares estimated regression equation for
Habesha Shiro are shown in Table below with the sample of 10 restaurants, we have n = 10
observations. Because above equations require and we begin the calculations by computing ẍ and
ӯ.
Σ xi y
ẍ = 140/10 = 14 and ӯ = i = 1300/10 = 130
n n
2
Restaurant i xi yi xi ─ ẍ yi ─ ӯ (xi ─ ẍ) (yi ─ ӯ) ( x i ─ ẍ)
1 2 58 -12 -72 864 144
2 6 105 -8 -25 200 64
3 8 88 -6 -42 252 36
4 8 118 -6 -12 72 36
5 12 117 2 -13 26 4
6 16 137 2 7 14 4
7 20 157 6 27 162 36
8 20 169 6 39 234 36
9 22 149 8 19 152 64
10 26 202 12 72 864 144
Totals Σxi = 140 Σyi = 1300 Σ = 2840 Σ = 568
2840
b1 = Σ ( xi −Ẍ ) ¿ ¿ = =5 and b0 = ӯ ─ b1Ẍ = 130 ─ 5(14) = 60
568
Thus, the estimated regression equation is ŷ = 60 + 5x
The slope of the estimated regression equation (b1 = 5) is positive, implying that as student
population increases, sales increase.
If we believe the least squares estimated regression equation adequately describes the
relationship between x and y, it would seem reasonable to use the estimated regression equation
8
to predict the value of y for a given value of x. For example, if we wanted to predict sales for a
restaurant to be located near a campus with 16,000 students, we would compute
yi = 60 + 5x = 60 + 5(16,000) = $140,000
Coefficient of Determination
For the Habesha Shiro example, we developed the estimated regression equation ŷ = 60 + 5x to
approximate the linear relationship between the size of the student population x and sales y. A
question now is: How well does the estimated regression equation fit the data? In this section, we
show that the coefficient of determination provides a measure of the goodness of fit for the
estimated regression equation.
For the ith observation, the difference between the observed value of the dependent variable, yi,
and the estimated value of the dependent variable, ŷ i, is called the ith residual. The ith residual
represents the error in using ŷ to estimate yi. Thus, for the ith observation, the residual is yi ─ ŷi .
The sum of squares of these residuals or errors is the quantity that is minimized by the least
squares method. This quantity, also known as the sum of squares due to error, is denoted by
SSE.
SSE = Σ( y i− ŷi )2
The value of SSE is a measure of the error in using the estimated regression equation to estimate
the values of the dependent variable in the sample. For instance, for Habesha restaurant 1 the
values of the independent and dependent variables are x1 = 2 and y1 = 58. Using the estimated
regression equation, we find that the estimated value of sales for restaurant 1 is ŷ = 60 + 5(2) =
70. Thus, the error in using to estimate y1 for restaurant 1 is y1 ─ ŷ1 = 58 ─ 70 = -12. The
squared error, (-12)2 = 144.
After computing and squaring the residuals for each restaurant in the sample in the same way, we
sum them to obtain SSE = 1530. Thus, SSE = 1530 measures the error in using the estimated
regression equation ŷ = 60 + 5x to predict sales.
Now suppose we are asked to develop an estimate of quarterly sales without knowledge of the
size of the student population. Without knowledge of any related variables, we would use the
sample mean as an estimate of quarterly sales at any given restaurant. We can maintain the sum
of squared deviations obtained by using the sample mean ӯ = 130 to estimate the value of sales
for each restaurant in the sample. For the ith restaurant in the sample, the difference yi ─ӯ
provides a measure of the error involved in using to estimate sales. The corresponding sum of
squares, called the total sum of squares, is denoted SST.
SST = Σ( y i− ӯ)2
Thus, the total sum of squares for Habasha restaurant is SST = 15,750.
The arithmetic difference between total sum of squares and sum of squares due to error is called
sum of squares due to regression (SSR), i.e SSR = SST ─ SSE
Thus, SSR = 15,750 ─ 1530 = 14,200
The ratio SSR/SST, which will take values between zero and one, is used to evaluate the
goodness of fit for the estimated regression equation. This ratio is called the coefficient of
determination and is denoted by r2.
SSR 14,200
Therefore, r2 = = = 0.9027
SST 15,730
When we express the coefficient of determination as a percentage, r2 can be interpreted as the
percentage of the total sum of squares that can be explained by using the estimated regression
equation. For Habesha Shiro, we can conclude that 90.27% of the total sum of squares can be
explained by using the estimated regression equation ŷ = 60 + 5x to predict sales. In other words,
9
90.27% of the variability in sales can be explained by the linear relationship between the size of
the student population and sales. We should be pleased to find such a good fit for the estimated
regression equation.
10