Correlation
So far we have confined to the distributions involving
only one variable known as univariate distributions.
Normally we collect the data on many variables. Whenever
we conduct any experiment we gather information on more
related variables. A questionnaire will have the details about
name, age, education, gender, years of experience, income
etc.
Correlation is the study of relationship between two or
more variables.
When there are two related variables their joint
distribution is known as bivariate normal distribution and if
there are more than two variables their joint distribution is
known as multivariate normal distribution.
In case of bi-variate or multivariate normal distribution,
we are interested in discovering and measuring the
magnitude and direction of relationship between 2 or more
variables. For this we use the tool known as correlation.
Suppose we have two continuous variables X and Y and
if the change in X affects Y, the variables are said to be
correlated. In other words, the systematic relationship
between the variables is termed as correlation. When only 2
variables are involved the correlation is known as simple
correlation and when more than 2 variables are involved the
correlation is known as multiple correlation. For example if
there are two variables X and Y the relation is called as
simple correlation and if there are three variables say Y, X1
and X2 then the relationship is known multiple correlation.
Types of correlation
An increase in the value of X results in an increase in
the value of Y, then they move in the same direction. A
decrease in the value of X results in a decrease in the value
of Y, then also we say they move in the same direction.
Similarly an increase in X if it results in a decrease in Y they
move in the opposite direction and a decrease in X if it
results in an increase in Y then they move in the opposite
direction.
When the variables move in the same direction, these
variables are said to be positively correlated and if they
move in the opposite direction they are said to be negatively
correlated. The positive correlation is also known as direct
correlation. The negative correlation is also termed as
inverse correlation. When the two variables are not at all
related they are said to be independent. An eample for each
of the above are
rainfall and yield of a crop - positive correlation
the speed of a vehicle and the time taken to reach a
destination - negative correlation
amount of rainfall in Mumbai and yield of a crop in
Chennai - independent
In correlation we need not know which is the cause
variable and which is the effect variable.
Other examples of positive correlation
Height and weight. Taller people tend to be heavier.
Day temperature and sale of ice cream
An example of negative correlation would be
height above sea level and temperature. As you climb the
mountain (increase in height) it gets colder (decrease in
temperature).
A zero correlation (independent) exists when there is no
relationship between two variables. For example there is no
relationship between the amount of tea drunk and level of
intelligence.
Some more examples are
• age and salary
• experience and salary
• quantity of fertiliser applied and yield of a crop
• demand and supply
• the cost (in rupees) of a call to its duration (length)
• The more time you spend running on a treadmill, the more
calories you will burn.
• Taller people have larger shoe sizes and shorter people have
smaller shoe sizes.
• The number of days of absences in a course and the final
exam grade
Scatter Diagram
In correlation studies first we have to investigate
whether there is a relation between the variables X and Y.
For this a correlation can be expressed visually. This is done
by drawing a scatter diagram (also known as a scatterplot,
scatter graph, scatter chart).
A scatter diagram is a graphical display that shows the
relationships between two numerical variables, which are
represented as points (or dots) for each pair of score.
A scatter diagram indicates the strength and direction of the
correlation between the variables.
To investigate whether there is any relation between
the variables X and Y we use scatter diagram. Let (x1,y1),
(x2,y2)….(xn,yn) be n pairs of observations. If the variables X
and Y are plotted along the X-axis and Y-axis respectively in
the x-y plane of a graph sheet the resultant diagram of dots
is known as scatter diagram. From the scatter diagram we
can say whether there is any correlation between x and y
and whether it is positive or negative or the correlation is
linear or curvilinear. It may take any one of the following
forms.
When you draw a scatter diagram it doesn't matter which
variable goes on the x-axis and which goes on the y-axis.
Remember, in correlations we are always dealing with
paired scores, so the values of the 2 variables taken
together will be used to make the diagram.
Decide which variable goes on each axis and then simply
put a dot or a cross at the point where the 2 values coincide.
There is no rule for determining what size of correlation is
considered strong, moderate or weak. The interpretation of
the coefficient depends on the topic of study.
If the correlation coefficient is
> 0.4 we say the relationship is moderate and > 0.75
relatively strong.
Correlation can have a value:
• 1 is a perfect positive correlation
• 0 is no correlation (the values don't seem linked at all)
• -1 is a perfect negative correlation
Pearson's product moment correlation coefficient
The most common measure of correlation is Pearson’s
product-moment correlation, which is commonly referred to
as the correlation coefficient. The measures of the degree of
relationship between two continuous variables is called
correlation coefficient. It is denoted by r (in case of sample)
and ( rho in case of population). The correlation coefficient
r is known as Pearson’s correlation coefficient as it was
discovered by Karl Pearson. It is also called as product
moment correlation.
The correlation coefficient r is given as the ratio of
covariance of the variables X and Y to the product of the
standard deviation of X and Y.
Symbolically,
1
( (x − x )( y − y ))
r = n − 1
1
( x − x )2 1 ( y − y )2
n −1 n −1
as n-1 is common in all it can be removed and written as
Where:
• rxy – the correlation coefficient of the linear relationship between
the variables x and y
• xi – the values of the x-variable in a sample
• x̅ – the mean of the values of the x-variable
• yi – the values of the y-variable in a sample
• ȳ – the mean of the values of the y-variable
In order to calculate the correlation coefficient using the formula
above, you must undertake the following steps:
1. Obtain a data sample with the values of x-variable and y-variable.
2. Calculate the means (averages) x̅ for the x-variable and ȳ for the
y-variable.
3. For the x-variable, subtract the mean from each value of the x-
variable (let’s call this new variable “a”). Do the same for the y-
variable (let’s call this variable “b”).
4. Multiply each a-value by the corresponding b-value and find the
sum of these multiplications (the final value is the numerator in
the formula).
5. Square each a-value and calculate the sum of the result
6. Find the square root of the value obtained in the previous step
(this is the denominator in the formula).
7. Divide the value obtained in step 4 by the value obtained in step
𝑐𝑜𝑣(𝑥,𝑦)
r=
√𝑣𝑎𝑟(𝑥 ) 𝑋 𝑣𝑎𝑟 (𝑦)
Another formula that can be used is
x y
xy −
r = n
( x )2
( y )2
x 2
−
n
y 2
−
n
𝑆𝑃(𝑥,𝑦)
r=
√𝑆𝑆(𝑥) 𝑋 𝑆𝑆(𝑦)
This correlation coefficient r is known as Pearson’s product
moment correlation coefficient. The numerator is termed as
sum of product of X and Y and abbreviated as SP(XY). In the
denominator the first term is called sum of squares of X (i.e)
SS(X) and second term is called sum of squares of Y (i.e)
SS(Y)
SP( XY )
r =
SS ( X ) SS (Y )
The denominator in the above formula is always positive.
The numerator may be positive or negative making r to be
either positive or negative.
Assumptions in correlation analysis:
Correlation coefficient r is used under certain assumptions
and they are
1. The variables under study are continuous random
variables and they are normally distributed
2. The relationship between the variables is linear
3. Each pair of observations is unconnected with other
pair (independent)
Problem:
Compute Pearsons coefficient of correlation between plant height (cm)
X and yield (Kgs) Y as per the data given below:
X 39 65 62 90 82 75 25 98 36 78
Y 47 53 58 86 62 68 60 91 51 84
X Y (x-
65)(y-
(x-65) (y-66) (x-65)2 (y-66)2 66)
39 47 -26 -19 676 361 494
65 53 0 -13 0 169 0
62 58 -3 -8 9 64 24
90 86 25 20 625 400 500
82 62 17 -4 289 16 -68
75 68 10 2 100 4 20
25 60 -40 -6 1600 36 240
98 91 33 25 1089 625 825
36 51 -29 -15 841 225 435
78 84 13 18 169 324 234
650 660 5398 2224 2704
rxv = 2704 / √(5398 X 2224) = 0.7804
n = 10
x = 650 y = 660 xy = 45604 x 2
= 47648 y 2
= 45784
x y
xy −
r = n
( x ) 2
( y ) 2
x 2
−
n
y 2
−
n
(650)(660)
45604 −
= 10
(650) 2 (660) 2
47648 − 45784 −
10 10
45604 − 42900
= = 0.7804
(73.47)( 47.1)
Correlation coefficient is positively correlated.
Both the methods give the same result.
u=y- v=x-
Y X 5.13 79.41 v sqr v sqr uXv
5.22 94.2 0.09 14.79 0.0081 218.74 1.3311
8.13 69.3 3 -10.11 9 102.21 -30.33
6.52 114.3 1.39 34.89 1.9321 1217.3 48.497
4.16 83.3 -0.97 3.89 0.9409 15.132 -3.773
8.98 85.4 3.85 5.99 14.823 35.88 23.062
3.05 68.1 -2.08 -11.31 4.3264 127.92 23.525
3.49 50.7 -1.64 -28.71 2.6896 824.26 47.084
5.4 96.2 0.27 16.79 0.0729 281.9 4.5333
2.39 76.1 -2.74 -3.31 7.5076 10.956 9.0694
2.71 52 -2.42 -27.41 5.8564 751.31 66.332
3.97 82.1 -1.16 2.69 1.3456 7.2361 -3.12
7.56 81.3 2.43 1.89 5.9049 3.5721 4.5927
61.58 953 0.02 0.08 54.407 3596.4 190.8
0.4313
Y X y-3 x-20
5.22 94.2 2.22 74.2
8.13 69.3 5.13 49.3
6.52 114.3 3.52 94.3
4.16 83.3 1.16 63.3
8.98 85.4 5.98 65.4
3.05 68.1 0.05 48.1
3.49 50.7 0.49 30.7
5.4 96.2 2.4 76.2
2.39 76.1 -0.61 56.1
2.71 52 -0.29 32
3.97 82.1 0.97 62.1
7.56 81.3 4.56 61.3
corr 0.4313
Properties
1. It is a unit free measure.
[Link] correlation coefficient value ranges between –1
and +1. If we get a value of r beyond these limits, it is
an indication of wrong computation.
3. The correlation coefficient is not affected by change
of origin or scale or both. When a constant is added or
subtracted from the original values of a variable we say
that the origin is changed. When the original values of a
variable is multiplied or divided by a constant we say
that the scale is changed.
4. It is symmetric. i.e. rxy = ryx.
Consider
X 2 4 6 8 10
Y 5 8 11 14 17
For an increase in X value of 2 there is an increase of 3 units
in Y. The rate of change is constant and we say they are
linearly related. The correlation calculated for such data is
called simple linear correlation.
Rank correlation
One of the assumption under correlation analysis is that the
2 variables are normally distributed. When both the variables
are not normal the linear correlation procedure is not
applicable. In such case we use rank correlation. There are
two methods available to calculate rank correlation. One is
proposed by Spearman and the other by Kendall. Both can
be applied for the same data. But Spearman rank correlation
is more popular than the other.
Spearman's rank correlation
This is indicated by rs. This procedure starts with ranking of
the measurements of the variable X and Y separately. The
ranks are assigned with the highest value getting the 1st
rank. The differences between the ranks of each of n pairs
are then found out. They are denoted by d. The Spearman's
rank correlation is then calculated by using the formula
6 𝑑2
rs =1 -
𝑛(𝑛2−1)
when there are no ties.
Calculate Spearman's rank correlation for the following data.
X Y d d2
5 7 -2 4
6 6 0 0
4 2 2 4
8 4 4 16
1 3 -2 4
3 1 2 4
2 5 -3 9
7 10 -3 9
9 8 1 1
10 9 1 1
∑ d = 52
2
rs = 1 - (6 ∑d2 / (n(n2-1)))
rs = 1- (6 X 52 / (10(102-1)))
rs = 1- (312/990) = 0.68
The following are the marks assigned by two judges in an interview.
Calculate Spearman's rank correlation.
Judge A 40, 48, 25, 32, 50, 41, 15, 18, 27, 36, 43, 49
Judge B 35, 40, 22, 25, 47, 38, 17, 19, 26, 33, 41, 39