Key Concepts
Pearson’s Correlation
⚫ Correlation as a statistic
⚫ Positive and Negative Bivariate Correlation
⚫ Range Effects
⚫ Outliers
Azmi Mohd Tamil ⚫ Regression & Prediction
⚫ Directionality Problem ( & cross-lagged panel)
⚫ Third Variable Problem (& partial correlation)
Example of Non-Linear Relationship
Assumptions Yerkes-Dodson Law – not for correlation
⚫ Related pairs Better
⚫ Scale of measurement. For Pearson, data
should be interval or ratio in nature.
⚫ Normality Performance
⚫ Linearity
⚫ Homocedasticity
Worse
Low
Stress High
Correlation Correlation – parametric & non-para
⚫ 2 Continuous Variables - Pearson
linear relationship
X Y
⚫
⚫ e.g., association between height and weight
Stress Illness
1 Continuous, 1 Categorical Variable
(Ordinal) Spearman/Kendall
–e.g., association between Likert Scale on work
satisfaction and work output
–pain intensity (no, mild, moderate, severe) and
dosage of pethidine
Pearson Correlation History of Pearsons’ Correlation
Sir Francis Galton was studying In 1915, Pearson introduced R.A.
2 Continuous Variables
⚫ ⚫
⚫ the relationship between the Fisher to the difficult problem of
– linear relationship height of the fathers and the determining the statistical
distribution of Galton's correlation
– e.g., association between height and weight, + height of their sons and
co-efficient. Fisher thought about
discovered a way to
⚫ measures the degree of linear association mathematically measure this
the problem, cast it into a geometric
formulation, and within a week had
relationship. He called it the
between two interval scaled variables "co-efficient of correlation.“ He
a complete answer. He submitted it
for publication in Biometrika; but
⚫ analysis of the relationship between two gave a specific formula for Pearson & William Sealy Gosset
had difficulty understanding the
computing this number from the
quantitative outcomes, e.g., height and weight, data he collected. Galton died paper. Pearson got his workers to
check the calculations. In every
in 1911. It was his disciple, Karl
case, they agreed with Fisher's
Pearson, who first formulated more general solution.
the idea in its most complete
form in 1895.
History of Pearsons’ Correlation
⚫ Please note that Pearson stated it as Galton's
correlation co-efficient not Pearson's correlation
co-efficient to R.A. Fisher. However it is now
known as Pearson's correlation co-efficient .
⚫ This is an example of what Stephen Stigler, a
contemporary historian of science, calls the law of
misonomy, that nothing in mathematics is ever
named after the person who discovered it. Sir
Francis Galton was the one who came out with the
co-efficient of correlation theory but Karl Pearson's
was the one credited for it.
How to calculate r? How to calculate r?
a
df = np - 2
b c
Example
We refer to Table A3.
so we use df=30 .
t = 8.349436 > 3.65 (p=0.001)
Therefore if t=8.349436, p<0.001.
• x = 4631 x2 = 688837
• y = 2863 y2 = 264527
• xy = 424780 n = 32
•a=424780-(4631*2863/32)=10,450.22
•b=688837-46312/32=18,644.47
•c=264527-28632/32=8,377.969
•r=a/(b*c)0.5
=10,450.22/(18,644.47*83,77.969)0.5
=0.836144
•t= 0.836144*((32-2)/(1-0.8361442))0.5
t = 8.349436 & d.f. = n - 2 = 30,
p < 0.001
Correlation Strength of relationship
⚫ r lies between -1 and 1. Values near 0
Two pieces of information: means no (linear) correlation and values
⚫ The strength of the relationship
near ± 1 means very strong correlation.
⚫ The direction of the relationship
-1.0 0.0 +1.0
Strong Negative No Rel. Strong Positive
How to interpret the value of r? Correlation ( + direction)
⚫ Positive correlation:
high values of one
variable associated with
high values of the other
⚫ Example: Higher
course entrance exam
scores are associated
with better course
grades during the final Positive and Linear
exam.
Correlation ( - direction) Pearson’s r
⚫ Negative correlation: The ⚫ A 0.9 is a very strong positive association (as one
negative sign means that variable rises, so does the other)
the two variables are ⚫ A -0.9 is a very strong negative association
inversely related, that is, (as one variable rises, the other falls)
as one variable increases
the other variable
decreases. r=0.9 has nothing to do with 90%
⚫ Example: Increase in r=correlation coefficient
body mass index is
associated with reduced Negative and Linear
effort tolerance.
Coefficient of Determination
Defined Coefficient of Determination
⚫ Pearson’s r can be squared , r 2, to derive a ⚫ Pearson’s r can be squared , r 2, to derive a
coefficient of determination. coefficient of determination.
⚫ Example of depression and CGPA
– Pearson’s r shows negative correlation, r=-0.5
⚫ Coefficient of determination – the portion of – r2=0.25
variability in one of the variables that can be
accounted for by variability in the second – In this example we can say that 1/4 or 0.25 of the variability
in CGPA scores can be accounted for by depression
variable (remaining 75% of variability is other factors, habits, ability,
motivation, courses studied, etc)
A study was done to find the association between
Coefficient of Determination the mothers’ weight and their babies’ birth weight.
The following is the scatter diagram showing the
and Pearson’s r relationship between the two variables.
5.00
The coefficient of
⚫ Pearson’s r can be squared , r 2 correlation (r) is
4.00
0.452
⚫ If r=0.5, then r2=0.25 The coefficient of
If r=0.7 then r2=0.49 determination (r2)
Baby's Birthweight
⚫
3.00 is 0.204
Twenty percent of
⚫ Thus while r=0.5 versus 0.7 might not look so the variability of
different in terms of strength, r2 tells us that r=0.7
2.00
the babies’ birth
accounts for about twice the variability relative to weight is
determined by the
r=0.5 1.00
R Sq Linear = 0.204 variability of the
mothers’ weight.
0.00
0.0 20.0 40.0 60.0 80.0 100.0
Mother's Weight
Causal Silence: CORRELATION DOES NOT MEAN
Correlation Does Not Imply Causality CAUSATION
Causality – must demonstrate that variance in one ⚫ A high correlation does not give us the evidence to make a cause-
and-effect statement.
variable can only be due to influence of the other ⚫ A common example given is the high correlation between the cost of
variable damage in a fire and the number of firemen helping to put out the
fire.
⚫ Does it mean that to cut down the cost of damage, the fire
department should dispatch less firemen for a fire rescue!
⚫ Directionality of Effect Problem
⚫ The intensity of the fire that is highly correlated with the cost of
damage and the number of firemen dispatched.
⚫ The high correlation between smoking and lung cancer. However,
one may argue that both could be caused by stress; and smoking
⚫ Third Variable Problem does not cause lung cancer.
⚫ In this case, a correlation between lung cancer and smoking may be
a result of a cause-and-effect relationship (by clinical experience +
common sense?). To establish this cause-and-effect relationship,
controlled experiments should be performed.
Directionality of Effect Problem
More
Big Fire Firemen
Sent
X Y
More X Y
Damage
X Y
Directionality of Effect Problem Directionality of Effect Problem
X Y X Y
Aggressive Behavior Viewing Violent TV
Class Higher
Attendance Grades
X Y X Y
Aggressive Behavior Viewing Violent TV
Class Higher
Attendance Grades
Aggressive children may prefer violent programs or
Violent programs may promote aggressive behavior
Methods for Dealing with
Directionality Cross-Lagged Panel
⚫ Cross-Lagged Panel design Pref for violent TV .05 Pref for violent TV
3rd grade 13th grade
– A type of longitudinal design
– Investigate correlations at several points in time
– STILL NOT CAUSAL .21 .31 .01 -.05
TV to later Aggression to later TV
aggression
Example next page
Aggression Aggression
3rd grade .38 13th grade
Third Variable Problem Class Exercise
Identify the
X Y third variable
that influences both X and Y
Z
Third Variable Problem Third Variable Problem
+ +
Class GPA Number of Crime
Attendance Mosques Rate
Motivation Size of
Population
Third Variable Problem Third Variable Problem
+ +
Ice Cream Number of Reading Score Reading
Consumed Drownings Comprehension
Temperature IQ
Correlation In SPSS
Data Preparation - Correlation
For this exercise, we will be
⚫ Screen data for outliers and ensure that ⚫
using the data from the CD,
there is evidence of linear relationship, since under Chapter 8, [Link]
correlation is a measure of linear ⚫ This data is a subset of a case-
control study on factors
relationship. affecting SGA in Kelantan.
Open the data & select -
⚫ Assumption is that each pair is bivariate ⚫
>Analyse
normal. >Correlate
>Bivariate…
⚫ If not normal, then use Spearman.
Correlation in SPSS Correlation Results
⚫ We want to see whether Correlations
there is any association WEIGHT2 BIRTHWGT
between the mothers’ weight WEIGHT2 Pearson Correlation 1 .431*
Sig. (2-tailed) . .017
and the babies’weight. So N 30 30
select the variables (weight2 BIRTHWGT Pearson Correlation .431* 1
& birthwgt) into ‘Variables’. Sig. (2-tailed) .017 .
N 30 30
⚫ Select ‘Pearson’ Correlation *. Correlation is significant at the 0.05 level (2-tailed).
Coefficients.
⚫ Click the ‘OK’ button. ⚫ The r = 0.431 and the p value is significant at
0.017.
⚫ The r value indicates a fair and positive linear
relationship.
Scatter Diagram
3.6
3.4
3.2
3.0
2.8 ⚫ If the correlation is
2.6 significant, it is best to
2.4
2.2
include the scatter
2.0 diagram.
1.8
1.6
⚫ The r square indicated
1.4 mothers’ weight
1.2 contribute 19% of the
1.0
.8
variability of the babies’
.6 weight.
.4
.2
0.0 Rsq = 0.1861
0 10 20 30 40 50 60 70 80 90 100
MOTHERS' WEIGHT