KU | Correlation and Regression Analysis | Prof.
Taewan Kim
Marketing Strategy
3960, 3961
Correlation and Regression Analysis
Introduction
What if we are interested in the relationship between two variables?
If we change one variable (x) does another variable also change (y)?
We could also possibly use a variable (x) to predict another variable
(y). These types of analysis, correlation and regression, are useful
and powerful tools for using your data to make decisions.
Objectives
1. Define correlation.
2. Define regression.
3. Interpret scatter plots.
4. Explain simple regression.
5. Explain multiple regression.
6. Interpret a simple regression output.
7. Interpret a multiple regression output.
8. Define dummy variable.
9. Define multicollinearity.
Resources
Primary
Churchill and Brown, Basic Marketing Research, Chapter 20
1
KU | Correlation and Regression Analysis | Prof. Taewan Kim
Correlation and Regression
Correlation: closeness of the relationship between two variables. Notes
“Is x correlated to y?” is the same as “Is y correlated to x?”
Corr(X,Y) = Corr(Y,X)
*Correlation does not mean causation!
Regression: Derive an equation that relates the criterion variable
to one or more predictor variables.
“Is x useful in predicting y?” (This is not the same as “Is y useful
in predicting x?”)
1. Variable of Interests: Y
2. Why is Y fluctuating?
a. Scatter plots
b. Pick one Xi from theory or model
c. Fitted model (Ordinary Least Squares)
n
min∑ (Y i−Y^ )
2
i=1
3. Is the model any good?
How close does the fitted line come to the actual scatter plot?
TSS
RSS
ESS
2 RSS
0≤ R = ≤1
TSS
2
KU | Correlation and Regression Analysis | Prof. Taewan Kim
Bivariate (two-variable) Scatter Plots
3
KU | Correlation and Regression Analysis | Prof. Taewan Kim
Pearson’s Correlation Coefficient
Association between two variables – How one variable changes (from its Notes
average) with the change (from its average) in another variable.
r=
∑ ( x i − x̄ )( y i− ȳ )
√ ∑ (xi − x̄ )2 √∑ ( y i − ȳ )2
Cov ( x , y )
r=
sx s y
−1≤ r≤+1
4
KU | Correlation and Regression Analysis | Prof. Taewan Kim
Regression: Assessing the Effects of Marketing Mix Variables
Simple Regression Model Notes
Simple (or bivariate) regression examines the effect of one independent
variable (X) on a dependent variable (Y).
The Population Model:
Y =α + βX + ε
Where,
Y = Dependent variable (or Outcome)
X = Independent variable (Input)
α = Intercept; value of Y when X=0
β= Slope, the change in Y due to one unit change in X.
β is a measure of the effect of X on Y.
When β= 0, X has no effect on Y.
When β >0, X has a positive effect on Y.
(e.g., Effects of advertising, # sales persons on sales)
When β <0, X has a negative effect on Y.
(e.g., Effects of a product's own price on its own sales)
The Fitted Model:
Based on data on X and Y, we fit the regression line as:
Y^ =a+bX
Where,
Y=Predicted (or fitted) value of the dependent variable for a given value of
X. The fitted value Y may not match with the actual value of Y. This
results in error =Y −Y
^
a= estimate of the interceptα , based on the sample data on X and Y
b= estimate of the intercept β , based on the sample data on X and Y
a , b are also referred to as regression parameters.
5
KU | Correlation and Regression Analysis | Prof. Taewan Kim
Simple Regression (Continued)
Is the Fitted Regression Model any good? Notes
Note that the analyst had picked the independent variable (X) based on
his/her conjecture. S/he assumed that this X (e.g., number of sales visit
made by a salesperson) might have an effect on Y (e.g., actual sales
generated). But s/he might be wrong in his/her assumption.
The problem is that any data on X and Y will produce a fitted regression
^
line (i.e., OLS parameter a and b): Y =a+bX
Therefore, two very important questions to ask are
1. How well the fitted regression line “fit” (i.e., come close to) the actual
data points (i.e., the scatter plot)?
2. Is the estimated slope parameter b statistically different from zero?
Note that when the slope is zero, it is a flat line, suggesting that
‘X has no effect on Y’.
1. Fit of a Regression
If the analyst has no explanation about why the dependent variable Y (e.g.,
sales) is varying from observation to observation (e.g., salesperson to
salesperson), s/he can only look at each data point
Yi
and examine how
much it is deviating from the average Y.
(Y −Ȳ )
The deviation is i . Squaring this and adding then for all observation,
we get what we call the Total Sums of Square (TSS). Thus,
n
TSS=∑ ( Y i −Ȳ )2
i =1
TSS=( s2 )∗(n−1) 2
Note that , where s is variance of Y
6
KU | Correlation and Regression Analysis | Prof. Taewan Kim
Simple Regression (Continued)
2. Partitioning of TSS Notes
A regression partitions TSS into two groups: TSS=RSS+ESS
(1) Regression (or explained) Sums of Square (RSS) and
(2) Error (or unexplained) Sums of Squares(ESS)
n
RSS=∑ ( Y^ i −Ȳ )2
i=1
: the difference between the mean and the regression
n
ESS=∑ (Y i −Y^ )2
i=1
: the difference between the regression and the data
R2 is a measure of “Fit” ; It gives us the % of total variation in Y(TSS) that
could be explained by the regression (RSS). i.e.,
RSS ESS
R2 = =1−
TSS TSS 0≤R2 ≤1 ;
In simple regression, the correlation
r = √ R 2
3. Test for the Slope Parameter (b)
Since we are using a sample to generate data for X and Y, the parameters,
a and b, are not single numbers; there is confidence interval around it,
depending on the respective standard errors (SE).
Therefore, we need to test if the slope parameter (b) is statistically different
from zero.
H 0 : β=0
H a : β≠0
Just like in the case of a one-population mean test, we compute the test
statistic as follows:
b− β
t= Under¿ : β=0
SE ( b)
The computer output prints the t and the corresponding error (called the p-
values)
H 0 and conclude that
If the p<0.05 (i.e., 95% confidence), then reject
β≠0 , which means that X has a significant effect on Y
7
KU | Correlation and Regression Analysis | Prof. Taewan Kim
Simple Regression: Excel Output
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.880
R Square 0.775
Adjusted R Square 0.769
Standard Error 59.560
Observations 40.000
ANOVA
df SS MS F Significance F
Regression 1.000 463451.009 463451.009 130.644 0.000
Residual 38.000 134802.015 3547.421
Total 39.000 598253.024
Coefficient Standard
s Error t Stat P-value
Intercept 135.434 25.907 5.228 0.000
ADV 25.308 2.214 11.430 0.000
8
KU | Correlation and Regression Analysis | Prof. Taewan Kim
Multiple Regression
Multiple (or multivariate) regression: Notes
One dependent variable and more than one independent variables
The Population Model (with k independent variables):
Y =α + β 1 X 1 + β 2 X 2 +. . .+ β k X k +ε
Using data on Y and
X 1 , X 2 ,. . . , X k , one can fit a model:
The Fitted Model:
Y^ =a+b 1 X 1 +b 2 X 2 +.. .+bk X k
1. Fit of the Model:
As in simple regression, fit of a multiple regression model is indicated by
R2 (i.e., the proportion of variations in the dependent (Y) variable can be
explained by the independent variables (all the X’s)). Again:
0≤R2 ≤1
Unlike simple regression, there is a second issue here. If an analyst keeps
on adding more and more independent variable (X’s) to explain the
variation in the dependent (Y) variable, then s/he can conceivably keep on
increasing R2. The question is – at what cost?
The more number of variables you include in the model, the more is the
need for data on these variables. There is a cost for additional information.
Also, it is not fair to compare R2’s of two models, one with fewer (say 2)
independent (X) variables and another with greater number (say 3 or 4) of
independent (X) variables. When you encounter this type of a situation,
you should look at the adjusted R2, which has a way to penalize the model
with more independent variable.
Adjusted 2R=1−(1−R 2 )
n−1
n−k−1
=R 2
n−1
n−k−1
−( k
n−k −1 )( )
9
KU | Correlation and Regression Analysis | Prof. Taewan Kim
Multiple Regression (continued)
2. Effects of the Independent Variables Notes
A second issue that multiple regression models have to contend with is the
relative effects of the independent (X) variables. That is, which X’s have a
stronger impact on Y? Which X’s have weaker or no impact on Y? To
address this issue, we follow two steps.
Step 1: F-test
We start with a very modest question:
Does any one of the X’s have an impact on Y?
Stated differently – is at least one of the β ' s≠0 . Thus,
H 0 : β 1 =β 2=. . .=β k =0
H a : At least one β i≠0
The test of this hypothesis is carried out through a F-test. Look up the RSS /k
F=
ANOVA table in the regression output. If the value of F is “large” and the ESS/(n−k −1)
corresponding p-value is small (<0.05), then one can reject the H0 and
conclude that at least one of the X’s have a statistically significant impact
on Y. But the F-test does not tell us which variable(s). For this, we have to
go to the next step.
Step 2: t-test
In this step, we answer the question:
Which of the independent (X) variables has/have statistically significant
effect on the dependent (Y) variable? i.e., which β ' s≠0 . One has to test a
separate hypothesis for each β associated with each independent variable.
H 0 : β 1 =0 H 0 : β 2 =0 H 0 : β k =0
, ,. . .,
H a : β 1 ≠0 H a : β 2 ≠0 H a : β k ≠0
To test these hypotheses, look at the lower part of the regression output.
For each independent variable (X), there is a corresponding value of the
parameter β , its standard error (SE) reflecting the variation around it, the
corresponding (SE) t-value, and finally, the p-value.
The t is computed as follows:
b −β i
t= i Under ¿ : βi =0
SE ( bi )
If the corresponding p-value is <0.05 (i.e., 95% confidence), then reject
H 0 and conclude that X has a significant effect on Y. That effect can be
either positive or negative.
10
KU | Correlation and Regression Analysis | Prof. Taewan Kim
Multiple Regression: Excel Output
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.939
R Square 0.881
Adjusted R Square 0.871
Standard Error 44.423
Observations 40.000
ANOVA
df SS MS F Significance F
Regression 3.000 527209.081 175736.360 89.051 0.000
Residual 36.000 71043.943 1973.443
Total 39.000 598253.024
Coefficient Standard
s Error t Stat P-value
Intercept 31.150 34.175 0.911 0.368
ADV 12.968 2.737 4.738 0.000
SalesRep 41.246 7.280 5.666 0.000
Wholesale 11.524 7.691 1.498 0.143
11
KU | Correlation and Regression Analysis | Prof. Taewan Kim
Dummy Variables
Let’s consider the role of qualitative independent variables in regression Notes
analysis.
In regression analysis the dependent variables is frequently influenced not
only by variables that can be readily quantified on some well-defined scale
(e.g., income, sales, prices, promotions), but also by variables that are
essentially qualitative in nature (e.g., gender, race, class, residence).
As an example, consider the following model:
Y =α + βD+ ε
Where Y = annual salary of a college professor
D=1 if male college professor
=0 otherwise (i.e., female professor)
We obtain from the model as follow:
Mean salary of female college professor: E(Y|D=0)=α
Mean salary of male college professor: E(Y|D=1)=α+β
A test of the null hypothesis that there is no sex discrimination ( H0: β=0 )
can be easily made by running regression in usual manner and finding out
whether on the basis of the t test the estimated β is statistically significant.
See next page.
Let us modify above model with one quantitative variable:
Y =α 1 + α 2 D+ βX+ ε
Where Y = annual salary of a college professor
X = years of teaching experience
D=1 if male college professor
=0 otherwise (i.e., female professor)
We obtain from the model as follow:
Mean salary of female college professor:
E(Y |X , D=0)=α 1 + βX
Mean salary of male college professor:
E(Y |X , D=1 )=(α 1 +α 2 )+βX
12
KU | Correlation and Regression Analysis | Prof. Taewan Kim
Dummy Variables: Excel Output
Salary Sex (1=male,0=female)
25
22.0 1
19.0 0
20
18.0 0
21.7 1 15
18.5 0
21.0 1 10
20.5 1
17.0 0 5
17.5 0
21.2 1 0
1 2 3 4 5 6 7 8 9 10
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.935
R Square 0.874
Adjusted R Square 0.858
Standard Error 0.697
Observations 10.000
ANOVA
df SS MS F Significance F
Regression 1.000 26.896 26.896 55.342 0.000
Residual 8.000 3.888 0.486
Total 9.000 30.784
Coefficient
Standard Error t Stat P-value
s
Intercept 18.000 0.312 57.735 0.000
sex 3.280 0.441 7.439 0.000
13
KU | Correlation and Regression Analysis | Prof. Taewan Kim
Regression with Interaction Term
Let’s consider the model with Interaction Term. The following model has, Notes
as independent variables, an interval scaled variable, and a product of a
dummy and an interval scaled variables. Consider:
Y =α 1 + β1 X + β 2 ( D∗X )+ ε
Where Y = annual salary of a college professor
X = years of teaching experience
D=1 if male college professor
=0 otherwise (i.e., female professor)
We obtain from the model as follow:
D=0 : Y =α 1 + β1 X
D=1 :Y =α 1 +( β 1 + β 2 ) X
Let’s consider the model with Interaction Term. The following model has,
as independent variables, a dummy variable, an interval scaled variable,
and a product of a dummy and an interval scaled variables. Consider:
Y =α 1 + α 2 D+ β1 X+ β 2 ( D∗X )+ ε
Where Y = annual salary of a college professor
X = years of teaching experience
D=1 if male college professor
=0 otherwise (i.e., female professor)
We obtain from the model as follow:
D=0 : Y =α 1 + β1 X
D=1 :Y =( α 1 +α 2 )+( β 1 + β 2 ) X
14
KU | Correlation and Regression Analysis | Prof. Taewan Kim
Multicollinearity
Definition: A condition said to be present in a multiple regression analysis Notes
when the independent variables are highly correlated among themselves.
Interpretation of the multiple regression equation depends implicitly on the
assumption that the predictor variables are not strongly interrelated. It is
usual to interpret a regression coefficient as measuring the change in the
dependent variable when the independent variable is increased by one unit
and all other predictor variables are held constant. At higher degrees of
multicollinearity, the coefficients for individual independent variables
become unstable and as a result they cannot be interpreted effectively:
Estimators, b’s have larger variances(s.d.’s)
Confidence Interval gets larger
Variance gets larger, hence, t-stat tends to be insignificant
2
Even though lower t-stat, R can be very high
OLS estimators, b’s and s.e.(b)’s can be sensitive to small change
in the data.
How to detect this problem
1
VIF j =
Variance Inflation Factor = 1-R 2j
where j = 1, 2, …, k (the number of independent variables),
2
R j denote R2 of regressing X j against all the other X’s,
i.e., k = 3, R21 is the R-square from the regression X 1 against
X 2 and X 3 .
If VIF > 10, then there is a multicollinearity problem.
How to solve this problem
Using a priori information, i.e., k = 3,
Y =α+ β 1 X 1 + β 2 X 2 + β 3 X 3 + ε
If there is a historical relationship between X 1 and X 2 that is
approximated as X 1 = 0.3 X 2 . Replacing X 2 in the original model,
Y =α+(0 .3 β 2 + β 2 ) X 2 + β 3 X 3 +ε
By Dropping a variable X 1 from the equation we are able to obtain
accurate estimate of 0 . 3 β 2 + β2 and
β 3 . Multicollinearity is no
longer present.
15