0% found this document useful (0 votes)
8 views15 pages

Correlation and Regression Analysis Guide

The document provides an overview of correlation and regression analysis, focusing on their definitions, applications, and interpretations in marketing strategy. It outlines the objectives of understanding correlation, regression, scatter plots, and the significance of dummy variables and multicollinearity. Additionally, it includes examples of simple and multiple regression models, their statistical outputs, and the importance of testing hypotheses regarding the effects of independent variables.

Uploaded by

emiliakuschka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views15 pages

Correlation and Regression Analysis Guide

The document provides an overview of correlation and regression analysis, focusing on their definitions, applications, and interpretations in marketing strategy. It outlines the objectives of understanding correlation, regression, scatter plots, and the significance of dummy variables and multicollinearity. Additionally, it includes examples of simple and multiple regression models, their statistical outputs, and the importance of testing hypotheses regarding the effects of independent variables.

Uploaded by

emiliakuschka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

KU | Correlation and Regression Analysis | Prof.

Taewan Kim

Marketing Strategy
3960, 3961

Correlation and Regression Analysis


Introduction
What if we are interested in the relationship between two variables?
If we change one variable (x) does another variable also change (y)?
We could also possibly use a variable (x) to predict another variable
(y). These types of analysis, correlation and regression, are useful
and powerful tools for using your data to make decisions.

Objectives
1. Define correlation.
2. Define regression.
3. Interpret scatter plots.
4. Explain simple regression.
5. Explain multiple regression.
6. Interpret a simple regression output.
7. Interpret a multiple regression output.
8. Define dummy variable.
9. Define multicollinearity.

Resources
Primary
 Churchill and Brown, Basic Marketing Research, Chapter 20

1
KU | Correlation and Regression Analysis | Prof. Taewan Kim

Correlation and Regression

Correlation: closeness of the relationship between two variables. Notes

“Is x correlated to y?” is the same as “Is y correlated to x?”

Corr(X,Y) = Corr(Y,X)

*Correlation does not mean causation!

Regression: Derive an equation that relates the criterion variable


to one or more predictor variables.

“Is x useful in predicting y?” (This is not the same as “Is y useful
in predicting x?”)

1. Variable of Interests: Y

2. Why is Y fluctuating?

a. Scatter plots

b. Pick one Xi from theory or model

c. Fitted model (Ordinary Least Squares)


n
min∑ (Y i−Y^ )
2

i=1

3. Is the model any good?


How close does the fitted line come to the actual scatter plot?

TSS

RSS

ESS

2 RSS
0≤ R = ≤1
TSS

2
KU | Correlation and Regression Analysis | Prof. Taewan Kim

Bivariate (two-variable) Scatter Plots

3
KU | Correlation and Regression Analysis | Prof. Taewan Kim

Pearson’s Correlation Coefficient

Association between two variables – How one variable changes (from its Notes
average) with the change (from its average) in another variable.

r=
∑ ( x i − x̄ )( y i− ȳ )
√ ∑ (xi − x̄ )2 √∑ ( y i − ȳ )2
Cov ( x , y )
r=
sx s y

−1≤ r≤+1

4
KU | Correlation and Regression Analysis | Prof. Taewan Kim

Regression: Assessing the Effects of Marketing Mix Variables

Simple Regression Model Notes


Simple (or bivariate) regression examines the effect of one independent
variable (X) on a dependent variable (Y).

The Population Model:

Y =α + βX + ε

Where,
Y = Dependent variable (or Outcome)
X = Independent variable (Input)
α = Intercept; value of Y when X=0
β= Slope, the change in Y due to one unit change in X.

β is a measure of the effect of X on Y.

When β= 0, X has no effect on Y.


When β >0, X has a positive effect on Y.
(e.g., Effects of advertising, # sales persons on sales)
When β <0, X has a negative effect on Y.
(e.g., Effects of a product's own price on its own sales)

The Fitted Model:


Based on data on X and Y, we fit the regression line as:

Y^ =a+bX
Where,
Y=Predicted (or fitted) value of the dependent variable for a given value of
X. The fitted value Y may not match with the actual value of Y. This
results in error =Y −Y
^

a= estimate of the interceptα , based on the sample data on X and Y


b= estimate of the intercept β , based on the sample data on X and Y
a , b are also referred to as regression parameters.

5
KU | Correlation and Regression Analysis | Prof. Taewan Kim

Simple Regression (Continued)

Is the Fitted Regression Model any good? Notes


Note that the analyst had picked the independent variable (X) based on
his/her conjecture. S/he assumed that this X (e.g., number of sales visit
made by a salesperson) might have an effect on Y (e.g., actual sales
generated). But s/he might be wrong in his/her assumption.

The problem is that any data on X and Y will produce a fitted regression
^
line (i.e., OLS parameter a and b): Y =a+bX

Therefore, two very important questions to ask are

1. How well the fitted regression line “fit” (i.e., come close to) the actual
data points (i.e., the scatter plot)?

2. Is the estimated slope parameter b statistically different from zero?


Note that when the slope is zero, it is a flat line, suggesting that
‘X has no effect on Y’.

1. Fit of a Regression
If the analyst has no explanation about why the dependent variable Y (e.g.,
sales) is varying from observation to observation (e.g., salesperson to
salesperson), s/he can only look at each data point
Yi
and examine how
much it is deviating from the average Y.

(Y −Ȳ )
The deviation is i . Squaring this and adding then for all observation,
we get what we call the Total Sums of Square (TSS). Thus,
n
TSS=∑ ( Y i −Ȳ )2
i =1

TSS=( s2 )∗(n−1) 2
Note that , where s is variance of Y

6
KU | Correlation and Regression Analysis | Prof. Taewan Kim

Simple Regression (Continued)

2. Partitioning of TSS Notes


A regression partitions TSS into two groups: TSS=RSS+ESS
(1) Regression (or explained) Sums of Square (RSS) and
(2) Error (or unexplained) Sums of Squares(ESS)
n
RSS=∑ ( Y^ i −Ȳ )2
 i=1
: the difference between the mean and the regression
n
ESS=∑ (Y i −Y^ )2
 i=1
: the difference between the regression and the data

R2 is a measure of “Fit” ; It gives us the % of total variation in Y(TSS) that


could be explained by the regression (RSS). i.e.,

RSS ESS
R2 = =1−
TSS TSS 0≤R2 ≤1 ;

In simple regression, the correlation


r = √ R 2

3. Test for the Slope Parameter (b)


Since we are using a sample to generate data for X and Y, the parameters,
a and b, are not single numbers; there is confidence interval around it,
depending on the respective standard errors (SE).

Therefore, we need to test if the slope parameter (b) is statistically different


from zero.

H 0 : β=0
H a : β≠0

Just like in the case of a one-population mean test, we compute the test
statistic as follows:

b− β
t= Under¿ : β=0
SE ( b)

The computer output prints the t and the corresponding error (called the p-
values)

H 0 and conclude that


If the p<0.05 (i.e., 95% confidence), then reject
β≠0 , which means that X has a significant effect on Y

7
KU | Correlation and Regression Analysis | Prof. Taewan Kim

Simple Regression: Excel Output

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.880
R Square 0.775
Adjusted R Square 0.769
Standard Error 59.560
Observations 40.000

ANOVA
df SS MS F Significance F
Regression 1.000 463451.009 463451.009 130.644 0.000
Residual 38.000 134802.015 3547.421
Total 39.000 598253.024

Coefficient Standard
s Error t Stat P-value
Intercept 135.434 25.907 5.228 0.000
ADV 25.308 2.214 11.430 0.000

8
KU | Correlation and Regression Analysis | Prof. Taewan Kim

Multiple Regression

Multiple (or multivariate) regression: Notes


One dependent variable and more than one independent variables

The Population Model (with k independent variables):

Y =α + β 1 X 1 + β 2 X 2 +. . .+ β k X k +ε

Using data on Y and


X 1 , X 2 ,. . . , X k , one can fit a model:

The Fitted Model:

Y^ =a+b 1 X 1 +b 2 X 2 +.. .+bk X k

1. Fit of the Model:

As in simple regression, fit of a multiple regression model is indicated by


R2 (i.e., the proportion of variations in the dependent (Y) variable can be
explained by the independent variables (all the X’s)). Again:
0≤R2 ≤1

Unlike simple regression, there is a second issue here. If an analyst keeps


on adding more and more independent variable (X’s) to explain the
variation in the dependent (Y) variable, then s/he can conceivably keep on
increasing R2. The question is – at what cost?

The more number of variables you include in the model, the more is the
need for data on these variables. There is a cost for additional information.
Also, it is not fair to compare R2’s of two models, one with fewer (say 2)
independent (X) variables and another with greater number (say 3 or 4) of
independent (X) variables. When you encounter this type of a situation,
you should look at the adjusted R2, which has a way to penalize the model
with more independent variable.

Adjusted 2R=1−(1−R 2 )
n−1
n−k−1
=R 2
n−1
n−k−1
−( k
n−k −1 )( )

9
KU | Correlation and Regression Analysis | Prof. Taewan Kim

Multiple Regression (continued)

2. Effects of the Independent Variables Notes


A second issue that multiple regression models have to contend with is the
relative effects of the independent (X) variables. That is, which X’s have a
stronger impact on Y? Which X’s have weaker or no impact on Y? To
address this issue, we follow two steps.

Step 1: F-test
We start with a very modest question:
Does any one of the X’s have an impact on Y?
Stated differently – is at least one of the β ' s≠0 . Thus,

H 0 : β 1 =β 2=. . .=β k =0
H a : At least one β i≠0

The test of this hypothesis is carried out through a F-test. Look up the RSS /k
F=
ANOVA table in the regression output. If the value of F is “large” and the ESS/(n−k −1)
corresponding p-value is small (<0.05), then one can reject the H0 and
conclude that at least one of the X’s have a statistically significant impact
on Y. But the F-test does not tell us which variable(s). For this, we have to
go to the next step.

Step 2: t-test
In this step, we answer the question:
Which of the independent (X) variables has/have statistically significant
effect on the dependent (Y) variable? i.e., which β ' s≠0 . One has to test a
separate hypothesis for each β associated with each independent variable.

H 0 : β 1 =0 H 0 : β 2 =0 H 0 : β k =0
, ,. . .,
H a : β 1 ≠0 H a : β 2 ≠0 H a : β k ≠0

To test these hypotheses, look at the lower part of the regression output.
For each independent variable (X), there is a corresponding value of the
parameter β , its standard error (SE) reflecting the variation around it, the
corresponding (SE) t-value, and finally, the p-value.

The t is computed as follows:


b −β i
t= i Under ¿ : βi =0
SE ( bi )
If the corresponding p-value is <0.05 (i.e., 95% confidence), then reject
H 0 and conclude that X has a significant effect on Y. That effect can be
either positive or negative.

10
KU | Correlation and Regression Analysis | Prof. Taewan Kim

Multiple Regression: Excel Output

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.939
R Square 0.881
Adjusted R Square 0.871
Standard Error 44.423
Observations 40.000

ANOVA
df SS MS F Significance F
Regression 3.000 527209.081 175736.360 89.051 0.000
Residual 36.000 71043.943 1973.443
Total 39.000 598253.024

Coefficient Standard
s Error t Stat P-value
Intercept 31.150 34.175 0.911 0.368
ADV 12.968 2.737 4.738 0.000
SalesRep 41.246 7.280 5.666 0.000
Wholesale 11.524 7.691 1.498 0.143

11
KU | Correlation and Regression Analysis | Prof. Taewan Kim

Dummy Variables

Let’s consider the role of qualitative independent variables in regression Notes


analysis.

In regression analysis the dependent variables is frequently influenced not


only by variables that can be readily quantified on some well-defined scale
(e.g., income, sales, prices, promotions), but also by variables that are
essentially qualitative in nature (e.g., gender, race, class, residence).

As an example, consider the following model:

Y =α + βD+ ε

Where Y = annual salary of a college professor


D=1 if male college professor
=0 otherwise (i.e., female professor)

We obtain from the model as follow:


 Mean salary of female college professor: E(Y|D=0)=α
 Mean salary of male college professor: E(Y|D=1)=α+β

A test of the null hypothesis that there is no sex discrimination ( H0: β=0 )
can be easily made by running regression in usual manner and finding out
whether on the basis of the t test the estimated β is statistically significant.
See next page.

Let us modify above model with one quantitative variable:

Y =α 1 + α 2 D+ βX+ ε

Where Y = annual salary of a college professor


X = years of teaching experience
D=1 if male college professor
=0 otherwise (i.e., female professor)

We obtain from the model as follow:


 Mean salary of female college professor:
E(Y |X , D=0)=α 1 + βX
 Mean salary of male college professor:
E(Y |X , D=1 )=(α 1 +α 2 )+βX

12
KU | Correlation and Regression Analysis | Prof. Taewan Kim

Dummy Variables: Excel Output


Salary Sex (1=male,0=female)
25
22.0 1
19.0 0
20
18.0 0
21.7 1 15
18.5 0
21.0 1 10
20.5 1
17.0 0 5
17.5 0
21.2 1 0
1 2 3 4 5 6 7 8 9 10

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.935
R Square 0.874
Adjusted R Square 0.858
Standard Error 0.697
Observations 10.000

ANOVA

df SS MS F Significance F
Regression 1.000 26.896 26.896 55.342 0.000
Residual 8.000 3.888 0.486
Total 9.000 30.784

Coefficient
Standard Error t Stat P-value
s
Intercept 18.000 0.312 57.735 0.000
sex 3.280 0.441 7.439 0.000

13
KU | Correlation and Regression Analysis | Prof. Taewan Kim

Regression with Interaction Term

Let’s consider the model with Interaction Term. The following model has, Notes
as independent variables, an interval scaled variable, and a product of a
dummy and an interval scaled variables. Consider:

Y =α 1 + β1 X + β 2 ( D∗X )+ ε

Where Y = annual salary of a college professor


X = years of teaching experience
D=1 if male college professor
=0 otherwise (i.e., female professor)

We obtain from the model as follow:


 D=0 : Y =α 1 + β1 X
 D=1 :Y =α 1 +( β 1 + β 2 ) X

Let’s consider the model with Interaction Term. The following model has,
as independent variables, a dummy variable, an interval scaled variable,
and a product of a dummy and an interval scaled variables. Consider:

Y =α 1 + α 2 D+ β1 X+ β 2 ( D∗X )+ ε

Where Y = annual salary of a college professor


X = years of teaching experience
D=1 if male college professor
=0 otherwise (i.e., female professor)

We obtain from the model as follow:


 D=0 : Y =α 1 + β1 X
 D=1 :Y =( α 1 +α 2 )+( β 1 + β 2 ) X

14
KU | Correlation and Regression Analysis | Prof. Taewan Kim

Multicollinearity

Definition: A condition said to be present in a multiple regression analysis Notes


when the independent variables are highly correlated among themselves.

Interpretation of the multiple regression equation depends implicitly on the


assumption that the predictor variables are not strongly interrelated. It is
usual to interpret a regression coefficient as measuring the change in the
dependent variable when the independent variable is increased by one unit
and all other predictor variables are held constant. At higher degrees of
multicollinearity, the coefficients for individual independent variables
become unstable and as a result they cannot be interpreted effectively:

 Estimators, b’s have larger variances(s.d.’s)


 Confidence Interval gets larger
 Variance gets larger, hence, t-stat tends to be insignificant
2
 Even though lower t-stat, R can be very high
 OLS estimators, b’s and s.e.(b)’s can be sensitive to small change
in the data.

How to detect this problem


1
VIF j =
 Variance Inflation Factor = 1-R 2j
where j = 1, 2, …, k (the number of independent variables),
2
R j denote R2 of regressing X j against all the other X’s,
i.e., k = 3, R21 is the R-square from the regression X 1 against
X 2 and X 3 .
If VIF > 10, then there is a multicollinearity problem.

How to solve this problem

 Using a priori information, i.e., k = 3,


Y =α+ β 1 X 1 + β 2 X 2 + β 3 X 3 + ε
If there is a historical relationship between X 1 and X 2 that is
approximated as X 1 = 0.3 X 2 . Replacing X 2 in the original model,
Y =α+(0 .3 β 2 + β 2 ) X 2 + β 3 X 3 +ε
By Dropping a variable X 1 from the equation we are able to obtain
accurate estimate of 0 . 3 β 2 + β2 and
β 3 . Multicollinearity is no
longer present.

15

You might also like