0% found this document useful (0 votes)
20 views91 pages

Econometrics Ch2 Multiple Regression Analysis

Chapter 2 discusses Multiple Regression Analysis, emphasizing the importance of including multiple variables to avoid Omitted Variable Bias in economic models. It covers the mathematical derivation of regression models, the significance of key assumptions, and the interpretation of coefficients while introducing hypothesis testing methods like the t-test and F-test. The chapter also addresses issues like multicollinearity and provides practical examples and practice problems related to economic analysis.

Uploaded by

thanhhhk24410e1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views91 pages

Econometrics Ch2 Multiple Regression Analysis

Chapter 2 discusses Multiple Regression Analysis, emphasizing the importance of including multiple variables to avoid Omitted Variable Bias in economic models. It covers the mathematical derivation of regression models, the significance of key assumptions, and the interpretation of coefficients while introducing hypothesis testing methods like the t-test and F-test. The chapter also addresses issues like multicollinearity and provides practical examples and practice problems related to economic analysis.

Uploaded by

thanhhhk24410e1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CHAPTER 2 Multiple Regression Analysis

Multiple Regression Analysis


1. Introduction: Why do we need more variables?

Simple Linear Regression (SLR) is rarely enough for economic analysis because of
Omitted Variable Bias.
• Example: If we regress Wage only on Education, the coefficient for Education is
likely "too high."
• Why? It captures the effect of Education plus the effect of omitted variables
correlated with education (like Intelligence or Family Background).
• Solution: We must explicitly include these variables in the model to control for them.
1. Introduction: Why do we need more variables?

The Mathematical Derivation


To show students why the error term u becomes correlated with X, you can compare the
"True World" with the "Model We Estimated."
1. Introduction: Why do we need more variables?

Step A: The True Model


Imagine the true population relationship for the dependent variable Y depends on two
variables, X and Z:

Y = β0 + β1X + β2Z + v
• Here, v is the "true" random noise (uncorrelated with X or Z).

• Z is a relevant variable (β2 ≠ 0).


1. Introduction: Why do we need more variables?

Step B: The Omitted Model (What we actually run)


Suppose we do not have data for Z (or we simply forget to include it). We run this
regression:

Y = α0 + α1X + u
• We omitted Z.
• u is the error term for this specific misspecified model.
1. Introduction: Why do we need more variables?

Step C: What is inside u?


By comparing the True Model and the Omitted Model, we can see exactly what the error
term u is composed of:

u = β2Z + v
The error term u is not just random noise anymore; it effectively "absorbs" the omitted
variable Z.
1. Introduction: Why do we need more variables?

Step D: The Bias Mechanism


Now, we ask: Is X correlated with u?

Cov(X, u) = Cov(X, β2Z + v)


Cov(X, u) = β2Cov(X, Z) + Cov(X, v)
Since v is pure noise, Cov(X, v) = 0. We are left with:
Cov(X, u) = β2Cov(X, Z)
1. Introduction: Why do we need more variables?

Conclusion: The error term u is correlated with X IF AND ONLY IF:

1. The omitted variable affects Y (β2 ≠ 0).


2. The omitted variable is correlated with X (Cov(X, Z) ≠ 0).
Because the correlation between X and u is driven entirely by the relationship between X
and the missing Z, we call it Omitted Variable Bias.
1. Introduction: Why do we need more variables?

What happens instead?

If you omit a variable Z that is uncorrelated with X (Cov(X, Z) = 0), two things happen:
1. Bias: Zero. Your estimate of β1̂ remains correct on average.
2. Variance: The "noise" term (u) gets larger because it now includes Z. This makes
your Standard Errors larger (less precision), but it does not make the coefficient
wrong.
2. The Population Regression Equation (PRE)

We extend the model to include k independent variables.

Yi = β0 + β1X1i + β2 X2i + … + βk Xki + ui


• Y: The Dependent Variable.
• X1, X2, …: The Independent (Explanatory) Variables.
• β0: The Intercept.
• β1 to βk: The Slope Parameters (Partial Regression Coefficients).
• u: The Error Term (captures unobserved factors).
3. Key Assumptions (The Gauss-Markov Extensions)

We keep the assumptions from Simple Regression (Linearity, Random Sampling, Zero
Conditional Mean, Homoskedasticity), but we add a critical new one:
3. Key Assumptions (The Gauss-Markov Extensions)

Assumption: No Perfect Multicollinearity


• The Rule: No independent variable (X) can be a perfect linear combination of the
others.
• Example of Violation: You cannot include both "Expenditure in Dollars" and
"Expenditure in Euros" in the same model. They are perfectly correlated.
• Note: Imperfect multicollinearity (variables are correlated, but not perfectly) is
allowed, though it makes estimation less precise.
4. Estimating Coefficients (OLS)

The Ordinary Least Squares (OLS) principle remains the same: we want to minimize the
Sum of Squared Residuals (SSR).
n
(Yi − β0̂ − β1̂ X1i − … − βk̂ Xki)2

min
i=1

• Geometric Interpretation: In Simple Regression, we fit a line. In Multiple Regression


with two X variables, we fit a plane in a 3D space. With more variables, we fit a
hyperplane.
5. Interpreting Coefficients: "Ceteris Paribus"

This is the most important concept for economics students.

In Simple Regression (Y = β0 + β1X1):


• β1 is the total effect of X1 on Y.
In Multiple Regression (Y = β0 + β1X1 + β2 X2):
• β1 is the partial effect.
• Definition: β1 measures the change in Y for a one-unit change in X1, holding X2
constant (Ceteris Paribus).
5. Interpreting Coefficients: "Ceteris Paribus"

Economic Example:

Wage ̂ = … + 0.08(Education) + 0.05(Experience)


"Holding experience constant, an additional year of education is associated with an 8%
increase in wages."
6. Analyzing Significance (Hypothesis Testing)

We perform two types of tests in Multiple Regression.


6. Analyzing Significance (Hypothesis Testing)

A. The t-Test (Individual Significance)


Tests if one specific variable matters, assuming all other variables are already in the
model.

• H0 : βj = 0
βĵ
t=
• SE( βĵ )
6. Analyzing Significance (Hypothesis Testing)

B. The F-Test (Joint Significance)


Tests if the entire group of variables explains Y, or if the model is useless.

• H0 : β1 = β2 = … = βk = 0 (All slopes are zero).


• If the F-statistic is high (P-value < 0.05), at least one variable helps explain Y.
6. Analyzing Significance (Hypothesis Testing)
There are two common ways to write the F-statistic formula. Both yield the exact same number.
Formula A: Using Sums of Squares (The ANOVA Method)
This formula compares the "Explained Variance" (Signal) to the "Unexplained Variance"
(Noise).
Explained Variance ESS/k
F= =
Unexplained Variance RSS/(n − k − 1)
• ESS: Explained Sum of Squares (Variation captured by the model).
• RSS: Residual Sum of Squares (Variation missed by the model).
• k: Number of independent variables (degrees of freedom for the model).
• n - k - 1: Degrees of freedom for the residuals.
6. Analyzing Significance (Hypothesis Testing)

Formula B: Using R 2 (The "Goodness of Fit" Method)

This version is often easier for students to calculate if they only have the R 2 value.

R 2 /k
F=
(1 − R 2)/(n − k − 1)
7. Variance and Standard Deviation of Estimators

In multiple regression, the precision of our estimates depends on how correlated our X
variables are.
7. Variance and Standard Deviation of Estimators

The Variance of a slope coefficient βĵ is:


2
σ
Var( βĵ ) =
SSTj(1 − Rj2)

• σ 2: The variance of the error term (noise in the data).


• SSTj: The total variation in variable Xj (Total Sum of Squares).
• Rj2: The R-squared obtained from regressing Xj on all other independent variables.
7. Variance and Standard Deviation of Estimators

The Mathematical Logic: "Partialling Out"

• Step 1: Isolate the variation in Xj.


In a multiple regression, Xj might be correlated with other variables. To find the
specific effect of Xj, OLS essentially looks for the variation in Xj that is unique (not
explained by the other variables).
We find this by regressing Xj on all other independent variables.
◦ The "good" variation is the residual from this regression.
◦ The Sum of Squared Residuals from this auxiliary regression is exactly:
SSTj(1 − Rj2).
7. Variance and Standard Deviation of Estimators

• Step 2: The General Variance Formula.


σ2
For a simple regression (one X), the variance is just .
SSTx
For multiple regression, we replace the total variation (SSTx) with the unique variation
we found in Step 1.
7. Variance and Standard Deviation of Estimators

• Step 3: Combine them.


Noise Variance
Var( β ̂ ) =
j
Unique Variation in Xj
2
σ
Var( βĵ ) =
SSTj(1 − Rj2)
7. Variance and Standard Deviation of Estimators

Intuitive Explanation
A. The Numerator (The Bad Stuff)

• σ 2 (Error Variance): This represents the noise in the data.


◦ Logic: If the data points are very scattered around the true line (high σ 2), it is very
hard to pinpoint the exact slope.

◦ Effect: Higher σ 2 → Higher Variance (Less precise).


7. Variance and Standard Deviation of Estimators

B. The Denominator (The Good Stuff)


The denominator represents the quality of the signal. We want this to be as large as
possible to get a small variance.

1. SSTj (Total Variation in X):


◦ Logic: We need X to move around! If X never changes (e.g., everyone has the
same education level), we can't estimate how it affects wages. The more spread
out X is, the easier it is to see the trend.

◦ Effect: Higher SSTj → Lower Variance (More precise).


7. Variance and Standard Deviation of Estimators

2. (1 − Rj2) (Independence of X):


◦ Logic: This measures how distinct Xj is from the other variables.

◦ IfXj is highly correlated with other variables (Multicollinearity), then Rj2 is close to
1, and (1 − Rj2) becomes tiny (close to 0).
◦ This makes the denominator tiny, which makes the Variance HUGE.

◦ Effect: We need Xj to have its own unique variation. High correlation (high Rj2)
kills precision.
7. Variance and Standard Deviation of Estimators

The "Variance Inflation Factor" (VIF):


1
The term is called the VIF.
1 − Rj
2

• If X1 is highly correlated with X2 (Multicollinearity), Rj2 is high.

• This makes the VIF high → Variance increases → Standard Errors increase → t-
statistics get smaller.
• Lesson: Multicollinearity makes it harder to find statistically significant results.
8. Examples (Economics Focus)

Passage 1: The GPA Model


A university researcher wants to predict student GPA (Y).

• Model A: GPA = β0 + β1(HoursStudied) + u


• Model B: GPA = β0 + β1(HoursStudied) + β2(SAT_Score) + u
After running the regressions, the coefficient for HoursStudied (β1) drops from 0.15 in
Model A to 0.05 in Model B.
8. Examples (Economics Focus)

Question 1 Which of the following best explains the decrease in the β1 coefficient in Model
B?

A) Model B has a lower R 2 than Model A.


B) SAT_Score is negatively correlated with GPA.
C) HoursStudied and SAT_Score are positively correlated, and SAT_Score affects GPA,
causing Model A to suffer from Omitted Variable Bias.
D) The sample size was too small to estimate Model B accurately.
8. Examples (Economics Focus)
8. Examples (Economics Focus)

Passage 2: Housing Prices


An economist estimates the following model for house prices in a city:

Price ̂ = 50,000 + 100(Size) − 5,000(Distance)


Where Size is in square feet and Distance is miles from the city center.
8. Examples (Economics Focus)

Question 2 Based on the equation above, what is the interpretation of the coefficient
-5,000?
A) For every additional mile from the city center, the house price decreases by $5,000.
B) For every additional mile from the city center, the house price decreases by $5,000,
holding the size of the house constant.
C) Houses located in the city center cost $5,000 less than houses outside the city.
D) Distance is not a statistically significant predictor of house price.
8. Examples (Economics Focus)
8. Examples (Economics Focus)

Passage 3: Multicollinearity Logic


A researcher attempts to predict total household consumption (C) using two variables:

1. Income_Pre_Tax (X1)

2. Income_Post_Tax (X2)
The regression software returns an "Error" or very strange results with massive standard
errors.
8. Examples (Economics Focus)

Question 3 What is the most likely technical reason for this error?
A) Heteroskedasticity: The variance of consumption is higher for rich people.
B) Perfect Multicollinearity: Pre-tax and Post-tax income are perfectly (or near perfectly)
linearly related.
C) Endogeneity: Consumption causes Income.
D) The sample size is too large.
8. Examples (Economics Focus)
Practice problems

Problem 1: Determinants of Used Car Prices


A researcher estimates a model to predict the price of used cars based on their age and
mileage. The dataset consists of 500 used cars.
Practice problems

a. Write the estimated regression equation.


b. Interpret the coefficient on the age variable.
c. Predict the price of a car that is 5 years old and has 40,000 miles on it.
d. Is the coefficient on mileage statistically significant at the 1% level? Explain.
Practice problems
Practice problems

Problem 2: Wage Equation with Education and Experience


An economist estimates the effect of education and experience on hourly wages using a
sample of 1,000 workers.
Practice problems

a. Construct a 95% confidence interval for the coefficient on exper.


b. A worker has 16 years of education and 10 years of experience. What is their predicted
hourly wage?
c. Another worker has the same experience but 4 fewer years of education (12 years). How
much less is this worker predicted to earn per hour compared to the worker in part (b)?
d. What does the Root MSE of 3.8729 represent?
Practice problems
Practice problems

Problem 3: Advertising and Sales


A marketing analyst studies the impact of TV and Radio advertising spending on product
sales (in thousands of units).
Practice problems

a. Interpret the coefficient on tv_ads.


b. Is the effect of radio_ads statistically significant at the 5% level? Explain using the p-
value.
c. Calculate the t-statistic for radio_ads (verify the value in the table).
d. The company spends an additional $1,000 on TV ads (which corresponds to a 1 unit
increase in tv_ads if units are in thousands). How much are sales predicted to increase?
Practice problems
Practice problems

Problem 4: House Prices with Dummy Variables


A real estate model predicts house prices (in $1000s) based on size (sq ft) and whether the
house has a pool.
Practice problems

a. Write the regression equation.


b. What is the predicted price of a house with 2,000 sq ft and no pool?
c. What is the estimated "premium" for having a pool, holding size constant?
d. Is the pool premium statistically significant at the 1% level?
Practice problems
Practice problems

Problem 5: Test Scores and Student-Teacher Ratio


A policy analyst examines the relationship between district test scores and two variables:
student-teacher ratio (str) and percentage of English learners (el_pct).
Practice problems

a. Interpret the intercept coefficient (700.00). Does it have a realistic interpretation here?
b. If a district reduces its student-teacher ratio by 2 students (e.g., from 22 to 20), what is
the predicted change in test score, holding el_pct constant?
c. District A has str=20 and el_pct=10. District B has str=20 and el_pct=20. What is
the predicted difference in test scores between District A and District B?
d. Calculate the F-statistic using the MS values provided.
Practice problems
CHAPTER 2 Multiple Regression Analysis
Model Selection and Specification Analysis
1. Model Selection Criteria

The most common question students ask is: "Should I keep this variable in my model?"
1. Model Selection Criteria

The Trap of R 2

In Simple Regression, a higher R 2 is generally better. In Multiple Regression, R 2 is


dangerous.

• The Rule: Every time you add a variable (even a random nonsense variable), R2
never decreases. It either stays the same or goes up.

• The Problem: You can get an R 2 of 1.0 just by adding as many variables as you
have observations, creating a meaningless model ("Overfitting").
1. Model Selection Criteria

The Solution: Adjusted R 2 (R̄2)


This metric imposes a penalty for adding useless variables.

2 2n−1
R̄ = 1 − (1 − R )
n−k−1
• If you add a variable and the Adjusted R 2 drops, that variable likely didn't add
enough explanatory power to justify the loss of degrees of freedom.
2. Specification Errors: The Two Sins

In econometrics, not all mistakes are created equal.


2. Specification Errors: The Two Sins

Sin #1: Omitting a Relevant Variable (Underfitting)


Scenario: The true model requires Education and Ability
(wage = β0 + β1Educ + β2 Ability), but you only run wage = β0 + β1Educ.
• Consequence: Bias.
• Because Ability is correlated with Education (students with high ability tend to stay in
school longer), the coefficient for Education (β1̂ ) "steals" the credit for Ability.
• Your estimate is wrong and biased. (e.g., You overestimate the return to schooling).
2. Specification Errors: The Two Sins

Sin #2: Including an Irrelevant Variable (Overfitting)


Scenario: The true model depends only on Income, but you add "Zodiac Sign" to the
regression.
• Consequence: Inefficiency (Higher Variance).
• Bias: None. Your estimates for the important variables are still unbiased (centered on
the truth).
• Variance: The standard errors of all your coefficients will likely increase. This makes
t-stats smaller, making it harder to find statistically significant results.
• Lesson: It is generally "safer" to include a variable if you aren't sure, than to omit it
and risk bias.
3. The Ramsey RESET Test

How do we know if we have the wrong functional form (e.g., we used a straight line when
the data is curved)?
3. The Ramsey RESET Test

RESET (Regression Equation Specification Error Test):

1. Run the original regression and get the predicted values ( Y ).̂

2. Run a second regression adding powers of those predictions (e.g., Y 2̂ , Y 3̂ ) as new


independent variables.
3. Test: Use an F-test to see if these new terms are significant.

◦ Null Hypothesis (H0): Model is correctly specified.

◦ Reject H0: You missed something non-linear (you might need logs or quadratics).
4. Hypothesis Testing for Selection (F-Test)

When choosing between a short model (Restricted) and a long model (Unrestricted), we
use the F-Test for Joint Significance.
(SSRrestricted − SSRunrestricted)/q
F=
SSRunrestricted /(n − k − 1)
• Logic: Does the Sum of Squared Residuals (error) drop enough to justify adding the
group of q new variables?
• If F is high (P-value low), the extra variables are jointly significant. Keep them.
5. Examples (Economics Focus)

Passage 1: The wage gap study


A researcher investigates the gender wage gap.

• Regression 1: Wage = β0 + β1(Female) + u


◦ Result: β1 = − 5.00 (Females earn $5 less/hour).
• Regression 2: Wage = β0 + β1(Female) + β2(Occupation) + u
◦ Result: β1 = − 2.00 (Females earn $2 less/hour).
5. Examples (Economics Focus)

Question 1 Which of the following statements best explains the change in the coefficient
for the Female variable from -5.00 to -2.00?
A) Regression 2 suffers from multicollinearity, making the estimate unreliable.
B) In Regression 1, the Female variable was biased downwards because it suffered from
Omitted Variable Bias regarding Occupation.
C) Occupation is an irrelevant variable and should be removed to restore the efficiency of
the model.

D) The Adjusted R 2 of Regression 2 is definitely lower than Regression 1.


5. Examples (Economics Focus)
5. Examples (Economics Focus)

Passage 2: The Marketing Director's Dilemma

A marketing director is building a model to predict sales. She starts with Price (X1) and
Advertising (X2). She considers adding a third variable: "CEO's Golf Handicap" (X3), which
she knows theoretically has absolutely zero impact on customer behavior.
5. Examples (Economics Focus)

Question 2 If she includes X3 in the regression model, what will be the statistical
consequence?
A) The coefficient for Price (β1̂ ) will become biased.
B) The R 2 of the model will decrease.
C) The Standard Errors of β1̂ and β2̂ will likely increase, reducing the t-statistics.
D) The model will fail the Ramsey RESET test.
5. Examples (Economics Focus)
5. Examples (Economics Focus)

Passage 3: Interpreting the RESET


An economics student estimates a production function:
Output = β0 + β1(Labor) + β2(Capital).
He suspects the relationship might actually be Cobb-Douglas (multiplicative/curved) rather
than linear. He runs a Ramsey RESET test and obtains a P-value of 0.01.
5. Examples (Economics Focus)

Question 3 Based on the P-value of 0.01, what should the student conclude?
A) Fail to reject the Null; the linear model is correctly specified.
B) Reject the Null; the linear specification is likely incorrect and functional forms like logs or
squares should be investigated.
C) Reject the Null; the model suffers from heteroskedasticity.
D) The model is suffering from perfect multicollinearity.
5. Examples (Economics Focus)
More Problems

Problem 1: Omitted Variable Bias and Coefficient Stability


A labor economist is investigating the return to education. She first estimates a "Short
Model" regressing log_wage on education. She then estimates a "Long Model" that
adds ability (a test score measure) to check for specification bias.
More Problems
More Problems

a. Compare the coefficient on education in Model (1) and Model (2). By how much did it
change?
b. Based on the change in coefficients, was the "Short Model" suffering from positive or
negative bias?
c. What two conditions must be true about the ability variable for this bias to exist in the
Short Model?
d. Which model is preferred for estimating the causal effect of education on wages? Explain
briefly using the statistical significance of the added variable.
More Problems
More Problems

Problem 2: Functional Form – Testing for Non-Linearity


A researcher models the relationship between corn yield (bushels/acre) and nitrogen
fertilizer (lbs/acre). They suspect diminishing marginal returns and fit a quadratic model.
More Problems
More Problems

a. Write the estimated regression equation.


b. Is the quadratic term nitrogen_sq statistically significant at the 5% level? What does
this imply about the functional form specification?
c. Calculate the level of nitrogen where yield is maximized (the turning point).
d. If the researcher had only estimated a linear model (yield = b0 + b1*nitrogen),
would that model be considered correctly specified? Explain.
More Problems
More Problems

Problem 3: Joint Hypothesis Testing (F-Test)


An analyst is building a model to predict house prices. They start with square footage
(sqft) and then consider adding neighborhood dummy variables (d_north, d_south,
d_east; West is the base group). They run the Unrestricted Model below.
More Problems
More Problems

a. Are any of the location dummy variables individually significant at the 5% level?
b. Suppose you run a Restricted Model excluding all location dummies (regressing price
only on sqft) and find the Restricted SSR (SSRR) is 79,200,000. The Unrestricted SSR
(SSRU) from the table above is 78,400,000.
Calculate the F-statistic for the joint significance of the neighborhood effects.

c. Based on your F-statistic (assume Critical F ≈ 2.65), should you include the
neighborhood dummies in your final model specification?
More Problems
More Problems

Problem 4: Adjusted R-Squared and Model Penalties


You are selecting between two models for predicting student test scores.
• Model A: score = b0 + b1*study_time

• Model B: score = b0 + b1*study_time + b2*height + b3*shoe_size


You run the regressions in Stata and obtain the following summary statistics:
More Problems

a. Calculate the standard R 2 for Model A and Model B.

b. Calculate the Adjusted R 2 (R̄2) for Model A and Model B.

c. Which model is preferred based on Adjusted R 2? Explain why this metric is better than
standard R 2 for this comparison.
More Problems
More Problems

Problem 5: Interaction Terms and Slope Specification


A researcher estimates the effect of experience on wages, hypothesizing that the return to
experience is different for men and women.
female = 1 if female, 0 if male.
fem_exper = female × exper.
More Problems
More Problems

a. What is the estimated return to an additional year of experience for Men (female=0)?
b. What is the estimated return to an additional year of experience for Women
(female=1)?
c. Is the difference in the return to experience between men and women statistically
significant at the 5% level? Which variable tells you this?
d. If the researcher removed the interaction term fem_exper, would they be accurately
modeling the wage dynamics? Explain.
More Problems

You might also like