0% found this document useful (0 votes)
16 views25 pages

Omitted Variable Bias in Regression Analysis

The document discusses omitted variable bias (OVB) in simple linear regression, explaining that bias occurs when an omitted variable is correlated with the regressor and affects the dependent variable. It uses the example of English language ability as an omitted variable that impacts test scores and class size, demonstrating how ignoring such factors can lead to biased estimates. The document concludes with empirical results from a regression analysis that confirms the presence of OVB in the context of test scores and class size.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views25 pages

Omitted Variable Bias in Regression Analysis

The document discusses omitted variable bias (OVB) in simple linear regression, explaining that bias occurs when an omitted variable is correlated with the regressor and affects the dependent variable. It uses the example of English language ability as an omitted variable that impacts test scores and class size, demonstrating how ignoring such factors can lead to biased estimates. The document concludes with empirical results from a regression analysis that confirms the presence of OVB in the context of test scores and class size.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Omitted Variable Bias

(SW Section 6.1)

1
Omitted Variable Bias

In Simple (Linear) Regression, we posit the population relation


between Y and X as:
Yi = 0 + 1Xi + ui
where u denotes the error term. The error u arises because of other
factors (or variables) that influence Y but are not included in the
regression function.
Fact 1. There are always omitted variables; they hide in u.
Fact 2. Sometimes, the omission of those variables can lead to
bias in the OLS estimator.
Question: When bias occurs? Answer: If the assumption (LS A1)
that justifies OLS estimation is violated.

2
Definition: The bias in the OLS estimator that occurs as a result
of an omitted factor, or variable, is called omitted variable bias.

Let Z denote an omitted variable.


Claim: For omitted variable bias (OVB) to occur, the omitted
variable “Z” must satisfy both of the following conditions:
(1) Z is correlated with the regressor X (i.e., Cov(Z, X)  0),
(2) Z is a determinant of Y (i.e., Z is part of u);

We will prove this claim. First, let’s return to our primary


example and think again.

3
Question: Does reducing class size (student to teacher ratio)
improve student achievement (test scores)?
Many factors have been left out. Consider Z1 = English language
ability. We can measure it in various ways. For simplicity we take,
1 if English is the second language of student
Z1 = { .
0 else
(1) Immigrant communities tend to be less affluent. As a result
they have smaller school budgets and higher student-teacher
ratios (STR): Z1 is correlated with X.
(2) English language ability (whether the student learned English
as a second language) is likely to affect test scores: Z1 is a
determinant of Y.
Since both conditions are met, we expect 𝛽̂1 in SR to be biased.
4
For omitted variable bias (OVB) to occur, the omitted variable “Z”
must satisfy both conditions.
Question: Do all omitted factors satisfy both conditions?
• Consider Z2 = Time of day of the test.
Is Z2 correlated with X?
Is Z2 a determinant of Y?

• Consider Z3 = Parking lot space per pupil.


Is Z3 correlated with X?
Is Z3 a determinant of Y?

5
What is the direction of the omitted variable bias?
Common sense suggests an answer. Think about the example of Z1
= English language (in)ability. Common sense says overstatement
of the expected negative class size effect. (see next two slides)

What is the magnitude of the omitted variable bias?


Common sense will not help here, but there is a formula.

We will obtain an expression for the asymptotic bias in the OLS


estimator of slope, relying on the algebra we used in SR.
(For an alternative approach, see S&W Appendix 6.1 which builds
on their Appendix 4.3.)
6
Test Score and Class Size problem
Instead of Z (= 1 if English Language student, = 0 else) we have
fraction (percentage) of students in the district who are English
Learners (PctEL).
Logic: Districts that have higher PctEL are expected to:
(1) Have larger class sizes, and
(2) Do worse on standardized tests

Claim about the direction of bias: Ignoring the effect of “having


many English learners” in the class (omitting PctEL) would result
in a larger negative slope (overstatement of the expected negative
class size effect).
7
Is this is actually going on in the California data?

Verification of 1: Districts with higher PctEL have bigger classes.


(compare the distribution of small vs. large classes for each PctEL)

Verification of 2: Districts with higher PctEL have lower test scores.


(compare the test scores by PctEL within each class size)
8
Verification of the claim about the direction of the bias:
Among districts with comparable PctEL, the effect of class size is
smaller than the overall “test score gap” = 7.4.
In the Simple Regression model, STR gets credit for the negative
effect attributable to presence of English learners in the class.
Thus, ignoring the effect of “having many English learners” in the
class (omitting PctEL) would result in a larger negative slope in
the Simple Regression.

How about the magnitude of the bias?

9
The OLS estimator of the slope is a linear function of Y,
n

( X − X )(Yi − Y ) ∑𝑖 𝑥∗𝑖 𝑦𝑖∗ ∑𝑖 𝑥∗𝑖 𝑌𝑖


ˆ1 =
i
i =1 = = = ∑𝑖 𝑐𝑖 𝑌𝑖 (*)
n ∑𝑖(𝑥∗𝑖 )2 ∑𝑖(𝑥∗𝑖 )2
 i
( X
i =1
− X ) 2

𝑥∗𝑖
where 𝑥𝑖∗ = Xi – X , 𝑦𝑖∗ = Yi – Y , and ci = , a function of
∑𝑖(𝑥∗𝑖 )2
X alone. Recall that ci’s satisfy the following conditions:
1
∑𝑖 𝑐𝑖 = 0, ∑𝑖 𝑐𝑖 𝑥𝑖∗ = 1, ∑𝑖 𝑐𝑖2 = 2, ∑𝑖 𝑐𝑖 𝑋𝑖 = 1.

∑𝑖(𝑥𝑖 )

Population relation: Yi = 0 + 1Xi + ui. Plug this in (*):


ˆ = ∑𝑖 𝑐𝑖 𝑌𝑖 = ∑𝑖 𝑐𝑖 ( +  𝑋𝑖 + 𝑢𝑖 )
1 0 1
= 0 ∑
⏟ 𝑖 𝑐𝑖 + 𝛽1 ∑
⏟𝑖 𝑐𝑖 𝑋𝑖 + ∑
⏟𝑖 𝑐𝑖 𝑢𝑖
0 1 𝐿𝑖𝑛𝑒𝑎𝑟 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑢

10
Rewrite as:
ˆ1 – 𝛽1 = ∑𝑖 𝑐𝑖 𝑢𝑖 .
Examine the discrepancy further:
1 1
𝑥𝑖∗ 𝑢𝑖 ∑𝑖 𝑥𝑖∗ 𝑢𝑖 ∑𝑖 𝑥𝑖∗ 𝑢𝑖 ∑𝑖 𝑥𝑖∗ 𝑢𝑖
𝑛 𝑛
∑𝑖 𝑐𝑖 𝑢𝑖 = ∑𝑖 2 = 2 = 1 2 = 1 2 .
∑𝑖(𝑥𝑖∗ ) ∑𝑖(𝑥𝑖∗ ) ∑𝑖(𝑥𝑖∗ ) ∑𝑖(𝑥𝑖∗ )
𝑛 𝑛

Now:
1 1
∑𝑖(𝑥𝑖∗ )2 = ∑𝑖(𝑋𝑖 − 𝑋̅ )2 ≅ 𝑠𝑋2 ,
𝑛 𝑛
1 1
∑𝑖 𝑥𝑖∗ 𝑢𝑖 = ∑𝑖(𝑋𝑖 − 𝑋̅)𝑢𝑖 ≅ 𝑠𝑋𝑢 ,
𝑛 𝑛

𝑠𝑋2 = sample variance of X,


𝑠𝑋𝑢 = sample covariance between X and u.

11
Under LS A2-3: As n → ∞, 𝑠𝑋2 → Var(X) and 𝑠𝑋𝑢 → Cov(X, u).

Thus,
𝑃 𝐶𝑜𝑣(𝑋,𝑢)
ˆ1 – 𝛽1 → . ())
𝑉𝑎𝑟(𝑋)

LS A1: E(u | X) = 0 implies that 𝐶𝑜𝑣(𝑋, 𝑢) is zero (because mean-


independence implies zero covariance). In this case the OLS
estimator ˆ1 is unbiased and consistent for 𝛽1 .

12
Since “Z is a determinant of Y,” we may augment the population
regression equation as:
(Multiple: MR) Yi = 0 + 1Xi + 2Zi + vi
where vi denotes the new error term. Earlier we used simple
regression (SR) and took the population regression to be
(Simple: SR) Yi = 0 + 1Xi + ui.
The relation between (MR) and (SR) is given by ui = 2Zi + vi.
Suppose E(v | X, Z) = 0. Then Cov(X, v) = 0. In this case :
𝐶𝑜𝑣(𝑋,𝑢) 𝐶𝑜𝑣(𝑋,𝛽2 𝑍+𝑣)
()) =
𝑉𝑎𝑟(𝑋) 𝑉𝑎𝑟(𝑋)
𝛽2 𝐶𝑜𝑣(𝑋,𝑍)+𝐶𝑜𝑣(𝑋,𝑣) 𝐶𝑜𝑣(𝑋,𝑍)
= = 𝛽2 × .
𝑉𝑎𝑟(𝑋) 𝑉𝑎𝑟(𝑋)
13
Thus, the consequence of using (SR) and running the LS
regression of Y on X when (MR) is the correct model is:
𝑃 𝐶𝑜𝑣(𝑋,𝑍)
̂
𝛽1,𝑠𝑖𝑚𝑝𝑙𝑒 – 𝛽1 → 𝛽2 ()
𝑉𝑎𝑟(𝑋)

where 𝛽̂1,𝑠𝑖𝑚𝑝𝑙𝑒 will be the LS estimator of 𝛽1 in the regression of


Y on X.

With the help of (), we can determine both the direction, and
magnitude of the “large sample” bias (inconsistency).

14
Recapitulation: We argued that two conditions have to be met
for the OLS estimator to suffer from omitted variable bias:
OVB condition 1 → 𝐶𝑜𝑣(𝑋, 𝑍) ≠ 0.
OVB condition 2 → 𝛽2 ≠ 0.

If both are met, then the SR slope 𝛽̂1,𝑠𝑖𝑚𝑝𝑙𝑒 estimated by OLS is


not consistent for 𝛽1 .

The “true” population slope of X will be estimated from the


multiple regression (i.e., regression of Y on X and Z.). The LS
estimator of 𝛽1 (i.e., the slope of X) in that multiple regression will
be called 𝛽̂1,𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑒 .

15
Return to test score-class size problem:
Logic: Districts that have higher PctEL are expected to:
(1) Have larger class sizes (𝐶𝑜𝑣(𝑆𝑇𝑅, 𝑃𝑐𝑡𝐸𝐿) > 0), and
(2) Do worse on standardized tests (𝛽2 < 0).
Plug these in (): 𝛽̂1,𝑠𝑖𝑚𝑝𝑙𝑒 – 𝛽1 < 0 (negative bias)

→ Ignoring the effect of “having many English learners” in the


class (omitting PctEL) would result in a larger negative slope
(overstatement of the expected negative class size effect).

16
Magnitude of bias:
𝐶𝑜𝑣(𝑋,𝑍)
We need estimates of: (1) γ = , and (2) 𝛽2 .
𝑉𝑎𝑟(𝑋)
Can we find them using regression? (1) Yes; (2) Yes.

(1) To estimate γ: We regress PctEL on STR.


̂ = 19.3 + 1.81 STR
(Auxiliary) 𝑃𝑐𝑡𝐸𝐿
̂ (𝑋, 𝑍) ≠ 0.
OVB condition 1 is met: 𝛾̂ ≠ 0 ↔ 𝐶𝑜𝑣

(2) To estimate 𝛽2 : We regress TestScore on STR and PctEL!


̂
(Multiple) 𝑇𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒 = 686.0 – 1.10 STR – 0.65PctEL
OVB condition 2 is met: 𝛽̂2 ≠ 0.
17
Empirical implementation:
We regressed TestScore on STR and got:
̂
(Simple) 𝑇𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒 = 698.9 – 2.28 STR

Question: What is the bias in the SR slope estimate?


We can use ().
Verify that: – 2.28 – ( – 1.10) = – 0.65 * 1.81

18
Mutiple Regression using Stata:

regress testscr str pctel, robust

Regression with robust standard errors Number of obs = 420


F( 2, 417) = 223.82
Prob > F = 0.0000
R-squared = 0.4264
Root MSE = 14.464
------------------------------------------------------------------------------
| Robust
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
str | -1.101296 .4328472 -2.54 0.011 -1.95213 -.2504616
pctel | -.6497768 .0310318 -20.94 0.000 -.710775 -.5887786
_cons | 686.0322 8.728224 78.60 0.000 668.8754 703.189
------------------------------------------------------------------------------

19
Recapitulation: Omitting variables from a regression might
cause a bias if the omitted variable satisfies the two conditions:
(i) being correlated with other independent variables, and
(ii) being a determinant of the dependent variable.

A solution to deal with OVB is to apply Multiple Regression


(MR). Before studying MR in detail. Let’s pose and answer the
following question.

Question: Are there any other solutions to OVB issue?

20
Ideal Randomized Controlled Experiment
Ideal: subjects all follow the treatment protocol – perfect
compliance, no errors in reporting, etc.!
Randomized: subjects from the population of interest are
randomly assigned to a treatment or control group (so there are no
confounding factors)
Controlled: having a control group permits measuring the
differential effect of the treatment
Experiment: treatment status is assigned as part of the
experiment. The subjects have no choice, so there is no “reverse
causality” in which subjects choose the treatment they think will
work best.

21
Back to class size:
Imagine an ideal randomized controlled experiment for measuring
the effect on Test Score of reducing STR.
• In that experiment, students would be randomly assigned to
classes, which would have different sizes.
• Because they are randomly assigned, all student characteristics
(and thus ui) would be distributed independently of STRi.
• Thus, E(ui|STRi) = 0. That is, LS A1 holds in a randomized
controlled experiment.
Is this how class size data were collected? No.
Observational data often differ from this ideal! Why? Because
treatment status is not randomly assigned.
22
What we know about California school data:
Consider PctEL (percent English learners) in the [Link]
plausibly satisfies the two criteria for omitted variable bias.
Thus, the “control” and “treatment” groups differ in a systematic
way, so corr(STR, PctEL)  0.

Idea: (based on our examination of Table 6.1) We can eliminate


the influence of systematic differences in PctEL between the large
(control) and small (treatment) groups by examining the effect of
class size among districts with the same PctEL.

23
How this idea helps us:
If the only systematic difference between the large and small class
size groups is in PctEL, then we would have something similar to
the randomized controlled experiment: within each PctEL group,
assignment to treatment would be random.
This is what Multiple Regression (MR) achieves! Consider
Yi = 0 + 1Xi + 2Zi + vi
Providing E(v | X, Z) = 0,
E(Y | X, Z) = 0 + 1X + 2Z.
𝜕𝐸(𝑌|𝑋,𝑍)
Math: 1 = is the change in E(Y | X, Z) when Z is held
𝜕𝑋
constant. We often say “1 is the expected change in Y” when Z is
held constant.
24
Summary: Two ways to overcome omitted variable bias:
1. Run a randomized controlled experiment in which treatment
(STR) is randomly assigned. In this case PctEL is still a
determinant of TestScore, but PctEL is uncorrelated with STR.
(This solution to OV bias is rarely feasible.)
2. Use a regression in which the omitted variable (PctEL) is no
longer omitted: include PctEL as an additional regressor in a
multiple regression.

25

You might also like