0% found this document useful (0 votes)
11 views58 pages

Understanding Instrumental Variables in Regression

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views58 pages

Understanding Instrumental Variables in Regression

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

🧩 The core issue: "Endogeneity" — when X and error are friends

In regression, we want to estimate how an explanatory variable (say, education) affects an


outcome (say, income).​
We usually assume that:

X (education) is not correlated with the error term (all other unobserved
factors).

If this assumption breaks, the model is endogenous.​


That means the variable we think explains Y (education) is actually tangled up with Y in
both directions — or linked to missing factors that also affect Y.

Example:

●​ More motivated people both study more and earn more → motivation is missing
from your regression (omitted variable).​

●​ So your "education" variable picks up not just schooling, but also motivation →
bias!​

💡 What you’ve already seen (the earlier chapters recap)


1.​ Omitted variable bias (OVB):​

○​ If you leave out an important variable that’s correlated with X, OLS gives
you biased estimates.​

2.​ Proxy variables:​

○​ Sometimes we can fix OVB by including a proxy (a substitute variable) for


the missing factor.​

○​ But proxies aren’t always available or perfect.​

3.​ Panel data tricks (fixed effects, first differencing):​


○​ With data over time for each person/firm, you can remove time-constant
unobserved stuff (like talent, personality, etc.).​

○​ But these methods fail if:​

■​ You don’t have panel data.​

■​ The variable of interest doesn’t change over time (like gender).​

■​ The unobserved factor changes with time (time-varying omitted


variables).​

🚨 So now what?
If all the above tricks fail, we need a new weapon: Instrumental Variables (IV).

⚙️ What IV does (intuitively)


Suppose education (X) is endogenous because it’s correlated with motivation
(unobserved).​
We need a variable, say distance to nearest college (Z), that:

1.​ Affects education (people closer to college study more) → relevance​

2.​ Is not related to motivation or other hidden stuff that affects income → exogeneity​

Then, Z can serve as a tool (instrument) to isolate the part of education that’s unrelated to
motivation — the “clean” variation.​
That clean part can then show the true effect of education on income.

This is what IV regression does.

🪜 Two Stage Least Squares (2SLS) — the practical way to do IV


It’s called “two-stage” because it happens in two steps:

1.​ Stage 1: Use Z to predict X.​


This gives the "clean" version of X (the part of X that comes from Z).​

2.​ Stage 2: Use that predicted X to explain Y.​


Now you’ve removed the bias caused by endogeneity.

🧩 What this section is saying, in simple terms


They’re starting the instrumental variables chapter by connecting it to everything you
already learned — OLS, omitted variables, and proxies.

It’s basically saying:

“We’ll now study instrumental variables, and we’ll do it in a way that feels
familiar — like how we studied OLS.”

Let’s decode each part.

🧠 “Our treatment of IV closely follows OLS...”


OLS (ordinary least squares) is the normal regression method.​
They’re saying: we’ll explain IV using the same logic and framework as OLS — random
sample, population, equations, assumptions — so it’s easier to follow.

That means:

●​ We’ll think of IV like an upgraded OLS that works even when OLS fails (because of
endogeneity).​

●​ The structure and math will look similar, just with an extra step.​

🕰 “OLS can be applied to time series data... and so can IV.”


This means: IV methods don’t just work with random survey-type data (cross-sections);
they can also work with time series (data over time) and panel data (data across people
and time).​
They’ll talk about special issues with those later (Sections 15.7 and 15.8).

💡 Now comes the motivation — why we need IV


They remind you:​
Omitted variables cause bias when the omitted factor is related to both X and Y.

When that happens, there are three possible reactions:

Option 1: Ignore it

You just estimate with OLS and accept bias.​


Sometimes that’s okay if you know the direction of bias.​
Example:​
You think job training raises wages, but people with more motivation both train more and
earn more.​
If you know this bias pulls your estimate toward zero, and you still find a positive effect,
you can still say:

“Job training helps, and our estimate is probably too low.”

But usually, you can’t be sure which way the bias goes — so this isn’t reliable.

Option 2: Use a proxy variable

You find a substitute for the unobserved thing.​


Example:​
If you can’t observe “ability,” use IQ as a proxy.​
Then you regress:

log(wage) = β₀ + β₁ educ + β₂ IQ + e

That can work — but only if your proxy is strong and correlated with the true missing
factor.​
Often, you don’t have such a proxy.
Option 3: Use a method that handles the omitted variable inside the error term

This is where Instrumental Variables (IV) comes in.

Instead of pretending the missing factor doesn’t exist or trying to find a proxy, IV says:

“Okay, ability is inside the error term — but we’ll use a clever external variable
to get around that.”

⚙️ Example — unobserved ability in a wage equation


They use a concrete example to show the problem.

Suppose the true model is:

log(wage) = β₀ + β₁ educ + β₂ abil + e

Where:

●​ educ = years of education​

●​ abil = ability (unobserved talent)​

●​ e = error term (everything else unmeasured)​

Now, if you can’t observe ability, you have to lump it into the error term:

log(wage) = β₀ + β₁ educ + u​
where u = β₂ abil + e

🔍 What’s the problem here?


If education (educ) and ability (abil) are correlated — which they almost certainly are
(smart people tend to get more education) — then education is correlated with the error
term u (since u contains ability).
That’s the definition of endogeneity.​
So OLS will give you a biased and inconsistent estimate of β₁ (the effect of education on
wage).

💥 Therefore...
Instrumental Variables (IV) is introduced as a method to handle this situation:

●​ You can’t observe the omitted variable (like ability),​

●​ You can’t find a good proxy,​

●​ But you can find an instrument — a variable that affects education but not wages
directly (like distance to college).​

That’s what the rest of the chapter builds on.

🧩 The core idea


We have a regression:

y = β₀ + β₁ x + u

where

●​ y = outcome (say, wage)​

●​ x = explanatory variable (say, education)​

●​ u = error term (stuff we can’t measure — like ability, motivation, family background,
etc.)​

The problem:​
If x and u are correlated → OLS is biased.

That’s what Cov(x, u) ≠ 0 means.


⚙️ The fix: bring in a helper variable — the instrument (z)
To fix that, we find a new variable z that meets two key conditions.

Let’s see both one by one:

✅ 1. Instrument exogeneity
Cov(z, u) = 0

This means:

z is not related to the error term.​


It doesn’t share anything with the unobserved stuff (like ability, motivation,
etc.).

So z doesn’t have its own hidden effect on y — it only affects y through x.

💬 Example:​
“Distance to the nearest college” is exogenous if it’s unrelated to ability or motivation.​
Where you live doesn’t directly affect your wage, but it affects how much education you
get.

✅ 2. Instrument relevance
Cov(z, x) ≠ 0

This means:

z must be strongly related to x.​


If z has nothing to do with x, it can’t help us “predict” x or separate the clean
variation in x.

💬 Example:​
If you live closer to college → you’re likely to get more education.​
So “distance to college” is correlated with education.
🧠 Summary Table
Property Math Meaning Example

Exogeneit Cov(z, u) = 0 z isn’t connected to unobserved Distance to college unrelated


y stuff in u to ability

Relevance Cov(z, x) ≠ 0 z is related to x Distance to college affects


education

⚠️ The key difference


●​ You can test relevance because you can check if z actually predicts x in data.​

●​ You cannot test exogeneity because u is unobserved — that’s the whole problem!​

So, exogeneity must come from theory, reasoning, or intuition (not statistics).

🔍 How to test relevance in practice


You can estimate this “first-stage” regression:

x = π₀ + π₁z + v

Then check if π₁ ≠ 0.

If the coefficient π₁ is significantly different from zero → your instrument is relevant.

That’s the meaning of the hypothesis:

H₀: π₁ = 0 (instrument has no power → bad instrument)​


H₁: π₁ ≠ 0 (instrument explains x → good instrument)

🪄 Putting it all together (intuition)


Let’s connect all the dots.

Without IV:

y = β₀ + β₁ x + u​
but x is “dirty” because it’s mixed with u.

With IV:

Use z to extract the clean part of x — the variation in x that’s uncorrelated with
u.

Then, we use that clean x (the part explained by z) to estimate β₁.​


That’s what gives us a consistent, unbiased estimate even when OLS fails.

🧩 Recap before diving in


We’re dealing with the model:

y = β₀ + β₁ x + u

where x (education, for example) is endogenous — it’s correlated with the error u
(unobserved ability, motivation, etc.).

We bring in an instrument z (like mother’s education or distance to college) that must


meet:

1.​ Exogeneity → Cov(z, u) = 0​


→ z is unrelated to unobserved stuff (u).​

2.​ Relevance → Cov(z, x) ≠ 0​


→ z is related to x.​

Now this section goes from theory to practice:

●​ How do we check if z is relevant?​

●​ What are good or bad examples of z?​


●​ How do we finally compute β₁ with IV?​

🧠 Step 1: Testing for relevance


You test relevance by regressing x on z:

x = π₀ + π₁ z + v

and checking whether π₁ ≠ 0.​


That’s a t-test.

If you can reject H₀: π₁ = 0 at 5% or 1% significance → z and x are correlated → relevance


confirmed.

🪄 Step 2: Good and bad instrument examples


Example 1: Education and Wages

We want to estimate:

log(wage) = β₀ + β₁ educ + u​
where u includes ability.

We need a z that:

●​ is uncorrelated with ability (Cov(z, u)=0)​

●​ but correlated with education (Cov(z, x)≠0)​

❌ Bad instrument: last digit of Social Security Number


●​ Random → ✅ uncorrelated with ability.​
●​ But also random → ❌ not correlated with education.​
So it fails relevance.​
It’s too random to explain education differences.​

❌ Bad instrument: IQ
●​ IQ is strongly correlated with ability, which is in u → ❌ fails exogeneity.​
●​ But it’s correlated with education → ✅ relevant.​
So it fails for the opposite reason.​

Thus:

●​ Proxy variables (like IQ) = good for omitted variable correction​

●​ Instrumental variables (like distance to college) = must be uncorrelated with the


omitted variable​

⚠️ Tricky instrument: Mother’s education (motheduc)


●​ Positively correlated with child’s education ✅ (relevant)​

●​ But possibly correlated with child’s ability ❌ (because ability can be inherited)​
So exogeneity might fail.​

Economists debate this one — it’s not black and white.

✅ Possible instrument: Number of siblings (sibs)


●​ More siblings → less parental time/resources → lower education (negative
correlation) ✅relevant.​
●​ Might be unrelated to ability (maybe) ✅
exogenous.​
So sibs could work — but you always need to argue why.​

Example 2: Skipping classes and exam scores

We want:

score = β₀ + β₁ skipped + u

Here, u might include motivation or ability.​


Students who skip less may just be more motivated.

So skipped is endogenous.

We need a z that:

●​ affects skipped ✅ (relevance)​


●​ doesn’t affect score directly ❌ (exogeneity risk)​

Candidate instrument: distance to campus

●​ Students living farther skip more ✅ relevant.​


●​ But distance might also be related to income or background (which affect score) ❌
possibly not exogenous.​

So again, the logic must be argued carefully.

💬 The big lesson from these examples


A good instrument is rare and precious —​
It must thread the needle between being:

●​ Related enough to x​
●​ But unrelated to the hidden confounders in u​

And you can check correlation (relevance),​


but you can’t test exogeneity directly — you must defend it with reasoning.

🧮 Step 3: The actual IV formula


We now get to the algebra that shows how IV identifies β₁.

Start with:

y = β₀ + β₁ x + u​
Take the covariance with z on both sides:

Cov(z, y) = β₁ Cov(z, x) + Cov(z, u)

If z is exogenous → Cov(z, u) = 0.​


So:

𝐶𝑜𝑣(𝑧, 𝑦)
β₁ = 𝐶𝑜𝑣(𝑧, 𝑥)

That’s the population formula for IV estimation — it’s how β₁ is identified.

If z and x are uncorrelated (Cov(z, x)=0), then the formula breaks → z fails relevance.

🧩 In sample terms
You just replace “Cov” with actual sample covariances (sums):

[Σ(𝑧𝑖 − 𝑧̄)(𝑦𝑖 − ȳ)]


β1 = [Σ(𝑧𝑖 − 𝑧̄)(𝑥𝑖 − 𝑥̄)]

That’s your IV slope estimate.

And intercept is just:

β0 = 𝑦 − β1𝑥
— same as OLS, except the slope is computed differently.

📊 Summary Table
Concept Formula Meaning

Relevance x = π₀ + π₁ z + v → test if π₁ ≠ 0 z affects x


test

Exogeneity Cov(z, u) = 0 z unrelated to unobservables

IV estimator β₁ =
𝐶𝑜𝑣(𝑧, 𝑦) true slope using “clean” variation in
𝐶𝑜𝑣(𝑧, 𝑥)
x

Sample how we estimate it from data


[Σ(𝑧𝑖 − 𝑧̄)(𝑦𝑖 − ȳ)]
version β1 = [Σ(𝑧𝑖 − 𝑧̄)(𝑥𝑖 − 𝑥̄)]

🧠 Intuitive summary
●​ The instrument z works like a “remote control” for x — it moves x around, but
without touching u.​

●​ By seeing how y responds to those z-driven movements in x, we recover the true β₁.​

𝐶𝑜𝑣(𝑧, 𝑦)
●​ That’s why β₁ = 𝐶𝑜𝑣(𝑧, 𝑥)
:​
it’s just the “change in y caused by z” divided by the “change in x caused by z.”

🧩 Starting Point: What’s Going On?


We’re still talking about instrumental variables (IV) — a way to fix bias when your
explanatory variable (x) is endogenous (correlated with the error term u).​
OLS (Ordinary Least Squares) doesn’t work properly in that case, so IV gives us a
consistent alternative.

Now, this section explains 3 things:


1.​ What happens if you don’t need IV (x is exogenous)​

2.​ Why IV is consistent (works in large samples)​

3.​ Why IV is less efficient (has bigger variance) than OLS.​

⚙️ When x = z → IV = OLS
Equation (15.9) for the IV slope:

That’s just the OLS formula.​


✅ So: when x is exogenous (no endogeneity problem), OLS and IV give the same answer.

🧠 Why IV is consistent (works in large samples)


Consistency means: as the sample gets bigger, your estimate gets closer to the true β₁.

If the two key IV conditions hold:

1.​ Relevance: Cov(z, x) ≠ 0​


→ instrument and regressor are correlated​

2.​ Exogeneity: Cov(z, u) = 0​


→ instrument is unrelated to the error​
Then the IV estimator converges to the real β₁ as sample size → ∞.

So even though IV may be biased in small samples, it’s reliable in large ones.

⚠️ IV is usually biased in small samples


Even though it’s consistent (good long run), with small data sets the IV estimate might still
be off (biased).​
That’s why econometricians prefer large samples when using IV.

💡 Important Clarification — “IV is an estimation method, not a model”


They’re just reminding you of terminology:

●​ A model is your equation, like​


wage = β₀ + β₁ education + u​

●​ An estimator is the method you use to estimate β₀ and β₁ —​


e.g. OLS, Weighted LS, IV, etc.​

So saying “I ran an IV model” is technically wrong — you’re estimating your model using IV,
not that IV is the model itself.

📏 Statistical Inference with IV (testing, confidence intervals)


Like with OLS, you can test whether your β₁ is significantly different from zero, but you
need a standard error for the IV estimate.

For that, we assume:

→ meaning the variance of u is the same no matter what z is (homoskedasticity).


🧮 The variance formula (big scary equation explained)
They give:

Here’s what each piece means:

●​ σ²: variance of the error term (how noisy the model is)​

●​ n: sample size​

●​ σₓ²: variance of x (spread of x values)​

●​ ρₓ,z²: square of the correlation between x and z​

So if z is strongly correlated with x (ρₓ,z close to 1), your estimate is precise (small
variance).​
If z and x are weakly correlated (ρₓ,z small), your estimate is very noisy (big variance).

📊 Key insight: IV is always less efficient than OLS


When x is exogenous (so OLS is valid):
Since R² ≤ 1, the denominator of IV is smaller → IV variance is bigger.

That means:

●​ IV is consistent (correct on average in large samples)​

●​ But it’s less precise (higher variance)​

●​ Especially when your instrument is weakly correlated with x.​

If R²ₓ,z = 1 (perfect correlation, z = x), then IV = OLS, same variance.

🧩 Summary in Human Words


Concept Meaning

Endogeneity problem x is correlated with u → OLS fails

IV (Instrumental Variable) Variable z related to x but unrelated to u

When z = x OLS and IV are identical

IV is consistent Works correctly in large samples (unbiased in the long


run)

IV is biased in small Especially with weak instruments


samples

IV variance > OLS variance Because instruments are never perfectly correlated with
x
Rule of thumb Use IV only when you must (OLS is better if x is
exogenous)

🎯 The Big Picture


We’re estimating the return to education — how much wages rise with each extra year of
schooling.

The model is:

●​ β₁ = % increase in wages from one more year of education​

●​ u = unobserved stuff like ability, motivation, family background, etc.​

Problem: education (educ) may be correlated with u (e.g., more able people get more
education and higher wages).​
→ That’s endogeneity → OLS gives biased results.​
→ So we use Instrumental Variables (IV).

🧮 Example 15.1 – Married Women (MROZ data)


Goal: Estimate the return to education for married working women.

Step 1: OLS

Equation:
Standard errors in brackets:​
(.185) (.014)

n = 428, R² = 0.118

Interpretation:

●​ Each extra year of schooling → ~11% higher wage (because 0.109 ≈ 10.9%).​

●​ But: this could be overstated if smarter or more motivated women both get more
education and earn higher wages.​

Step 2: Pick an instrument — father’s education (fatheduc)

We assume:

●​ Father’s education is correlated with daughter’s education (plausible),​

●​ but uncorrelated with her unobserved ability, u (hopeful assumption).​

This is the instrument validity logic:

●​ Cov(fatheduc, educ) ≠ 0 ✅ (relevant)​


●​ Cov(fatheduc, u) = 0 ❓ (exogenous — assumed)​

Step 3: Check instrument relevance

Regress educ on fatheduc:


(.28) (.029), n = 428, R² = .173​
→ t = 9.28 → strong positive relationship.

So fatheduc is a relevant instrument.

Step 4: Use IV estimation

IV estimate:

(.446) (.035), n = 428, R² = .093

Interpretation:

●​ IV return = 5.9% per year of education (≈ half the OLS result).​

●​ Suggests OLS was upward biased, likely because of omitted ability bias (i.e., ability
correlated with education inflated the OLS estimate).​

But notice:

●​ The IV standard error (0.035) is ~2.5× the OLS SE (0.014).​


→ IV estimate is much less precise.​

●​ The 95% confidence interval for β₁ (IV) is wide and includes the OLS estimate.​
→ So statistically, we can’t say the two are significantly different yet.​

👨‍🏭 Example 15.2 – Men (WAGE2 data)


Goal: Estimate return to education for men.

Step 1: Choose an instrument — number of siblings (sibs)


Assume:

●​ More siblings → less parental resources → less education → negative correlation


(Cov(sibs, educ) < 0).​

●​ Siblings number not directly related to u → Cov(sibs, u) = 0 (assumed).​

Check:

(.11) (.030), n = 935, R² = .057​


✅ Sibs and educ are significantly negatively correlated.

Step 2: IV estimation

(.36) (.026), n = 935

Compare:

●​ OLS: β₁ = 0.059 (s.e. = 0.006)​

●​ IV: β₁ = 0.122 (s.e. = 0.026)​

So now IV > OLS, opposite to before.

Why the difference?

Several possibilities:
●​ Maybe sibs is not truly exogenous — number of siblings could be linked to family
environment or ability (more siblings → less parental attention → lower ability).​
→ This violates IV exogeneity.​

●​ Or, maybe OLS was biased downward because of measurement error in education
(if education is measured noisily).​

So, the direction of bias depends on what’s driving endogeneity — omitted ability or
measurement error.

🎓 Example 15.3 – Angrist & Krueger (1991): Quarter of Birth as an IV


They used season of birth as an instrument for education among U.S. men.

●​ Model: same as before (log wage = β₀ + β₁ educ + u)​

●​ Instrument: frstqrt = 1 if born in Q1, else 0​

Logic:

●​ Quarter of birth affects when you start school (and therefore total years of
schooling, due to compulsory schooling laws).​

●​ But quarter of birth shouldn’t affect ability or motivation → exogenous.​

✅ So quarter of birth is a binary instrument for a continuous variable (educ).


The empirical result

Using 247,199 men (huge dataset):

Estimator β₁ (Return to Std. Error


Education)
OLS 0.0801 0.0004

IV (Quarter of 0.0715 0.0219


birth)

Observations:

●​ OLS t-statistic ≈ 200 (!)​

●​ IV t-statistic ≈ 3.26​
→ IV works but is much less precise — again due to weak instrument (very small
R²ₓ,z).​

Interpretation:​
Returns to education ~7–8%.​
IV ≈ OLS → suggests little or no omitted ability bias in this case.

But:​
Bound, Jaeger, and Baker (1995) criticized it — maybe quarter of birth is correlated with
unobserved traits (e.g., seasonality in parental characteristics), violating exogeneity.

🎖️ Example 15.4 – Vietnam Draft Lottery as a Natural Experiment


Model:

Problem: being a veteran isn’t random — people choose → selection bias.

Instrument:​
→ Vietnam draft lottery number — randomly assigned!

●​ Low lottery number = more likely to serve → correlated with veteran status ✅​
●​ Random assignment → uncorrelated with u ✅​
So, draft lottery number = good instrument for veteran status.

Even when both variables (endogenous + instrument) are binary, IV still works.

🧩 Summary Table
Example Endogenous Instrument Expected IV vs OLS Key Point
variable (z) Sign

15.1 educ fatheduc + IV < OLS OLS upward bias


Married (ability bias)
Women

15.2 Men educ sibs – IV > OLS Possibly OLS


downward bias
(measurement error
or invalid IV)

15.3 educ quarter of tiny IV ≈ OLS Weak IV but large


Angrist & birth data
Krueger

15.4 veteran draft lottery – IV solves Natural experiment


Vietnam number selection
Veterans bias

🌿 Context
We already know:

●​ An instrument z must satisfy two key conditions:​


When exogeneity fails, your IV is invalid.​


When relevance fails or is weak, your IV is weak — and that’s the focus here.

🧮 Step 1: Probability limit of IV estimator


The population (asymptotic) behavior of the IV estimator is given by

So the IV estimate equals the true coefficient β1\beta_1β1​plus a bias term.

Notice that bias term → depends on:

●​ how much the instrument zzz is correlated with the error term uuu​

●​ how much zzz is correlated with xxx​

💡 Step 2: Why weak instruments are dangerous


If Corr(z,x) is small (i.e., the instrument barely explains x),​
then the denominator is tiny → the bias term becomes large, even if Corr(z,u) is tiny.

So:

Even a “slightly endogenous” instrument can cause massive bias if it’s only
weakly correlated with x.
This is why economists say:

A weak instrument is almost as bad as a bad one.

🔁 Step 3: Comparison with OLS


OLS has its own asymptotic bias:

Comparing:

●​ OLS bias ∝ correlation between x and u​

●​ IV bias ∝ (correlation between z and u) ÷ (correlation between z and x)​

🧭 Step 4: Example of how this plays out


Example 15.3: Smoking and Birth Weight

That means cigarette price doesn’t explain smoking behavior — maybe because smoking
is addictive.

When they still use it as an IV, the result is:

Huge coefficient, wrong sign, massive standard error — i.e., nonsense output.​
Because the instrument fails the only thing we can check empirically: relevance.

⚠️ Step 5: The weak instrument problem


The more subtle case is when z and x are correlated, but only weakly.

Then:

●​ IV estimates become extremely unstable.​

●​ Standard errors are large.​

●​ The usual t-tests and confidence intervals become misleading.​

Econometricians like Staiger and Stock (1997) showed:


●​ If the correlation between z and x shrinks with sample size (∼1/sqrt n​),​
then even with large samples, IV inference breaks down.​

●​ In such cases, the IV estimator does not follow the usual normal distribution.​

●​ So your reported p-values and confidence intervals may be completely wrong.​

✅ Step 6: Practical takeaway


To check for weak instruments, you can look at the first-stage regression of xxx on zzz.​
If the F-statistic < 10, your instruments are probably weak.

That’s why:

●​ We always test relevance empirically.​

●​ We can’t test exogeneity, so we rely on reasoning.​

●​ But even a “valid” instrument that’s too weak can blow up your results.

📘 Summary
Concept Condition What happens if it fails?

Instrument Exogeneity Cov(z, u) = 0 IV is inconsistent (biased, invalid)

Instrument Relevance Cov(z, x) ≠ 0 IV is weak (huge SEs, possibly worse than


OLS)

Weak Instrument Corr(z, x) is Bias from Corr(z, u) gets magnified; t-stats


Problem small unreliable

3. Why R² behaves oddly in IV

Regression software still reports an R² after IV estimation using the same formula:
R² = 1 − (SSR_IV / SST)

But unlike OLS, IV does not minimize SSR when estimating coefficients.​
So the IV residuals can actually have a larger SSR than the total variation (SST), which
makes R² negative.

That’s because IV estimation is not trying to fit the data best — it’s trying to fix the bias
caused by endogeneity.

4. Why R² in IV has no real meaning

In OLS, R² has a clear interpretation:

“This model explains X% of the variation in y.”

But in IV:

●​ Since x and u are correlated, we can’t cleanly decompose​


Var(y) = β₁² Var(x) + Var(u)​
like we can in OLS.​

●​ So the R² from IV no longer measures how much of y’s variation is “explained.”​

Hence, the IV R² is just a computed value with no meaningful interpretation for


goodness-of-fit.

5. Why we still use IV anyway

If the goal were just to get the highest R², we’d always choose OLS.​
But OLS can be biased when x and u are correlated — even if its R² looks great.

IV estimation sacrifices “fit” for unbiased, consistent causal estimates of β₁.

So:

●​ OLS: High R², possibly biased​


●​ IV: Low or negative R², but unbiased causal estimate​

6. Summary table

Concept OLS IV

Goal Fit data well Estimate causal effect correctly

R² range 0 ≤ R² ≤ 1 Can be negative

Interpretati % of variance in y No clear interpretation


on explained

Use when x and u are uncorrelated x and u are correlated (endogeneity


problem)

🧩 Step 1: The Setup — Our Model


We start with the model:

y₁ = β₀ + β₁y₂ + β₂z₁ + u₁ (15.22)

where

●​ y₁ = dependent variable (endogenous)​

●​ y₂ = explanatory variable that is endogenous → it’s correlated with u1​

●​ z₁ = explanatory variable that is exogenous → uncorrelated with u1​

●​ u₁ = error term​

Assume 𝐸(u₁) = 0.​

🔹 Step 4. Why can’t we use z1​as the instrument?


Because z1​(like experience) already appears as an explanatory variable inside the
equation we’re estimating.​
You can’t use a variable as both:

●​ an independent variable and​

●​ an instrument at the same time.​

That’s why we need a new exogenous variable z2​— something outside the current
regression.

🔹 Step 5. The key exogeneity assumptions (Equation 15.24)


We assume:

E(u1​)=0,
Cov(z1​,u1​)=0,
Cov(z2​,u1​)=0

That means:

●​ On average, the errors balance out.​

●​ Both z1z1​and z2​are “clean” — not contaminated by the error term.​

These are the conditions that make z1​and z2​valid exogenous variables.

7. Moment Conditions (How IV Estimates Are Computed)

⚙️ Step 1: Write the Model Again


⚙️ Step 2: What We Know About the True Model
If the model is correct and the instruments are valid, then by definition:

That’s just the assumption that both z₁ and z₂ are exogenous — uncorrelated with the
error.

⚙️ Step 3: Replace the Unknown u₁


We don’t observe u₁​, but we can plug in its expression from Step 1:

The last one (without a z) just enforces that the average residual = 0 (that’s what the
intercept does).

These are the population moment conditions — they express the idea that the
instruments are uncorrelated with the true error.

⚙️ Step 4: Move to the Sample Version (What We Actually Compute)


In data, we approximate the expectations E(⋅) using sample averages (Σ / n).​
So we get the sample moment equations:

These are equations (15.25) from your text.

Each equation says:

“The sample covariance between the residual and the instrument (or constant)
is zero.”

⚙️ Step 5: Why Three Equations?


We have three unknowns:​
β₀, β₁, β₂.

We have three equations (the three moment conditions).​


So we can solve them simultaneously — that’s how we estimate the βs.

Moment Conditions (How IV Estimates Are Computed)

We want β₀, β₁, and β₂ so that the instruments (z₁, z₂) are uncorrelated with the estimated
residual.

That gives three “moment equations”:

Σ (yᵢ₁ − β₀ − β₁yᵢ₂ − β₂zᵢ₁) = 0​


Σ zᵢ₁ (yᵢ₁ − β₀ − β₁yᵢ₂ − β₂zᵢ₁) = 0​
Σ zᵢ₂ (yᵢ₁ − β₀ − β₁yᵢ₂ − β₂zᵢ₁) = 0 (15.25)

These three equations (for β₀, β₁, β₂) define the IV estimators.

If y₂ were exogenous, choosing z₂ = y₂ gives the standard OLS equations.

8. Reduced Form for the Endogenous Variable

We can model y₂ as a function of the exogenous variables:

y₂ = π₀ + π₁z₁ + π₂z₂ + v₂ (15.26)

where​
E(v₂) = 0, Cov(z₁, v₂) = 0, Cov(z₂, v₂) = 0.

The key condition is π₂ ≠ 0 — this means z₂ and y₂ are correlated even after controlling for
z₁.​
(We test this by regressing y₂ on z₁ and z₂ and checking if π₂ ≠ 0.)

This is the reduced-form equation, because it writes an endogenous variable (y₂) in terms
of exogenous ones.
9. Structural vs. Reduced Form

●​ Structural equation (15.22): The causal relationship we care about.​

●​ Reduced form (15.26): How an endogenous variable depends on exogenous ones.​

10. The General Case (More Variables)

If we have:

y₁ = β₀ + β₁y₂ + β₂z₁ + … + β z ₋₁ + u₁ (15.28)

and an additional exogenous variable z not in the equation (to serve as instrument for y₂),

then the reduced form for y₂ is:

y₂ = π₀ + π₁z₁ + … + π ₋₁z ₋₁ + π z + v₂

Valid instrument conditions:

E(u₁) = 0​
Cov(z , u₁) = 0 for all j = 1,…,k​
πₖ ≠ 0 (15.29–15.31)

Exploring Further 15.2

Suppose we wish to estimate the effect of marijuana usage on college grade point average
(GPA).​
For the population of college seniors at a university, let daysused denote the number of
days in the past month on which a student smoked marijuana, and consider the structural
equation:

colGPA = β0​+ β1 ​daysused + β2 ​SAT + u

(i)
Let percHS denote the percentage of a student’s high school graduating class that
reported regular use of marijuana.​
If this is an IV candidate for daysused, then the reduced form for daysused is:

This means that, after controlling for SAT, the percentage of high school classmates who
regularly used marijuana (percHS) must still be correlated with an individual’s marijuana
use in college (daysused).​
It seems plausible that percHS and daysused are positively correlated — students from
schools where marijuana use was more common may be more likely to use it in college.​
So, condition (15.27) is likely to hold.

(ii)

However, is percHS truly exogenous in the structural equation?

That requires:

But there are potential problems:

●​ percHS might be correlated with unobserved factors in u that also affect college
GPA.​
For example, students from schools with higher marijuana use might differ in
average academic motivation or school quality.​

●​ These omitted factors (motivation, environment, peer effects) could enter the
error term u and make percHS endogenous.​
Therefore, percHS might not be a valid instrument if it’s correlated with those unobserved
characteristics.

Example 15.4: Using college proximity as an IV for education

Card (1995) wanted to estimate:

But education may be endogenous — ability, ambition, etc., affect both education and
wage.

So, he used nearc4 (whether you grew up near a 4-year college) as an instrument.

The coefficient on nearc4 (0.320) implies that, holding other factors constant, people who
lived near a college had about one-third of a year more education on average.​
The t-statistic is 3.64, indicating strong partial correlation, so (15.27) holds.

Using nearc4 as IV

If nearc4 affects wages only through education (not directly correlated with unobserved
ability), then we can use it as a valid IV.

The result:
Variabl OLS IV
e

educ 0.075 0.132


(.003) (.055)

The IV estimate of return to education is almost double, but less precise (bigger SE) —
typical of IV estimates.

Reduced form for y₁

We start from the structural model — this is the main equation we actually care about
(the “causal” one):

We think y2 is correlated with u1, so we can’t use OLS directly.

Step 1. Introduce an instrument

We bring in a new exogenous variable zk (like nearc4) to use as an instrument for y2.​
It must satisfy:

Cov(zk, u1) = 0 (exogeneity)

Cov(zk, y2) ≠ 0 (relevance)


Step 2. Reduced form for y₂

We first express the endogenous variable y2 as a function of all the exogenous ones:

This is called a reduced form because it “reduces” y2 to depend only on exogenous


variables (the z’s).

Here:

●​ v2 is the new error term for this equation.​

●​ We assume E(v2) = 0,
●​ Cov(zj, v2) = 0 for all j.​

Step 3. Substitute this into the structural equation

Now plug that expression for y2 into structural form equation):

We get,
This is the reduced form for y1.

Step 4. Define the new parameters

Each γ (gamma) and e₁ (error) is made up of the original β’s and π’s:

Step 5. What does it mean?

●​ The reduced form for y₁ shows how the dependent variable (like wage) depends on
only exogenous variables (the z’s).​

●​ It’s called “reduced” because all the endogenous variables have been replaced by
their expressions in terms of exogenous ones.​

Since all z’s are exogenous (uncorrelated with e₁), we can estimate the γ’s in the reduced
form using OLS.

But — OLS on this reduced form gives you γ’s, not the causal β’s.

To recover β₁ (the true causal effect of y₂ on y₁), you need IV estimation, which combines
both reduced forms (for y₁ and y₂).

Step 6. Special case: when y₂ and z are binary


●​ y2​=1 if a person participates in a program (treatment), 0 otherwise​

●​ zk​=1 if a person is eligible for the program, 0 otherwise​

Example:

●​ y1​: health score​

●​ y2​: actually joined a health program​

●​ zk​: was offered the chance to join (eligibility)

zk is an IV for y2. Here: y2=π0+πkzk+v2​

●​ β1 = effect of actual participation (the causal effect)​

●​ πk = change in probability of participation due to eligibility​

●​ γk = intention-to-treat (ITT) effect → the effect of offering eligibility “How much


does offering the program raise average outcomes?”​

So, ITT (γk) measures the impact of being offered the program, not necessarily of actually
taking it.

🎯 Interpretation: “Intention-to-Treat” (ITT)


●​ γk​measures the effect of being offered the program, not the effect of actually
participating.​
→ It’s the Intention-to-Treat (ITT) effect.​

●​ β1​measures the effect of actual participation (what you’d ideally want).​


But since some eligible people don’t participate, γk=β1.πk​​

So the ITT effect is smaller in magnitude — it’s the true effect multiplied by how much
eligibility changes participation.

🧮 Example
Say:

●​ β1​=10: participating raises test score by 10 points​

●​ πk​=0.5: eligibility increases participation by 50%​

Then:

γk=β1×πk=10×0.5=5

So, the “offering” of the program raises average test scores by 5 points, even though not
everyone took part.

That’s your intention-to-treat estimate.

The Problem: Why Do We Need 2SLS?

We have a model like:


The issue:​
y2​is endogenous — i.e., correlated with the error term u1​(maybe smarter people get
more education and higher wages).​
That means OLS gives biased estimates.

So we need an instrumental variable (IV) — something that:

1.​ Affects y2​(relevance)​

2.​ Doesn’t directly affect y1​, except through y2​(exogeneity)​

Example: parents’ education could affect your education (good instrument) but not your
wage directly.

Now, Suppose We Have More Than One IV

Say we have:

●​ z1​: exogenous control (experience)​

●​ z2,z3​: excluded exogenous variables → potential IVs for y2​(like mother’s and
father’s education)​

Then the reduced form for y2​is:


This says: y2​depends on all exogenous variables (the z’s).​

Step 1: The First Stage

In short:​
We’ve “purged” the endogenous variable of its endogeneity.

Step 2: The Second Stage


Intuition

Important Details

●​ If you have only one instrument, 2SLS = IV (they’re identical).​

●​ If you have more than one instrument, 2SLS automatically combines them
optimally — it uses the linear combination most correlated with y2​.​

●​ Before doing 2SLS, always check relevance using the F-test in the first stage (F > 10
is a good rule of thumb).​

●​ You shouldn’t do both stages manually for inference (standard errors won’t be
right) — use built-in 2sls or ivreg commands in software.​

Example
“But even with one instrument, IV also replaces the endogenous y2​with the
part predicted by the instrument — so what’s different when we have two
instruments? Isn’t 2SLS doing the same thing?”
Now, the question becomes:

“Which one do we use as the instrument? 𝑧2, 𝑧3? Both? Or some combination?”
Each individual z is valid, but they differ in how strongly they correlate with 𝑦2​.​
If you use just one, you’re wasting the information in the other(s).​
If you combine them randomly, your estimator may be inefficient.

So — we want a best possible linear combination of 𝑧1, 𝑧2, 𝑧3 that:

●​ stays uncorrelated with 𝑢1​, and​

●​ is most correlated with 𝑦2.​

That’s what the first stage of 2SLS automatically does.


Multicollinearity
●​ Multicollinearity happens when two or more independent variables (regressors) in
a regression are highly correlated.

●​ Why it’s a problem: If variables move together, it’s hard for the model to figure out
which one is really affecting the dependent variable.

●​ Effect: The standard errors of the coefficients get bigger, so your estimates are
less precise.

2SLS (Two-Stage Least Squares)


●​ 2SLS is used when one or more independent variables are endogenous (correlated
with the error term).

●​ Idea: Replace the endogenous variable with a predicted version that only uses
exogenous instruments.

Stage 1: Predict the endogenous variable


●​ Example: 𝑦2 is endogenous.

^
●​ Regress 𝑦2 on instruments + other exogenous variables → get 𝑦2 (predicted value).

Stage 2: Use the predicted variable


^
●​ Replace 𝑦2 in the main regression with 𝑦2 and estimate the coefficients.

Why multicollinearity is worse in 2SLS


●​ Variance formula (simplified idea):

( )
2
^ σ
𝑉𝑎𝑟 β1 ∼
( )
^
(
𝑉𝑎𝑟 𝑦2 × 1−𝑅 ^
2
𝑦2 𝑜𝑛 𝑜𝑡ℎ𝑒𝑟 𝑒𝑥𝑜𝑔 )
●​ Two things make variance bigger than OLS:
^
a.​ 𝑦2 has less variation than 𝑦2 (because it’s only the part explained by
instruments).
^
b.​ 𝑦2 is highly correlated with other exogenous variables → “classic
multicollinearity problem.”
●​ In short: 2SLS can have huge standard errors if your instruments don’t add
enough new information.

Example
●​ Original variable 𝑒𝑑𝑢𝑐 has R² = 0.475 → okay, OLS standard error small.
^
●​ Predicted 𝑒𝑑𝑢𝑐 from instruments has R² = 0.995 → almost perfectly correlated
with exogenous variables → 2SLS standard error becomes very large.

4. Multiple Endogenous Variables


●​ What it means: Sometimes your regression has more than one endogenous
variable.​
Example:

𝑦1 = β0 + β1𝑦2 + β2𝑦3 + β3𝑧1 + 𝑢1

​ Here, 𝑦2 and 𝑦3 are endogenous (they are correlated with the error 𝑢1).

●​ Problem: Each endogenous variable needs an instrument (something outside the


equation that predicts it but is not correlated with the error).

●​ Order condition (necessary):

o​ You must have at least as many excluded exogenous variables


(instruments) as endogenous variables.

o​ Example: 2 endogenous variables → need at least 2 instruments.

o​ This is easy to check: just count variables.

●​ Rank condition (sufficient):

o​ Just having enough instruments is not always enough.

o​ Instruments must provide independent information for each endogenous


variable.

o​ This is more complicated (matrix math), but if it fails, 2SLS estimates are
inconsistent (wrong).

●​ Weak or missing instruments:

o​ If instruments are weakly correlated with endogenous variables or missing


→ 2SLS estimates become unreliable and have huge standard errors.
5. Testing after 2SLS
●​ Usual OLS tests don’t work directly:

o​ F-test and R² formulas assume OLS.

o​ In 2SLS, the R² can even be negative → using standard F-tests can give
nonsense results.

●​ Correct approach:

o​ Use specialized commands in econometrics software that are designed for


2SLS hypothesis testing.

o​ These take into account that the regression used predicted values, not the
original endogenous variables.

IV Solutions to Errors-in-Variables Problems


1. The Problem: Measurement Error
●​ Sometimes, the variable you care about is not measured perfectly.

Example: You want x*1 (true education), but you only observe x1 (self-reported
education).
*
𝑥1 = 𝑥1 + 𝑒1

●​ e1 = measurement error (the difference between true value and what we see).

●​ Problem: Using x1 in OLS gives biased estimates. The coefficient is usually smaller
than it should be (this is called attenuation).

2. Why OLS fails


●​ Rewrite the regression using observed x1 instead of true x*1:

(
𝑦 = β0 + β1𝑥1 + β2𝑥2 + 𝑢 − β1𝑒1 )
●​ Notice (u - β1 e1) → the measurement error is now part of the error term.

●​ This makes x1 correlated with the error, which breaks OLS assumptions → biased
and inconsistent estimates.
3. The IV solution
●​ Idea: Use an instrumental variable (IV) that can replace the mismeasured variable.

Requirements for a valid IV for x1:


1.​ Correlated with x1 (must predict it).

2.​ Uncorrelated with the regression error u.

3.​ Uncorrelated with measurement error e1.

4. Using a second measurement


●​ If you can measure the same thing in a different way (say z1), you can use it as an IV:
*
𝑧1 = 𝑥1 + 𝑎1

●​ a1 = measurement error in the second measurement.

●​ Key assumption: Errors in the two measures are uncorrelated: e1 ⊥ a1.

●​ Why it works: Both x1 and z1 depend on x*1, so they are correlated. But the error
in z1 doesn’t correlate with e1 or u → makes z1 a valid IV for x1.

Real-life examples:
●​ Self-reported salary vs. employer’s record.

●​ Each spouse reporting household income independently.

●​ Twins reporting each other’s education (Ashenfelter & Krueger study).

5. Using other exogenous variables


●​ Sometimes you don’t have a second measurement.

●​ You can use another variable that is:

a.​ Correlated with x1*

b.​ Uncorrelated with the measurement error e1

●​ Example: Using mother’s and father’s education as IVs for self-reported education.

6. IV for proxy variables


●​ Sometimes we use a proxy for an unobserved variable (like ability).
Example: Wage equation:
2
𝑙𝑜𝑔(𝑤𝑎𝑔𝑒) = β0 + β1𝑒𝑑𝑢𝑐 + β2𝑒𝑥𝑝𝑒𝑟 + β3𝑒𝑥𝑝𝑒𝑟 + 𝑎𝑏𝑖𝑙 + 𝑢

●​ abil (ability) is unobserved.

●​ We have test scores as indicators of ability:

𝑡𝑒𝑠𝑡1 = γ1𝑎𝑏𝑖𝑙 + 𝑒1, 𝑡𝑒𝑠𝑡2 = δ1𝑎𝑏𝑖𝑙 + 𝑒2

●​ Problem: test1 is correlated with the error (because of measurement error e1) →
OLS fails.

●​ Solution: Use test2 as an IV for test1, assuming errors e1 and e2 are uncorrelated.

🌱 What’s happening here — in plain English


We’re learning how to test if an explanatory variable (like y₂) is endogenous —​
meaning, whether it’s correlated with the error term (u₁) and is messing up OLS.
If it is endogenous → we should use IV / 2SLS.​
If it isn’t → we can just use OLS, which is simpler and more efficient.

🧩 Step 1: The basic problem


We have a main equation (called the structural equation):
y₁ = β₀ + β₁y₂ + β₂z₁ + β₃z₂ + u₁​
→ (z₁ and z₂ are exogenous, i.e., clean variables)​
→ y₂ is suspected to be endogenous.
We also have two extra exogenous variables (z₃ and z₄) that can be used as instruments for
y₂.
The question is:
Is y₂ actually endogenous? Or can we safely treat it as exogenous?

🧠 Step 2: The idea (Hausman logic)


Hausman’s idea (1978) was simple:
If OLS and 2SLS give very different estimates, something’s wrong — probably
endogeneity.
Because:
●​ If everything is exogenous, both OLS and 2SLS are fine → estimates will be similar.

●​ If something is endogenous, OLS becomes biased, but 2SLS stays consistent →


estimates will differ.

So, testing for endogeneity = checking how different OLS and 2SLS estimates are.

⚙️ Step 3: How to actually test it (the regression way)


Instead of manually comparing coefficients, we use a small trick called the Regression
Test.​
It’s easier and more formal.

Step 3a: Get the “predicted part” and “unexplained part” of y₂


Estimate the reduced form for y₂ — that means regress y₂ on all exogenous variables (z₁, z₂,
z₃, z₄):
y₂ = π₀ + π₁z₁ + π₂z₂ + π₃z₃ + π₄z₄ + v₂
Here:
●​ v₂ is the residual (the part of y₂ not explained by the instruments).

●​ If y₂ is correlated with u₁, then u₁ and v₂ will also be correlated.

We can get an estimate of v₂ — call it v̂₂ (the predicted residuals).

Step 3b: Add those residuals into the main equation


Now take your main equation and add v̂₂ as an extra variable:
y₁ = β₀ + β₁y₂ + β₂z₁ + β₃z₂ + δ₁v̂₂ + error
Where u₁ = δ₁v₂ + e₁
Run this regression using OLS.

Step 3c: Test δ₁ = 0 using a t-test


●​ Null hypothesis (H₀): δ₁ = 0 → means v̂₂ is not related to u₁ → no endogeneity.

●​ If we reject H₀ (i.e., δ₁ is significantly ≠ 0), → endogeneity exists.


That’s it!​
So basically:
If v̂₂ helps explain y₁ → y₂ must have been correlated with u₁.​
If not → OLS was fine all along.

🧮 Example — “Return to Education” (they mention)


They tested whether education is endogenous for working women.
●​ Coefficient on v̂₂ = 0.058, t = 1.67 → moderate evidence of correlation.

●​ So education might be mildly endogenous (correlated with error term).

●​ Hence, report both OLS and 2SLS results — OLS gives higher (biased) return
(10.8%), while 2SLS gives lower (6.1%).

🧩 For multiple variables


If you suspect more than one variable is endogenous:
●​ Get residuals (v̂₂, v̂₃, …) for each.

●​ Add all of them to the main equation.

●​ Run an F-test for joint significance.​


→ If at least one is significant → at least one variable is endogenous.

Testing Overidentifying Restrictions

1️⃣ Recap: What an instrument must do


An instrument (IV) must satisfy two things:
1.​ Relevance: It must be correlated with the endogenous variable (e.g., years of
education).

2.​ Exogeneity: It must not be correlated with the error term in the main equation
(e.g., wage equation).

We can test relevance with t-tests or F-tests, but we cannot directly test exogeneity if we
only have just enough instruments.
2️⃣ What “overidentifying” means
●​ Suppose you have 1 endogenous variable, but 2 instruments.

●​ You only need 1 instrument to estimate the effect, so the second instrument is
“extra.”

●​ This extra instrument creates an overidentifying restriction: the estimates using


different instruments should agree if all instruments are valid.

💡 Think of it like a puzzle: if all pieces are correct, they should fit perfectly. Extra pieces
(extra instruments) let you check if the fit is correct.

3️⃣ How the test works (intuition)


^
1.​ Estimate your equation using 2SLS with all instruments → get residuals (𝑢1).

2.​ Regress those residuals on all instruments.


2
3.​ Compute a test statistic (𝑛𝑅 ) → it basically measures whether residuals are
correlated with the instruments.

●​ If residuals are uncorrelated: Instruments are likely valid.

●​ If residuals are correlated: At least one instrument is not exogenous → fail the
test.

So the test uses the “extra” instruments to check if they are really valid.

4️⃣ Example intuition


●​ Endogenous variable: education

●​ Instruments: mother’s education, father’s education

●​ Only need 1 instrument to estimate effect, but we have 2 → 1 overidentifying


restriction

●​ Test: Do the 2SLS residuals correlate with these instruments?

o​ If no correlation → instruments pass test

o​ If correlation → at least one instrument is invalid


5️⃣ Key points to remember
●​ More instruments than needed → can test exogeneity.

●​ Exactly enough instruments → cannot test exogeneity (just-identified).

●​ Adding instruments can reduce standard errors, but only if they are truly
exogenous. Too many instruments → may bias 2SLS.

In short: “overidentifying restrictions” = extra instruments that let us check if our


instruments are valid.

You might also like