Inference in Simple Regression
(SW Chapter 5)
1
Overview of where we are heading:
We want to learn about a population relation. We have data from a
sample, so there is sampling uncertainty. There are five steps towards
this goal:
1. State the population object of interest.
2. Provide an estimator of this population object.
3. Derive the sampling distribution of the estimator (this requires
certain assumptions). In large samples this sampling
distribution will be approximately normal (CLT).
4. The square root of the estimated variance of the sampling
distribution is the standard error (SE) of the estimator.
5. Use the SE to construct t-statistics (for hypothesis tests) and
confidence intervals.
2
Object of interest is described by the population regression model:
Yi = 0 + 1Xi + ui, i = 1,…, n
1 = Y/X, for an autonomous change in X.
Estimator: the OLS estimator ˆ1 .
n
( X i − X )(Yi − Y )
𝑠𝑋𝑌
ˆ1 = i =1
n
= 2 (4.7)
𝑠𝑋
i
( X
i =1
− X ) 2
The OLS estimator of the intercept,
𝛽̂0 = 𝑌̅ − 𝛽̂1 𝑋̅ (4.8)
3
To derive the statistical properties, we relied on:
The Least Squares Assumptions:
1. E(u|X = x) = 0.
2. (Xi,Yi), i = 1, …, n, are i.i.d.
3. Large outliers are rare.
Under the Least Squares Assumptions, the C.L.T. assures that for n
large, is approximately normally distributed:
𝐴
𝛽𝑗 ~ 𝑁(𝛽𝑗 , 𝑉(𝛽̂𝑗 ))
̂ for j=0,1
Note that: The expression of 𝑉(𝛽̂1 ) depends on V(𝑢𝑖 |𝑋).
Also, to put this into use, 𝑉(𝛽̂𝑗 ) has to be estimated.
4
Remember that SE( ˆ1 ) is the positive square-root of the estimated
variance of ˆ :
1
SE( ˆ1 ) = +√𝑉̂ (𝛽̂1 )
Remark: We use 𝑉̂ (𝛽̂1 ) (i.e., with ^ on top of V) as an estimator of
𝑉(𝛽̂1 ).
The expression of 𝑉(𝛽̂1 ) depends on V(𝑢𝑖 |𝑋).
• So, based on the assumption about V(𝑢𝑖 |𝑋), 𝑉̂ (𝛽̂1 ) differs.
• Therefore, 𝑆𝐸(𝛽̂1 ) differs.
5
Formula for SE( ˆ1 ) – for the general case
(i.e., V(ui | X) = 𝜎𝑖2 , different for each i)
The expression for the variance of ˆ (for large n) is:
1
ˆ 1 var[(𝑋𝑖 −𝜇𝑥 )𝑢𝑖 ] 2
V( ) = = v
, where vi = (Xi – X)ui.
var[𝑋𝑖 ]2 n( )
1 𝑛 2 2
X
The estimator of V( ˆ1 ) replaces the unknown population values of 2
2
and X by estimators constructed from the data:
1 n 2
1 estimator of 2
1
n − 2 i =1
vˆi
̂ ˆ
𝑉( 1 ) = ˆ ˆ =
2 v
=
1
n (estimator of X ) 2 2
n 1 n
2
2
n ( X i − X )
i =1
where vˆi = ( X i − X )uˆi . [Do you remember the significance of “hats” (^)?]
6
Formula for SE( ˆ1 ) – for the general case:
SE( ˆ1 ) = + ˆ 2ˆ = the Standard Error of ˆ1 ,
1
1 ∑𝑛 [(𝑋 𝑖 ̅
−𝑋 ̂
)𝑢 𝑖 ] 2 /(𝑛−2)
ˆ 2ˆ = 𝑖=1
.
1 𝑛 [ ∑𝑛 ̅ 2
𝑖=1(𝑋𝑖 −𝑋 ) /𝑛]
2
This looks complicated, but it is easily calculated:
• The numerator estimates 𝜎𝑣2 , using 𝑣̂𝑖 = (𝑋𝑖 – 𝑋̅)𝑢̂𝑖 .
• Why n – 2? This is the degrees-of-freedom adjustment.
Because 2 coefficients have been estimated (0 & 1), we have n – 2.
• The denominator estimates [𝜎𝑋2 ]2.
In practice SE( ˆ1 ) is computed by regression software.(Stata:robust)
7
Formula for SE( ˆ1 ) – for the special case
(i.e., V(ui | X) = 𝜎𝑢2 , the same for all i, independently of X. )
The expression for the unconditional variance of ˆ1 (for large n) is:
1 𝜎𝑢2
V( ˆ1 ) =
1 var[𝑢𝑖 ]
= ( 2)
𝑛 var[𝑋𝑖 ] 𝑛 𝜎𝑋
The estimator of V( ˆ1 ) replaces the unknown population values of 𝜎𝑢2
and 𝜎𝑋2 by estimators constructed from the data:
1 ∑𝑛 2
𝑉̂( ˆ1 ) =
̂
𝑖=1 𝑖 /(𝑛−2)
𝑢
where 𝑢̂𝑖 refers to residuals.
𝑛 ∑𝑛 ̅ 2
𝑖=1(𝑋𝑖 −𝑋 ) /𝑛
Then, SE( ˆ1 ) = +√𝑉̂ ( ˆ1 ) = the standard error of ˆ1 can be
calculated.
Software packages have this option as well. (Stata: drop the robust)
8
Precision of our estimates
Remark 1: The larger the sample size (i.e., n), the smaller the
variance of 𝛽̂1 .
Remark 2: The smaller the variance of the error term (i.e., 𝜎𝑢2 ), the
smaller the variance of 𝛽̂1 .
Remark 3: The larger the sample variance of X (i.e., 𝑠𝑋2 =
(∑𝑖 𝑥𝑖2 )/(𝑛 − 1) ), which is an estimator of 𝜎𝑥2 , the smaller the
variance of ˆ1 .
Intuition: If there is more variation in X, then there is more
information in the data that you can use to locate the regression line.
This is most easily seen in a figure…
9
The larger the variance of X, the smaller the variance of ˆ1
Question: Which set of dots would yield a more accurate regression
line, blue dots, or black dots?
Hint: Blue dots are more “concentrated,” 𝒔𝟐𝑿 < 𝒔𝟐𝑿 .
10
Let’s compare the following two outputs:
11
Summary: There are two ways to compute standard errors:
• Homoskedasticity-only standard errors – these are valid only if
the errors are homoskedastic. (Strong assumption!)
These can be obtained by omitting the Stata subcommand
“robust.”
• Heteroskedasticity – robust standard errors -- these are always
valid; hence Stock & Watson prefer them. These require the Stata
subcommand “robust.”
The main advantage of the homoskedasticity-only standard errors is
that the formula is simpler. But the disadvantage is that the formula
is correct only if the errors are homoskedastic.
Since Stata calculates standard errors for us, it is better to adopt the
general approach.
12
Conventional way to report regression results concisely:
• Put standard errors in parentheses below the estimated
coefficients to which they apply.
• Write goodness of fit statistics on the same line as the equation.
̂
𝑇𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒 = 698.9 – 2.28STR, R2 = .0512, SER = 18.6
(10.4) (0.52)
How do we find these numbers?
The formulas are given in Stock & Watson, also in Lecture Notes.
Stata will find them for us :)
Caution: Stata reports the root mean squared error (RMSE) ≈ SER.
13
Hypothesis Testing using 𝛽̂𝑗 and the Standard Error of 𝛽̂𝑗
The objective is to reach a conclusion regarding the numerical value
of 1, using data (information in a random sample).
General setup
Null hypothesis and two-sided alternative:
H0: 1 = 1,0 vs. H1: 1 1,0
where 1,0 is the hypothesized numerical value under the null.
Null hypothesis and one-sided alternative:
H0: 1 = 1,0 vs. H1: 1 < 1,0
OR
H0: 1 = 1,0 vs. H1: 1 > 1,0
14
General approach: construct t-statistic, and compute p-value (or
obtain the critical value) using the standard normal c.d.f. table.
estimator - hypothesized value
• In general: t=
standard error of the estimator
where the SE of the estimator is the square root of an estimator
of the variance of the estimator.
𝑌̅−𝜇𝑌,0
• Math 201: test on the mean of Y is based on t = ;
𝑆𝐸(𝑌̅)
ˆ1 − 1,0
• Econ 311: test on 1 is based on t0 = ,
ˆ
SE ( 1 )
where SE( ˆ1 ) = the positive square root of a (consistent)
estimator of the variance of the sampling distribution of ˆ1 .
15
Hypothesis testing - mechanics: to test
H0: 1 = 1,0 vs. H1: 1 1,0,
Construct the t-statistic
̂1 −𝜷𝟏,𝟎
𝛽
t0 = ̂1 ) .
𝑆𝐸(𝛽
(i) Reject H0 at 5% significance level if |t0 | > 1.96.
(ii) Calculate the p-value = 2 Pr(Z > | t0 |) for the two-sided
alternative, which is the probability contained in the tails of a
standard normal beyond |t0 |. Reject H0 if the p-value is “small”.
Note the relation between (i) and (ii): You will surely reject at the
5% significance level if the p-value is ≤ 0.05.
16
Remarks:
• By opting for the standard normal, we engage in “practical”
inference.
• This procedure relies on the CLT. Typically n = 30 is large enough
for the approximation to be a good one.
• The language “t-statistic” invokes memories of “Student’s t-
distribution.” However, “t-statistic” will distribute as “Student’s
t-distribution” only under some special conditions (i.e.,
assumptions).
17
Zero slope null: The simplest (and most widely used) hypothesis
test sets the numerical value of 1,0 to ”zero”:
H0: 1 = 0 vs. H1: 1 0
This test is known as ”test of (statistical) significance” of 𝛽̂1 .
When 1,0 = 0, the test statistic
̂1 −𝛽1,0
𝛽
t0 = ̂1 )
𝑆𝐸(𝛽
simplifies to the ”t-ratio”:
̂1
𝛽
t-ratio = ̂1 ) .
𝑆𝐸(𝛽
In applications users typically check the t-ratio first, to decide
whether the estimated slope is (statistically) different from zero.
18
On our choice of notation: Stock and Watson refer to the numerical
value of the test statistic for a particular value of 𝛽̂1 as:
ˆ −
tact = 1 1,0
SE ( ˆ1 )
where superscript “act” stands for “actual.” This notation does not
capture the fact that for a given estimate 𝛽̂1 of the unknown
parameter 1, the value of the test statistic depends on 1,0, the
numerical value stated in the null hypothesis.
We chose our notation to emphasize the link between the value of
the test statistic and the hypothesized value of 1,0:
̂1 −𝛽1,0
𝛽
t0 = ̂1 ) .
𝑆𝐸(𝛽
19
Example: Test Scores and STR, California school data
A convenient method for summarizing the regression results is to
write down the estimated regression equation, where the standard
errors are shown under the estimated coefficients:
̂
𝑇𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒 = 698.9 – 2.28 STR
(10.4) (0.52)
That is, SE( ˆ ) = 10.4, SE( ˆ ) = 0.52.
0 1
The t-ratio for the slope is: –2.28/0.52 = –4.38.
(i) At the 1% significance level, the 2-sided critical value from the
standard normal table is z*= 2.58; since |–4.38| > 2.58, we
reject the null at the 1% significance level.
20
Note that with a t-ratio of this magnitude, the evidence against the
null is “very” strong. Even if we were to impose a tougher standard
than 1%, we would still reject the null.
Question: What is the toughest standard that we can apply, and
still reject the null?
Answer: p-value (marginal level of significance).
(ii) We can compute the p-value for the two-sided alternative from
the standard normal table:
With Z ~ N(0, 1),
p-value = 2Pr(Z > |–4.38| ) = 0.00001 (= 10–5).
When the p-value is this small, we often write: “p-value << 0.01”
and say “the evidence against the null is extremely strong.”
21
Geometry:
p-value << 0.01
22
Confidence Intervals for 1
Recall that a 95% confidence interval is, equivalently:
• The set of numerical values 1,0 that cannot be rejected at the 5%
significance level;
• An interval computed as a function of the data that contains the
true parameter value 95% of the time in repeated samples.
CLT: The t-statistic for 1 becomes Z ~ N(0,1) in large samples. Thus
the (approximate) 95% symmetric confidence interval for 1 is:
ˆ1 1.96SE( ˆ1 ).
23
Example: Test Scores and STR, California school data
̂
𝑇𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒 = 698.9 – 2.28 STR
(10.4) (0.52)
The parameter of interest is 1. We have 𝛽̂1 = – 2.28 and
SE( ˆ ) = 0.52. The (approximate) 95% confidence interval for 1 is:
1
ˆ1 1.96SE( ˆ1 ) or –2.28 1.960.52
= (–3.30, –1.26).
The following two statements are equivalent
• The 95% confidence interval does not include zero;
• The hypothesis 1 = 0 is rejected at the 5% level.
24
Test Scores and STR, California school data:
regress testscr str, robust
Regression with robust standard errors Number of obs = 420
F( 1, 418) = 19.26
Prob > F = 0.0000
R-squared = 0.0512
Root MSE = 18.581
-------------------------------------------------------------------------
| Robust
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------+----------------------------------------------------------------
str | -2.279808 .5194892 -4.38 0.000 -3.300945 -1.258671
_cons | 698.933 10.36436 67.44 0.000 678.5602 719.3057
-------------------------------------------------------------------------
_cons denotes the intercept. Slope is identified with variable name.
̂
𝑇𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒 = 698.9 – 2.28STR, R2 = .0512, SER = 18.6.
(10.4) (0.52)
t-ratio for 1 = –4.38; p-value = 0.000 (2-sided)
Approx. 95% confidence interval for 1 is: (–3.30, –1.26).
Statistical inference about the intercept, 0 -- summary
25
Estimation:
• OLS estimator of 0 is ˆ0 .
• ˆ has an approx. normal distribution in large samples.
0
Testing:
• H0: 0 = 0,0 vs. 0 0,0 (0,0 is the value of 0 under H0).
• t0 = ( ˆ0 – 0,0)/SE( ˆ0 ).
• p-value = area under standard normal outside (-|t0|, |t0|).
Confidence Intervals:
• 95% confidence interval for 0 is [ ˆ0 1.96SE( ˆ0 )].
This is the set of 0 that is not rejected at the 5% level.
26