0% found this document useful (0 votes)
44 views6 pages

F-Test and Regression Analysis Questions

1) The document contains a problem set with multiple choice and analytical questions regarding regression analysis and hypothesis testing. 2) It includes regression output from 3 models examining the relationship between costs and output. The output is used to answer questions about sample sizes, R-squared values, and effects of variables in each model. 3) Questions also address hypothesis testing, model selection, issues like multicollinearity, and pooling time series data across periods.

Uploaded by

Sila Kapsata
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views6 pages

F-Test and Regression Analysis Questions

1) The document contains a problem set with multiple choice and analytical questions regarding regression analysis and hypothesis testing. 2) It includes regression output from 3 models examining the relationship between costs and output. The output is used to answer questions about sample sizes, R-squared values, and effects of variables in each model. 3) Questions also address hypothesis testing, model selection, issues like multicollinearity, and pooling time series data across periods.

Uploaded by

Sila Kapsata
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Problem Set 6

Multiple Choice Questions

1. The critical value in the F-distribution depends on the degrees of freedom in the
numerator and denominator. How do you find the degrees of freedom in the nu-
merator?
(a) It is the number of observations minus the number of coefficients estimated
(N − K)
(b) It is the number of hypotheses being tested simultaneously (J)
(c) It is the number of coefficients being estimated (K)
(d) It is the number of observations minus the number of hypotheses tested (N −J)
2. The critical value in the F-distribution depends on the degrees of freedom in the
numerator and denominator. How do you find the degrees of freedom in the de-
nominator?
(a) It is the number of observations minus the number of coefficients estimated
(N − K)
(b) It is the number of hypotheses being tested simultaneously (J)
(c) It is the number of coefficients being estimated (K)
(d) It is the number of observations minus the number of hypotheses tested (N −
J)
3. When performing an F-test, if the null hypothesis is H0 : β1 = β2 = 0. What is
the alternative hypothesis?
(a) β1 6= 0 and β2 6= 0
(b) β1 6= 0 or β2 6= 0
(c) (β1 6= 0 and β2 = 0) or (β1 = 0 and β2 6= 0)
(d) β1 = β2 6= 0
4. How does omitting a relevant variable from a regression model affect the estimated
coefficient of other variables in the model?
(a) they are biased downward and have smaller standard errors
(b) they are biased upward and have larger standard errors
(c) they are biased and the bias can be negative or positive
(d) they are unbiased but have larger standard errors
5. How does including an irrelevant variable in a regression model affect the estimated
coefficient of other variables in the model?
(a) they are biased downward and have smaller standard errors
(b) they are biased upward and have larger standard errors
(c) they are biased and the bias can be negative or positive

1
(d) they are unbiased but have larger standard errors
6. Which of the following measures is NOT used to evaluate model specification?
(a) The adjusted R2
(b) Akaike Information Criterion
(c) Bayesian Information Criterion
(d) Jarque-Bera test
7. When are the R2 and adjusted R2 equal?
(a) When the model is correctly specified
(b) When K = 1
(c) When the error terms are normally distributed
(d) When an unrestricted model is estimated
8. When highly collinear variables are included in an econometric model coefficient
estimates are
(a) biased downward and have smaller standard errors
(b) biased upward and have larger standard errors
(c) biased and the bias can be negative or positive
(d) unbiased but have larger standard errors
9. When a set of variables with perfect collinearity is included in an econometric
model coefficient estimates are
(a) undefined
(b) unbiased
(c) biased upward
(d) biased, but the direction is unclear
10. If your regression results show a high R2 , adj R2 , and a significant F-test, but low
t-values for the coefficients, what is the most likely cause?
(a) omitted relevant variables
(b) irrelevant variables have been included
(c) multicolinearity
(d) heteroskedasticity

Analytical Questions

11. Past EXAM Question


The following output is taken from OLS regressions of three different models which
try to establish the effect of output (measured in Kilograms) on total costs (mea-
sured in £’s). The first model regresses the level of costs, (costs), on the level of
output, (output). The second model regresses the natural log of costs, (log cost),
on the natural log of output, (log output) and the third model regresses the level

2
of costs on the level of output, the square of output, (output sq) and the cube of
output (output cub). Some of the regression output has been hidden.
Model 1
reg costs output
Source | SS df MS Number of obs =
-------------+------------------------------ F( 1, 58) = 662.73
Model | 733.336303 1 733.336303 Prob > F = 0.0000
Residual | 97.3749935 58 1.10653402 R-squared = 0.8828
-------------+------------------------------ Adj R-squared = 0.8814
Total | 830.711297 59 9.33383479 Root MSE = 1.0519
------------------------------------------------------------------------------
costs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
output | .5000000 .0250000 0.000
_cons | .6501553 .1677777 3.88 0.000 .3167323 .9835782
------------------------------------------------------------------------------

Model 2
reg log_cost log_output

Source | SS df MS Number of obs =


-------------+------------------------------ F( 1, 58) = 185.50
Model | 1 Prob > F = 0.0000
Residual | 10.0000000 58 .113636360 R-squared =
-------------+------------------------------ Adj R-squared = 0.9155
Total | 100.000000 59 1.69491530 Root MSE = 1.3019
------------------------------------------------------------------------------
log_cost | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
log_output | .6000000 .0272426 22.39 0.000 .5556884 .6639662
_cons | -2.447097 .1569509 -15.59 0.000 -2.759004 -2.13519
------------------------------------------------------------------------------

Model 3
reg costs output output_sq output_cub

Source | SS df MS Number of obs =


-------------+------------------------------ F( , ) =
Model | 855.000000 3 285.00000 Prob > F = 0.0000
Residual | 95.0000000 56 1.69642860 R-squared = 0.9000
-------------+------------------------------ Adj R-squared = 0.8806
Total | 950.000000 59 16.1016950 Root MSE = 4.0126
------------------------------------------------------------------------------
costs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
output | .8000000 .2000000 4.00 0.000
output_sq | -0.003000 0.003000 -1.00 0.290 -9.07e-06 2.74e-06
output_cub | 0.000001 0.000001 0.95 0.347 -1.36e-09 3.83e-09
_cons | .4343922 .2503542 1.74 0.086 -.0632955 .9320799
------------------------------------------------------------------------------

(a) Find the sample size in model 1


(b) Calculate the R2 value in model 2
(c) Interpret the effect of the estimated effect of output on costs in each model
(d) Test the hypothesis that the variable output has some explanatory power in
model 1 (use the 5% significance level for your test and the nearest critical
value in the Table for the relevant degrees of freedom)
(e) Calculate the F test of goodness of fit of the model as a whole in model 3
(f) Explain, briefly, how the adjusted R2 helps with model selection. Use this to
help you choose whether you prefer models 1, 2 or 3
(g) The regression output in Model 3 suggests the presence of what issue that
arises in many multiple regression models? Give reasons for your answer.
(h) Why might we worry if the OLS residuals are not normally distributed?

3
12. You have time series data for the period 1935-2000. You are given an estimate
of the effects of income (measured in £billion) and interest rates, (measured in
percentage points) on aggregate consumption expenditure (measured in £billion).
ˆ =
Cons 10.00 + 0.90Income − 6.00IntRate T SS = 70 ĒSS = 10
(1.00) + (0.45) + (2.00)

You then split the data into two periods, and run 2 separate regressions

For the period 1935-1970:


ˆ =
Cons 6.00 + 0.95Income − 2.00IntRate T SS = 30 ĒSS = 10
(1.00) + (0.40) + (1.00)

For the period 1971-2000:


ˆ =
Cons 14.00 + 0.85Income − 10.00IntRate T SS = 20 ĒSS = 10
(1.00) + (0.50) + (4.00)

Test the hypothesis that the data could be pooled across both time periods and
estimated as a single equation.
13. Consider the model Y = β0 + β1 X + u
(a) What is the formula for the Ordinary Least Squares estimate of β1 ?
(b) Under what conditions will Ordinary Least Squares produce an unbiased and
efficient estimate of β1 ?
(c) Prove that the Ordinary Least Squares estimate of β1 is unbiased.
14. A researcher is interested how the proportion of household budget spent on trans-
portation (W T RAN S) depends on total household expenditure (measured in logs
- LOGEXP ), the age of the household head (AGE) and the number of children
in the household (N U M KIDS). The researcher produces the following table of
estimates:
WTRANS
Log expenditure 0.0414
(0.0071)
Age of HH head -0.0001
(0.0004)
No. of children -0.0130
(0.0055)
Constant -0.0315
(0.0322)
R2 0.0247
N 1,519
Standard errors reported in parentheses

(a) What was the theoretical model the researcher took to the data?
(b) Write down the estimated model
(c) Interpret the estimates
(d) Are there any variables you would exclude from the model? Why, or why
not?

4
(e) Predict the proportion of a budget that will be spent on transportation for a
one-child household when total expenditure and age are set at their sample
means (98.7 and 36 respectively)

Practical Questions

15. When estimating wage equations we expect that young, experienced workers will
have relatively low wages and that with additional experience their wages will rise,
but then begin to decline after middle age, as the worker nears retirement. This
lifecycle pattern of wages can be captured by introducing experience and the square
of experience to explain the level of wages.
Consider the theoretical model

W age = β0 + β1 Exper + β2 Exper2 + β3 Educ + u (1)

(a) What is the marginal effect of experience on wages?


(b) What signs do you expect for each of the coefficients β1 and β2 and why?
(c) After how many years of experience do wages start to decline?
(d) Open the dataset [Link] (we used this dataset previously in Problem
Set 4)
i. Estimate a simple regression model of wages on years of experience
ii. Estimate a second model where you also include years of education
iii. Estimate the full theoretical model in (1) and interpret the estimates.
Are the estimates consistent with your expectations?
iv. Export your the estimates from all three models in one single table
• To export the table requires the outreg2 command. (Recall if nec-
essary you can download the command using < ssc install outreg2 .
• After estimating each model you need to save the estimates in STATA’s
internal memory. Then tell STATA to put the estimates together in
one table with the outreg command. The syntax will look something
like:
reg . . . . . .
estimates store model1
reg . . . . . .
est sto model2
reg . . . . . .
est sto model3
outreg2 [model1 model2 model3] using datapath\Table1, replace
word
v. Compare the coefficient on experience between the simple regression
model and the second model. What happens? Why? What does this
tell you about the correlation between experience and education?

5
16. The file [Link] available on Moodle contains 56 observations on variables
related to sales of cocaine powder in northeastern California over the period 1984-
1991. The data are a subset of those used in the study

Caulkins, J.P. and R. Padman (1993) “Quantity Discounts and Quality Premia
for Illicit Drugs” Journal of the American Statistical Association, 88, 748-757

The variables are:


• PRICE = price per gram in dollars for a cocaine sale
• QUANT= number of grams of cocaine in a given sale
• QUAL = quality of the cocaine expressed as a percentage of purity
• TREND = a time variable with 1984=1 up to 1991=8
Consider the regression model

P RICE = β0 + β1 QU AN T + β2 QU AL + β3 T REN D + u

(a) What signs would you expect for the coefficients β1 , β2 and β3 . Explain
(b) Estimate the model in STATA and interpret the coefficient estimates. Do the
signs of the coefficients conform to your expectations?
(c) What proportion of the variation in cocaine prices is explained jointly by
variation in quantity, quality and time?
17. Use the data [Link] to estimate the following wage equation:

ln(W age) = β0 + β1 Educ + β2 Exper + β3 Hrswk + u

(a) Interpret the regression output.


(b) Test the hypothesis that an extra year of education increases the wage rate
by 10%
(c) Re-estimate the model with the additional variables EDU C ∗ EXP ER and
EDU C 2 and EXP ER2 . Interpret the regression output
(d) Estimate the marginal effects ∂ ∂EDU
ln(W age)
C for a woman with 16 years of educa-
tion and 2 years of experience and for a woman with 12 years of education
and 2 years of experience. What can you say about the marginal effect of
education for women as education increases?
(e) Estimate the marginal effects ∂ ∂EDU
ln(W age)
C for a man with 16 years of education
and 2 years of experience and for a man with 12 years of education and 2
years of experience. What can you say about the marginal effect of education
for men as education increases?

Common questions

Powered by AI

The degrees of freedom in the numerator for the F-distribution are given by the number of hypotheses being tested simultaneously, represented as J. For the denominator, it is the number of observations minus the number of coefficients being estimated, represented as (N - K).

Omitting a relevant variable from a regression model typically results in biased coefficients for the included variables, and the direction of the bias can be either positive or negative depending on the correlation between the omitted variable and the included variables .

Including quadratic and cubic terms in a regression model allows for capturing non-linear relationships between variables. In the context of output and costs, adding these higher-order terms can indicate the presence of diminishing or increasing returns to scale, which would not be captured in a linear model. The coefficients on these terms are indicative of the shape of the cost function, and their significance helps infer the nature of the curvature in the relationship .

This scenario likely indicates the presence of multicollinearity in the regression model. Despite a high R-squared and significant F-test suggesting a good overall model fit, the low t-values indicate that individual predictors are not statistically significant, which is a common indicator of multicollinearity .

Multicollinearity leads to coefficients that are unbiased but have larger standard errors, which can reduce the statistical significance of individual predictors, making it difficult to assess their individual effect .

Adjusted R-squared accounts for the number of predictors in the model and only improves when the additional variables improve the model beyond what would be expected by chance. Hence, it is particularly useful for model selection as it penalizes the inclusion of unnecessary variables. A model with a higher adjusted R-squared is typically preferred as it indicates a better balance between model complexity and fit .

If OLS residuals are not normally distributed, the validity of hypothesis tests and confidence intervals is compromised because the standard errors are likely to be inaccurate. This can lead to incorrect inferences about the significance of predictors and overstated confidence in predictions. Non-normal residuals can also violate the assumptions necessary for the Gauss-Markov theorem, affecting the efficiency of the OLS estimators .

Pooling data across different time periods should be considered when the underlying data-generating processes are assumed to be the same or stable across the periods, which can be tested by evaluating the stability of coefficients across the time periods. This is done using Chow tests or other statistical tests for structural stability .

Including an irrelevant variable typically results in unbiased estimates of the coefficients of the other variables, but these estimates will have larger standard errors, potentially decreasing the precision of the estimates and inflating variance inflation factors .

The presence of perfect collinearity between variables in a regression model means that some coefficients will be undefined. This is because perfect collinearity makes it mathematically impossible to distinguish the individual impact of the collinear variables on the outcome, hence, leading to issues in estimating their unique effects .

You might also like