Data Analytics
DS342
12/27/2025
Chapter 7
Regression Analysis
3
Introductory Case: College Scorecard
• College costs and student debt are on the rise
• Students and parents struggle to find clear, reliable data on critical questions of college affordability
and value.
• The Department of Education (DOE) published a redesigned College Scorecard that reports the most
reliable national data on college costs and students’ outcomes at specific colleges.
• Fiona Schmidt, a college counselor, believes that the information from the College Scorecard can help
her as she advises families.
• Fiona wonders what college factors influence post-college earnings and wants answers to the following
questions:
• If a college costs more or has a higher graduation rate, should a student expect to earn more after
graduation?
• If a greater percentage of the students are paying down debt after college, does this somehow
influence post-college earnings?
• And finally, does the location of a college affect post-college earnings?
12/27/2025
Introductory Case: College Scorecard
• Information from 116 colleges on the following variables:
• Annual post-college earnings (Earnings in $)
• The average annual cost (Cost in $)
• The graduation rate (Grad in %)
• The percentage of students paying down debt (Debt in %)
• Whether or not a college is located in a city (City equals 1 if a city location, 0 otherwise)
• Use the information to:
• Make predictions for post-college earnings using regression analysis
• Interpret goodness-of-fit measures for the post-college earnings model
• Determine which factors are statistically significant in explaining post-college earnings
12/27/2025
Prediction
• Predictions are subject to sampling variability.
• The prediction will change if we use a different sample to estimate the
regression model.
• There is a distinction between the interval estimate for the mean of the
response and the interval estimate for the individual value of the
response.
• Confidence interval: mean.
• Prediction interval: individual.
Prediction intervals are always wider than confidence interval.
Prediction
• The point prediction, or best guess, is found by substituting the given
values of the Xs into the estimated regression equation.
• To measure the accuracy of the point predictions, calculate standard errors of
prediction.
• Standard error of prediction for a single Y:
• This error is approximately equal to the standard error of estimate.
• Standard error of prediction for the mean Y:
• This error is approximately equal to the standard error of estimate divided by the square root of
the sample size.
Prediction
• These standard errors can be used to calculate a
95% prediction interval for an individual value and a 95%
confidence interval for a mean value.
• Go out a t-tabulated of the relevant standard error on either side of the
point prediction.
• The term prediction interval (rather than confidence interval) is used for
an individual value because an individual value of Y is not a population
parameter.
• However, the interpretation is basically the same.
Prediction
• Example: Consider the below model from the College Scorecard
case:
Earnings = 0 + 1 Cost + 2 Grad + 3 Debt + 4 City +
• Construct the 95% confidence interval for the expected Earnings if
Cost equals $25,000, Grad equals 60, Debt equals 80, and City
equals 1.
• Construct the 95% prediction interval for the expected Earnings if
Cost equals $25,000, Grad equals 60, Debt equals 80, and City
equals 1.
Prediction
• df = 111, t0.025,111 = 1.982, se = 5,645.83
For the confidence interval:
• 𝑦ො = 45,408.8
𝑠𝑒 5.645.83
• 𝑦ො ± 𝑡𝛼Τ2,𝑑𝑓 ∗ = 45,408.8 ± 1.982 ∗ = [44,370.06,46,447.54]
𝑛 116
With 95% confidence, we can state that the mean Earnings fall between
$44,370.06 and $46,447.54
For the prediction interval:
• 𝑦ො ± 𝑡𝛼Τ2,𝑑𝑓 ∗ 𝑠𝑒 = 45,408.8 ± 1.982 ∗ 5.645.83 = [34,221.21,56,596.39]
With 95% confidence, the Earnings fall between $34,221.21 and $56,596.39
Model Selection
• Example: Recall the College Scorecard case and
consider three models. Which should we choose?
Model 1: Earnings = 0 + 1 Cost +
Model 2: Earnings = 0 + 1 Cost + 2 Grad + 3 Debt +
Model 3: Earnings = 0 + 1 Cost + 2 Grad + 3Debt + 4City +
• Several “goodness-of-fit” measures summarize how
well the sample regression equation fits the data.
• The standard error of the estimate, se .
2
• The coefficient of determination, .
R
2
• The adjusted coefficient of determination, adjusted R .
Model Selection
• Recall that a residual is the difference between the observed and predicted value of
the response,
→ 𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖 .
• The sample regression equation provides a good fit when the dispersion of the
residuals is relatively small.
• The sample variance, 𝑆𝑒2 ,is the average squared deviation between the observed
and predicted values.
• The standard deviation of the residuals, or standard error of the estimate, has the
same units of measurement as the response.
𝑆𝑆𝐸
→ 𝑠𝑒 =
𝑛−𝑘−1
• 𝑆𝑆𝐸is the error sum of squares, 𝑘denotes the number of predictors, and 𝑛is the
sample size.
Model Selection
•For a fixed sample size, adding predictors changes both the numerator and denominator of
model fit measures.
→ The overall effect helps determine whether new predictors truly improve the model.
•When comparing models with the same response variable, the model with the smaller
standard error of the estimate (𝑠𝑒 ) is preferred.
•The coefficient of determination (𝑅2 ) measures how much of the variation in the
response is explained by the regression model.
•𝑅2 is the ratio of explained variation to total variation in the response variable.
Model Selection
• We cannot use 𝑅2 for model comparison when the competing models do not include the
same number of predictor variables (but have the same response).
𝑹𝟐 never decreases as we add more variables.
→ May include variables with no economic or intuitive foundation.
• Adjusted 𝑅2 explicitly accounts for the sample size 𝑛 and the number of predictor
variables 𝑘.
𝑛−1
Adjusted 𝑅2 = 1 − 1 − 𝑅2
𝑛−𝑘−1
• Imposes a penalty for any additional predictors.
• The higher the adjusted 𝑅2 ,the better the model.
• When comparing models with the same response, the model with the higher adjusted 𝑅2 is
preferred.
Model Selection
• Example: Recall the introductory case and consider three models.
Model 1: Earnings = 0 + 1 Cost +
Model 2: Earnings = 0 + 1 Cost + 2 Grad + 3Debt +
Model 3: Earnings = 0 + 1 Cost + 2 Grad + 3 Debt + 4 City +
Model 1 Model 2 Model 3
Standard error of the estimate s e 6,271.4407 5,751.8065 5,645.8306
Coefficient of determination R 2 0.2767 0.4023 0.4292
Adjusted R 2 0.2703 0.3862 0.4087
• a. Which of the three models is the preferred model?
• b. Interpret the coefficient of determination for the preferred model.
• c. What percentage of the sample variation in annual post-college earnings is unexplained
by the preferred model?
Model Selection
• Example:
Model 1 Model 2 Model 3
Standard error of the estimate s e 6,271.4407 5,751.8065 5,645.8306
Coefficient of determination R 2 0.2767 0.4023 0.4292
Adjusted R 2 0.2703 0.3862 0.4087
• a. Model 3 has the lowest standard error of the estimate and the highest adjusted 𝑅2
• b. Model 3 explains 42.92% of the sample variation in the earnings.
• c. Model 3 does not explain 57.08% of the sample variation in earnings.
7.2: Model Selection 8
• Note that goodness-of-fit measures discussed in this section use the
same sample to build the model to asses it.
• Unfortunately, this procedure does not gauge how well the estimated
model will predict in an unseen sample.
• We will discuss cross-validation techniques that evaluate predictive
models by dividing the original sample into a training set to build (train)
the model and a validation set to evaluate (validate) it.
• The validation set is used to provide an independent performance
assessment by exposing the model to unseen data.
Testing the Model for Significance
Testing the Significance
Testing the Significance
Testing the Significance
of the coefficients of the
of the Overall Model
Model
Testing the Model for Significance
• When the sample size is too small, you can get good values for
MSE and r2 even if there is no relationship between the variables
• Testing the model for significance helps determine if the values
are meaningful
• We do this by performing a statistical hypothesis test
Regression as Analysis of Variance
ANOVA conducts an F-test to determine whether variation in Y is due to varying
levels of X (to test for significance of regression).
We start with the general linear model
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝛽𝑘 𝑋𝑘 + ⋯ + 𝜀
H0: all population slope coefficients (i)= 0
H1: at least one of the population slope coefficients (i) ≠ 0
◼ If 1 = 0, the null hypothesis is that there is no relationship between X and Y
◼ The alternate hypothesis is that there is a linear relationship (1 ≠ 0)
◼ If the null hypothesis can be rejected, we have proven there is a relationship
Measuring the Fit of the Regression Model
◼ Regression models can be developed for any variables X
and Y
◼ How do we know the model is actually helpful in predicting
Y based on X?
◼ Three measures of variability are
◼ SST – Total variability about the mean
◼ SSE – Variability about the regression line
◼ SSR – Total variability that is explained by the regression model
Measuring the Fit of the Regression Model
◼Three measures of
variability are 12 –
Y^ = 2 + 1.25X
SST – Total variability about the mean
10 –
SSR – Total Variability that is explained
^
by the regression line Y–Y
8– Y–Y
SSE – variability about the regression ^
Sales ($100,000)
Y–Y Y
model
6–
4–
2–
0– | | | | | | | |
0 1 2 3 4 5 6 7 8
Payroll ($100 million)
Analysis of Variance (ANOVA) Table
◼ When software is used to develop a regression model, an ANOVA table is typically
created that shows the observed significance level (p-value) for the calculated F value
◼ This can be compared to the level of significance () to make a decision
This is a right-tailed F test
I. Testing the Significance of the Overall Model
▪ If there is very little error, the MSE would be small and the F-statistic
would be large indicating the model is useful.
▪ If the F-statistic is large, the significance level (p-value) will be low,
indicating it is unlikely this would have occurred by chance.
▪ So when the F-value is large, we can reject the null hypothesis and
accept that there is a linear relationship between X and Y and the
values of the MSE and r2 are meaningful.
I. Testing the Significance of the Overall Model
Make a decision using one of the following methods
a) Reject the null hypothesis if the test statistic is greater than the F-value
from the statistical tables. Otherwise, do not reject the null hypothesis:
Reject if 𝐹𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑 > 𝐹𝛼 ,𝑑𝑓 ,𝑑𝑓
1 2
𝑑𝑓1 = 𝑘
𝑑𝑓2 = 𝑛 − 𝑘 − 1
b) Reject the null hypothesis if the observed significance level, or p-value, is less than
the level of significance (𝛼Τ2). Otherwise, do not reject the null hypothesis:
𝑝−value = 𝑃(𝐹 > calculated test statistic)
Reject if 𝑝−value < 𝛼
College Scorecard Example
𝑬𝒂𝒓𝒏𝒊𝒏𝒈𝒔 = 𝜷𝟎 + 𝜷𝟏 𝑪𝒐𝒔𝒕 + 𝜷𝟐 𝑮𝒓𝒂𝒅 + 𝜷𝟑 𝑫𝒆𝒃𝒕 + 𝜷𝟒 𝑪𝒊𝒕𝒚 + 𝜺
Given = 0.05
Calculate the value of the test
statistic
𝑀𝑆𝑅 665172989.8
𝐹= = =20.87
𝑀𝑆𝐸 31875403.3
The value of F associated with a 5% level of significance and with degrees of freedom 4 and
111 from statistical tables is: F = 2.4
0.05,4,111
Fcalculated = 20.86
Reject H0 because 20.86 > 2.4
College Scorecard Example
◼ We can conclude there is a statistically
significant relationship between X’s and Y
◼ The r2 value of 0.42 means about 42% of the
variability in earnings (Y) is explained by the
other variables (X’s)
0.05
F = 2.4 20.86
II. Testing the Significance of the coefficients of the Model
◼ Evaluation is similar to simple linear regression models
◼ The p-value for the F-test and r2 are interpreted the same
◼ The hypothesis is different because there is more than one
independent variable
◼ The F-test is investigating whether all the coefficients are equal to 0
II. Testing the Significance of the coefficients of the Model
• There is another important piece of information in regression outputs:
the t-values for the individual regression coefficients.
• Each t-value is the ratio of the estimated coefficient to its standard error.
• It indicates how many standard errors the regression coefficient is from zero.
• A t-value can be used in a hypothesis test for a regression
coefficient.
• If a variable’s coefficient is zero, there is no point in including this variable in the
equation.
• To run this test, simply compare the t-value in the regression output with a
tabulated t-value and reject the null hypothesis only if the t-value from the
computer output is greater in magnitude than the tabulated t-value.
Evaluating Scorecard Example
• All explanatory variables are significant since both have too low p-values
(<0.05) unless one which is the Dept%
• T-tabulated = t0.025,111 = 1.982
• Again, all explanatory variables are significant since both have higher t-values (>1.982)
unless one which is the Dept%
INCLUDE/EXCLUDE DECISIONS
• The t-values of regression coefficients can be used to make
include/exclude decisions for explanatory variables in a
regression equation.
• Finding the best Xs to include in a regression equation is the most
difficult part of any real regression analysis.
• You are always trying to get the best fit possible, but the principle of
parsimony suggests using the fewest number of variables.
• This presents a trade-off, where there are not always easy answers.
• To help with this decision, several guidelines are presented on the next
slide.
Guidelines for Including/Excluding Variables
in a Regression Equation
• Look at a variable’s t-value and its associated p-value. If the p-value is above
some accepted significance level, such as 0.05, this variable is a candidate for
exclusion.
• Check whether a variable’s t-value is less than 1 or greater than 1 in magnitude. If
it is less than 1, then this variable is excluded from the equation.
• When there is a group of variables that are in some sense logically related, it is
sometimes a good idea to include all of them or exclude all of them.
Thank You ☺
33