0% found this document useful (0 votes)
14 views4 pages

Simple Linear Regression Project Guide

The document outlines the instructions for Project 2 of the MA 425/625 Applied Regression Analysis course, focusing on Simple Linear Regression using RStudio and RMarkdown. It includes data requirements, analysis tasks for two parts using cost-report years 2000 and 2001, and specifies submission details. The project involves various statistical analyses, including correlations, scatter plots, linear regression models, hypothesis testing, and confidence intervals.

Uploaded by

iamstupid502
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views4 pages

Simple Linear Regression Project Guide

The document outlines the instructions for Project 2 of the MA 425/625 Applied Regression Analysis course, focusing on Simple Linear Regression using RStudio and RMarkdown. It includes data requirements, analysis tasks for two parts using cost-report years 2000 and 2001, and specifies submission details. The project involves various statistical analyses, including correlations, scatter plots, linear regression models, hypothesis testing, and confidence intervals.

Uploaded by

iamstupid502
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MA 425/625: Applied Regression Analysis - Fall 2025

Project 2: Simple Linear Regression

Instructor: Mohamed Abu Sheha

24 September, 2025

Instructions

• Complete the entire project using RStudio.


• Use RMarkdown to typeset your work.
• You are required to submit two files on canvas (RMarkdown and PDF/WORD/HTML).
• Make sure your code is readable (i.e. well annotated).
• Due September 30, 2025, at 11 p.m. CT (Total Points = 100).

Data

• The data set named WNH for this project was provided by the Wisconsin Department of Health and
Family Services (DHFS).
• The data is accessible from the following website:
– [Link]

Part A (45 Points)

Use cost-report year 2000 data, and do the following analysis.

• Below is a snapshot of the data set and the definition of variables.

1
• Read your data into R. Please check your data for NAs. If there are NAs, run your data through
the function [Link]() to omit all the NAs before proceeding to perform analysis. Assign the name
nurse_2000 to the data in R. (5 points)

# write your code here

a. Correlations (10 points)


i. Calculate the correlation between TPY and LOGTPY (logarithm of TPY). Comment on your
result. Note: LOGTPY is a new variable you have to create by taking the log of TPY.

• Write your comment on the result here!

# write your code here

ii. Calculate the correlation among TPY, NUMBED, and SQRFOOT. Do these variables appear highly
correlated?

• Write your comment here!

# write your code here

b. Scatter plots. Plot TPY (Y-axis) versus NUMBED (X-axis) and TPY versus SQRFOOT. Comment
on the plots. (10 points)

• Write your comment on the plots here!

# write your code here

c. Basic linear regression. (20 points)


i. Fit a basic linear regression model using TPY as the outcome variable and NUMBED as the
explanatory variable. Summarize the fit by quoting the coefficient of determination, the t-statistic
(t value), and the p-value for NUMBED.

• Write your summary here!

# write your code here

ii. Repeat c(i), using SQRFOOT instead of NUMBED. In terms of R2 , which model fits better?

• Write your response here!

# write your code here

iii. Repeat c(i), using LOGTPY for the outcome variable and LOG(NUMBED) as the explanatory variable.

• Write your summary here!

2
# write your code here

iv. Repeat c(iii) using LOGTPY for the outcome variable and LOG(SQRFOOT) as the explanatory vari-
able.

• Write your summary here!

# write your code here

Part B (55 Points)

Use cost-report year 2001 data, and do the following analysis.

• Read your data into R. Please check your data for NAs. If there are NAs, run your data through
the function [Link]() to omit all the NAs before proceeding to perform analysis. Assign the name
nurse_2001 to the data in R. (5 points)

# write your code here

You decide to examine the relationship between total patient years (LOGTPY) and the number of beds
(LOGNUMBED), both in logarithmic units.

a. Summary statistics. Create basic summary statistics (Mean, Median, Standard Deviation Minimum,
and Maximum) for each variable. Summarize the relationship through a correlation statistic and a
scatter plot. (Round values to 3 decimal places). (10 points)

• Write your summary here!

# write your code here

b. Fit the basic linear model. Summarize the fit by quoting the coefficient of determination, the t-statistic
(t value), and the p-value for LOGNUMBED. (5 points)

• Write your summary here!

# write your code here

c. Hypothesis testing. Test the following hypothesis at the 5% level of significance using the p-value
method. Thus test whether LOGNUMBED is an important/significant predictor of LOGTPY. (10
points)
H0 : β1 = 0 versus Ha : β1 ̸= 0.

• Just make a decision and summarize your results here!

d. Construct a 95% confidence interval for the slope parameter (β1 ). Interpret the interval (Round
intermediate calculations and final interval to 4 decimal places). (10 points)

• Write your interpretation of the confidence interval here!

3
# write your code here

e. At a specified number of beds estimate x∗ = 100: (15 points)


i. Find the predicted value and construct a 95% prediction interval for your prediction (Leave all
intermediate calculations and final answer to 6 decimal places).

• Write the predicted value and the confidence interval here!

# write your code here

ii. Convert the point prediction and the prediction interval obtained in part e(i) into total person years
(through exponentiation).

• Write the predicted value and the confidence interval after exponentiation here!

# write your code here

Common questions

Powered by AI

Visualization plays a crucial role in regression analysis by revealing patterns, trends, and potential correlations between variables through visual means. Scatter plots, in particular, provide a visual assessment of the relationship and potential correlation between two quantitative variables. For variables like TPY, NUMBED, and SQRFOOT, scatter plots can help identify linear or non-linear relationships, outliers, and direction of relationships, aiding in preliminary data exploration before formal modeling.

To convert a prediction interval for LOGTPY to total person years, you first calculate the point prediction and prediction interval in the log-transformed scale. Using exponentiation, you then transform these predictions back to the original scale of TPY, providing a prediction interval that reflects the actual total patient years. This process involves reversing the log transformation applied to the variable.

Logarithmic transformations are used in regression analysis to stabilize variance and make relationships linear, enhancing model assumptions and interpretability. Variables such as TPY and NUMBED may be log-transformed due to skewness or non-linear relationships; transforming them can linearize the relationship, improve normality of residuals, and make interpretation more meaningful. However, this can complicate interpretation, as results are in the transformed scale.

To import and clean data in R for regression analysis, you start by using functions such as read.csv() to load the data into R. It's crucial to check for missing data using functions like is.na() and to omit such data with na.omit(). This process ensures that the analysis runs smoothly without errors related to missing values.

A scatter plot illustrates the relationship between two variables by displaying data points on a Cartesian plane. In the context of TPY versus NUMBED, the scatter plot would help determine if a linear relationship exists between the total patient years and the number of beds. If the points form a linear pattern, it suggests a potential correlation that might be explored through regression analysis. Outliers or non-linear patterns could indicate the need for model adjustments or transformations.

The coefficient of determination (R²) in a linear regression model indicates the proportion of variance in the dependent variable that is predictable from the independent variable(s). A higher R² value signifies a stronger explanatory power of the model, meaning more variation in the outcome is explained by the predictors. However, it doesn't imply causation and must be considered along with statistical tests and residuals analysis.

Hypothesis testing in regression analysis is used to determine if there is enough statistical evidence to infer that a predictor significantly influences the outcome variable. For evaluating LOGNUMBED as a predictor for LOGTPY, you set up null (H0: β1 = 0) and alternative (Ha: β1 ≠ 0) hypotheses, where rejecting H0 implies LOGNUMBED significantly predicts LOGTPY. Using the p-value method, if the p-value is less than the significance level (usually 0.05), you reject H0, considering LOGNUMBED a significant predictor.

A correlation coefficient quantifies the strength and direction of a linear relationship between two variables. If the correlation between TPY and LOGTPY is close to 1 or -1, it indicates a strong linear relationship, while a value close to 0 suggests a weak relationship. For example, a high correlation between TPY and LOGTPY would imply that as one significantly increases or decreases, so does the other.

A confidence interval for a slope parameter in linear regression provides a range of values within which the true slope likely falls, with a certain level of confidence (e.g., 95%). It's constructed using the standard error of the slope estimate and critical values from a t-distribution. Interpretation involves saying that if the interval does not contain zero, there is evidence at the specified confidence level that the slope is significantly different from zero, implying a significant predictor effect.

Regression analysis can help quantify the relationship between NUMBED and TPY by estimating how changes in the number of beds predict changes in total patient years. It does this through regression coefficients which indicate the expected change in TPY for a one-unit increase in NUMBED. This quantification can identify trends and potential causations in healthcare facilities, allowing for strategic planning and resource allocation.

You might also like