0% found this document useful (0 votes)
5 views20 pages

Linear Regression Analysis of Student Performance

This document discusses the results of a linear regression analysis with performance index as the dependent variable. It examines the relationships between various independent variables and performance index, as well as assessing the model fit and significance of predictors. The analysis involved multiple steps, including checking assumptions, interpreting coefficients, and testing for multicollinearity and outliers.

Uploaded by

Faiza Noor
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views20 pages

Linear Regression Analysis of Student Performance

This document discusses the results of a linear regression analysis with performance index as the dependent variable. It examines the relationships between various independent variables and performance index, as well as assessing the model fit and significance of predictors. The analysis involved multiple steps, including checking assumptions, interpreting coefficients, and testing for multicollinearity and outliers.

Uploaded by

Faiza Noor
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

3/18/24, 6:07 PM Assignment 2

Assignment 2
Faiza
2024-03-10

Question 1
Question 1.1
## '[Link]': 10000 obs. of 7 variables:
## $ Sex : chr "Male" "Female" "Female" "Male" ...
## $ Hours_Studied : int 7 4 5 8 5 7 3 7 8 4 ...
## $ Previous_Scores : int 99 82 77 51 52 75 78 73 45 89 ...
## $ Extracurricular_Activities: chr "Yes" "No" NA "Yes" ...
## $ Sleep_Hours : int 9 4 8 7 5 8 9 5 4 4 ...
## $ Academic_Year : int 2 3 2 2 1 2 2 2 5 1 ...
## $ Performance_Index : int 91 65 61 45 36 66 61 63 42 69 ...

## Hours_Studied Previous_Scores Sleep_Hours Performance_Index


## Hours_Studied 1.000000000 -0.012389916 0.001245198 0.37373035
## Previous_Scores -0.012389916 1.000000000 0.005944219 0.91518914
## Sleep_Hours 0.001245198 0.005944219 1.000000000 0.04810584
## Performance_Index 0.373730351 0.915189141 0.048105835 1.00000000

[Link] assign/[Link] 1/20


3/18/24, 6:07 PM Assignment 2

There might be linear relationships between the independent variables (Hours Studied, Previous Scores, Sleep
Hours) and the dependent variable (Performance Index).However, the histogram of the dependent variable does
not appear perfectly normally distributed. There seems to be a slight skew towards higher performance indices.
While linear regression assumes a linear relationship between predictors and the dependent variable, it doesn’t
require the dependent variable itself to be normally distributed. Despite the slight skewness, linear regression can
still be a reasonable approach, especially if the assumptions of linearity and homoscedasticity hold reasonably
well.

[Link] assign/[Link] 2/20


3/18/24, 6:07 PM Assignment 2

Question 1.2
##
## Call:
## lm(formula = Performance_Index ~ Hours_Studied + Previous_Scores +
## Sleep_Hours + Extracurricular_Activities + Academic_Year,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9736 -1.4142 0.0066 1.4089 8.8946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -33.266618 0.135355 -245.774 <2e-16 ***
## Hours_Studied 2.856813 0.008160 350.080 <2e-16 ***
## Previous_Scores 1.018694 0.001218 836.157 <2e-16 ***
## Sleep_Hours 0.481970 0.012462 38.674 <2e-16 ***
## Extracurricular_ActivitiesYes 0.627122 0.042270 14.836 <2e-16 ***
## Academic_Year 0.008730 0.014875 0.587 0.557
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.113 on 9992 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.9879, Adjusted R-squared: 0.9879
## F-statistic: 1.634e+05 on 5 and 9992 DF, p-value: < 2.2e-16

Examining the estimate column in the summary is necessary in order to understand the co-efficient of the
independent variables. Keeping other variables constant, a one-unit increase in the independent variable is
correlated with a corresponding rise or reduction in the Performance Index based on the coefficient value. For
instance, a student’s Performance index rises by 2.856 for each hour spent in Hours Studies. For all other
variables, it remains the same.

Question 1.3
The degree of variance that your model can explain is shown by the R-square value. Your model can explain 98%
of the variation in the data, with an R-square of 0.987. A better model is indicated by a greater R-square. The p-
value, on the other hand, offers details on the F statistic that was employed to evaluate the claim that the “fit of the
intercept-only model and your model are equal.” Consequently, if the p-value is less than the significance level,
which is typically 0.05, your model fits the data well. Since the p-value in this case is 2.2e-16, which is quite near to
0, we can rule out the null hypothesis that β = 0. As a result, the variables lstat and mdev in the linear regression
model have a strong association. This model’s low p-value and larger R-squared value indicate that it is significant
and can explain a large amount of the variance in the data.

[Link] assign/[Link] 3/20


3/18/24, 6:07 PM Assignment 2

Question 1.4
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -33.2666177 0.135354522 -245.77397 0.000000e+00
## Hours_Studied 2.8568131 0.008160465 350.07972 0.000000e+00
## Previous_Scores 1.0186941 0.001218305 836.15686 0.000000e+00
## Sleep_Hours 0.4819703 0.012462285 38.67432 4.819750e-305
## Extracurricular_ActivitiesYes 0.6271216 0.042270067 14.83607 2.858808e-49

By looking at the p-values that correspond to the coefficients in the multiple linear regression model summary, you
can ascertain which predictors have a statistically significant association with the response variable
(Performance_Index). Predictors that have low p-values, often less than 0.05, are thought to be statistically
significant when it comes to their association with the response variable. The predictors that have a statistically
significant relationship to the response are: Hours_Studied, Previous_Scores,
Sleep_Hours,Extracurricular_Activities.

Question 1.5

Question 1.6
There are obvious locations in our figure that are both outside of Cook’s distance boundaries and distant from the
plot’s center, indicating that there may be severe outliers or high leverage points (e.g. point 685,7469 etc.)

[Link] assign/[Link] 4/20


3/18/24, 6:07 PM Assignment 2

Question 1.7
## Warning: package 'car' was built under R version 4.3.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.3.3

## Hours_Studied Previous_Scores
## 1.000217 1.000263
## Sleep_Hours Extracurricular_Activities
## 1.000609 1.000624
## Academic_Year
## 1.000083

When a predictor variable has a value of 1, it means that there is no association between it and any other predictor
variables in the model. Since every value is really near to 1, there isn’t a multicollinearity issue.

[Link] assign/[Link] 5/20


3/18/24, 6:07 PM Assignment 2

Question 1.8
##
## Call:
## lm(formula = Performance_Index ~ Hours_Studied + Previous_Scores +
## Sleep_Hours + Extracurricular_Activities + Academic_Year +
## Sleep6 + Extracurricular_Activities:Sleep6, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9105 -1.4141 0.0072 1.4011 8.8894
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -33.561057 0.236011 -142.201 <2e-16
## Hours_Studied 2.856778 0.008160 350.088 <2e-16
## Previous_Scores 1.018685 0.001218 836.085 <2e-16
## Sleep_Hours 0.518300 0.026027 19.914 <2e-16
## Extracurricular_ActivitiesYes 0.605260 0.059192 10.225 <2e-16
## Academic_Year 0.008660 0.014875 0.582 0.560
## Sleep6 0.118750 0.098042 1.211 0.226
## Extracurricular_ActivitiesYes:Sleep6 0.043896 0.084568 0.519 0.604
##
## (Intercept) ***
## Hours_Studied ***
## Previous_Scores ***
## Sleep_Hours ***
## Extracurricular_ActivitiesYes ***
## Academic_Year
## Sleep6
## Extracurricular_ActivitiesYes:Sleep6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.112 on 9990 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.9879, Adjusted R-squared: 0.9879
## F-statistic: 1.167e+05 on 7 and 9990 DF, p-value: < 2.2e-16

We can see that by considering this interaction term in the linear regression model, the R squared value remains
same which indicates that the fitted model does not show any improvement from such interaction.

Question 1.9
## The predicted performance index for the given student based on the specified values for Hour
s Studied, Previous Scores, Extracurricular Activities, and Sleep Hours, using the final regress
ion model is 62.84815

Question 2
[Link] assign/[Link] 6/20
3/18/24, 6:07 PM Assignment 2

Question 2.1
##
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':


##
## recode

## The following objects are masked from 'package:stats':


##
## filter, lag

## The following objects are masked from 'package:base':


##
## intersect, setdiff, setequal, union

## Warning: package 'corrplot' was built under R version 4.3.3

## corrplot 0.92 loaded

[Link] assign/[Link] 7/20


3/18/24, 6:07 PM Assignment 2

[Link] assign/[Link] 8/20


3/18/24, 6:07 PM Assignment 2

[Link] assign/[Link] 9/20


3/18/24, 6:07 PM Assignment 2

[Link] assign/[Link] 10/20


3/18/24, 6:07 PM Assignment 2

[Link] assign/[Link] 11/20


3/18/24, 6:07 PM Assignment 2

[Link] assign/[Link] 12/20


3/18/24, 6:07 PM Assignment 2

[Link] assign/[Link] 13/20


3/18/24, 6:07 PM Assignment 2

Question 2.2
##
## Call:
## glm(formula = HeartDisease ~ Age + BMI + SleepTime + Sex + Smoking +
## AlcoholDrinking + Stroke + DiffWalking + Diabetic + Asthma,
## family = binomial, data = Heart_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.062934 0.320354 -22.047 < 2e-16 ***
## Age 0.050625 0.002754 18.379 < 2e-16 ***
## BMI 0.026195 0.005990 4.373 1.22e-05 ***
## SleepTime -0.041003 0.024643 -1.664 0.0961 .
## Sex 0.662987 0.079952 8.292 < 2e-16 ***
## Smoking 0.530962 0.078739 6.743 1.55e-11 ***
## AlcoholDrinking -0.481783 0.204702 -2.354 0.0186 *
## Stroke 1.278899 0.125199 10.215 < 2e-16 ***
## DiffWalking 0.664094 0.089216 7.444 9.79e-14 ***
## Diabetic 0.843162 0.086003 9.804 < 2e-16 ***
## Asthma 0.602291 0.101422 5.938 2.88e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6035.7 on 9993 degrees of freedom
## Residual deviance: 4773.0 on 9983 degrees of freedom
## (6 observations deleted due to missingness)
## AIC: 4795
##
## Number of Fisher Scoring iterations: 6

In the summary output, we may examine the p-values in the “Pr(>|z|)” column to find statistically significant
predictors. The statistical importance of every predictor is shown by these p-values. A predictor is deemed
statistically significant if its p-value is low, usually less than 0.05. Therefore, age, BMI, sex, smoking, stroke,
difficulty walking, diabetes, asthma, and alcohol use are the major predictors.

Question 2.3
## the confusion matrix is

## Predicted
## Actual 0 1
## 0 9007 90
## 1 809 88

## the overall fraction of correct predictions is 0.910046

[Link] assign/[Link] 14/20


3/18/24, 6:07 PM Assignment 2

Actual 0(No Event) : There were 9,097 observations in which there was a 0 (No Event) result. Among these, 9,007
were properly classified by the model as 0 (true negatives), while 90 were incorrectly classified as 1 (false
positives).

Actual 1 (Event): In 897 observations, there was a single actual result. Among those, the model correctly identified
809 as true positives (or 1s), while incorrectly classified 88 as false negatives (or 0s).

Accuracy: (TP + TN) / Total, where TP stands for true positives, TN for true negatives, and Total for the total
number of observations, is the formula used to determine the overall proportion of right predictions. This works out
to (9007+88) / (9007 + 90 + 809 + 88), or around 0.91004 or 91.004% in this instance.

Question 2.4
The estimated coefficients linked to predictor variables in logistic regression represent the impact of a predictor
variable, such as age and gender (male/female), on the probability of a binary outcome (heart disease, or CHD, in
this example). Look at the logistic regression model’s coefficient related to the “male” predictor. Given that the
coefficient is positive, it appears that males have higher probabilities of developing CHD than females do. The
strength of the influence is indicated by the coefficient’s magnitude. A bigger influence is indicated by a higher
positive (or more negative) coefficient. In a similar vein, the coefficient is positive with age, indicating that the
probabilities of CHD rise with age. In other words, older individuals are more likely to have CHD.

Question 2.5
## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: lattice

## Predicted
## Actual 0 1
## 0 1788 19
## 1 175 17

## [1] 0.9029515

Question 2.6
## Warning: package 'MASS' was built under R version 4.3.3

##
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':


##
## select

[Link] assign/[Link] 15/20


3/18/24, 6:07 PM Assignment 2

## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf

## Predicted
## Actual 0 1
## 0 1760 47
## 1 153 39

## [1] 0.89995

Question 2.7
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf

## Predicted
## Actual 0 1
## 0 2482 230
## 1 169 118

## [1] 0.8669557

Question 2.8
## Warning: package 'e1071' was built under R version 4.3.3

## prediction_for_NB
## 0 1
## 0 2475 238
## 1 168 119

##
## Accuracy of 'Naive Bayes' tesing data: 0.8646667

Question 2.9
## Predicted
## Actual 0 1
## 0 2548 165
## 1 236 51

## [1] 0.8663333

[Link] assign/[Link] 16/20


3/18/24, 6:07 PM Assignment 2

Question 2.10
## Warning in [Link](x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.

## [1] 20

## Predicted
## Actual 0 1
## 0 2710 3
## 1 287 0

## [1] 0.9033333

Question 2.11
While all of the methods for calculating the confusion matrix for the regression model are effective, cross-
validation, which selects the value for K in the KNN classifier, appears to be yielding the best results. Its accuracy
is the highest of all the methods, coming in at approximately 90.3%, and is closely followed by logistic regression,
which has an accuracy of 90.29%.

[Link] assign/[Link] 17/20


3/18/24, 6:07 PM Assignment 2

Question 2.12
## Predictors: Age_BMI_SleepTime_Sex_Smoking_AlcoholDrinking_Stroke_DiffWalking_Diabetic_Asthma
## Confusion Matrix:
## Predicted
## Actual 0 1
## 0 2679 34
## 1 263 24
##
## Predictors: Sex_Age_SleepTime
## Confusion Matrix:
## Predicted
## Actual 0 1
## 0 2712 1
## 1 286 1
##
## Predictors: Sex_Age_Smoking_Stroke
## Confusion Matrix:
## Predicted
## Actual 0 1
## 0 2693 20
## 1 272 15
##
## Predictors: Sex_Age_SleepTime_BMI
## Confusion Matrix:
## Predicted
## Actual 0 1
## 0 2709 4
## 1 285 2
##
## Predictors: Sex_Age_SleepTime_BMI_Smoking
## Confusion Matrix:
## Predicted
## Actual 0 1
## 0 2706 7
## 1 281 6
##
## Predictors: Sex_Age_SleepTime_BMI_Diabetic
## Confusion Matrix:
## Predicted
## Actual 0 1
## 0 2704 9
## 1 280 7
##
## Predictors: Sex_Age_SleepTime_BMI_Diabetic_Smoking
## Confusion Matrix:
## Predicted
## Actual 0 1
## 0 2703 10
## 1 278 9
##

[Link] assign/[Link] 18/20


3/18/24, 6:07 PM Assignment 2
## Predictors: Sex_Age_SleepTime_BMI_Diabetic_Smoking_Stroke
## Confusion Matrix:
## Predicted
## Actual 0 1
## 0 2689 24
## 1 263 24

Question 2.13.a
##
## Attaching package: 'boot'

## The following object is masked from 'package:lattice':


##
## melanoma

## The following object is masked from 'package:car':


##
## logit

## Age BMI SleepTime Sex Smoking


## 0.308324917 0.002467518 0.005949438 0.025436124 0.078286538
## AlcoholDrinking Stroke DiffWalking Diabetic Asthma
## 0.077786822 0.204559764 0.134952458 0.093821823 0.093492344
## <NA>
## 0.107209235

[Link] assign/[Link] 19/20


3/18/24, 6:07 PM Assignment 2

Question 2.13.b

Question 2.13.c
While the estimated standard error obtained via the glm() technique in (2.2) is often based on the asymptotic
characteristics of the maximum probability estimator, the bootstrap standard error is empirically generated by
resampling. Actually, in cases where normalcy assumptions are not met, the bootstrap standard error can provide
a more accurate approximation of the standard error.

Question 2.13.d
## 2.5% 97.5%
## 1.05091 1.05930

[Link] assign/[Link] 20/20

Common questions

Powered by AI

The R-squared value is a measure of the proportion of variance in the dependent variable that can be explained by the independent variables in the model. In this case, the R-squared value is 0.9879, indicating that approximately 98% of the variance in the Performance Index is explained by the model, suggesting a very high level of predictive power .

In the logistic regression analysis for heart disease, 'Sex' (specifically being male), 'Age', and 'Smoking' have positive coefficients, indicating these factors increase the likelihood of heart disease. Their statistical significance is validated by p-values well below 0.05 in the model output, confirming a strong association with the outcome .

The statistically significant predictors in the logistic regression model for heart disease include Age, BMI, Sex, Smoking, Stroke, DiffWalking, Diabetic, and Asthma, each with a p-value less than 0.05. Positive coefficients, such as those for Age and Sex, indicate an increased probability of heart disease with an increase in the predictor's value or presence of the condition (e.g., male gender).

Adding interaction terms like Sleep6 and Extracurricular_Activities:Sleep6 does not change the model's R-squared value, which remains at 0.9879. This suggests that these interactions do not improve the explanatory power of the model significantly beyond the original independent variables .

The confusion matrix shows the true positives, true negatives, false positives, and false negatives, which are used to compute the model's accuracy. In this case, the model's accuracy is calculated as the sum of true positives and true negatives divided by the total number of predictions, resulting in an accuracy of approximately 0.91004, demonstrating that the model correctly predicts the outcomes in about 91% of cases .

Multicollinearity refers to the correlation between independent variables in a regression model, which can make it difficult to interpret individual coefficients. The document reports the Variance Inflation Factor (VIF) for each predictor, which were close to 1, indicating no multicollinearity issue. This suggests that the coefficients can be interpreted without the bias or instability often caused by multicollinearity .

The bootstrap standard error offers an empirical estimate by resampling the data, providing a potentially more accurate approximation in scenarios where normality assumptions underpinning the glm() method do not hold. This could be more reliable in reflecting the variability in the parameter estimation under irregular distributions .

The p-value is critical for determining if the linear regression model fits the data well. A p-value of less than 0.05 typically indicates that the model is statistically significant, meaning it provides a fit that is better than a model with no predictors (an intercept-only model). In this case, the p-value is 2.2e-16, which is significantly less than 0.05, indicating a strong fit of the model to the data .

The F-statistic assesses whether at least one predictor variable in the model is significantly associated with the dependent variable. A high F-statistic value, such as 1.634e+05 with a very low p-value (< 2.2e-16), demonstrates that the model has a significant level of explanatory power beyond what would be expected by chance .

The coefficient for 'Hours_Studied' is 2.856813, which indicates that for each additional hour spent studying, the Performance Index is predicted to increase by approximately 2.856 points, assuming all other factors remain constant .

You might also like