3/18/24, 6:07 PM Assignment 2
Assignment 2
Faiza
2024-03-10
Question 1
Question 1.1
## '[Link]': 10000 obs. of 7 variables:
## $ Sex : chr "Male" "Female" "Female" "Male" ...
## $ Hours_Studied : int 7 4 5 8 5 7 3 7 8 4 ...
## $ Previous_Scores : int 99 82 77 51 52 75 78 73 45 89 ...
## $ Extracurricular_Activities: chr "Yes" "No" NA "Yes" ...
## $ Sleep_Hours : int 9 4 8 7 5 8 9 5 4 4 ...
## $ Academic_Year : int 2 3 2 2 1 2 2 2 5 1 ...
## $ Performance_Index : int 91 65 61 45 36 66 61 63 42 69 ...
## Hours_Studied Previous_Scores Sleep_Hours Performance_Index
## Hours_Studied 1.000000000 -0.012389916 0.001245198 0.37373035
## Previous_Scores -0.012389916 1.000000000 0.005944219 0.91518914
## Sleep_Hours 0.001245198 0.005944219 1.000000000 0.04810584
## Performance_Index 0.373730351 0.915189141 0.048105835 1.00000000
[Link] assign/[Link] 1/20
3/18/24, 6:07 PM Assignment 2
There might be linear relationships between the independent variables (Hours Studied, Previous Scores, Sleep
Hours) and the dependent variable (Performance Index).However, the histogram of the dependent variable does
not appear perfectly normally distributed. There seems to be a slight skew towards higher performance indices.
While linear regression assumes a linear relationship between predictors and the dependent variable, it doesn’t
require the dependent variable itself to be normally distributed. Despite the slight skewness, linear regression can
still be a reasonable approach, especially if the assumptions of linearity and homoscedasticity hold reasonably
well.
[Link] assign/[Link] 2/20
3/18/24, 6:07 PM Assignment 2
Question 1.2
##
## Call:
## lm(formula = Performance_Index ~ Hours_Studied + Previous_Scores +
## Sleep_Hours + Extracurricular_Activities + Academic_Year,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9736 -1.4142 0.0066 1.4089 8.8946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -33.266618 0.135355 -245.774 <2e-16 ***
## Hours_Studied 2.856813 0.008160 350.080 <2e-16 ***
## Previous_Scores 1.018694 0.001218 836.157 <2e-16 ***
## Sleep_Hours 0.481970 0.012462 38.674 <2e-16 ***
## Extracurricular_ActivitiesYes 0.627122 0.042270 14.836 <2e-16 ***
## Academic_Year 0.008730 0.014875 0.587 0.557
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.113 on 9992 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.9879, Adjusted R-squared: 0.9879
## F-statistic: 1.634e+05 on 5 and 9992 DF, p-value: < 2.2e-16
Examining the estimate column in the summary is necessary in order to understand the co-efficient of the
independent variables. Keeping other variables constant, a one-unit increase in the independent variable is
correlated with a corresponding rise or reduction in the Performance Index based on the coefficient value. For
instance, a student’s Performance index rises by 2.856 for each hour spent in Hours Studies. For all other
variables, it remains the same.
Question 1.3
The degree of variance that your model can explain is shown by the R-square value. Your model can explain 98%
of the variation in the data, with an R-square of 0.987. A better model is indicated by a greater R-square. The p-
value, on the other hand, offers details on the F statistic that was employed to evaluate the claim that the “fit of the
intercept-only model and your model are equal.” Consequently, if the p-value is less than the significance level,
which is typically 0.05, your model fits the data well. Since the p-value in this case is 2.2e-16, which is quite near to
0, we can rule out the null hypothesis that β = 0. As a result, the variables lstat and mdev in the linear regression
model have a strong association. This model’s low p-value and larger R-squared value indicate that it is significant
and can explain a large amount of the variance in the data.
[Link] assign/[Link] 3/20
3/18/24, 6:07 PM Assignment 2
Question 1.4
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -33.2666177 0.135354522 -245.77397 0.000000e+00
## Hours_Studied 2.8568131 0.008160465 350.07972 0.000000e+00
## Previous_Scores 1.0186941 0.001218305 836.15686 0.000000e+00
## Sleep_Hours 0.4819703 0.012462285 38.67432 4.819750e-305
## Extracurricular_ActivitiesYes 0.6271216 0.042270067 14.83607 2.858808e-49
By looking at the p-values that correspond to the coefficients in the multiple linear regression model summary, you
can ascertain which predictors have a statistically significant association with the response variable
(Performance_Index). Predictors that have low p-values, often less than 0.05, are thought to be statistically
significant when it comes to their association with the response variable. The predictors that have a statistically
significant relationship to the response are: Hours_Studied, Previous_Scores,
Sleep_Hours,Extracurricular_Activities.
Question 1.5
Question 1.6
There are obvious locations in our figure that are both outside of Cook’s distance boundaries and distant from the
plot’s center, indicating that there may be severe outliers or high leverage points (e.g. point 685,7469 etc.)
[Link] assign/[Link] 4/20
3/18/24, 6:07 PM Assignment 2
Question 1.7
## Warning: package 'car' was built under R version 4.3.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.3
## Hours_Studied Previous_Scores
## 1.000217 1.000263
## Sleep_Hours Extracurricular_Activities
## 1.000609 1.000624
## Academic_Year
## 1.000083
When a predictor variable has a value of 1, it means that there is no association between it and any other predictor
variables in the model. Since every value is really near to 1, there isn’t a multicollinearity issue.
[Link] assign/[Link] 5/20
3/18/24, 6:07 PM Assignment 2
Question 1.8
##
## Call:
## lm(formula = Performance_Index ~ Hours_Studied + Previous_Scores +
## Sleep_Hours + Extracurricular_Activities + Academic_Year +
## Sleep6 + Extracurricular_Activities:Sleep6, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9105 -1.4141 0.0072 1.4011 8.8894
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -33.561057 0.236011 -142.201 <2e-16
## Hours_Studied 2.856778 0.008160 350.088 <2e-16
## Previous_Scores 1.018685 0.001218 836.085 <2e-16
## Sleep_Hours 0.518300 0.026027 19.914 <2e-16
## Extracurricular_ActivitiesYes 0.605260 0.059192 10.225 <2e-16
## Academic_Year 0.008660 0.014875 0.582 0.560
## Sleep6 0.118750 0.098042 1.211 0.226
## Extracurricular_ActivitiesYes:Sleep6 0.043896 0.084568 0.519 0.604
##
## (Intercept) ***
## Hours_Studied ***
## Previous_Scores ***
## Sleep_Hours ***
## Extracurricular_ActivitiesYes ***
## Academic_Year
## Sleep6
## Extracurricular_ActivitiesYes:Sleep6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.112 on 9990 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.9879, Adjusted R-squared: 0.9879
## F-statistic: 1.167e+05 on 7 and 9990 DF, p-value: < 2.2e-16
We can see that by considering this interaction term in the linear regression model, the R squared value remains
same which indicates that the fitted model does not show any improvement from such interaction.
Question 1.9
## The predicted performance index for the given student based on the specified values for Hour
s Studied, Previous Scores, Extracurricular Activities, and Sleep Hours, using the final regress
ion model is 62.84815
Question 2
[Link] assign/[Link] 6/20
3/18/24, 6:07 PM Assignment 2
Question 2.1
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.92 loaded
[Link] assign/[Link] 7/20
3/18/24, 6:07 PM Assignment 2
[Link] assign/[Link] 8/20
3/18/24, 6:07 PM Assignment 2
[Link] assign/[Link] 9/20
3/18/24, 6:07 PM Assignment 2
[Link] assign/[Link] 10/20
3/18/24, 6:07 PM Assignment 2
[Link] assign/[Link] 11/20
3/18/24, 6:07 PM Assignment 2
[Link] assign/[Link] 12/20
3/18/24, 6:07 PM Assignment 2
[Link] assign/[Link] 13/20
3/18/24, 6:07 PM Assignment 2
Question 2.2
##
## Call:
## glm(formula = HeartDisease ~ Age + BMI + SleepTime + Sex + Smoking +
## AlcoholDrinking + Stroke + DiffWalking + Diabetic + Asthma,
## family = binomial, data = Heart_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.062934 0.320354 -22.047 < 2e-16 ***
## Age 0.050625 0.002754 18.379 < 2e-16 ***
## BMI 0.026195 0.005990 4.373 1.22e-05 ***
## SleepTime -0.041003 0.024643 -1.664 0.0961 .
## Sex 0.662987 0.079952 8.292 < 2e-16 ***
## Smoking 0.530962 0.078739 6.743 1.55e-11 ***
## AlcoholDrinking -0.481783 0.204702 -2.354 0.0186 *
## Stroke 1.278899 0.125199 10.215 < 2e-16 ***
## DiffWalking 0.664094 0.089216 7.444 9.79e-14 ***
## Diabetic 0.843162 0.086003 9.804 < 2e-16 ***
## Asthma 0.602291 0.101422 5.938 2.88e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6035.7 on 9993 degrees of freedom
## Residual deviance: 4773.0 on 9983 degrees of freedom
## (6 observations deleted due to missingness)
## AIC: 4795
##
## Number of Fisher Scoring iterations: 6
In the summary output, we may examine the p-values in the “Pr(>|z|)” column to find statistically significant
predictors. The statistical importance of every predictor is shown by these p-values. A predictor is deemed
statistically significant if its p-value is low, usually less than 0.05. Therefore, age, BMI, sex, smoking, stroke,
difficulty walking, diabetes, asthma, and alcohol use are the major predictors.
Question 2.3
## the confusion matrix is
## Predicted
## Actual 0 1
## 0 9007 90
## 1 809 88
## the overall fraction of correct predictions is 0.910046
[Link] assign/[Link] 14/20
3/18/24, 6:07 PM Assignment 2
Actual 0(No Event) : There were 9,097 observations in which there was a 0 (No Event) result. Among these, 9,007
were properly classified by the model as 0 (true negatives), while 90 were incorrectly classified as 1 (false
positives).
Actual 1 (Event): In 897 observations, there was a single actual result. Among those, the model correctly identified
809 as true positives (or 1s), while incorrectly classified 88 as false negatives (or 0s).
Accuracy: (TP + TN) / Total, where TP stands for true positives, TN for true negatives, and Total for the total
number of observations, is the formula used to determine the overall proportion of right predictions. This works out
to (9007+88) / (9007 + 90 + 809 + 88), or around 0.91004 or 91.004% in this instance.
Question 2.4
The estimated coefficients linked to predictor variables in logistic regression represent the impact of a predictor
variable, such as age and gender (male/female), on the probability of a binary outcome (heart disease, or CHD, in
this example). Look at the logistic regression model’s coefficient related to the “male” predictor. Given that the
coefficient is positive, it appears that males have higher probabilities of developing CHD than females do. The
strength of the influence is indicated by the coefficient’s magnitude. A bigger influence is indicated by a higher
positive (or more negative) coefficient. In a similar vein, the coefficient is positive with age, indicating that the
probabilities of CHD rise with age. In other words, older individuals are more likely to have CHD.
Question 2.5
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: lattice
## Predicted
## Actual 0 1
## 0 1788 19
## 1 175 17
## [1] 0.9029515
Question 2.6
## Warning: package 'MASS' was built under R version 4.3.3
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
[Link] assign/[Link] 15/20
3/18/24, 6:07 PM Assignment 2
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Predicted
## Actual 0 1
## 0 1760 47
## 1 153 39
## [1] 0.89995
Question 2.7
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Predicted
## Actual 0 1
## 0 2482 230
## 1 169 118
## [1] 0.8669557
Question 2.8
## Warning: package 'e1071' was built under R version 4.3.3
## prediction_for_NB
## 0 1
## 0 2475 238
## 1 168 119
##
## Accuracy of 'Naive Bayes' tesing data: 0.8646667
Question 2.9
## Predicted
## Actual 0 1
## 0 2548 165
## 1 236 51
## [1] 0.8663333
[Link] assign/[Link] 16/20
3/18/24, 6:07 PM Assignment 2
Question 2.10
## Warning in [Link](x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.
## [1] 20
## Predicted
## Actual 0 1
## 0 2710 3
## 1 287 0
## [1] 0.9033333
Question 2.11
While all of the methods for calculating the confusion matrix for the regression model are effective, cross-
validation, which selects the value for K in the KNN classifier, appears to be yielding the best results. Its accuracy
is the highest of all the methods, coming in at approximately 90.3%, and is closely followed by logistic regression,
which has an accuracy of 90.29%.
[Link] assign/[Link] 17/20
3/18/24, 6:07 PM Assignment 2
Question 2.12
## Predictors: Age_BMI_SleepTime_Sex_Smoking_AlcoholDrinking_Stroke_DiffWalking_Diabetic_Asthma
## Confusion Matrix:
## Predicted
## Actual 0 1
## 0 2679 34
## 1 263 24
##
## Predictors: Sex_Age_SleepTime
## Confusion Matrix:
## Predicted
## Actual 0 1
## 0 2712 1
## 1 286 1
##
## Predictors: Sex_Age_Smoking_Stroke
## Confusion Matrix:
## Predicted
## Actual 0 1
## 0 2693 20
## 1 272 15
##
## Predictors: Sex_Age_SleepTime_BMI
## Confusion Matrix:
## Predicted
## Actual 0 1
## 0 2709 4
## 1 285 2
##
## Predictors: Sex_Age_SleepTime_BMI_Smoking
## Confusion Matrix:
## Predicted
## Actual 0 1
## 0 2706 7
## 1 281 6
##
## Predictors: Sex_Age_SleepTime_BMI_Diabetic
## Confusion Matrix:
## Predicted
## Actual 0 1
## 0 2704 9
## 1 280 7
##
## Predictors: Sex_Age_SleepTime_BMI_Diabetic_Smoking
## Confusion Matrix:
## Predicted
## Actual 0 1
## 0 2703 10
## 1 278 9
##
[Link] assign/[Link] 18/20
3/18/24, 6:07 PM Assignment 2
## Predictors: Sex_Age_SleepTime_BMI_Diabetic_Smoking_Stroke
## Confusion Matrix:
## Predicted
## Actual 0 1
## 0 2689 24
## 1 263 24
Question 2.13.a
##
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
##
## melanoma
## The following object is masked from 'package:car':
##
## logit
## Age BMI SleepTime Sex Smoking
## 0.308324917 0.002467518 0.005949438 0.025436124 0.078286538
## AlcoholDrinking Stroke DiffWalking Diabetic Asthma
## 0.077786822 0.204559764 0.134952458 0.093821823 0.093492344
## <NA>
## 0.107209235
[Link] assign/[Link] 19/20
3/18/24, 6:07 PM Assignment 2
Question 2.13.b
Question 2.13.c
While the estimated standard error obtained via the glm() technique in (2.2) is often based on the asymptotic
characteristics of the maximum probability estimator, the bootstrap standard error is empirically generated by
resampling. Actually, in cases where normalcy assumptions are not met, the bootstrap standard error can provide
a more accurate approximation of the standard error.
Question 2.13.d
## 2.5% 97.5%
## 1.05091 1.05930
[Link] assign/[Link] 20/20