Title : Machine Learning–Based Prediction of Heart Disease Using Logistic Regression, Support Vector
Machine, and Random Forest Classifier
[Link]
Heart disease is one of the major causes of deaths around the globe, which shows that there is a great
need for early and accurate diagnostic methods that will aid clinical decision ‐making. A machine
learning–based predictive system for heart disease will be developed and evaluated in this project using
a real‐world Heart Failure Prediction dataset that contains 918 anonymized patient records and 11
clinical attributes. As part of data preprocessing, medically impossible values were identified and
treated, invalid cholesterol readings were replaced with the median, non ‐sensical entries were
removed, categorical variables were encoded, and feature standardization was done to ready the
dataset for model training. Accordingly, Logistic Regression, Support Vector Machine (SVM) with an RBF
kernel, and Random Forest were three supervised learning algorithms implemented to evaluate their
performances in binary classification.
This study aims to present a detailed overview of existing machine learning guidelines used in
heart disease prediction and identify the obstacles to applying these approaches in the healthcare
domain.
[Link]
Heart disease encompasses various disorders that disrupt the normal operations of the heart and blood
vessels. It is still a major contributor to global mortality and morbidity, impacting a large number of
patients round the year, with millions of people being the current worldwide count for those already
diagnosed with it. Moreover, every year, millions of new cases are diagnosed too. Due to its extensive
reach and the fact that it can lead to death, if not addressed, the early diagnosis and prompt treatment
have become crucial areas in the management of heart disease to enhance patient outcomes and to
mitigate any long‐term complications (Misra [Link], 2023; Saeed [Link], 2023).
The recent breakthroughs in artificial intelligence and data‐driven technologies have provided new ways
of improving the clinical decision‐making process. Among them, machine learning models are taking a
front seat progressively to give the medical staff assistance in recognizing the disease, determining the
risk level, and providing preventive healthcare. Data of different dimensions are included here such as
the patient’s age, cholesterol levels, blood pressure, and electrocardiogram (ECG) results, and machine
learning is expected to perform detection of early signs of heart disease more efficiently than the
combination of conventional approaches extant and manual assessment alone. These algorithms will
not only be able to spot the existing patterns wherein clinical data of this nature might be stored but
also to provide quick, consistent, and easy‐to‐interpret predictions which will be a great help to the
healthcare providers in their decision‐making (Mythili [Link], 2013; Nasution [Link], 2025; Rani [Link],
2021).
This initiative is centered around the creation of a heart disease prediction model that incorporates
machine learning methodologies. The intent of the project is to develop precise and trustable models
that would be capable of detecting high‐risk persons for heart diseases by the utilization of supervised
learning algorithms on a genuine Heart Failure Prediction dataset. The process consists of data
preprocessing, feature encoding, scaling, model training, and assessment in a structured manner
coupled with standard clinical prediction practices.
Illustration of sources contributing to healthcare data
[Link] Review
Disease Prediction in Healthcare
Recent cases in data mining and AI have significantly changed disease prediction in healthcare
systems. The high rise of electronic records, medical imaging, and clinical datasets has made it
possible for researchers to apply machine learning approaches to support diagnosis, risk
assessment, and the clinical decision-making process.
Predictive analytics has been used to identify the patterns in patient data that might not be
detected easily using traditional statistical approaches, thus improving diagnostic accuracy and
treatment planning.
The studied literature shows that the disease prediction models are mostly applied to life-
threatening and chronic situations like diabetes, heart disease, neurological disorders, and cancer.
The study shows that machine learning based systems can help healthcare experts by reducing
diagnostic errors, enhancing efficiency, and allowing personalized medicine. But the
effectiveness of such systems largely depends on data quality, choice of predictive algorithms,
and feature selection.
Machine Learning Techniques for Disease Prediction
Traditional machine learning algorithms remain widely used in healthcare predictive analytics
because of their low computational complexity and interpretability. From all these, the Logistic
Regression, Support Vector Machines, and Random Forest are usually reported in the literature.
Logistic Regression is the commonly used model for the binary disease clarification issues. The
literature shows that it’s highly valued in clinical settings because of its simplicity and
interpretability, thus enabling healthcare professionals to understand the influence of individual
features on the prediction results.
Several studies have used Logistic Regression to predict diseases like diabetes and heart disease,
thereby achieving reliable baseline performance. But, its limits are in its inability to model the
complex non-linear relationships within the medical data.
On the other hand, Random Forest has been studied extensively as an ensemble learning
technique that enhances predictive performance by combining several decision trees.
Random Forest models are important in handling high-dimensional healthcare datasets, feature
interactions, and missing values. Most studies show that Random Forest outperforms the single
classifiers like Logistic Regressive especially when dealing with complex disease prediction
tasks. Despite Random Forest’s strong predictive performance, it’s often criticized for the
reduced interpretability which might hinder its acceptance in the sensitive clinical environment
where explainability is important.
Support vector machines have been widely used in healthcare analytics, especially in scenarios
that involve high-dimensional data. The literature shows that SVMs are great at separating
complex class boundaries and have been used for disease diagnosis and prognosis successfully.
But the performance is dependent on the kernel selection. Additionally, SVMs can be costly
when applied to the large scale healthcare datasets, which might hinder their scalability in real-
world applications.
Model Common Strengths Limitations
Applications in
Healthcare
Logistic Heart disease, Simple, interpretable, Limited to linear
Regression diabetes, cancer risk low computational cost relationships
prediction
Random Forest Chronic disease High accuracy, handles Reduced interpretability
prediction, clinical non-linearity and
decision support missing data
Support Vector Disease classification, Effective in high- Computationally
Machine medical diagnosis dimensional data expensive, sensitive to
(SVM) parameter tuning
Summary of Machine Learning Models Used in Healthcare Disease Prediction
Performance Evaluation in Existing Studies
The reviewed studies commonly evaluate the disease prediction model using performance
metrics like precision, accuracy, F1-score, recall, and the area under the receiver operating
characteristic curve (AUC-ROC). The literature states that the accuracy alone is not enough in
healthcare applications because of class imbalance, where the number of healthcare issues
exceeds the number of disease cases. Also, metrics like recall and AUC-ROC are considered
important since they better reflect the ability of the model to detect the true disease cases.
Cross-validation methods are often used to enhance the robustness of the model and reduce
overfitting. But the literature also states that variations in the datasets, preprocessing techniques,
and evaluation strategies make the direct comparison of the model performance across studies
quite challenging.
Metric Description Importance in Healthcare
Accuracy Overall correctness of Useful but misleading for imbalanced
predictions datasets
Precision Correct positive predictions Important to reduce false positives
Recall Ability to detect actual disease Critical for early diagnosis
(Sensitivity) cases
F1-Score Balance between precision and Useful for imbalanced data
recall
AUC-ROC Model discrimination ability Widely used for clinical evaluation
Performance evaluation and how important each metric is to the healthcare
2.4 Limitations and Research Gaps
Despite the positive and promising results reported in the literature, there are still several
limitations that remain. Many studies depend on small or the single sourced datasets, which in
turn limits the generalisability of the predictive models. What highly affects model reliability is
the data imbalance, noise in medical data, and the data imbalance. To add onto that, the lack of
model interpretability in ensemble and kernel-based techniques raises concerns regarding clinical
trust and adoption.
Ethical problems like data privacy, fairness, and bias are stated as critical challenges in
healthcare predictive analytics. The survey emphasizes the need for transparent, ethically
responsible machine learning models that can be integrated into real-world healthcare systems.
Challenge Description
Data Imbalance Disease cases are often underrepresented
Model Interpretability Complex models lack transparency
Data Quality Missing, noisy, or inconsistent data
Ethical Concerns Bias, privacy, and fairness issues
2.5 Summary
In summary, the reviewed literature shows that machine learning methods play an important role
in healthcare disease prediction. Logistic regression provides strong predictive performance, and
the Support Vector Machine handles the complex decisions effectively. But challenges related to
interpretability, data quality, and ethical consideration remain unsolved. Such findings justify the
use of Logistic Regression, Random Forest, and Support Vector Machines for the comparative
evaluation in this study.
The aim of using these data mining and AI techniques is to evaluate and compare effectiveness
in predicting the disease outcomes with a major focus on the accuracy, F1 score, recall, and
AUC0ROC. This analysis aims to identify a suitable model that is reliable for disease prediction
while considering challenges like data imbalance and model interpretability.
The findings will support the development of an effective predictive system, thus contributing to
enhanced clinical decision support and early disease detection.
PART B
[Link]
This study was to follow the process of data mining and machine learning which aimed to create a heart
disease predictive model that would be very precise and accurate. The whole workflow was divided into
four significant parts: EDA, data cleaning, model making, and model testing.
4.1 Dataset Description
The dataset used in this research was obtained from Kaggle Heart Disease Dataset. The dataset
comprises of 918 patients record with 11 clinical attributes related to the cardiovascular health. Every
record shows individual patient and also includes both the numerical and caterigorical attributes like
age, resting blood pressure, cholesterol level, maximum heart rate achieved, chest pain type, and ST
segment slope. 0 represents no heart disease and 1 represents the presence of heart disease.
Table 4.1 presents features and description of the dataset.
Feature Description
Age Age of patient (years)
Sex Biological sex (M/F)
ChestPainType Type of chest pain (ATA, NAP, ASY, TA)
RestingBP Resting blood pressure (mm Hg)
Cholesterol Serum cholesterol (mm/dl)
FastingBS Fasting bllod sugar > 120 mg/dl
(1=True,0=False)
RestingECG Resting electrocardiographic results
MaxHR Maximum heart rate achieved
ExerciseAngina Exercise-induced angina (Yes/No)
Oldpeak ST depression induced by exercise
ST_Slope Slope of peak exercise ST segment
HeartDisease Target variable (1=presence, 0=absence of
disease)
The dataset exhibits a relatively balanced class distribution, which reduces the risk of model bias
toward a dominant class and allows the use of standard classification evaluation metrics such as
accuracy, precision, recall, and F1-score.
Assumptions and Limitations
There are various assumptions and limitations that are associated with this dataset. First, it’s
assumed that all clinical measurements were recorded accurate and thus reflected the patient’s
true conditions.
Second, this dataset is retrospective and collected from very few and limited clinical sources,
therefore the findings may not generalize all population. To add onto that, genetic, lifestyle and
socioeconomic factors that may affect the risk of heart disease aren’t included which limits the
predictive completeness. But despite all those limitations, the dataset is still suitable to develop
and evaluate heart disease based on machine learning prediction models.
4.2 Exploratory Data Analysis (EDA)
The main aim of the exploratory data analysis was to get a clearer view of the dataset in terms of its
structure, distribution, and initial quality.
Pandas functions like [Link](), [Link](), [Link](), [Link](), [Link]().sum(), and
[Link]().sum() were used for checking the data types, computing basic statistics, counting missing
values, and detecting duplicates.
The analysis yielded the following findings:
The dataset has 918 observations and 11 clinical attributes.
There were neither missing values nor duplicates in the dataset.
Some features, such as Cholesterol, RestingBP, and MaxHR, had unrealistic values or very easily
perceivable outliers. The EDA clearly pointed out the data that needed to be corrected and thus helped
in deciding the preprocessing strategy.
4.2Data Pre-Processing
Data preprocessing was the last cleaning procedure to remove data‐quality issues and to format the
dataset for applying machine learning algorithms.
4..2.1Medically Impossible Values Handling
Even though there were no missing values, some features had physiologically impossible values. For
example, cholesterol was found to be 0.
The following actions were taken:
RestingBP: A row with an invalid value was deleted.
Cholesterol: 172 values were found to be medically impossible. If all were removed, the dataset would
be considerably reduced which means only around 800 records left. Therefore, the median imputation
method was applied.
Max_HR: There were no invalid values found. Only medically plausible outliers were kept to maintain
natural clinical variability.
Even though the median imputation preserves the dataset size, it might lower the variability in the
cholesterol level, biasing model learning. Other techniques KNN imputation were considered but
rejected due to increased computational complexity and risk of overfitting on a relatively small dataset.
4.2.2. Data Visualization for Validation
Distribution plots were plotted for each of the five numerical variables both prior and subsequent to
cleaning as shown in Confusion Matrix & ROC diagrams.
The plots confirmed the following:
The medically implausible values had been properly replaced.
Some of the statistically identified outliers had been removed but their presence was considered to be
clinically valid.
Therefore, the data set its diversity, in a sense it was not too much sanitized.
4.2.3. Encoding of Categorical Variables
The categorical features like Sex, ChestPainType and ST_Slope were mapped to numerical values
through label encoding, providing the compatibility with machine learning algorithms.
4.2.4. Train–Test Split
The dataset was separated into:
80% for training
20% for testing
A fixed random state along with data shuffling were implemented to achieve the purpose of
reproducibility and preventing sampling bias.
4.2.5. Feature Scaling
The training dataset received standard scaling treatment and the same scaling parameters were
employed for the test set. This process was very essential because the algorithms like Logistic Regression
and SVM are very sensitive to feature magnitude. This technique helps to prevent data from leaking and
also offers a fair model evaluation.
4.3. Machine Learning Model and Evaluation
Heart disease prediction is the application of machine learning where the three models are picked
according to their capability to work with structured clinical data efficiently and accurately in binary
problems.
4.3.1. Logistic Regression
The model acts as an interpretable baseline. The dependent variable which is heart disease is binary and
is represented as 0 for No disease and 1 for Disease, making logistic regression the best method to use.
This approach is significant when the relationship between independent and dependent variables is
predominantly linear.
4.3.2. Random Forest Classifier
The random forest classifier is a very essential model that evaluates the numerous decision trees and
aggregates their votes to come up with a final output. This model is optimal for the dataset due to its
ability to manage various types of features, both numerical and categorical.
4.3.3. Support Vector Machine (SVM) with RBF Kernel
SVM with radial basis function (RBF) kernel was incorporated into system to address the potential
complexity and non‐linear relationships among heart disease risk factors. The RBF kernel shows the
original computed features into a higher‐dimensional feature space, which allows easier separation of
the two classes which are the disease and non‐disease.
This is particularly relevant given the overlapping characteristics present in the data. The selection of
one linear model, one ensemble‐based model, and one non‐linear model was made to enable a
comparative analysis of the predictive capabilities of all models, as well as to identify the most
appropriate model for heart disease detection datasets.
4.3.4 Hyperparameter Configuration and Model Selection
To ensure there was a reliable and fair comparison among the chosen models, limited hyperparameter
configuration was used. Random Forest Classifies was utilized with 100 decision trees (n_estimators
= 100) to achieve a balance between predictive performance and computational efficiency. The
Support Vector Machine (SVM) model used a radial basis function (RBF) kernel with default
regularization (C) and gamma parameters, which are suitable for capturing non-linear
relationships in moderately sized datasets. Logistic Regression was implemented using default
regularization settings, serving as an interpretable baseline model.
4.3.5 Validation Strategy and Evaluation Metrics
The dataset was evaluated with the help of hold out validation strategy which consisted of 80% training
and 20% testing set. Even though the cross validation can offer a robust performance estimates, it
wasn’t used in this study to keep the computational simplicity and consistency across models.
Multiple evaluation metrics were used to asses the model performance. Accuracy, precision and
recall we used and the F1-score was selected to balance precision and recall, particularly in cases
of class imbalance.
4.4. Model Validation and Evaluation
The Logistic Regression model got a fair score with 84.24% accuracy, 89.52% Precision, 83.93% Recall,
and 86.64% F1 score. This means that although the models are powerful, the results are a bit lower than
SVM and Random Forest Classifier. The low recall 83.93% shows that the model is missing a great
number of actual positive instances detection compared to the other models, which may be crucial
depending on the application scenario. The confusion matrix reveals the true distribution of correct and
incorrect classifications: 61 true negatives, 94 true positives, 11 false positives, and 18 false negatives.
The model also shows a slightly elevated false negative rate in comparison to SVM and Random Forest,
thereby suggesting that a greater number of actual positive cases are going undetected. This lack of
detection can be concerning, particularly in the context of medical diagnosis.
Logistic Regression remains a valuable baseline model due to its simplicity, speed, and interpretability,
despite its slightly lower performance. Furthermore, while it may not achieve the highest accuracy,
Logistic Regression offers an acceptable level of accuracy and a clear understanding of model behavior.
The support vector machine (SVM) using the radial basis function (RBF) kernel has demonstrated
outstanding performance overall, achieving an accuracy of 86.41%, precision of 90.65%, recall of
86.61%, and an F1‐score of 88.58%.
When these metrics are considered collectively, they indicate that the model is proficient at identifying a
significant proportion of actual positive cases (recall) while simultaneously minimizing false positives
(precision). The elevated F1‐score reflects a careful balance between precision and recall, making it
highly relevant in scenarios where both false negatives and false positives carry serious implications. The
confusion matrix indicates that 62 negatives and 97 positives were accurately identified, while 15 cases
were incorrectly classified as positives and another 15 as negatives.
Confusion Matrix
The figure above shows performance of consusion matrix. The Random Forest classifier performed well
compared to Logistic Regression and Support Vector Machine across all key evaluation metrics. It
achieved an accuracy of 87.50%, a precision of 91.59%, a recall of 87.50%, and an F1 ‐score of 89.50.
Such high values shows that the model is able to deliver predictions that are extremely trustworthy,
with good sensitivity (recall) and specificity (precision). The balanced and high F1 ‐score shows that the
model can cope with the classification issue with almost no trade‐off between the two types of errors
(false positive and false negative). So, by looking into the confusion matrix, the model recorded 63 true
negatives and 98 true positives, along with just 9 false positives and 14 false negatives the least
misclassification of all models. It’s very low rates of false positives and false negatives signify high
trustworthiness. Therefore, the model is very appropriate for those areas where very accurate and
reliable predictions are a must.
From the validation point of view, Random Forest presents several benefits. First, it ensemble method is
quite effective as it combines the decision treesʹ predictions which helps to robustness, variance
reduction, and overfitting prevention. This is very important when working with very large or noisy
datasets. Also, it can handle both categorical and numerical data, and it automatically provides feature
importance. With excellent performance across all metrics and low misclassification rates, Random
Forest emerges as the most validated and reliable model among the three, as illustrated in the ROC
figure.
Receiver Operating Characteristics (ROC)
The Receiver Operating Characteristics (ROC) illustrates the True Positive Rate (1 Sensitivity) against the
False Positive Rate (‐1 Specificity) across various classification thresholds. The visualization provided
above aids in understanding how well a model can distinguish the positive class from the negative class.
In this study, three models, specifically Logistic Regression, Support Vector Machine (SVM), and Random
Forest, were employed, and their ROC curves were plotted for comparative analysis. Additionally, the
ROC curve was utilized to calculate the Area Under the Curve (AUC), which acts as a singular scalar
metric summarizing the modelʹs performance. AUC values range from 0 to 1, with higher values
signifying superior discrimination capability. Among the three models, the Random Forest Classifier
attained the highest AUC (0.9391), demonstrating its exceptional ability to differentiate between
patients with and without heart disease. The ROC curve not only validates the model’s effectiveness but
also provides a reliable ranking of the models’ performances, making it an indispensable step in
selecting the most appropriate model for the chosen case.
5. Results and Performance Analysis
5.1 Overview of Results Presentation
These visual aids summarises the model performance metrics, confusion matrices and ROC curves
enabling direct comparison between evaluated classifiers.
Model Accuracy (%) Precision (%) Recall (%) F1-Score (%) AUC
Logistic Regression 84.24 89.52 83.93 86.64 0.91
Support Vector Machine (RBF) 86.41 90.65 86.61 88.58 0.93
Random Forest 87.50 91.59 87.50 89.50 0.94
As shown in Table 5.1 the Random Forest classifier achieved the highest overall performance across all
evaluation metrics, including accuracy, F1-score, and AUC.
5.2 Comparative Performance of Machine Learning Models
5.3 Confusion Matrix Analysis
5.4 ROC Curve Analysis
6. Conclusion and Future Recommendations
To sum up, the study’s main aim was to classify heart disease prediction with the use of three machine
learning models, which included logistic regression, support vector machine with RBF kernel and
Random Forest Classifier.
Evaluating each model was done through the key performance metrics of accuracy, precision, recall, F1‐
score and ROC/AUC. Random Forest classifier performed the best overall which was evidenced by the
Accuracy of 87.50%, precision 91.59%, recall of 87.50% and an F1‐score 89.50% and the highest AUC
value of 0.9391.
This implies that Random Forest is the most reliable model that provides the best trade ‐off between the
correct identification of positive cases and the least number of false alarms. The ROC curves further
supported this superiority, as the area under the curve for Random Forest demonstrated a superior
classification capability compared to SVM and Logistic Regression. While Logistic Regression served as a
quick and easily interpretable baseline model, its lower recall and slightly elevated false negative rate
indicate its limitations in high‐stakes situations. SVM exhibited commendable performance, effectively
managing nonlinear data through the RBF kernel; however, it required more computational resources
and extensive parameter tuning.
In conclusion, the model has overall been successful in reaching its aim, but some future suggestions are
there that still need to be taken into account. One of the points can be the collection of more diverse
data and addition of the major lifestyle factors like smoking, diet, and exercise. Time ‐based recording
would enable better trend analysis. Development of a user‐friendly web or mobile app could help in
making it more available during the healthcare process.
On top of that, collaboration with the medical field would also ensure predictions being not only
practical but also interpretable. Burnout effects and a more polished preprocessing of the data can help
increase the quality of data even further. Finally, it is demonstrated that additional models such as
Decision Trees, Naive Bayes, or Neural Networks can be tried out to see which one is best in making the
right predictions.