Diabetes Prediction
Using Machine
03
Learning
01
AI-Mini-5
Individual contributions
Shashwata Suraj Munshi
01 02
Datta(LEADER) Project Report , Coder
Coder , Project Report
Tridib Mondal
Ankit Dutta: 04
03 Project Report , Coder
Power Point Designer
05
Sukreet Biswas
Power Point Designer
& Data collection
02
Problem Statement
The Challenge:
01
Diabetes is a chronic illness affecting millions
globally, often remaining undetected until
severe complications arise.
Current Gaps:
02
Traditional diagnostic methods can be time-
consuming and invasive.
Lack of accessible tools for early detection,
especially in underserved areas.
Our Aim:
03
To bridge this gap by developing a machine
learning model that predicts the likelihood of
diabetes early, using non-invasive data inputs.
03
Objectives:
Primary Goal:
01
Develop a machine learning-based model for early and
accurate prediction of diabetes.
Specific Objectives:
02
Utilize available datasets to train and test the model for high
accuracy.
Optimize the model for speed and reliability in predictions.
Create a user-friendly framework for real-world applications.
03
Impact:
Enable proactive healthcare interventions and reduce the burden
of diabetes-related complications.
04
Methodology / System Design
Used the PIMA Indians Diabetes dataset containing features like glucose,
BMI, age, insulin level, etc.
Data Preprocessing: Train-Test Split:
01 02
Standardized the features using Used train_test_split to
StandardScaler to bring them to divide the dataset into
a common scale. training and test sets.
Split the data into features (X)
and labels (Y).
Model Training: Model Evaluation:
03 04
Evaluated using accuracy
Used Support Vector Machine score on both training and
(SVM) classifier from sklearn. test data.
Prediction System:
05
Created a predictive
function to classify new
data points (likely in later
notebook cells).
05
Tools & Technologies Used
01 Programming Environment
Language: Python – Versatile and widely used for machine learning tasks.
Development Tools:
Jupyter Notebook: For interactive coding and data visualization.
Google Colab: For cloud-based development and leveraging GPU/TPU
support.
02 Data Resources
Dataset:
Kaggle’s Diabetes Dataset with 2000 records and 9 features for predicting
diabetic outcomes.
Data Preprocessing:
Feature scaling, normalization, and train-test splitting using Scikit-learn’s
utilities.
Ensemble & Optimization Techniques
Ensemble Methods:
Random Forest and AdaBoost for better generalization and improved
accuracy.
06
03 Libraries & Frameworks
Data Handling & Preprocessing
NumPy: For numerical operations and visualizations like
SVM decision boundaries.
Pandas: Managing datasets and creating DataFrames.
Scikit-learn (StandardScaler): Standardizing and scaling
feature data.
Visualization
Matplotlib: Core plotting library for graphs, ROC curves,
and visualizations.
Seaborn: Statistical plots, including bar charts for
feature importance and accuracy.
Model Training & Implementation
Logistic Regression, SVM, KNN, Naive Bayes, Decision
Tree, Random Forest, AdaBoost: Machine learning
models used for classification tasks.
Evaluation Metrics
Scikit-learn Metrics: Accuracy, classification reports, and
ROC/AUC for performance evaluation.
07
Results or Demo
Model Performance Metrics:
Algorithms Training Accuracy Test Accuracy
Logistic Regression 78.50% 75.97%
K-Nearest Neighours 82.90% 72.08%
SVM 82.90% 72.73%
Navive Bayes 75.57% 77.27%
Decision Tree 100.00% 70.78%
Random Forest 100.00% 75.97%
AdaBoost 80.46% 72.73%
08
Feature Importance:
Features like Glucose Level, BMI, and Age have significant importance
in prediction (based on Random Forest feature importance plot).
09
Visual Representations:
Feature Importance Plot:
Demonstrates the weight of each
feature in prediction.
ROC Curve: Showcases the True
Positive Rate vs. False Positive
Rate across different thresholds.
Confusion Matrix: Highlights the
distribution of true positives, true
negatives, false positives, and
false negatives.
10
Challenges Faced
Data Challenges: Resource Limitations:
01 03
Incomplete or inconsistent data Limited computational power for
in the dataset. training large models.
Imbalanced dataset, leading to Challenges in deploying the
biased predictions. model for real-world usage.
Model Challenges: Interpretability:
02 03
Difficulty in selecting the optimal
Ensuring the model's predictions
algorithm for the problem.
are understandable to non-
Balancing overfitting and
technical stakeholders.
underfitting during training.
11
Future Scope:
Enhancing Model Accuracy:
01
Incorporate larger and more diverse
datasets to improve generalization.
Explore advanced techniques like deep
learning for better prediction
performance.
Real-Time Predictions:
02
Integrate real-time data inputs from
wearable devices or IoT sensors.
Broader Health Insights:
03
Extend the model to predict related
conditions like hypertension or
cardiovascular risks.
Collaboration:
04
Work with healthcare professionals to
refine the system for clinical
applications.
12
References
Debadri Dutta, Debpriyo Paul, Tejas N. Joshi, Prof. Pramila M. Chawan,
01 04 "Diabetes Prediction Using Machine
Parthajeet Ghosh, "Analyzing Feature
Importance’s for Diabetes Prediction Learning Techniques".Int. Journal of
using Machine Learning". IEEE, pp 942- Engineering Research and Application,
928, 2018. Vol. 8, Issue 1, (Part -II) January 2018,
pp.-09-13
[Link], [Link], [Link],
02 Nonso Nnamoko, Abir Hussain, David
[Link] Caroline, "Random Forest 05 England, "Predicting Diabetes Onset: An
Algorithm for the Prediction of
Ensemble Supervised Learning Approach
Diabetes ". Proceeding of International
". IEEE Congress on Evolutionary
Conference on Systems Computation
Computation (CEC), 2018.
Automation and Networking, 2019.
Deeraj Shetty, Kishor Rit, Sohail Shaikh,
Nahla B., Andrew et al, "Intelligible 06
03 Nikita Patil, "Diabetes Disease
support vector machines for diagnosis
Prediction Using Data Mining ".
of diabetes mellitus. Information
International Conference on
Technology in Biomedicine", IEEE
Innovations in Information, Embedded
Transactions. 14, (July. 2010), 1114-20.
and Communication Systems (ICIIECS),
2017.
13