1.
Title of the paper
Title: Comparative analysis of classification algorithms for diabetic prediction
Diabetes means blood sugar is above desired level on a sustained basis. The prime
objective of this research work is to provide a better classification of diabetes. There are already
several existing method, which have been implemented for the classification of diabetes dataset.
In medical sector, the classifications systems have been widely used to exploit the patient’s data
and make the predictive models or build set of rules.
In this manuscript, we use the standard “Pima Indians Diabetes Dataset” provided by the UCI
machine learning repository. We process the data to check if there is a need for data cleaning. We
clean the data by replacing the value 0 with mean of the attributes, if any. We shall perform the
feature selection in order to increase the accuracy of the model. In the final stage, we evaluate
the model by splitting the data into training data and testing data. Performance of different
classification algorithms is studied by using the accuracy.
2. System Architecture
3. MODULES
1
Raw Data:
The quality of data, to a large extent affects the result of prediction. The accuracy
depends mainly on the data considered. In this, we will be using an existing data set called the
“Pima Indians Diabetes Dataset” provided by the UCI Machine Learning Repository. This is a
standard dataset that has drawn the values from the real instances.
Data Processing:
When encountered with a data set, first we analyze the data set. This step is necessary to
familiarize with the data, to gain some understanding about the potential features and to see if
data cleaning is needed.
Diabetes data set dimensions: (768, 9)
We can observe that the data set contain 768 rows and 9 columns. ‘Outcome’ is the column
which we are going to predict, which says if the patient is diabetic or not. 1 means the person is
diabetic and 0 means person is not.
Data Cleaning:
In the dataset we can see some missing values. Most of the inaccurate experimental
results were caused by these meaningless values. These values can be replaced by average of
values either mean, median, mode of the attribute.
For example, in the original dataset, the values 0, indicates that the real value was missing. To
reduce the influence of meaningless values, we used the means from the training data to replace
all missing values.
Feature Exploration:
It is the process of transforming the gathered data into features that better represent the
problem that we are trying to solve to the model, to improve its performance and accuracy.
2
It create more input features from the existing features and also combine several features to
produce more intuitive features to feed to the model.
Model selection:
Model selection or algorithm selection phase is the most exciting and the heart of
machine learning. It is the phase where we select the model which performs best for the data set
at hand.
First we will be calculating the “Classification Accuracy” of a given set of classification models
with their default parameters to determine which model performs better with the diabetes data
set.
Evaluating Methods:
We will be evaluating the model by splitting the data set into two portions : “training set”
and ”testing set”. The training set is used to train the model. And the testing set is used to test the
model. After being processed by classification algorithms, and evaluate the accuracy of the
model.
4. Status of implementation
We have completed 30% of the project by incorporating the predefined packages. In
future, we shall include the actual implementation of the algorithms and depict the accuracy of
the model.
5. Sample code
import numpy as np
import sklearn
import pandas as pd
import [Link] as plt
3
import os
print([Link]("dataset"))
df = pd.read_csv("dataset/[Link]")
[Link]()
import seaborn as sns
[Link]([Link])
df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']] =
df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']].replace(0,
[Link])
[Link]()
[Link]([Link](), inplace = True)
[Link]().sum()
[Link]([Link](),annot=True)
#[Link](df)
fig = [Link]()
fig.set_size_inches(8,8)
from [Link] import RandomForestClassifier
clf = RandomForestClassifier()
x=df[[Link][:8]]
y=[Link]
[Link](x,y)
feature_imp = [Link](clf.feature_importances_,index=[Link])
feature_imp.sort_values(by = 0 , ascending = False)
from sklearn.model_selection import train_test_split
features = df[["Glucose",'BMI','Age','DiabetesPedigreeFunction']]
4
labels = [Link]
[Link]()
features_train,features_test,labels_train,labels_test =
train_test_split(features,labels,stratify=[Link],test_size=0.4)
from [Link] import DecisionTreeClassifier
dtclf = DecisionTreeClassifier()
[Link](features_train,labels_train)
[Link](features_test,labels_test)
from sklearn import svm
clf = [Link](kernel="linear")
[Link](features_train,labels_train)
[Link](features_test,labels_test)
from sklearn import naive_bayes
nbclf = naive_bayes.GaussianNB()
[Link](features_train,labels_train)
[Link](features_test,labels_test)
from sklearn.linear_model import LogisticRegression
clf1 = LogisticRegression()
[Link](features_train,labels_train)
[Link](features_test,labels_test)
from [Link] import KNeighborsClassifier
knnclf = KNeighborsClassifier(n_neighbors=2)
[Link](features_train,labels_train)
print([Link](features_test,labels_test))
5
6
7
8
6. CONCLUSION
The main motto is “to prevent and cure diabetes and to improve the lives of all people affected
by diabetes”. To support the lives of the people all over the world, we are trying to detect and
prevent the complications of diabetes at the early stage through predictive analysis by improving
the classification techniques. Support vector machine and naive based techniques give the
accuracy of 75% and 75% respectively. Logistic regression also gives the accuracy of 75%.It
gives the best fit to data with respect to the diabetic and non-diabetic persons.
7. REFERENCES
1. Type 2 diabetes mellitus prediction model based on data mining.(IEEE paper)
2. [Link]
9
preg: Number of times pregnant
Plas: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
Pres: Diastolic blood pressure (mm Hg)
Skin: Triceps skin fold thickness (mm)
Test: 2-Hour serum insulin (mu U/ml)
mass Body mass index (weight in kg/(height in m)^2)
Pedi: Diabetes pedigree function
Age: Age (years)
Class: Class variable (0 or 1) **
10