0% found this document useful (0 votes)
17 views16 pages

Breast Cancer Detection via Data Mining

Kitokhi

Uploaded by

mehrabuse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views16 pages

Breast Cancer Detection via Data Mining

Kitokhi

Uploaded by

mehrabuse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: [Link]

net/publication/351706367

A Comparative Analysis and Predicting for Breast Cancer Detection Based on


Data Mining Models

Article in Asian Journal of Research in Computer Science · May 2021


DOI: 10.9734/AJRCOS/2021/v8i430209

CITATIONS READS

12 817

3 authors:

Shler Farhad Khorshid Adnan Mohsin Abdulazeez

6 PUBLICATIONS 70 CITATIONS
Duhok Polytechnic University
207 PUBLICATIONS 5,343 CITATIONS
SEE PROFILE
SEE PROFILE

Amira Bibo Sallow


Duhok Polytechnic University
49 PUBLICATIONS 782 CITATIONS

SEE PROFILE

All content following this page was uploaded by Adnan Mohsin Abdulazeez on 19 May 2021.

The user has requested enhancement of the downloaded file.


Asian Journal of Research in Computer Science

8(4): 45-59, 2021; Article [Link].68450


ISSN: 2581-8260

A Comparative Analysis and Predicting for Breast


Cancer Detection Based on Data Mining Models
Shler Farhad Khorshid1*, Adnan Mohsin Abdulazeez2
and Amira Bibo Sallow3
1
Akre Technical College of Informatics, Duhok Polytechnic University, Duhok, Kurdistan Region, Iraq.
2
Duhok Polytechnic University, Duhok, Kurdistan Region, Iraq.
3
Nawroz University, Duhok, Kurdistan Region, Iraq.

Authors’ contributions

This work was carried out in collaboration among all authors. Author SFK managed the literature
searches related to breast cancer classification and wrote the first draft of the manuscript. Author
AMA gave the idea and designed the study. Author ABS performed the statistical analysis data and
discuss the results. All authors read and approved the final manuscript.

Article Information
DOI: 10.9734/AJRCOS/2021/v8i430209
Editor(s):
(1) Dr. G. Sudheer, GVP College of Engineering for Women, India.
Reviewers:
(1) S. Rajasekaran, University of Technology and Applied Sciences-Ibri, Oman.
(2) D. Mallikarjuna Reddy, VIT University, India.
(3) Tesfay Gidey Hailu, Addis Ababa Science and Technology University, Ethiopia.
Complete Peer review History: [Link]

Received 10 March 2021


Accepted 14 May 2021
Review Article
Published 19 May 2021

ABSTRACT

Breast cancer is one of the most common diseases among women, accounting for many deaths
each year. Even though cancer can be treated and cured in its early stages, many patients are
diagnosed at a late stage. Data mining is the method of finding or extracting information from
massive databases or datasets, and it is a field of computer science with a lot of potentials. It
covers a wide range of areas, one of which is classification. Classification may also be
accomplished using a variety of methods or algorithms. With the aid of MATLAB, five classification
algorithms were compared. This paper presents a performance comparison among the classifiers:
Support Vector Machine (SVM), Logistics Regression (LR), K-Nearest Neighbors (K-NN), Weighted
K-Nearest Neighbors (Weighted K-NN), and Gaussian Naïve Bayes (Gaussian NB). The data set
was taken from UCI Machine learning Repository. The main objective of this study is to classify
breast cancer women using the application of machine learning algorithms based on their
accuracy. The results have revealed that Weighted K-NN (96.7%) has the highest accuracy among
all the classifiers.
_____________________________________________________________________________________________________

*Corresponding author: E-mail: [Link]@[Link];


Khorshid et al.; AJRCOS, 8(4): 45-59, 2021; Article [Link].68450

Keywords: Breast cancer; data mining; SVM; Logistics regression; weighted K-NN; Gaussian Naïve
Bayes.

1. INTRODUCTION In this paper, a comparative analysis is


presented of five different DM classification
Data mining (DM) uses a variety of techniques algorithms namely LR, SVM, K-NN, Weighted K-
(such as classification, clustering, regression, NN, and Gaussian NB on the Breast Cancer
association rules, and so on) and algorithms Data Set by measuring their classification
(such as Decision Tree (DT), Genetic Algorithms, accuracy. Results show that all the presented
Nearest Neighbor Form, and so on) to analyze DM algorithms performed well on the
large amounts of raw or multi-dimensional data. classification task.
To put it another way, DM can derive hidden
knowledge from large databases of clinical or The rest of the paper is organized as follows.
medical data obtained from health centers or Section 2 is a presentation of breast cancer.
hospitals using intelligent data analysis. These Section 3 discusses data mining. Section 4
insights can help enhance decision-making, focused on some of the applications of DM.
prevention, diagnosis, and treatment in the field Section 5 gives a review of similar research. The
of medicine [1], [2], [3], [4], [5]. Furthermore, DM material and methods used in the working
may establish relationships or define association process are discussed in section 6. Section 7
rules between various features, such as a summarizes the findings and discusses them.
patient's personal details, disease symptoms, Section 8 provides a comparison of the related
and so on [6], [7]. works. Finally, Section 9 presents the conclusion
of this paper.
In the field of medicine, DM plays a significant
role in computing applications [8], [9], [10]. The 2. BREAST CANCER (BC)
applications and methods of DM are
demonstrated in the areas of healthcare BC grows as cells in the breast tissue
administrations, patient care, management, and differentiate and expand without the usual
intensive care systems. Breast Cancer (BC) is controls on cell division and death [23]. It's the
the most common of all cancers, and it is the most common form of cancer in women [24],
leading cause of cancer deaths in women around [25]. While experts do not know the precise
the world, according to one of the latest DM causes of the majority of breast cancers, they do
studies [11], [12]. BC is one of the diseases with know some of the risk factors that increase a
the highest number of cases and deaths woman's chances of contracting the disease.
worldwide [13], [14], [15]. After lung cancer, it is Age, genetic risk, and family history are
the second leading cause of death in women examples of these influences [26].
[16]. There are two types of breast tumors:
malignant and benign [17], [18], [19]. As cells in BC treatments are divided into two categories,
the breast tissue become isolated and without regional and systematic. Systematic treatments
the usual controls on cell passing and cell include chemotherapy and hormone therapy,
division, a malignant tumor develops [20]. Benign while regional treatments include surgery and
tumors have a good contour and are not harmful. radiation. The two forms of the treatment are
They grow slowly in the organ where they first frequently utilized together to obtain better
appeared, with no signs of metastatic disease results. Despite the fact that BC is the second
[21]. Benign tumors are made up of cells that most common cause of death in women, it has a
look like normal breast tissue cells. While high survival rate. Ninety- seven percent of
malignant ones are harmful because they can women live for five years or longer if they are
expand to other parts of the body and cause diagnosed at an early stage [27].
metastatic disease. Cancer cells in malignant
tumors have many abnormalities in the form, 3. DATA MINING (DM)
scale, and contour as compared to normal cells,
where cells lose their original characteristics. DM In the domain of medicine, DM is playing a major
algorithms can be a useful tool for predicting and role in computing applications [28], [29]. DM
diagnosing breast cancer, as well as classifying research relies on classification algorithms.
them into; benign or malignant tumors. Earlier Classification, clustering, association rules,
treatment of BC can result in curing the body of prediction, and neural networks are some of the
this disease [22]. DM applications and techniques that are used to

46
Khorshid et al.; AJRCOS, 8(4): 45-59, 2021;; Article [Link].68450
no.

analyze large amounts of data. Among these, 1) DM applications in healthcare


some classification algorithms such as Naïve
Bayes (NB), SVM, Artificial Neural Network Health DM tools have a lot of potentials and can
(ANN), DT (C 5.0) and K-NN NN algorithms to are be very useful. However, the availability of clean
utilized to achieve the most accurate results. DM healthcare data is critical to the success of
is currently being used to solve a variety of real-
real healthcare DM. In this regard, the healthcare
world issues since the primary aim of DM is to industry must investigate how data can be
convert raw data into more usable information. collected, processed, prepared, and mined more
Medical databases raise problems for pattern effectively. Standardization of clinical
extortion due to their complex features [30], [31], terminology and data exchange across
[32], [33].. DM algorithms can be classified into organizations are two possible directions for
two types: statistical and machine learning (ML) enhancing the benefits of healthcare DM
algorithms. DM processes are categorized into applications [40].
descriptive and predictive categories (Fig.1).
(Fig
Descriptive mining tasks show the database's 1.1 Future directions of health care system
general data properties. To perform predictive through DM tools
mining tasks, the inference is made on the
results whereby a forecast is rendered based on Since healthcare data is not limited to
explicit values defined by established results. quantitative data (e.g., doctor's notes or hospital
Descriptive data mining
ng offers characteristics and records), It's critical to look at using text mining to
definitions for the data set without the need for a expand the scope and scope of what healthcare
predefined objective [34], [35]. data mining can currently do, according to the
International Journal of Computer Science,
Engineering and Information Technology
(IJCSEIT). This is used to combine all of the data
before mining the text. It's also worth
investigating how images (such as MRI scans)
cann be incorporated into healthcare data mining
applications. Progress has been made in these
fields, it should be noted [40].

2) DM is used for the construction industry

The discovery of valuable knowledge from vast


collections of data industries has draw drawn a lot of
interest in the field of DM [41].. In the construction
industry, DM from large amounts of data has
become an essential method for information
discovery. Energy, building occupant and
Fig. 1. Data mining techniques [34] occupancy actions, safety management,
material efficiency, and textual information
Because of the following, DM methods are discovery are some of the most common DM
successful and predictive of future patterns: a) application domains in the construction industry
it's easy to use, and it predicts outcomes based [42].
on previous events b) it operates by learning
from previous data c) data from a variety of 2.1 Future Directions of construction
sources are managed, and only the information industry system through DM Tools
needed is extracted d) Relearning, past data,
and evolving patterns are all easy ways to keep Two major developments in the building
models up to date. This is what makes it industry's future growth are sustainable
dependable and realistic in the classification of construction and digital construction. Energy
medical images [36], [37], [38], [39].. management, safety management, and green
building all fall under the scope of sustainable
4. DATA MINING FUTURE APPLICATIONS construction in a broad sense. This shows that
the use of data mining in sustainable
In this part, we'll look at some of the DM construction
truction is a hot topic, regardless of the past,
applications and techniques. current, or future [40].

47
Khorshid et al.; AJRCOS, 8(4): 45-59, 2021; Article [Link].68450

3) DM methods are used in the web education percent), has the highest accuracy of all the
classifiers.
In the field of web education, DM techniques are
used to upgrade courseware. The connections Bataineh et. al. [46] presented five nonlinear
are discovered by looking at the consumption algorithms including K-NN, Multi-Layer
data collected during students' sessions. This Perceptron (MLP), Classification and Regression
expertise is extremely beneficial to the course's Tree (CART), Gaussian NB, and SVM was done
instructor or author, who can determine which for BC detection. The author's main goal was to
changes are most necessary to increase the compare the performance and efficacy of BC
course's effectiveness. In the twenty-first century, detection algorithms. The accuracy of each
beginners use DM techniques, which are one of algorithm was also calculated separately by the
the most powerful learning methods available. author. A dataset of Wisconsin BC diagnostics
This allows learners to become more conscious was used in the study (WBCD). To calculate the
of their surroundings. The application of DM accuracy of each algorithm, the author used the
techniques to educational chats is both feasible K Fold validation process. MLP outperformed the
and can improve learning environments in the K-NN, CART, and NB algorithms with an
twenty-first century, according to Web Education accuracy of 96.70 percent.
[40].
Sinha et al. [47] introduced attribute filtering
4) In agriculture strategies, such as frequent itemsets mining, to
identify the most important and applicable
attribute from the Wisconsin BC dataset using a
Scientists and researchers around the world are
classification algorithmic such as SVM. Attribute
dealing with how to make agriculture safe and
filtering was used to compare NB, K-NN, and DT.
resilient in the face of continuing conditions and
With attribute filtering, SVM generated the
environmental change. Transition and
highest area under the curve as compared to
multidisciplinary approaches are needed in the
other classification techniques, resulting in better
agricultural system. For the production and
field accuracy and ROC curve.
efficiency when working with the same limited
resources, intelligent and precision agricultural
Bharati et al. [48] presented the capability of the
approaches were prioritized [43]. The strategy
classification of NB, Random Forest (RF), LR,
requires the collection of data from a variety of
MLP, K-NN in evaluating the BC disease dataset
sources and the effective application of that data
from the UCI repository, which was observed to
in the appropriate area. As a result of this need,
predict the existence of BC. The data set
there has been an increase in interest in
consisting of Kappa Statistics, TP Rate, FP Rate,
extracting information from large troves of data
and other metrics have all been investigated
resulting from various research and survey
exactitude. The efficiency of the K-NN classifier
projects. DM techniques advanced the concept
algorithm was observed.
of knowledge generation and pattern recognition
when they first appeared. Even though DM is a
Ghani et al. [49] used anthropometric data and
new science, it has a wide range of applications
parameters obtained during routine blood
in agriculture and related industries, and it has a
processing that can be used to predict BC. Using
bright future [44].
the recursive feature elimination process, they
first identified the most relevant attributes in the
5. RELATED WORK dataset that could be used as biomarkers. They
discovered that the best biomarkers for BC are
Singh et. al. [45] compared the performance of age, BMI, glucose, HOMA, and resistance. K-
different classifiers (DT classifier (J4.8, Simple NN, ANN, DT, and NB classification techniques
CART)), (Bayes classifier (NB, Bayesian LR)). were used for classification. ANN was found to
They were the most popular DM algorithms for be the most accurate at classifying the attribute,
BC classification. This paper aimed was to with an accuracy of 80.00 percent.
determine which classifier produces the most
reliable results for the Wisconsin Breast Cancer Basunia et al. [50] proposed a stacking classifier
(original) dataset WBCO. Dataset of BC was ensemble approach that effectively classifies
taken from the UCI ML repository using the benign and malignant tumors by combining
WEKA tool. The experimental results show that multiple classification techniques. Their
the DT classifier, i.e., Simple CART (98.13 experiment used the “Wisconsin Diagnosis BC”

48
Khorshid et al.; AJRCOS, 8(4): 45-59, 2021; Article [Link].68450

dataset from the UC Irvine Machine Learning Sudha et al. [54] suggested an improved lion
Repository. They chose 20 top features for BC optimisation algorithm (ILOA) technique that can
prediction using the Univariate Feature Selection identify small feature subsets quickly and
process. Jupyter Notebook is used with some accurately to classify the BC data set. A total of
Python open-source libraries to implement 500 mammogram images (288 benign and 212
various classification techniques such as CART, malignant) were used as a case sample in this
LR, K-NN, SVM, RF, and Stacking Classifier proposed study. After segmentation, each mass
techniques. The overall outcomes indicate was represented with 123 features, including 96
Stacking classifier has the highest accuracy texture features, 9 histogram features, 11 shape
97.20%. features, and 7 radial distance features, using a
region growing algorithm. The Feature selection
Saoud et al. [51] used feature selection technique used a minimum distance classifier, K-
techniques to enhance the accuracy of six NN classifier, and SVM classifier. As compared
algorithms for BC classification and diagnosis: to other algorithms, ILOA with K-NN classifier
Bayes Network, SVM, K-NN algorithm, ANN, DT performed well for BC classification [55].
(C4.5), and LR. They used both databases WBC
and WBCD. The feature selection technique 6. MATERIALS AND METHODS
increased the accuracy of some classifiers, such
as BN, in both WBCD and WBC. However, some In this proposed model many classifiers were
classifiers, such as SVM, had their accuracy used to classify the Breast Cancer tumor with
reduced as a result of the feature selection high accuracy, efficiency using via LR, fine K-NN,
technique. The BN with feature selection is the linear SVM, weighted K-NN, gaussian NB using
best model for classifying BC in WBC, while SVM nine features. The used dataset was the UCI
without feature selection is the best for WBCD. breast cancer machine learning repository. The
mechanism of this proposed model goes through
Kumar et al. [52] proposed two datasets of BC, five main stages, which are (Data Processing,
taken from the UCI Machine Learning repository. Validation Choosing, Classification and
On both datasets, seven algorithms were used. Evaluating the results), as demonstrated in Fig. 2
Which are (Bayes network, NB, SVM, K-NN, DT, that shows the Flowchart Diagram of the
RF, MLP). These two datasets have various proposed model. The preprocessing method is
features, with 11 and 32 features respectively. done for missing feature values (in Single
The datasets are split into two parts. The training Epithelial Cell Size feature, there are 16
data accounts for 65 percent of the overall instances in Groups 1 to 6 that contain a single
dataset, while the evaluation data account for the missing value in breast cancer dataset (i.e.,
remaining 35 percent. The accuracy of the (unavailable), attribute value denoted by “?”). To
Bayesian Network technique on the BCDW 11 test the predictive accuracy of the fitted models,
dataset was 97.13 percent, while the SVM use the 10-fold cross validation process by
technique on the WBCD dataset was 97.89 MATLAB as a classifier tool in this study. After
percent. the classification results of all the five algorithms,
the performance was measured by the confusion
Sakri et al. [53] proposed integrating the feature matrix and ROC area.
selection algorithm with classification algorithms
in BC prognosis. They claimed that using feature 6.1 Dataset
selection techniques to reduce the number of The data for this study was provided by the UCI
features in most classification algorithms, can Machine Learning repository, which is located in
improve them. Some features are more the BC Wisconsin sub-directory, with 699
significant and have a greater impact on the examples, two classes (malignant and benign),
classification algorithms' results than others. and 9 integer-valued attributes (as shown in
They presented the results of their experiments Table 1). In UCI Breast cancer dataset (Dataset’s
with and without the feature selection algorithm, link:[Link]
particle swarm optimization (PSO), on three +cancer) the class distribution are as following:
common classifying algorithms, namely NB, K-
NN, and REP Tree. As a result, NB obtained 1- Benign class: 458 (65.5%) instances.
better results with and without PSO, while the 2- Malignant class: 241 are (34.5%)
other two techniques performed better with PSO. instances.

49
Khorshid et al.; AJRCOS, 8(4): 45-59, 2021;; Article [Link].68450
no.

Fig. 2. The Proposed Model Flowchart Diagram

6.2 Classification Algorithmes of real-world


world applications of SVM. When dealing
with binary classification, it is an eff
effective
6.2.1 Support vector machine (SVM) technique. Can have the right to select the
normalization of w>x + b = 0 and c (w>x + b) = 0
SVM is a superior DM technique that produces since they define the same plane. Select
accurate classification results [56], [57], [58]. normalization such that positive and negative
Only data sets with exactly two groups to classify support vectors are w>x++b=+1 and w>x+b = 1,
can be used with SVM. It categorizes data by respectively [59], [60], [61].
deciding the best hyperplane that separates all
data points into one of two classes. SVM's main The margin is then calculated as follows:
goal is to maximize the margins between two
hyperPlane classes. Cancer diagnosis, Face ( )
( − )= =‖ (1)
recognition, and text categorization are examples ‖ ‖ ‖ ‖ ‖

50
Khorshid et al.; AJRCOS, 8(4): 45-59, 2021; Article [Link].68450

Table 1. Breast cancer dataset attribute information

Attribute Domain
1) Sample code number ID Number
2) Clump Thickness 1-10
3) Uniformity of cell size 1-10
4) Uniformity of cell shape 1-10
5) Marginal Adhesion 1-10
6) Single Epithelial cell size 1-10
7) Bare Nuclei 1-10
8) Bland Chromatin 1-10
9) Normal Nucleoli 1-10
10) Mitoses 1-10
11) Class: 2 For benign
4 For malignant
6.2.2 Logistics Regression (LR) these neighbors, despite their similarity to (y, x).
To do this, the distances used in the first stage of
One of the most widely utilized generalized linear the search for closest neighbors must be
models in DM is LR [62]. The probability of an transformed into similarity measurements that
outcome that can take two values from a can be used as weights [69]. Weighted K-NN
collection of predictor variables is predicted using assigns weights to each calculated value, then
LR. LR is primarily used for predicting and computes the nearest neighbors, and finally
calculating performance probabilities [63]. assigns the class to the processed instance [70],
[71], [72].
6.2.3 K-Nearest Neighbors (K-NN)
6.2.5 Gaussian Naïve Bayes (Gaussian NB)
K-NN is a simple algorithm for instance-based
learning that classifies objects in the feature Gaussian NB algorithm uses for classification,
space depending on their closest training dataset which is a special form of NB algorithm [73].
[64], [65]. An object is assigned to a class that When the features have continuous values or all
includes its K-NN. A class is created for an object of the features follow a Gaussian distribution
that includes its K-NN. To find the closest such as a normal distribution, this method is
neighbor, the K-NN algorithm was used, which particularly useful. The features' likelihood is
used Euclidean distance metrics [66], [67]. The assumed to be Gaussian [74].
equation below is used to measure the Euclidean ( )
distance metrics d(x,y) between two points x and ( | )= (− (3)
y. w. Where N denotes the number of features
with x = x1, x2, x3,...,xn and y = y1, y2, y3,...,yn
[68]. In equation (3), x is a continuous data variable,
and the parameters x and y are calculated
( , )=∑ − (2) using maximum likelihood estimation. After the
data has been segmented by class, the mean
6.2.4 Weighted K-Nearest-Neighbors and variance are measured.
(Weighted K-NN)
6.3 The Evaluation Metrics of the
This extension is based on the idea that Classifiers Performance
observations in the learning set that are
especially similar to the new observation (y, x) 6.3.1 Confusion matrix
should be given more weight in the decision than
observations that are far away from the new The confusion matrix (also called as the
observation (y, x). This is not the case with K- "Contingency Matrix") provides a good overview
NN: while only the k closest neighbors affect the of the classifiers performance. Table 2 shows a
prediction, this influence is consistent across standard confusion matric.

51
Khorshid et al.; AJRCOS, 8(4): 45-59, 2021;; Article [Link].68450
no.

Table 2. A Typical 2x2 Confusion Matrix

Predicted Class
Positive Negative
Positive TP FN
Actual Class Negative FP TN

a. True positive (TP) : number of positive samples correctly predicted.


b. False negative (FN) : number of positive samples wrongly predicted.
c. False positive (FP) : number of negative samples wrongly predicted as positive.
d. True negative (TN) : number of negative samples correctly predicted.
predicted

6.3.2 Receiver
eiver operating characteristics area
or ROC area

A ROC curve is a graphical representation of the


true positive rate against the false-positive
false rate
for different diagnostic test thresholds. The ROC
curve is used to measure a classifier's
performance and to give a higher score than the
previous classifier. The false positive rate is a. Confusion matrix b. ROC curve
known as specificity and the true positive rate is
also known as sensitivity. Excellent (0.90-1),
(0.90 Fig. 4. Evaluation results for SVM (Model 2)
good (0.80-0.90), fair (0.70-0.80),
0.80), bad (0.60-
(0.60
0.70), and fail (0.50–0.60)
0.60) are used to evaluate a Fig. 4 (a) shows the confusion matrix for SVM,
classifier's performance. there are 447 patients, which are true benign
whereas 11 are false benign. Also had 228 true
7. RESULTS AND DISCUSSION
ISCUSSION malignant and 13 false malignant patients. It al
also
has a very good area under the ROC curve
The research used the confusion matrix and area (AUC=1.00 the ideal value).
under the ROC curve, to determine the degree of
performance and applicability of the models.

a. Confusion matrix b. ROC curve


a. Confusion matrix b. ROC curve Fig. 5. Evaluation results for K-NN
K
(Model 3)
Fig. 3. Evaluation results for logistic
regression (Model 1) Moreover, the classification made by the K K-NN
was also evaluated using a confusion matrix Fig.
From Fig. 3 (a) the confusion matrix calculated 5 (a). In this matrix, there are 446 patients, which
for model 1(LR), shows the true and false are true benign whereas 12 are false benign.
positive and true and false negatives of the train Also have 221 patients who are true malignant
set. In this confusion matrix, there are 448 and 20 false malignant patients. Fig. 5 (b) the
patients, who have true benign lesions whereas ROC curve [Link] area under the curve (AUC)
10 have falsee benign. Also have, 224 patients is 0.95 and is close to 1 (which is Excellent).
who have a truly malignant disease and 17 cases
are false malignant. Fig. 3 (b) illustrates the ROC The confusion matrix for weighted K K-NN is
curve plot. The area under the curve (AUC) is illustrated in Fig. 6 (a), there are 447 patients,
0.99 and is close to 1 (which is Excellent). which are true benign whereas 11 are false

52
Khorshid et al.; AJRCOS, 8(4): 45-59, 2021;; Article [Link].68450
no.

benign. We also have 229 true malignant and 12 the lowest accuracy 95.4%. Also, the results
false malignant patients. Fig. 3 (b) illustrates the show can the Weighted K-NN NN Classifier has the
ROC curve plot. The area under the curve (AUC) best training time value (0.5096 6 sec) and ROC
is 0.99. area 0.99. The Weighted K-NN NN method is the
best classifier among the five proposed
classifiers for classifying a tumor as benign or
malignant, according to these findings.

As compared to other classifiers, weighted K


K-NN
has the highest accuracy
racy of 96.7 percent, as
shown in Fig. 8.

a. Confusion matrix b. ROC curve

Fig. 6. Evaluation results for Weighted K-NN


K
(Model 4)

Fig. 9. Comparison of training time between


all the classifiers
Fig. 9, Shows the training times of all the five
a. Confusion matrix b. ROC curve
classification algorithms. The training time for
weighted K-NNNN is less than other algorithms.
Fig. 7. Evaluation results for Gaussian NB
(Model 5)

Above in Fig. 7 (a) is the confusion matrix. There


are 436 true benign patients in this matrix, while
22 are false benign. We also have 236 true
malignant and 5 false malignant patients. Fig. 7
(b) is the ROC curve plot. The AUC measures
the training accuracy. The (AUC) is 0.98 and is
close to 1 (which is Excellent).

Fig. 10. ROC Area of all classifiers

From above figure show the graphical


representation of the ROC area in MATLAB of
the five classifiers on the dataset. ROC area of
the linear SVM gave a better result, followed by
the weighted K-NN,NN, LR, Gaussian NB then
th fine
K-NN classifiers.

8. Comparitive Study
Fig. 8. Comparison of accuracies between all
the classifiers The comparison summary of the related works is
shown in Table 4. The researchers in the related
The overall outcomes displayed in Table 3 papers used various techniques of feature
indicate that Weighted K-NN
NN Classifier has the selection and classification methods, as well as
highest accuracy 96.7%, where Fine K-NN
K has different datasets with different numbers of

53
Khorshid et al.; AJRCOS, 8(4): 45-59, 2021; Article [Link].68450

Table 3. Performance Study of Algorithms

Model no. Model Type Accuracy Training time (sec) Area under ROC curve
1 Logistic regression 96.1% 5.0739 0.99
2 Linear SVM 96.6% 2.4899 1.00
3 Fine K-NN 95.4% 1.1064 0.95
4 Weighted K-NN 96.7% 0.5096 0.99
5 Gaussian NB 96.1% 1.2643 0.98

Table 4. Comparison of related works

R# Classifier Tool Dataset Number of Data type Data processing Evaluation Validation Accuracy
attribute method method technique
[45] Naive Bayes WEKA UCI 11 Numeric Kappa statitics Performane - 95.26%
Bayesian Logistic Regression repository Attributes (Discrete Mean Absolute classifiers 65.42%
Simple CART value) error 98.13%
J48 97.27%
[46] MultiLayer preceptron MATLAB UCI repository 32 Images Standardize Binary cross 99.12%
K-Nearest Neighbours WDBC dataset Attributes rescaling method classification validation 95.61%
CART Accuracy 93.85%
Gussian Naive Bayes method 94.73%
Support vector machine 98.24%
[47] Support vector machine PYTHON UCI repository 31 Numeric z-score Confusion - 96.61%
Naïve Bayes WBC Attributes (binary normalization Matrix 96.46%
k-Nearest Neighbours value) 91.74%
Decision Tree 90.27%
[48] K-Nearest Neighbors WEKA UCI 10 Numeric Kappa Statistics Binary - 72.37 %
Naïve Bayes repository Attributes classification 71.67%
Random Forest Accuracy 69.58 %
Logistic Regression method 68.8%
Multilayer Perceptron 64.68 %

[49] K-Nearest Neighbors WEKA UCI 9 Numeric Recursive feature confusion matrix - 77.14%
Artificial Neural Networks repository Attributes Elimination for 80.00%
Decision Tree features selection 71.43%
Naive Bayesian 73.91%
[50] CART PYTHON UCI 32 Numeric Features selection confusion matrix Cross 94.74%

54
Khorshid et al.; AJRCOS, 8(4): 45-59, 2021; Article [Link].68450

R# Classifier Tool Dataset Number of Data type Data processing Evaluation Validation Accuracy
attribute method method technique
Logistic regression repository Attributes validation 97.08%
K-Nearest Neighbors 95.91%
Support Vector Machine 95.91%
Random Forest 97.08%
Stacking Classifier 97.20%

[51] Bayes Network WEKA UCI 9WBC Numeric Features selection confusion matrix Cross With WBC
Support Vector Machine repository 32WBCD validation (BN):97.42%
K-Nearest Neighbors WBC WBCD With WBCD
Artificial Neural Networks (SVM):
Decsion Tree (C4.5) 97.36%
Logistic Regression
[52] Bayesian network WEKA UCI 11BCWD Numeric Data statistics Performance Cross With BCDW
Naïve Bayes repository 32WBCD classifiers validation (Bayesian
SVM BCWD Network):
Multi Layer preceptron WBCD 97.13%
K-NN With WBCD
Decision Tree (J48) (SVM):
Random Forest 97.89%
[53] Naïve Bayes WEKA UCI 35 Numeric Features selection confusion matrix Cross 81.3%
K-Nearest Neighbors repository Attributes and extraction validation 75.0%
Fast decision tree learner 93.6%
(REPTREE)
[54] Support Vector Machine MATLAB Digital -30 Images Features selection Performance Cross 98.92%
K-Nearest Neighbours database for Attributes and extraction classifiers validation 99.31%
screening
mammography
(DDSM)
Proposed Logistic regression MATLAB UCI 9 Numeric - Confusion Cross 96.1%
Work Support Vector Machine repository Attributes matrix validation 96.6%
Weighted K-NN ROC area 96.7%
K-Nearest Neighbours 95.4%
Gaussian Naïve Bayes 96.1%

55
Khorshid et al.; AJRCOS, 8(4): 45-59, 2021; Article [Link].68450

features. Comparing with the previous works, the and Fine K-NN with the accuracy ratios of 96.6%,
provided method acquires a high accuracy 96.1%, 96.1%, and 95.4%.
classification of breast cancer. However,
researchers in [47] and [45] used WBC (original) COMPETING INTERESTS
dataset to train and test different DM algorithms.
They respectively registered anaccuracy of Authors have declared that no competing
96.61% (SVM) and 98.13% (CART), despite a interests exist.
high execution time of CART. Researchers in
[46] used the WDBC dataset with the
REFERENCES
standardization method to reach 99.12% for
MLP. In [48], researchers used fewer attributes 1. Abdulqader DM, Abdulazeez AM,
and gained an average of 72.37% accuracy for Zeebaree DQJML. Machine learning
K-NN, 71.67% for NB, 69.58% for RF, and supervised algorithms of gene selection: A
64.68% for LR. Researchers in [49] obtained Review. 2020;62(03).
80% for ANN by using the feature selection 2. Ahmed O, Brifcani A. Gene expression
method. In [50], researchers used the feature classification based on deep learning. in
selection method to reach 97.20% for Stacking 2019 4th Scientific International
Classifier, researchers in [51] used two datasets Conference Najaf (SICN). 2019;145-
with a feature selection technique to reach 149:IEEE.
97.42% from BN for WBC, 97.36% from SVM for 3. Zeebaree DQ, Haron H, Abdulazeez AM.
WBCD. Researchers in [52] used two different Gene selection and classification of
datasets to reach a high accuracy rate of 97.13% microarray data using convolutional neural
from BN for the BCDW dataset, 97.89% from network. in 2018 International Conference
SVM for WBCD. In [53], researchers used many on Advanced Science and Engineering
features but achieved fewer accuracy rates (ICOASE). 2018;145-150:IEEE.
(93.6%) for DT. Lastly, researchers in [54] gained 4. Eesa AS, Abdulazeez AM, Orman
a good accuracy result (98.92%) for K-NN. The ZJSJoUoZ. A DIDS based on the
proposed work utilized five DM classifiers combination of cuttlefish algorithm and
(Logistic Regression (LR), Support Vector decision tree. 2017;5(4):313-318.
Machine (SVM), K-Nearest Neighbors (K-NN), 5. Taher KI, Abdulazeez AM, Zebari
weighted K-Nearest Neighbors (Weighted K-NN), DAJAJoRiCS. Data mining classification
and Gaussian Naïve Bayes (Gaussian NB) algorithms for analyzing soil data. 2021;17-
algorithms) and the best classifier was Weighted 28.
K-NN with 96.7% accuracy. 6. Oskouei RJ, Kor NM, Maleki SAJAjocr.
Data mining and medical world: breast
9. CONCLUSION cancers’ diagnosis, treatment, prognosis
and challenges. 2017;7(3):610.
This paper attempted to improve the accuracy of 7. Ibrahim I, Abdulazeez AJJoAS, Trends T.
breast cancer classification using data mining The Role of Machine Learning Algorithms
techniques. In this study the UCI breast cancer for Diagnosing Diseases. 2021;2(01):10-
dataset used and five data mining algorithms 19.
were used for the classification (Logistic 8. Zebari R, Abdulazeez A, Zeebaree D,
Regression (LR), Support Vector Machine Zebari D, Saeed JJJoAS, Trends T. A
(SVM), K-Nearest Neighbors (K-NN), weighted comprehensive review of dimensionality
K-Nearest Neighbors (Weighted K-NN), and reduction techniques for feature selection
Gaussian Naïve Bayes (Gaussian NB) and feature extraction. 2020;1(2):56-70.
algorithms). All the experiments were done 9. Charbuty B, Abdulazeez AJJoAS, Trends
using MATLAB 2021a. The primary goal is to T. Classification Based on Decision Tree
assess how well each algorithm performs in Algorithm for Machine Learning.
terms of classification test accuracy when it 2021;2(01):20-28.
comes to classifying data. The evaluation of the 10. Sagar M, Vivekkumar G, Reddy M,
results done in terms of the confusion matrix and Devendiran S, Amarnath M. Research on
ROC curve. Investigational results show that the intelligent fault diagnosis of gears using
Weighted K-NN classifier has the highest EMD, spectral features and data mining
accuracy 96.7%, where Fine K-NN has the techniques. in IOP Conference Series:
lowest accuracy 95.4%. The last four classifiers Materials Science and Engineering,
respectively are Linear SVM, LR, Gaussian NB, 2017;263(6) :062047: IOP Publishing.

56
Khorshid et al.; AJRCOS, 8(4): 45-59, 2021; Article [Link].68450

11. PadmaPriya R, Vadivu PSJIJoE, e-ISSN unconstrained functional networks


MR. A review on data mining techniques classifier. in IEEE International Conference
for prediction of breast cancer recurrence. on Computer Systems and Applications.
2019;2250-0758. 2006;281-287: IEEE.
12. Zebari DA, Zeebaree DQ, Abdulazeez AM, 22. Eesa AS, Brifcani AMA, Orman ZJIJoC,
Haron H, Hamed HNAJIA. Improved Engineering I. A New DIDS Design Based
Threshold Based and Trainable Fully on a Combination Feature Selection
Automated Segmentation for Breast Approach. 2015;9(8):1914-1918.
Cancer Boundary and Pectoral Muscle in 23. Jerez-Aragonés JM, Gómez-Ruiz JA,
Mammogram Images. 2020;8:203097- Ramos-Jiménez G, Muñoz-Pérez J, Alba-
203116. Conejo EJAiim. A combined neural
13. Denny J, Ali S, Sobha TJIAJER. Efficient network and decision trees model for
segmentation method for roi detection in prognosis of breast cancer relapse.
mammography images using 2003;27(1):45-63.
morphological operations. 2020;3(6):1-8. 24. Shrivastavat SS, Sant A, Aharwal
14. Zeebaree DQ, Haron H, Abdulazeez AM, RPJIJoACR. An overview on data mining
Zebari DA. Trainable model based on new approach on breast cancer data.
uniform LBP feature to identify the risk of 2013;3(4):256.
the breast cancer. in 2019 International
25. Moura DC, López MAGJIjocar, surgery. An
Conference on Advanced Science and
evaluation of image descriptors combined
Engineering (ICOASE). 2019;106-
with clinical data for breast cancer
111:IEEE.
diagnosis. 2013;8(4):561-574.
15. Najat N, Abdulazeez AM. Gene clustering
with partition around mediods algorithm 26. Delen D, Walker G, Kadam AJAiim.
based on weighted and normalized Predicting breast cancer survivability: a
comparison of three data mining methods.
Mahalanobis distance. in 2017
International Conference on Intelligent 2005;34(2):113-127.
Informatics and Biomedical Sciences 27. O'Malley CD, Le GM, Glaser SL, Shema
(ICIIBMS). 2017;140-145: IEEE. SJ, West DWJCIIJotACS. Socioeconomic
16. Khorshid SF, Abdulazeez AMJPsJoAoEE. status and breast carcinoma survival in
Breast cancer diagnosis based on k- four racial/ethnic groups: a
nearest neighbors: A review. population‐based study. 2003;97(5):1303-
2021;18(4):1927-1951. 1311.
17. Zeebaree DQ, Haron H, Abdulazeez AM, 28. Richards G, Rayward-Smith VJ, Sönksen
Zebari DA. Machine learning and region P, Carey S, Weng CJAiim. Data mining for
growing for breast cancer segmentation. in indicators of early mortality in a database
2019 International Conference on of clinical records. 2001;22(3):215-
Advanced Science and Engineering 231.
(ICOASE). 2019;88-93:IEEE. 29. Shearer CJJodw. The CRISP-DM: the new
18. Zeebaree DQ, Abdulazeez AM, Zebari DA, blueprint for data mining. 2000;5(4).
Haron H, Hamed HNA. Multi-level fusion in 30. Ramana BV, Babu MSP, Venkateswarlu
ultrasound for cancer detection based on NJIJoDMS. A critical study of selected
uniform lbp features. classification algorithms for liver disease
19. Gupta S, Kumar D, Sharma AJIJoCS, diagnosis. 2011;3(2):101-114.
Engineering. Data mining classification 31. Eesa AS, Orman Z, Brifcani AMAJEswa. A
techniques applied for breast cancer novel feature-selection approach based on
diagnosis and prognosis. 2011;2(2):188- the cuttlefish optimization algorithm for
195. intrusion detection systems.
20. Kadambari S, Jaswal K, Kumar P, Rawat 2015;42(5):2670-2679.
S. Using twitter for tapping public minds, 32. Sulaiman MAJJoSC, Mining D. Evaluating
predict trends and generate value. in 2015 Data Mining Classification Methods
Fifth International Conference on Performance in Internet of Things
Advanced Computing & Communication Applications. 2020;1(2):11-25.
Technologies. 2015;586-589:IEEE. 33. Kumar AJSM, Structures. Different
21. El-Sebakhy EA, Faisal KA, Helmy T, interface delamination effects on laminated
Azzedin F, Al-Suhaim A. Evaluation of composite plate structure under free
breast cancer tumor classification with vibration analysis based on classical

57
Khorshid et al.; AJRCOS, 8(4): 45-59, 2021; Article [Link].68450

laminated plate theory. 2020;29(11): prediction using frequent itemsets mining


115028. for attributes filtering. in 2019 International
34. Lashari SA, Ibrahim R, Senan N, Conference on Intelligent Computing and
Taujuddin N. Application of data mining Control Systems (ICCS). 2019;979-
techniques for medical data classification: 982:IEEE.
a review. in MATEC Web of Conferences. 48. Bharati S, Rahman MA, Podder P. Breast
2018;150:06003:EDP Sciences. cancer prediction applying different
35. Omar N, Abdulazeez AM, Sengur A, Al-Ali classification algorithm with comparative
SGSJIJoEE, Science C. Fused faster analysis using WEKA. in 2018 4th
RCNNs for efficient detection of the license International Conference on Electrical
plates. 2020;19(2):974-982. Engineering and Information &
36. Fayyad U, Piatetsky-Shapiro G, Smyth Communication Technology (iCEEiCT).
PJAm. From data mining to knowledge 2018;581-584: IEEE.
discovery in databases. 1996;17(3):37-37. 49. Ghani MU, Alam TM, Jaskani FH.
37. Hasan DA, Abdulazeez AMJl. A modified Comparison of classification models for
convolutional neural networks model for early prediction of breast cancer. in 2019
medical image segmentation. 2020;20:22. International Conference on Innovative
38. Kareem FQ, Abdulazeez AM. Ultrasound Computing (ICIC). 2019;1-6: IEEE.
medical images classification based on 50. Basunia MR, Pervin IA, Al Mahmud M,
deep learning algorithms: A review. Saha S, Arifuzzaman M. On Predicting and
39. Hiremath N, Reddy DMJMTP. Analyzing Breast Cancer using Data
Experimental studies to assess surface Mining Approach. in 2020 IEEE Region 10
wear using grease degradation, bearing Symposium (TENSYMP). 2020;1257-
temperature and statistical parameter of 1260:IEEE.
vibration signals in a roller bearing. 51. Saoud H, Ghadi A, Ghailani M,
2017;4(8):8370-8377. Abdelhakim BA. Using feature selection
40. Reddy DLCJIJoCA. A review on data techniques to improve the accuracy of
mining from past to the future. breast cancer classification. in The
2011;975:8887. Proceedings of the Third International
41. Devendiran S, Mathew ATJMTP. Bearing Conference on Smart City Applications.
Fault Diagnosis Using Empirical Mode 2018;307-315:Springer.
Decomposition, Entropy Based Features 52. Kumar A, Sushil R, Tiwari AJIJoCS,
And Data Mining Techniques. Engineering. Comparative study of
2018;5(5):11460-11475. classification techniques for breast cancer
42. Yan H, Yang N, Peng Y, Ren YJAiC. Data diagnosis. 2019;7(1):234-240.
mining in the construction industry: Present 53. Sakri SB, Rashid NBA, Zain ZMJIA.
status, opportunities, and future trends. Particle swarm optimization feature
2020;119;103331. selection for breast cancer recurrence
43. Keleş MK. Breast cancer prediction and prediction. 2018;6:29637-29647.
detection using data mining classification 54. Sudha M, Selvarajan S, Suganthi
algorithms: a comparative study. Tehnički MJIJoBIC. Feature selection using
Vjesnik. 2019;26(1):149-155. improved lion optimisation algorithm for
44. Bhagawati K, Sen A, Shukla KK, breast cancer classification.
Bhagawati RJIJoAER, Science. Application 2019;14(4):237-246.
and Scope of Data Mining in Agriculture. 55. AK MF. A comparative analysis of breast
2016;3(7):236783. cancer detection and diagnosis using data
45. Singh S, Thakral S. Using data mining visualization and machine learning
tools for breast cancer prediction and applications. In healthcare. 2020;8(2):
analysis. in 2018 4th International 111. Multidisciplinary digital publishing
Conference on Computing Communication institute.
and Automation (ICCCA), 2018;1-4: IEEE. 56. Han J, Kamber M, Pei JJTMKSiDMS. Data
46. Bataineh AAJIJoML,Computing. A mining concepts and techniques third
comparative analysis of nonlinear machine edition. 2011;5(4):83-124.
learning algorithms for breast cancer 57. Cortes C, Vapnik VJMl. Support-vector
detection. 2019;9(3):248-254. networks. 1995;20(3):273-297.
47. Sinha A, Sahoo B, Rautaray SS, Pandey 58. Abdullah DM, Abdulazeez AMJQAJ.
M. Improved framework for breast cancer Machine Learning Applications based on

58
Khorshid et al.; AJRCOS, 8(4): 45-59, 2021; Article [Link].68450

SVM Classification A Review. Communication Technology. 2017;420-


2021;1(2):81-90. 432:Springer.
59. Nisbet R, Elder J, Miner G. Handbook of 66. Gareth J, Daniela W, Trevor H, Robert T.
statistical analysis and data mining An introduction to statistical learning: With
applications. Academic Press; 2009. applications in R. Spinger; 2013.
60. Naveed N, Jaffar AJIJoPS. Malignancy 67. Lavanya D, Rani DKUJIJoCS,
and abnormality detection of Engineering. Analysis of feature selection
mammograms using DWT features and with classification: Breast cancer datasets.
ensembling of classifiers. 2011;6(8):2107- 2011;2(5):756-763.
2116. 68. Medjahed SA, Saadi TA, Benyettou
61. Kamalakannan J, Thirumal T, AJIJoCA. Breast cancer diagnosis by using
Vaidhyanathan A, MukeshBhai KD. Study k-nearest neighbor with different distances
on different classification technique for and classification rules. 2013;62(1).
mammogram image. in 2015 International 69. Bhatia NJapa. Survey of nearest neighbor
Conference on Circuits, Power and techniques; 2010.
Computing Technologies [ICCPCT-2015], 70. Cherif WJPCS. Optimization of K-NN
2015;1-5:IEEE. algorithm by clustering and reliability
62. Tran HJn. Asurvey of machine learning coefficients: application to breast-cancer
and data mining techniques used diagnosis. 2018;127:293-299.
in multimedia system. 2019;113:13- 71. Saeed J, Abdulazeez AMJJoSC, Mining D.
21. Facial beauty prediction and analysis
63. Yusuff H, Mohamad N, Ngah U, Yahaya based on deep convolutional neural
AJIJoR, Sciences RiA. Breast cancer network: A review. 2021;2(1):1-12.
analysis using logistic regression. 72. Bailey T, AK J. A note on
2012;10(1):14-22. distance-weighted k-nearest neighbor
64. Purwanti E, Apsari R. Classification of rules; 1978.
digital mammograms using nearest 73. Witten IH, Frank EJASR. Data mining:
neighbor techniques. practical machine learning tools and
65. Mahmood MR, Abdulazeez AM. A techniques with Java implementations.
Comparative study of a new hand 2002;31(1):76-77.
recognition model based on line of features 74. Karabatak MJM. A new classifier for
and other techniques. in International bresast cancer detection based on Naïve
Conference of Reliable Information and Bayesian. 2015;72:32-36.
_________________________________________________________________________________
© 2021 Khorshid et al.; This is an Open Access article distributed under the terms of the Creative Commons Attribution License
([Link] which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly cited.

Peer-review history:
The peer review history for this paper can be accessed here:
[Link]

59

View publication stats

Common questions

Powered by AI

Support Vector Machine (SVM) generally outperforms Naive Bayes in accuracy, often by leveraging complex decision boundaries and high-dimensional data handling, while Naive Bayes, relying on feature independence assumption, provides efficiency and simplicity but may struggle with complex interactions present in breast cancer datasets .

Anthropometric data such as age, BMI, glucose levels, HOMA, and insulin resistance serve as critical inputs for predictive models like K-NN and ANN. By highlighting significant biomarkers, these parameters aid in forming precise predictions about breast cancer, improving model accuracy in classification tasks .

Feature selection can both positively and negatively impact SVM performance in breast cancer classification. In the case of SVM, feature selection reduced its accuracy for the WBCD dataset, indicating that the technique might remove features which are influential for SVM's predictive capability .

Combining recursive feature elimination with classifiers like K-NN, ANN, DT, and NB enhances the identification of biomarkers such as age, BMI, glucose, HOMA, and resistance by systematically eliminating less significant features, which refines model focus and improves classification accuracy to some extent, as ANN showed 80.00% accuracy .

Cross-validation, particularly 10-fold cross-validation, is extensively used to ensure robustness and accuracy of breast cancer classification models by dividing the data into subsets to train and validate iteratively, reducing overfitting and providing a reliable estimate of model performance across different segments of data .

ANN consistently shows high accuracy for breast cancer classification, reaching up to 80.00% accuracy as very effective in classification tasks due to its ability to model complex patterns and interactions within data, supported by references from multiple studies .

K-Nearest Neighbors (K-NN) algorithm performance can vary due to differences in dataset characteristics such as size, feature selection methods, and noise levels. Specific to breast cancer datasets, variations in attribute relevance and preprocessing impact the algorithm's distance-based calculations, causing inconsistencies in accuracy .

The improved lion optimization algorithm (ILOA) effectively identifies smaller, relevant feature subsets quickly and accurately, enhancing classification efficiency. In the context of breast cancer, ILOA with K-NN classifier has delivered superior performance in managing complex datasets .

Preprocessing techniques like z-score normalization standardize data by adjusting means and variances, which enhances the robustness of classification outcomes by ensuring that models, such as those used for breast cancer classification, do not get skewed by scale differences in dataset attributes, thereby improving accuracy .

The Stacking Classifier approach in breast cancer detection combines multiple individual classifiers to enhance overall performance, achieving the highest accuracy of 97.20% according to Basunia et al. This technique leverages the strengths of different algorithms, resulting in improved efficacy over singular methods .

You might also like