UNIT IV
Machine Learning
Hypothesis testing and determining the multiple analytical methodologies
What is Machine Learning :-
Machine learning usually refers to changes in systems that perform tasks associated with
artificial intelligence (AI). Such tasks involve recognition, diagnosis, planning, robot control,
prediction, etc.
Machine learning is a branch in computer science that studies the design of algorithms that
can learn. Typical machine learning tasks are concept learning, function learning or
“predictive modeling”, clustering and finding predictive patterns. These tasks are learned
through available data that were observed through experiences or instructions, for example.
Machine learning hopes that including the experience into its tasks will eventually improve
the learning. The ultimate goal is to improve the learning in such a way that it becomes
automatic, so that humans like ourselves don't need to interfere any more.
Machine learning tasks are typically classified into three broad categories, depending on the
nature of the learning "signal" or "feedback" available to a learning system. These are:
Supervised learning: The computer is presented with example inputs and their desired
outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to
outputs.
Unsupervised learning: No labels are given to the learning algorithm, leaving it on its
own to find structure in its input. Unsupervised learning can be a goal in itself (discovering
hidden patterns in data) or a means towards an end.
Reinforcement learning: A computer program interacts with a dynamic environment in
which it must perform a certain goal (such as driving a vehicle), without a teacher explicitly
telling it whether it has come close to its goal or not. Another example is learning to play a
game by playing against an opponent.
Between supervised and unsupervised learning is semi-supervised learning, where the
teacher gives an incomplete training signal: a training set with some (often many) of the
target outputs missing. Transduction is a special case of this principle where the entire set of
problem instances is known at learning time, except that part of the targets is missing.
A support vector machine is a classifier that divides its input space into two regions,
separated by a linear boundary. Here, it has learned to distinguish black and white circles.
Among other categories of machine learning problems, learning to learn learns its own
inductive bias based on previous experience. Developmental learning, elaborated for robot
learning, generates its own sequences (also called curriculum) of learning situations to
cumulatively acquire repertoires of novel skills through autonomous self-exploration and
social interaction with human teachers, and using guidance mechanisms such as active
learning, maturation, motor synergies, and imitation.
Another categorization of machine learning tasks arises when one considers the desired
output of a machine-learned system.
In classification, inputs are divided into two or more classes, and the learner must produce
a model that assigns unseen inputs to one (or multi-label classification) or more of these
classes. This is typically tackled in a supervised way. Spam filtering is an example of
classification, where the inputs are email (or other) messages and the classes are "spam" and
"not spam".
In regression, also a supervised problem, the outputs are continuous rather than discrete.
In clustering, a set of inputs is to be divided into groups. Unlike in classification, the
groups are not known beforehand, making this typically an unsupervised task.
Density estimation finds the distribution of inputs in some space.
Dimensionality reduction simplifies inputs by mapping them into a lower-dimensional
space.
Machine learning and data mining often employ the same methods and overlap significantly.
They can be roughly distinguished as follows:
Machine learning focuses on prediction, based on known properties learned from the
training data.
Data mining focuses on the discovery of (previously) unknown properties in the data. This
is the analysis step of Knowledge Discovery in Databases.
The two areas overlap in many ways: data mining uses many machine learning methods, but
often with a slightly different goal in mind. On the other hand, machine learning also
employs data mining methods as "unsupervised learning" or as a preprocessing step to
improve learner accuracy. Much of the confusion between these two research communities
(which do often have separate conferences and separate journals, ECML PKDD being a
major exception) comes from the basic assumptions they work with: in machine learning,
performance is usually evaluated with respect to the ability to reproduce known knowledge,
while in Knowledge Discovery and Data Mining (KDD) the key task is the discovery of
previously unknown knowledge. Evaluated with respect to known knowledge, an uninformed
(unsupervised) method will easily be outperformed by supervised methods, while in a typical
KDD task, supervised methods cannot be used due to the unavailability of training data.
Machine learning also has intimate ties to optimization: many learning problems are
formulated as minimization of some loss function on a training set of examples. Loss
functions express the discrepancy between the predictions of the model being trained and the
actual problem instances
For example, in classification, one wants to assign a label to instances, and models are trained
to correctly predict the pre-assigned labels of a set examples.
The difference between the two fields arises from the goal of generalization: while
optimization algorithms can minimize the loss on a training set, machine learning is
concerned with minimizing the loss on unseen samples.
Train model using statistical/machine learning algorithms, Test model
To train the algorithm we feed it quality data known as a training set. A training set is the set
of training examples we‟ll use to train our machine learning algorithms.
Train the algorithm: - This is where the machine learning takes place. This step and the
next step are where the “core” algorithms lie, depending on the algorithm. You feed the
algorithm good clean data from the first two steps and
extract knowledge or information. This knowledge you often store in a format that‟s readily
useable by a machine for the next two steps. In the case of unsupervised learning, there‟s no
training step because you don‟t have a target value. Everything is used in the next step.
Test the algorithm:-This is where the information learned in the previous step is put to use.
When you‟re evaluating an algorithm, you‟ll test it to see how well it does. In the case of
supervised learning, you have some known values you can use to evaluate the algorithm. In
unsupervised learning, you may have to use some other metrics to evaluate the success.
Sample for prediction
For prediction various types of algorithms are used.
Collect [Link] could collect the samples by scraping a website and extracting data, or
you could get information from an RSS feed or an API. You could have a device collect wind
speed measurements and send them to you, or blood glucose levels, or anything you can
measure. The number of options is endless. To save some time and effort, you could use
publicly available data.
Prepare the input data. Once you have this data, you need to make sure it‟s in a useable
format. The format we‟ll be using in this book is the Python list. We‟ll talk about Python
more in a little bit, and lists are reviewed in appendix A. The benefit of having this standard
format is that you can mix and match algorithms and data sources. You may need to do some
algorithm-specific formatting here. Some algorithms need features in a special format, some
algorithms can deal with target variables and features as strings, and some
need them to be integers. We‟ll get to this later, but the algorithm-specific formatting is
usually trivial compared to collecting data. One idea that naturally arises is combining
multiple [Link] that do this are known as ensemble methods or meta-algorithms.
Ensemble methods can take the form of using different algorithms, using the same algorithm
with different settings, or assigning different parts of the dataset to different classifiers.
Explore the chosen algorithms for more accuracy
Analyze the input data. This is looking at the data from the previous task. This could be as
simple as looking at the data you‟ve parsed in a text editor to make sure that data is collected
and prepared in proper way and are actually working and you don‟t have a bunch of empty
values. You can also look at the data to see if you can recognize any patterns or if there‟s
anything obvious, such as a few data points that are vastly different from the rest of the set.
Plotting data in one, two, or three dimensions can also help. But most of the time you‟ll have
more than three features and you can‟t easily plot the data across all features at one time. You
could, however, use some advanced methods we‟ll talk about later to distill multiple
dimensions down to two or three so you can visualize the data.
If you‟re working with a production system and you know what the data should look like, or
you trust its source, you can skip this step. This step takes human involvement, and for an
automated system you don‟t want human involvement. The value of this step is that it makes
you understand you don‟t have garbage coming in.
What is kNN Algorithm?
Let‟s assume we have several groups of labeled samples. The items present in the groups are
homogeneous in nature. Now, suppose we have an unlabeled example which needs to be
classified into one of the several labeled groups.
How do you do that?
Unhesitatingly, using kNN Algorithm.
k nearest neighbors is a simple algorithm that stores all available cases and classifies new
cases by a majority vote of its k neighbors. This algorithms segregates unlabeled data points
into well defined groups.
How to select appropriate k value?
Choosing the number of nearest neighbors i.e. determining the value of k plays a significant
role in determining the efficacy of the model. Thus, selection of k will determine how well
the data can be utilized to generalize the results of the kNN algorithm. A large k value has
benefits which include reducing the variance due to the noisy data; the side effect being
developing a bias due to which the learner tends to ignore the smaller patterns which may
have useful insights.
The following example will give you practical insight on selecting the appropriate k value.
Example of kNN Algorithm
Let‟s consider 10 ‟drinking items‟ which are rated on two parameters on a scale of 1 to 10.
The two parameters are “sweetness” and “fizziness”. This is more of a perception based
rating and so may vary between individuals. I would be considering my ratings (which might
differ) to take this illustration ahead. The ratings of few items look somewhat as:
“Sweetness” determines the perception of the sugar content in the items. “Fizziness”
ascertains the presence of bubbles in the drink due to the carbon dioxide content in the drink.
Again, all these ratings used are based on personal perception and are strictly relative.
From the above figure, it is clear we have bucketed the 10 items into 4 groups namely,
‟COLD DRINKS‟, „ENERGY DRINKS‟, „HEALTH DRINKS‟ and „HARD DRINKS‟. The
question here is, to which group would „Maaza‟ fall into? This will be determined by
calculating distance.
Calculating Distance
Now, calculating distance between „Maaza‟ and its nearest neighbors („ACTIV‟, „Vodka‟,
„Pepsi‟ and „Monster‟) requires the usage of a distance formula, the most popular
being Euclidean distanceformula i.e. the shortest distance between the 2 points which may be
obtained using a ruler.
Using the co-ordinates of Maaza (8,2) and Vodka (2,1), the distance between „Maaza‟ and
„Vodka‟ can be calculated as:
dist(Maaza,Vodka) = 6.08
Using Euclidean distance, we can calculate the distance of Maaza from each of its nearest
neighbors. Distance between Maaza and ACTIV being the least, it may be inferred that
Maaza is same as ACTIV in nature which in turn belongs to the group of drinks (Health
Drinks).
If k=1, the algorithm considers the nearest neighbor to Maazai.e, ACTIV; if k=3, the
algorithm considers „3‟ nearest neighbors to Maaza to compare the distances (ACTIV,
Vodka, Monster) – ACTIV stands the nearest to Maaza.
kNN Algorithm – Pros and Cons
Pros: The algorithm is highly unbiased in nature and makes no prior assumption of the
underlying data. Being simple and effective in nature, it is easy to implement and has gained
good popularity.
Cons: Indeed it is simple but kNN algorithm has drawn a lot of flake for being extremely
simple! If we take a deeper look, this doesn‟t create a model since there‟s no abstraction
process involved. Yes, the training process is really fast as the data is stored verbatim (hence
lazy learner) but the prediction time is pretty high with useful insights missing at times.
Therefore, building this algorithm requires time to be invested in data preparation (especially
treating the missing data and categorical features) to obtain a robust model.
Case Study: Detecting Prostate Cancer
Machine learning finds extensive usage in pharmaceutical industry especially in detection of
oncogenic (cancer cells) growth. R finds application in machine learning to build models to
predict the abnormal growth of cells thereby helping in detection of cancer and benefiting the
health system.
Let‟s see the process of building this model using kNN algorithm in R Programming. Below
you‟ll observe I‟ve explained every line of code written to accomplish this task.
Step 1- Data collection
We will use a data set of 100 patients (created solely for the purpose of practice) to
implement the knn algorithm and thereby interpreting results .The data set has been prepared
keeping in mind the results which are generally obtained from DRE (Digital Rectal Exam).
You can download the data set and practice these steps as I explain.
The data set consists of 100 observations and 10 variables (out of which 8 numeric variables
and one categorical variable and is ID) which are as follows:
1. Radius
2. Texture
3. Perimeter
4. Area
5. Smoothness
6. Compactness
7. Symmetry
8. Fractal dimension
In real life, there are dozens of important parameters needed to measure the probability of
cancerous growth but for simplicity purposes let‟s deal with 8 of them!
Here‟s how the data set looks like:
Step 2- Preparing and exploring the data
Let‟s make sure that we understand every line of code before proceeding to the next stage:
setwd("C:/Users/Payal/Desktop/KNN") #Using this command, we've imported the
„Prostate_Cancer.csv‟ data file. This command is used to point to the folder containing the
required file. Do keep in mind, that it‟s a common mistake to use “\” instead of “/” after the
setwd command.
prc<- [Link]("Prostate_Cancer.csv",stringsAsFactors = FALSE) #This command imports
the required data set and saves it to the prc data frame.
stringsAsFactors = FALSE #This command helps to convert every character vector to a
factor wherever it makes sense.
str(prc) #We use this command to see whether the data is structured or not.
We find that the data is structured with 10 variables and 100 observations. If we observe the
data set, the first variable „id‟ is unique in nature and can be removed as it does not provide
useful information.
prc<- prc[-1] #removes the first variable(id) from the data set.
The data set contains patients who have been diagnosed with either Malignant (M) or Benign
(B) cancer
table(prc$diagnosis_result) # it helps us to get the numbers of patients
(The variable diagnosis_result is our target variable i.e. this variable will determine the
results of the diagnosis based on the 8 numeric variables)
In case we wish to rename B as”Benign” and M as “Malignant” and see the results in the
percentage form, we may write as:
prc$diagnosis<- factor(prc$diagnosis_result, levels = c("B", "M"), labels = c("Benign",
"Malignant"))
round([Link](table(prc$diagnosis)) * 100, digits = 1) # it gives the result in the
percentage form rounded of to 1 decimal place( and so it‟s digits = 1)
Normalizing numeric data
This feature is of paramount importance since the scale used for the values for each variable
might be different. The best practice is to normalize the data and transform all the values to a
common scale.
normalize<- function(x) {
return ((x - min(x)) / (max(x) - min(x))) }
Once we run this code, we are required to normalize the numeric features in the data set.
Instead of normalizing each of the 8 individual variables we use:
prc_n<- [Link](lapply(prc[2:9], normalize))
The first variable in our data set (after removal of id) is „diagnosis_result‟ which is not
numeric in nature. So, we start from 2nd variable. The function lapply() applies normalize() to
each feature in the data frame. The final result is stored to prc_n data frame using
[Link]() function
Let‟s check using the variable „radius‟ whether the data has been normalized.
summary(prc_n$radius)
The results show that the data has been normalized. Do try with the other variables such as
perimeter, area etc.
Creating training and test data set
The kNN algorithm is applied to the training data set and the results are verified on the test
data set.
For this, we would divide the data set into 2 portions in the ratio of 65: 35 (assumed) for the
training and test data set respectively. You may use a different ratio altogether depending on
the business requirement!
We shall divide the prc_n data frame into prc_train and prc_test data frames
prc_train<- prc_n[1:65,]
prc_test<- prc_n[66:100,]
A blank value in each of the above statements indicate that all rows and columns should be
included.
Our target variable is „diagnosis_result‟ which we have not included in our training and test
data sets.
prc_train_labels<- prc[1:65, 1]
prc_test_labels<- prc[66:100, 1] #This code takes the diagnosis factor in column 1 of the prc
data frame and on turn creates prc_train_labels and prc_test_labels data frame.
Step 3 – Training a model on data
The knn () function needs to be used to train a model for which we need to install a package
„class‟. The knn() function identifies the k-nearest neighbors using Euclidean distance where
k is a user-specified number.
You need to type in the following commands to use knn()
[Link](“class”)
library(class)
Now we are ready to use the knn() function to classify test data
prc_test_pred<- knn(train = prc_train, test = prc_test,cl = prc_train_labels, k=10)
The value for k is generally chosen as the square root of the number of observations.
knn() returns a factor value of predicted labels for each of the examples in the test data set
which is then assigned to the data frame prc_test_pred
Step 4 – Evaluate the model performance
We have built the model but we also need to check the accuracy of the predicted values in
prc_test_pred as to whether they match up with the known values in prc_test_labels. To
ensure this, we need to use the CrossTable() function available in the package „gmodels‟.
We can install it using:
[Link]("gmodels")
The test data consisted of 35 observations. Out of which 5 cases have been accurately
predicted (TN->True Negatives) as Benign (B) in nature which constitutes 14.3%. Also, 16
out of 35 observations were accurately predicted (TP-> True Positives) as Malignant (M) in
nature which constitutes 45.7%. Thus a total of 16 out of 35 predictions where TP i.e, True
Positive in nature.
There were no cases of False Negatives (FN) meaning no cases were recorded which actually
are malignant in nature but got predicted as benign. The FN‟s if any poses a potential threat
for the same reason and the main focus to increase the accuracy of the model is to reduce
FN‟s.
There were 14 cases of False Positives (FP) meaning 14 cases were actually benign in nature
but got predicted as malignant.
The total accuracy of the model is 60 %( (TN+TP)/35) which shows that there may be
chances to improve the model performance
Step 5 – Improve the performance of the model
This can be taken into account by repeating the steps 3 and 4 and by changing the k-value.
Generally, it is the square root of the observations and in this case we took k=10 which is a
perfect square root of [Link] k-value may be fluctuated in and around the value of 10 to
check the increased accuracy of the model. Do try it out with values of your choice to
increase the accuracy! Also remember, to keep the value of FN‟s as low as possible.