Machine Learning Tutorial
Machine Learning Tutorial
This machine learning tutorial gives you an introduction to machine learning along
with the wide range of machine learning techniques such
as Supervised, Unsupervised, and Reinforcement learning. You will learn
about regression and classification models, clustering methods, hidden Markov
models, and various sequential models.
1
Machine Learning is said as a subset of artificial intelligence that is mainly
concerned with the development of algorithms which allow a computer to learn
from the data and past experiences on their own. The term machine learning was
first introduced by Arthur Samuel in 1959. We can define it in a summarized way
as:
Machine learning enables a machine to automatically learn from data, improve performance from
experiences, and predict things without being explicitly programmed.
With the help of sample historical data, which is known as training data, machine
learning algorithms build a mathematical model that helps in making predictions
or decisions without being explicitly programmed. Machine learning brings
computer science and statistics together for creating predictive models. Machine
learning constructs or uses the algorithms that learn from historical data. The more
we will provide the information, the higher will be the performance.
2
amount of data helps to build a better model which predicts the output more
accurately.
We can train machine learning algorithms by providing them the huge amount of
data and let them explore the data, construct the models, and predict the required
output automatically. The performance of the machine learning algorithm depends
on the amount of data, and it can be determined by the cost function. With the
help of machine learning, we can save both time and money.
3
The importance of machine learning can be easily understood by its uses cases,
Currently, machine learning is used in self-driving cars, cyber fraud
detection, face recognition, and friend suggestion by Facebook, etc. Various
top companies such as Netflix and Amazon have build machine learning models
that are using a vast amount of data to analyze the user interest and recommend
product accordingly.
Following are some key points which show the importance of Machine
Learning:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
4
1) Supervised Learning
Supervised learning is a type of machine learning method in which we provide
sample labeled data to the machine learning system in order to train it, and on
that basis, it predicts the output.
The system creates a model using labeled data to understand the datasets and
learn about each data, once the training and processing are done then we test the
model by providing a sample data to check whether it is predicting the exact
output or not.
The goal of supervised learning is to map input data with the output data. The
supervised learning is based on supervision, and it is the same as when a student
learns things in the supervision of the teacher. The example of supervised learning
is spam filtering.
o Classification
o Regression
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any
supervision.
The training is provided to the machine with the set of data that has not been
labeled, classified, or categorized, and the algorithm needs to act on that data
without any supervision. The goal of unsupervised learning is to restructure the
input data into new features or a group of objects with similar patterns.
o Clustering
o Association
3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a learning
agent gets a reward for each right action and gets a penalty for each wrong action.
The agent learns automatically with these feedbacks and improves its
5
performance. In reinforcement learning, the agent interacts with the environment
and explores it. The goal of an agent is to get the most reward points, and hence,
it improves its performance.
The robotic dog, which automatically learns the movement of his arms, is an
example of Reinforcement learning.
Note: We will learn about the above types of machine learning in detail in later chapters.
o 1940: In 1940, the first manually operated computer, "ENIAC" was invented,
which was the first electronic general-purpose computer. After that stored
program computer such as EDSAC in 1949 and EDVAC in 1951 were
invented.
o 1943: In 1943, a human neural network was modeled with an electrical
circuit. In 1950, the scientists started applying their idea to work and
analyzed how human neurons might work.
o 1952: Arthur Samuel, who was the pioneer of machine learning, created a
program that helped an IBM computer to play a checkers game. It performed
better more it played.
o 1959: In 1959, the term "Machine Learning" was first coined by Arthur
Samuel.
o The duration of 1974 to 1980 was the tough time for AI and ML researchers,
and this duration was called as AI winter.
o In this duration, failure of machine translation occurred, and people had
reduced their interest from AI, which led to reduced funding by the
government to the researches.
o 2006: In the year 2006, computer scientist Geoffrey Hinton has given a new
name to neural net research as "deep learning," and nowadays, it has
become one of the most trending technologies.
o 2012: In 2012, Google created a deep neural network which learned to
recognize the image of humans and cats in YouTube videos.
o 2014: In 2014, the Chabot "Eugen Goostman" cleared the Turing Test. It
was the first Chabot who convinced the 33% of human judges that it was not
a machine.
o 2014: DeepFace was a deep neural network created by Facebook, and they
claimed that it could recognize a person with the same precision as a human
can do.
o 2016: AlphaGo beat the world's number second player Lee sedol at Go
game. In 2017 it beat the number one player of this game Ke Jie.
o 2017: In 2017, the Alphabet's Jigsaw team built an intelligent system that
was able to learn the online trolling. It used to read millions of comments
of different websites to learn to stop online trolling.
8
Modern machine learning models can be used for making various predictions,
including weather prediction, disease prediction, stock market analysis,
etc.
Prerequisites
Before learning machine learning, you must have the basic knowledge of
followings so that you can easily understand the concepts of machine learning:
Audience
Our Machine learning tutorial is designed to help beginner and professionals.
Problems
We assure you that you will not find any difficulty while learning our Machine
learning tutorial. But if there is any mistake in this tutorial, kindly post the problem
or error in the contact form so that we can improve it.
1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It
is used to identify objects, persons, places, digital images, etc. The popular use
case of image recognition and face detection is, Automatic friend tagging
suggestion:
10
It is based on the Facebook project named "Deep Face," which is responsible for
face recognition and person identification in the picture.
Play Videox
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under
speech recognition, and it's a popular application of machine learning.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the
correct path with the shortest route and predicts the traffic conditions.
o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes
information from the user and sends back to its database to improve the
performance.
4. Product recommendations:
11
Machine learning is widely used by various e-commerce and entertainment
companies such as Amazon, Netflix, etc., for product recommendation to the
user. Whenever we search for some product on Amazon, then we started getting
an advertisement for the same product while internet surfing on the same browser
and this is because of machine learning.
Google understands the user interest using various machine learning algorithms
and suggests the product as per customer interest.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars.
Machine learning plays a significant role in self-driving cars. Tesla, the most
popular car manufacturing company is working on self-driving car. It is using
unsupervised learning method to train the car models to detect people and objects
while driving.
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
12
ways just by our voice instructions such as Play music, call someone, Open an
email, Scheduling an appointment, etc.
These assistant record our voice instructions, send it over the server on a cloud,
and decode it using ML algorithms and act accordingly.
For each genuine transaction, the output is converted into some hash values, and
these values become the input for the next round. For each genuine transaction,
there is a specific pattern which gets change for the fraud transaction hence, it
detects it and makes our online transactions more secure.
Machine learning life cycle involves seven major steps, which are given below:
o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment
14
The most important thing in the complete process is to understand the problem
and to know the purpose of the problem. Therefore, before starting the life cycle,
we need to understand the problem because the good result depends on the
better understanding of the problem.
1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this
step is to identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be
collected from various sources such as files, database, internet, or mobile
devices. It is one of the most important steps of the life cycle. The quantity and
quality of the collected data will determine the efficiency of the output. The more
will be the data, the more accurate will be the prediction.
15
This step includes the below tasks:
By performing the above task, we get a coherent set of data, also called as
a dataset. It will be used in further steps.
2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation
is a step where we put our data into a suitable place and prepare it to use in our
machine learning training.
In this step, first, we put all data together, and then randomize the ordering of
data.
o Data exploration:
It is used to understand the nature of data that we have to work with. We
need to understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.
o Data pre-processing:
Now the next step is preprocessing of data for its analysis.
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable
format. It is the process of cleaning the data, selecting the variable to use, and
transforming the data in a proper format to make it more suitable for analysis in
the next step. It is one of the most important steps of the complete process.
Cleaning of data is required to address the quality issues.
16
It is not necessary that data we have collected is always of our use as some of the
data may not be useful. In real-world applications, collected data may have various
issues, including:
o Missing Values
o Duplicate data
o Invalid data
o Noise
It is mandatory to detect and remove the above issues because it can negatively
affect the quality of the outcome.
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step
involves:
The aim of this step is to build a machine learning model to analyze the data using
various analytical techniques and review the outcome. It starts with the
determination of the type of the problems, where we select the machine learning
techniques such as Classification, Regression, Cluster analysis, Association,
etc. then build the model using prepared data, and evaluate the model.
Hence, in this step, we take the data and use machine learning algorithms to build
the model.
5. Train Model
Now the next step is to train the model, in this step we train our model to improve
its performance for better outcome of the problem.
17
We use datasets to train the model using various machine learning algorithms.
Training a model is required so that it can understand the various patterns, rules,
and, features.
6. Test Model
Once our machine learning model has been trained on a given dataset, then we
test the model. In this step, we check for the accuracy of our model by providing a
test dataset to it.
Testing the model determines the percentage accuracy of the model as per the
requirement of project or problem.
7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the
model in the real-world system.
In this topic, we will learn to install Python and an IDE with the help of Anaconda
distribution.
Below some steps are given to show the downloading and installing process of
Anaconda and IDE:
19
o After clicking on the first link, you will reach to download page of Anaconda,
as shown in the below image:
o Since, Anaconda is available for Windows, Linux, and Mac OS, hence, you can
download it as per your OS type by clicking on available options shown in
below image. It will provide you Python 2.7 and Python 3.7 versions, but the
20
latest version is 3.7, hence we will download Python 3.7 version. After
clicking on the download option, it will start downloading on your computer.
Note: In this topic, we are downloading Anaconda for Windows you can choose it as per your OS.
21
o It will open a License agreement window click on "I Agree" option and move
further.
22
o In the next window, you will get two options for installations as given in the
below image. Select the first option (Just me) and click on Next.
o Now you will get a window for installing location, here, you can leave it as
default or change it by browsing a location, and then click on Next. Consider
the below image:
23
o Now select the second option, and click on install.
24
o Now installation is completed, tick the checkbox if you want to learn more
about Anaconda and Anaconda cloud. Click on Finish to end the process.
25
Note: Here, we will use the Spyder IDE to run Python programs.
26
o Write your first program, and save it using the .py extension.
o Run the program using the triangle Run button.
o You can check the program's output on console pane at the bottom right
side.
27
How to get datasets for Machine Learning
The key to success in the field of machine learning or to become a great data
scientist is to practice with different types of datasets. But discovering a suitable
dataset for each kind of machine learning project is a difficult task. So, in this topic,
we will provide the detail of the sources from where you can easily get the dataset
according to your project.
Before knowing the sources of the machine learning dataset, let's discuss datasets.
What is a dataset?
A dataset is a collection of data in which data is arranged in some order. A
dataset can contain any data from a series of an array to a database table. Below
table shows an example of the dataset:
India 38 48000 No
Germany 30 54000 No
France 48 65000 No
Germany 40 Yes
28
o Ordinal data:These data are similar to categorical data but can be
measured on the basis of comparison.
Note: A real-world dataset is of huge size, which is difficult to manage and process at
the initial level. Therefore, to practice machine learning algorithms, we can use any
dummy dataset.
Need of Dataset
To work with machine learning projects, we need a huge amount of data, because,
without the data, one cannot train ML/AI models. Collecting and preparing the
dataset is one of the most crucial parts while creating an ML/AI project.
The technology applied behind any ML projects cannot work properly if the dataset
is not well prepared and pre-processed.
During the development of the ML project, the developers completely rely on the
datasets. In building ML applications, datasets are divided into two parts:
o Training dataset:
o Test Dataset
29
Note: The datasets are of large size, so to download these datasets, you must have
fast internet on your computer.
1. Kaggle Datasets
30
Kaggle is one of the best sources for providing datasets for Data Scientists and
Machine Learners. It allows users to find, download, and publish datasets in an
easy way. It also provides the opportunity to work with other machine learning
engineers and solve difficult Data Science related tasks.
Kaggle provides a high-quality dataset in different formats that we can easily find
and download.
31
UCI Machine learning repository is one of the great sources of machine learning
datasets. This repository contains databases, domain theories, and data
generators that are widely used by the machine learning community for the
analysis of ML algorithms.
Since the year 1987, it has been widely used by students, professors, researchers
as a primary source of machine learning dataset.
It classifies the datasets as per the problems and tasks of machine learning such
as Regression, Classification, Clustering, etc. It also contains some of the
popular datasets such as the Iris dataset, Car Evaluation dataset, Poker
Hand dataset, etc.
32
The link for the UCI machine learning repository
is [Link]
We can search, download, access, and share the datasets that are publicly
available via AWS resources. These datasets can be accessed through AWS
resources but provided and maintained by different government organizations,
researches, businesses, or individuals.
Anyone can analyze and build various services using shared data via AWS
resources. The shared dataset on cloud helps users to spend more time on data
analysis rather than on acquisitions of data.
33
This source provides the various types of datasets with examples and ways to use
the dataset. It also provides the search box using which we can search for the
required dataset. Anyone can add any dataset or example to the Registry of
Open Data on AWS.
34
The link for the Google dataset search engine
is [Link]
5. Microsoft Datasets
The Microsoft has launched the "Microsoft Research Open data" repository with
the collection of free datasets in various areas such as natural language
processing, computer vision, and domain-specific sciences.
Using this resource, we can download the datasets to use on the current device, or
we can also directly use it on the cloud infrastructure.
35
The link to download or use the dataset from this resource
is [Link]
The link to download the dataset from Awesome public dataset collection
is [Link]
36
7. Government Datasets
There are different sources to get government-related data. Various countries
publish government data for public use collected by them from different
departments.
o US Government Dataset
37
Visual data provides multiple numbers of the great dataset that are specific to
computer visions such as Image Classification, Video classification, Image
Segmentation, etc. Therefore, if you want to build a project on deep learning or
image processing, then you can refer to this source.
The link for downloading the dataset from this source is [Link]
9. Scikit-learn dataset
38
Scikit-learn is a great source for machine learning enthusiasts. This source
provides both toy and real-world datasets. These datasets can be obtained from
[Link] package and using general dataset API.
The toy dataset available on scikit-learn can be loaded using some predefined
functions such as, load_boston([return_X_y]), load_iris([return_X_y]), etc,
rather than importing any file from external sources. But these datasets are not
suitable for real-world projects.
39
Data Preprocessing in Machine learning
Data preprocessing is a process of preparing the raw data and making it suitable
for a machine learning model. It is the first and crucial step while creating a
machine learning model.
When creating a machine learning project, it is not always a case that we come
across the clean and formatted data. And while doing any operation with data, it is
mandatory to clean it and put in a formatted way. So for this, we use data
preprocessing task.
Dataset may be of different formats for different purposes, such as, if we want to
create a machine learning model for business purpose, then dataset will be
different with the dataset required for a liver patient. So each dataset is different
40
from another dataset. To use the dataset in our code, we usually put it into a
CSV file. However, sometimes, we may also need to use an HTML or xlsx file.
Here we will use a demo dataset for data preprocessing, and for practice, it can be
downloaded from here, "[Link]
learning. For real-world problems, we can download datasets online from various
sources such
as [Link] [Link]
etc.
We can also create our dataset by gathering data using various API with Python
and put that data into a .csv file.
2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some
predefined Python libraries. These libraries are used to perform some specific jobs.
There are three specific libraries that we will use for data preprocessing, which are:
Numpy: Numpy Python library is used for including any type of mathematical
operation in the code. It is the fundamental package for scientific calculation in
Python. It also supports to add large, multidimensional arrays and matrices. So, in
Python, we can import it as:
1. import numpy as nm
Here we have used nm, which is a short name for Numpy, and it will be used in the
whole program.
41
Pandas: The last library is the Pandas library, which is one of the most famous
Python libraries and used for importing and managing the datasets. It is an open-
source data manipulation and analysis library. It will be imported as below:
Here, we have used pd as a short name for this library. Consider the below image:
Note: We can set any directory as a working directory, but it must contain the required dataset.
Here, in the below image, we can see the Python file along with required dataset.
Now, the current folder is set as a working directory.
42
read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library, which
is used to read a csv file and performs various operations on it. Using this function,
we can read a csv file locally as well as through an URL.
1. data_set= pd.read_csv('[Link]')
Here, data_set is a name of the variable to store our dataset, and inside the
function, we have passed the name of our dataset. Once we execute the above
line of code, it will successfully import the dataset in our code. We can also check
43
the imported dataset by clicking on the section variable explorer, and then
double click on data_set. Consider the below image:
As in the above image, indexing is started from 0, which is the default indexing in
Python. We can also change the format of our dataset by clicking on the format
option.
44
there are three independent variables that are Country, Age, and Salary, and
one is a dependent variable which is Purchased.
1. x= data_set.iloc[:,:-1].values
In the above code, the first colon(:) is used to take all the rows, and the second
colon(:) is for all the columns. Here we have used :-1, because we don't want to
take the last column as it contains the dependent variable. So by doing this, we
will get the matrix of features.
As we can see in the above output, there are only three variables.
1. y= data_set.iloc[:,3].values
Here we have taken all the rows with the last column only. It will give the array of
dependent variables.
Output:
45
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)
Note: If you are using Python language for machine learning, then extraction is mandatory, but for
R language it is not required.
There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal with null
values. In this way, we just delete the specific row or column which consists of null
values. But this way is not so efficient and removing data may lead to loss of
information which will not give the accurate output.
By calculating the mean: In this way, we will calculate the mean of that column
or row which contains any missing value and will put it on the place of missing
value. This strategy is useful for the features which have numeric data such as
age, salary, year, etc. Here, we will use this approach.
To handle missing values, we will use Scikit-learn library in our code, which
contains various libraries for building machine learning models. Here we will
use Imputer class of [Link] library. Below is the code for it:
1. #handling missing data (Replacing missing data with the mean value)
2. from [Link] import Imputer
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
4. #Fitting imputer object to the independent variables x.
5. imputerimputer= [Link](x[:, 1:3])
6. #Replacing missing data with the calculated mean value
7. x[:, 1:3]= [Link](x[:, 1:3])
Output:
46
['France', 48.0, 65000.0],
['Germany', 40.0, 65222.22222222222],
['India', 35.0, 58000.0],
['Germany', 41.111111111111114, 53000.0],
['France', 49.0, 79000.0],
['India', 50.0, 88000.0],
['France', 37.0, 77000.0]], dtype=object
As we can see in the above output, the missing values have been replaced with the
means of rest column values.
Since machine learning model completely works on mathematics and numbers, but
if our dataset would have a categorical variable, then it may create trouble while
building the model. So it is necessary to encode these categorical variables into
numbers.
Firstly, we will convert the country variables into categorical data. So to do this, we
will use LabelEncoder() class from preprocessing library.
1. #Catgorical data
2. #for Country Variable
3. from [Link] import LabelEncoder
4. label_encoder_x= LabelEncoder()
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
Output:
Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)
Explanation:
47
In above code, we have imported LabelEncoder class of sklearn library. This
class has successfully encoded the variables into digits.
But in our case, there are three country variables, and as we can see in the above
output, these variables are encoded into 0, 1, and 2. By these values, the machine
learning model may assume that there is some correlation between these
variables which will produce the wrong output. So to remove this issue, we will
use dummy encoding.
Dummy Variables:
Dummy variables are those variables which have values 0 or 1. The 1 value gives
the presence of that variable in a particular column, and rest variables become 0.
With dummy encoding, we will have a number of columns equal to the number of
categories.
In our dataset, we have 3 categories so it will produce three columns having 0 and
1 values. For Dummy Encoding, we will use OneHotEncoder class
of preprocessing library.
Output:
48
8.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
7.70000000e+04]])
As we can see in the above output, all the variables are encoded into numbers 0
and 1 and divided into three columns.
1. labelencoder_y= LabelEncoder()
2. y= labelencoder_y.fit_transform(y)
49
For the second categorical variable, we will only use labelencoder object
of LableEncoder class. Here we are not using OneHotEncoder class because the
purchased variable has only two categories yes or no, and which are automatically
encoded into 0 and 1.
Output:
6) Splitting the Dataset into the Training set and Test set
In machine learning data preprocessing, we divide our dataset into a training set
and test set. This is one of the crucial steps of data preprocessing as by doing this,
we can enhance the performance of our machine learning model.
50
Suppose, if we have given training to our machine learning model by a dataset and
we test it by a completely different dataset. Then, it will create difficulties for our
model to understand the correlations between the models.
If we train our model very well and its training accuracy is also very high, but we
provide a new dataset to it, then it will decrease the performance. So we always
try to make a machine learning model which performs well with the training set
and also with the test dataset. Here, we can define these datasets as:
Training Set: A subset of dataset to train the machine learning model, and we
already know the output.
Test set: A subset of dataset to test the machine learning model, and by using the
test set, model predicts the output.
For splitting the dataset, we will use the below lines of code:
Explanation:
o In the above code, the first line is used for splitting arrays of the dataset into
random train and test subsets.
o In the second line, we have used four variables for our output that are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
51
o In train_test_split() function, we have passed four parameters in which
first two are for arrays of data, and test_size is for specifying the size of the
test set. The test_size maybe .5, .3, or .2, which tells the dividing ratio of
training and testing sets.
o The last parameter random_state is used to set a seed for a random
generator so that you always get the same result, and the most used value
for this is 42.
Output:
By executing the above code, we will get 4 different variables, which can be seen
under the variable explorer section.
As we can see in the above image, the x and y variables are divided into 4
different variables with corresponding values.
7) Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a
technique to standardize the independent variables of the dataset in a specific
range. In feature scaling, we put our variables in the same range and in the same
scale so that no any variable dominate the other variable.
52
As we can see, the age and salary column values are not on the same scale. A
machine learning model is based on Euclidean distance, and if we do not scale
the variable, then it will cause some issue in our machine learning model.
53
If we compute any two values from age and salary, then salary values will
dominate the age values, and it will produce an incorrect result. So to remove this
issue, we need to perform feature scaling for machine learning.
Standardization
Normalization
54
Here, we will use the standardization method for our dataset.
Now, we will create the object of StandardScaler class for independent variables
or features. And then we will fit and transform the training dataset.
1. st_x= StandardScaler()
2. x_train= st_x.fit_transform(x_train)
1. x_test= st_x.transform(x_test)
Output:
By executing the above lines of code, we will get the scaled values for x_train and
x_test as:
x_train:
55
x_test:
56
As we can see in the above output, all the variables are scaled between values -1
to 1.
Note: Here, we have not scaled the dependent variable because there are only two values 0 and
1. But if these variables will have more range of values, then we will also need to scale those
variables.
Now, in the end, we can combine all the steps together to make our complete code
more understandable.
1. # importing libraries
2. import numpy as nm
3. import [Link] as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('[Link]')
8.
9. #Extracting Independent Variable
10. x= data_set.iloc[:, :-1].values
11.
57
12. #Extracting Dependent variable
13. y= data_set.iloc[:, 3].values
14.
15. #handling missing data(Replacing missing data with the mean value)
16. from [Link] import Imputer
17. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
18.
19. #Fitting imputer object to the independent varibles x.
20. imputerimputer= [Link](x[:, 1:3])
21.
22. #Replacing missing data with the calculated mean value
23. x[:, 1:3]= [Link](x[:, 1:3])
24.
25. #for Country Variable
26. from [Link] import LabelEncoder, OneHotEncoder
27. label_encoder_x= LabelEncoder()
28. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
29.
30. #Encoding for dummy variables
31. onehot_encoder= OneHotEncoder(categorical_features= [0])
32. x= onehot_encoder.fit_transform(x).toarray()
33.
34. #encoding for purchased variable
35. labelencoder_y= LabelEncoder()
36. y= labelencoder_y.fit_transform(y)
37.
38. # Splitting the dataset into training and test set.
39. from sklearn.model_selection import train_test_split
40. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_s
tate=0)
41.
42. #Feature Scaling of datasets
43. from [Link] import StandardScaler
44. st_x= StandardScaler()
58
45. x_train= st_x.fit_transform(x_train)
46. x_test= st_x.transform(x_test)
In the above code, we have included all the data preprocessing steps together. But
there are some steps or lines of code which are not necessary for all machine
learning models. So we can exclude them from our code to make it reusable for all
models.
In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly. It applies the
same concept as a student learns in the supervision of the teacher.
In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
59
Suppose we have a dataset of different types of shapes which includes square,
rectangle, triangle, and Polygon. Now the first step is that we need to train the
model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be
labelled as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model
is to identify the shape.
The machine is already trained on all types of shapes, and when it finds a new
shape, it classifies the shape on the bases of a number of sides, and predicts the
output.
60
o Determine the suitable algorithm for the model, such as support vector
machine, decision tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need
validation sets as the control parameters, which are the subset of training
datasets.
o Evaluate the accuracy of the model by providing the test set. If the model
predicts the correct output, which means our model is accurate.
1. Regression
Regression algorithms are used if there is a relationship between the input variable
and the output variable. It is used for the prediction of continuous variables, such
as Weather forecasting, Market Trends, etc. Below are some popular Regression
algorithms which come under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
61
Classification algorithms are used when the output variable is categorical, which
means there are two classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
62
Unsupervised Machine Learning
In the previous topic, we learned supervised machine learning in which models are
trained using labeled data under the supervision of training data. But there may be
many cases in which we do not have labeled data and need to find the hidden
patterns from the given dataset. So, to solve such types of cases in machine
learning, we need unsupervised learning techniques.
Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.
63
Why use Unsupervised Learning?
Below are some main reasons which describe the importance of Unsupervised
Learning:
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their
own experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which
make unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding
output so to solve such cases, we need unsupervised learning.
64
Here, we have taken an unlabeled input data, which means it is not categorized
and corresponding outputs are also not given. Now, this unlabeled input data is fed
to the machine learning model in order to train it. Firstly, it will interpret the raw
data to find the hidden patterns from the data and then will apply suitable
algorithms such as k-means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into
groups according to the similarities and difference between the objects.
65
o Clustering: Clustering is a method of grouping the objects into clusters such
that objects with most similarities remains into a group and has less or no
similarities with the objects of another group. Cluster analysis finds the
commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which
is used for finding the relationships between variables in the large database.
It determines the set of items that occurs together in the dataset.
Association rule makes marketing strategy more effective. Such as people
who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam)
item. A typical example of Association rule is Market Basket Analysis.
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
66
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
67
Difference between Supervised and Unsupervised
Learning
Supervised and Unsupervised learning are the two techniques of machine learning.
But both the techniques are used in different scenarios and with different datasets.
Below the explanation of both learning methods along with their difference table is
given.
Skip Ad
68
Example: Suppose we have an image of different types of fruits. The task of our
supervised learning model is to identify the fruits and classify them accordingly. So
to identify the image in supervised learning, we will give the input data as well as
output for that, which means we will train the model by the shape, size, color, and
taste of each fruit. Once the training is completed, we will test the model by giving
the new set of fruit. The model will identify the fruit and predict the output using a
suitable algorithm.
Example: To understand the unsupervised learning, we will use the example given
above. So unlike supervised learning, here we will not provide any supervision to
the model. We will just provide the input dataset to the model and allow the model
to find the patterns from the data. With the help of a suitable algorithm, the model
will train itself and divide the fruits into different groups according to the most
similar features between them.
The main differences between Supervised and Unsupervised learning are given
below:
Supervised learning model takes direct Unsupervised learning model does not
feedback to check if it is predicting take any feedback.
correct output or not.
Supervised learning model predicts the Unsupervised learning model finds the
output. hidden patterns in data.
69
provided to the model along with the data is provided to the model.
output.
Supervised learning can be used for Unsupervised learning can be used for
those cases where we know the input as those cases where we have only input
well as corresponding outputs. data and no corresponding output
data.
Note: The supervised and unsupervised learning both are the machine learning methods, and
selection of any of these learning depends on the factors related to the structure and volume of
your dataset and the use cases of the problem.
70
Algorithms
Regression Analysis in Machine learning
Regression analysis is a statistical method to model the relationship between a
dependent (target) and independent (predictor) variables with one or more
independent variables. More specifically, Regression analysis helps us to
understand how the value of the dependent variable is changing corresponding to
an independent variable when other independent variables are held fixed. It
predicts continuous/real values such as temperature, age, salary, price, etc.
We can understand the concept of regression analysis using the below example:
Now, the company wants to do the advertisement of $200 in the year 2019 and
wants to know the prediction about the sales for this year. So to solve such
type of prediction problems in machine learning, we need regression analysis.
71
forecasting, time series modeling, and determining the causal-effect
relationship between variables.
In Regression, we plot a graph between the variables which best fits the given
datapoints, using this plot, the machine learning model can make predictions
about the data. In simple words, "Regression shows a line or curve that
passes through all the datapoints on target-predictor graph in such a
way that the vertical distance between the datapoints and the regression
line is minimum." The distance between datapoints and line tells whether a
model has captured a strong relationship or not.
72
Why do we use Regression Analysis?
As mentioned above, Regression analysis helps in the prediction of a continuous
variable. There are various scenarios in the real world where we need some future
predictions such as weather condition, sales prediction, marketing trends, etc., for
such case we need some technology which can make predictions more accurately.
So for such case we need Regression analysis which is a statistical method and
used in machine learning and data science. Below are some other reasons for
using Regression analysis:
Types of Regression
There are various types of regressions which are used in data science and machine
learning. Each type has its own importance on different scenarios, but at the core,
all the regression methods analyze the effect of the independent variable on
dependent variables. Here we are discussing some important types of regression
which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
73
Linear Regression:
74
o The relationship between variables in the linear regression model can be
explained using the below image. Here we are predicting the salary of an
employee on the basis of the year of experience.
1. Y= aX+b
75
Logistic Regression:
When we provide the input values (data) to the function, it gives the S-curve as
follows:
76
o It uses the concept of threshold levels, values above the threshold level are
rounded up to 1, and values below the threshold level are rounded up to 0.
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
Polynomial Regression:
77
linear model. Which means the datapoints are best fitted using a
polynomial line.
o The equation for polynomial regression also derived from linear regression
equation that means Linear regression equation Y= b 0+ b1x, is transformed
into Polynomial regression equation Y= b 0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression
coefficients. x is our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic
Note: This is different from Multiple Linear regression in such a way that in Polynomial regression,
a single element has different degrees instead of multiple variables with the same degree.
78
o Hyperplane: In general SVM, it is a separation line between two classes, but
in SVR, it is a line which helps to predict the continuous variables and cover
most of the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane,
which creates a margin for datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to
the hyperplane and opposite class.
Here, the blue line is called hyperplane, and the other two lines are known as
boundary lines.
79
o Decision Tree regression builds a tree-like structure in which each internal
node represents the "test" for an attribute, each branch represent the result
of the test, and each leaf node represents the final decision or result.
o A decision tree is constructed starting from the root node/parent node
(dataset), which splits into left and right child nodes (subsets of dataset).
These child nodes are further divided into their children node, and
themselves become the parent node of those nodes. Consider the below
image:
Above image showing the example of Decision Tee regression, here, the model is
trying to predict the choice of a person between Sports cars or Luxury car.
80
g(x)= f0(x)+ f1(x)+ f2(x)+....
Ridge Regression:
81
o The equation for ridge regression will be:
Lasso Regression:
82
Linear Regression in Machine Learning
Linear regression is one of the easiest and most popular Machine Learning
algorithms. It is a statistical method that is used for predictive analysis. Linear
regression makes predictions for continuous/real or numeric variables such
as sales, salary, age, product price, etc.
The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:
83
Mathematically, we can represent a linear regression as:
y= a0+a1x+ ε
Here,
The values for x and y variables are training datasets for Linear Regression model
representation.
84
dependent variable, then such a Linear Regression algorithm is called Simple
Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm is
called Multiple Linear Regression.
85
Finding the best fit line:
When working with linear regression, our main goal is to find the best fit line that
means the error between predicted values and actual values should be minimized.
The best fit line will have the least error.
The different values for weights or the coefficient of lines (a 0, a1) gives a different
line of regression, so we need to calculate the best values for a 0 and a1 to find the
best fit line, so to calculate this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a 0, a1) gives the
different line of regression, and the cost function is used to estimate the
values of the coefficient for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures
how a linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping function,
which maps the input variable to the output variable. This mapping function
is also known as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function,
which is the average of squared error occurred between the predicted values and
actual values. It can be written as:
86
For the above linear equation, MSE can be calculated as:
Where,
Residuals: The distance between the actual value and predicted values is called
residual. If the observed points are far from the regression line, then the residual
will be high, and so cost function will high. If the scatter points are close to the
regression line, then the residual will be small and hence the cost function.
Gradient Descent:
Model Performance:
The Goodness of fit determines how the line of regression fits the set of
observations. The process of finding the best model out of various models is
called optimization. It can be achieved by below method:
1. R-squared method:
87
o The high value of R-square determines the less difference between the
predicted values and actual values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
o It can be calculated from the below formula:
88
It can be checked using the q-q plot. If the plot shows a straight line without
any deviation, which means the error is normally distributed.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If
there will be any correlation in the error term, then it will drastically reduce
the accuracy of the model. Autocorrelation usually occurs if there is a
dependency between residual errors.
The key point in Simple Linear Regression is that the dependent variable must
be a continuous/real value. However, the independent variable can be
measured on continuous or categorical values.
89
o Model the relationship between the two variables. Such as the
relationship between Income and expenditure, experience and Salary, etc.
o Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a year,
etc.
417
When Crypto Becomes Intelligent
y= a0+a1x+ ε
Where,
Here we are taking a dataset that has two variables: salary (dependent variable)
and experience (Independent variable). The goals of this problem is:
In this section, we will create a Simple Linear Regression model to find out the best
fitting line for representing the relationship between these two variables.
90
To implement the Simple Linear regression model in machine learning using
Python, we need to follow the below steps:
The first step for creating the Simple Linear Regression model is data pre-
processing
. We have already done it earlier in this tutorial. But there will be some changes, which are given in the
below steps:
o First, we will import the three important libraries, which will help us for
loading the dataset, plotting the graphs, and creating the Simple Linear
Regression model.
1. import numpy as nm
2. import [Link] as mtp
3. import pandas as pd
1. data_set= pd.read_csv('Salary_Data.csv')
By executing the above line of code (ctrl+ENTER), we can read the dataset on our
Spyder IDE screen by clicking on the variable explorer option.
91
The above output shows the dataset, which has two variables: Salary and
Experience.
Note: In Spyder IDE, the folder containing the code file must be saved as a working directory, and
the dataset or csv file should be in the same folder.
o After that, we need to extract the dependent and independent variables from
the given dataset. The independent variable is years of experience, and the
dependent variable is salary. Below is code for it:
1. x= data_set.iloc[:, :-1].values
92
2. y= data_set.iloc[:, 1].values
In the above lines of code, for x variable, we have taken -1 value since we want to
remove the last column from the dataset. For y variable, we have taken 1 value as
a parameter, since we want to extract the second column and indexing starts from
the zero.
By executing the above line of code, we will get the output for X and Y variable as:
In the above output image, we can see the X (independent) variable and Y
(dependent) variable has been extracted from the given dataset.
o Next, we will split both variables into the test set and training set. We have
30 observations, so we will take 20 observations for the training set and 10
observations for the test set. We are splitting our dataset so that we can
93
train our model using a training dataset and then test the model using a test
dataset. The code for this is given below:
By executing the above code, we will get x-test, x-train and y-test, y-train dataset.
Consider the below images:
Test-dataset:
Training Dataset:
94
o For simple linear Regression, we will not use Feature Scaling. Because
Python libraries take care of it for some cases, so we don't need to perform it
here. Now, our dataset is well prepared to work on it and we are going to
start building a Simple Linear Regression model for the given problem.
Now the second step is to fit our model to the training dataset. To do so, we will
import the LinearRegression class of the linear_model library from the scikit
learn. After importing the class, we are going to create an object of the class
named as a regressor. The code for this is given below:
95
1. #Fitting the Simple Linear Regression model to the training dataset
2. from sklearn.linear_model import LinearRegression
3. regressor= LinearRegression()
4. [Link](x_train, y_train)
In the above code, we have used a fit() method to fit our Simple Linear Regression
object to the training set. In the fit() function, we have passed the x_train and
y_train, which is our training dataset for the dependent and an independent
variable. We have fitted our regressor object to the training set so that the model
can easily learn the correlations between the predictor and target variables. After
executing the above lines of code, we will get the below output.
Output:
dependent (salary) and an independent variable (Experience). So, now, our model
is ready to predict the output for the new observations. In this step, we will provide
the test dataset (new observations) to the model to check whether it can predict
the correct output or not.
We will create a prediction vector y_pred, and x_pred, which will contain
predictions of test dataset, and prediction of training set respectively.
On executing the above lines of code, two variables named y_pred and x_pred will
generate in the variable explorer options that contain salary predictions for the
training set and test set.
Output:
You can check the variable by clicking on the variable explorer option in the IDE,
and also compare the result by comparing values from y_pred and y_test. By
comparing these values, we can check how good our model is performing.
96
Step: 4. visualizing the Training set results:
Now in this step, we will visualize the training set result. To do so, we will use the
scatter() function of the pyplot library, which we have already imported in the pre-
processing step. The scatter () function will create a scatter plot of observations.
In the x-axis, we will plot the Years of Experience of employees and on the y-axis,
salary of employees. In the function, we will pass the real values of training set,
which means a year of experience x_train, training set of Salaries y_train, and color
of the observations. Here we are taking a green color for the observation, but it
can be any color as per the choice.
Now, we need to plot the regression line, so for this, we will use the plot()
function of the pyplot library. In this function, we will pass the years of experience
for training set, predicted salary for training set x_pred, and color of the line.
Next, we will give the title for the plot. So here, we will use the title() function of
the pyplot library and pass the name ("Salary vs Experience (Training Dataset)".
After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel()
function.
Finally, we will represent all above things in a graph using show(). The code is
given below:
Output:
By executing the above lines of code, we will get the below graph plot as an
output.
97
In the above plot, we can see the real values observations in green dots and
predicted values are covered by the red regression line. The regression line shows
a correlation between the dependent and independent variable.
The good fit of the line can be observed by calculating the difference between
actual values and predicted values. But as we can see in the above plot, most of
the observations are close to the regression line, hence our model is
good for the training set.
In the previous step, we have visualized the performance of our model on the
training set. Now, we will do the same for the Test set. The complete code will
remain the same as the above code, except in this, we will use x_test, and y_test
instead of x_train and y_train.
Here we are also changing the color of observations and regression line to
differentiate between the two plots, but it is optional.
98
1. #visualizing the Test set results
2. [Link](x_test, y_test, color="blue")
3. [Link](x_train, x_pred, color="red")
4. [Link]("Salary vs Experience (Test Dataset)")
5. [Link]("Years of Experience")
6. [Link]("Salary(In Rupees)")
7. [Link]()
Output:
By executing the above line of code, we will get the output as:
In the above plot, there are observations given by the blue color, and prediction is
given by the red regression line. As we can see, most of the observations are close
to the regression line, hence we can say our Simple Linear Regression is a good
model and able to make good predictions.
99
Multiple Linear Regression
In the previous topic, we have learned about Simple Linear Regression, where a
single Independent/Predictor(X) variable is used to model the response variable
(Y). But there may be various cases in which the response variable is affected by
more than one predictor variable; for such cases, the Multiple Linear Regression
algorithm is used.
Multiple Linear Regression is one of the important regression algorithms which models the linear
relationship between a single dependent continuous variable and more than one independent
variable.
Example:
Prediction of CO2 emission based on engine size and number of cylinders in a car.
MLR equation:
In Multiple Linear Regression, the target variable(Y) is a linear combination of
multiple predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple
Linear Regression, so the same is applied for the multiple linear regression
equation, the equation becomes:
100
1. Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</
sub>x<sub>2</sub>+ b<sub>3</sub>x<sub>3</sub>+...... bnxn ...............
(a)
Where,
Y= Output/Response variable
Problem Description:
Since we need to find the Profit, so it is the dependent variable, and the other four
variables are independent variables. Below are the main steps of deploying the
MLR model:
o Importing libraries: Firstly we will import the library which will help in
building the model. Below is the code for it:
1. # importing libraries
2. import numpy as nm
3. import [Link] as mtp
4. import pandas as pd
1. #importing datasets
2. data_set= pd.read_csv('50_CompList.csv')
102
In above output, we can clearly see that there are five variables, in which four
variables are continuous and one is categorical variable.
Output:
Out[5]:
103
[144372.41, 118671.85, 383199.62, 'New York'],
[142107.34, 91391.77, 366168.42, 'Florida'],
[131876.9, 99814.71, 362861.36, 'New York'],
[134615.46, 147198.87, 127716.82, 'California'],
[130298.13, 145530.06, 323876.68, 'Florida'],
[120542.52, 148718.95, 311613.29, 'New York'],
[123334.88, 108679.17, 304981.62, 'California'],
[101913.08, 110594.11, 229160.95, 'Florida'],
[100671.96, 91790.61, 249744.55, 'California'],
[93863.75, 127320.38, 249839.44, 'Florida'],
[91992.39, 135495.07, 252664.93, 'California'],
[119943.24, 156547.42, 256512.92, 'Florida'],
[114523.61, 122616.84, 261776.23, 'New York'],
[78013.11, 121597.55, 264346.06, 'California'],
[94657.16, 145077.58, 282574.31, 'New York'],
[91749.16, 114175.79, 294919.57, 'Florida'],
[86419.7, 153514.11, 0.0, 'New York'],
[76253.86, 113867.3, 298664.47, 'California'],
[78389.47, 153773.43, 299737.29, 'New York'],
[73994.56, 122782.75, 303319.26, 'Florida'],
[67532.53, 105751.03, 304768.73, 'Florida'],
[77044.01, 99281.34, 140574.81, 'New York'],
[64664.71, 139553.16, 137962.62, 'California'],
[75328.87, 144135.98, 134050.07, 'Florida'],
[72107.6, 127864.55, 353183.81, 'New York'],
[66051.52, 182645.56, 118148.2, 'Florida'],
[65605.48, 153032.06, 107138.38, 'New York'],
[61994.48, 115641.28, 91131.24, 'Florida'],
[61136.38, 152701.92, 88218.23, 'New York'],
[63408.86, 129219.61, 46085.25, 'California'],
[55493.95, 103057.49, 214634.81, 'Florida'],
[46426.07, 157693.92, 210797.67, 'California'],
[46014.02, 85047.44, 205517.64, 'New York'],
[28663.76, 127056.21, 201126.82, 'Florida'],
[44069.95, 51283.14, 197029.42, 'California'],
[20229.59, 65947.93, 185265.1, 'New York'],
[38558.51, 82982.09, 174999.3, 'California'],
[28754.33, 118546.05, 172795.67, 'California'],
[27892.92, 84710.77, 164470.71, 'Florida'],
[23640.93, 96189.63, 148001.11, 'California'],
[15505.73, 127382.3, 35534.17, 'New York'],
[22177.74, 154806.14, 28334.72, 'California'],
[1000.23, 124153.04, 1903.93, 'New York'],
[1315.46, 115816.21, 297114.46, 'Florida'],
[0.0, 135426.92, 0.0, 'California'],
[542.05, 51743.15, 0.0, 'New York'],
[0.0, 116983.8, 45173.06, 'California']], dtype=object)
As we can see in the above output, the last column contains categorical variables
which are not suitable to apply directly for fitting the model. So we need to encode
this variable.
104
As we have one categorical variable (State), which cannot be directly applied to
the model, so we will encode it. To encode the categorical variable into numbers,
we will use the LabelEncoder class. But it is not sufficient because it still has
some relational order, which may create a wrong model. So in order to remove this
problem, we will use OneHotEncoder, which will create the dummy variables.
Below is code for it:
1. #Catgorical data
2. from [Link] import LabelEncoder, OneHotEncoder
3. labelencoder_x= LabelEncoder()
4. x[:, 3]= labelencoder_x.fit_transform(x[:,3])
5. onehotencoder= OneHotEncoder(categorical_features= [3])
6. x= onehotencoder.fit_transform(x).toarray()
Here we are only encoding one independent variable, which is state as other
variables are continuous.
Output:
105
As we can see in the above output, the state column has been converted into
dummy variables (0 and 1). Here each dummy variable column is
corresponding to the one State. We can check by comparing it with the original
dataset. The first column corresponds to the California State, the second column
corresponds to the Florida State, and the third column corresponds to the New
York State.
Note: We should not use all the dummy variables at the same time, so it must be 1 less than the
total number of dummy variables, else it will create a dummy variable trap.
o Now, we are writing a single line of code just to avoid the dummy variable
trap:
106
If we do not remove the first dummy variable, then it may introduce
multicollinearity in the model.
As we can see in the above output image, the first column has been removed.
o Now we will split the dataset into training and test set. The code for this is
given below:
The above code will split our dataset into a training set and test set.
107
Output: The above code will split the dataset into training set and test set. You
can check the output by clicking on the variable explorer option given in Spyder
IDE. The test set and training set will look like the below image:
Test set:
Training set:
108
Note: In MLR, we will not do feature scaling as it is taken care by the library, so we don't need to
do it manually.
Output:
109
Out[9]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
Now, we have successfully trained our model using the training dataset. In the
next step, we will test the performance of the model using the test dataset.
By executing the above lines of code, a new vector will be generated under the
variable explorer option. We can test our model by comparing the predicted values
and test set values.
Output:
110
In the above output, we have predicted result set and test set. We can check
model performance by comparing these two value index by index. For example,
the first index has a predicted value of 103015$ profit and test/real value
of 103282$ profit. The difference is only of 267$, which is a good prediction, so,
finally, our model is completed here.
o We can also check the score for training dataset and test dataset. Below is
the code for it:
The above score tells that our model is 95% accurate with the training
dataset and 93% accurate with the test dataset.
Note: In the next topic, we will see how we can improve the performance of the model using
the Backward Elimination process.
111
What is Backward Elimination?
Backward elimination is a feature selection technique while building a machine
learning model. It is used to remove those features that do not have a significant
effect on the dependent variable or prediction of output. There are various ways to
build a model in Machine Learning, which are:
1. All-in
2. Backward Elimination
3. Forward Selection
4. Bidirectional Elimination
5. Score Comparison
Above are the possible methods for building the model in Machine learning, but we
will only use here the Backward Elimination process as it is the fastest method.
Step-2: Fit the complete model with all possible predictors/independent variables.
Step-3: Choose the predictor which has the highest P-value, such that.
Step-5: Rebuild and fit the model with the remaining variables.
112
Administration spend, Marketing spend, and state (dummy variables))
and one dependent variable (Profit). But that model is not optimal, as we have
included all the independent variables and do not know which independent model
is most affecting and which one is the least affecting for the prediction.
So, in order to optimize the performance of the model, we will use the Backward
Elimination method. This process is used to optimize the performance of the MLR
model as it will only include the most affecting feature and remove the least
affecting feature. Let's start to apply it to our MLR model.
1. # importing libraries
2. import numpy as nm
3. import [Link] as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('50_CompList.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, :-1].values
11. y= data_set.iloc[:, 4].values
12.
13. #Catgorical data
14. from [Link] import LabelEncoder, OneHotEncoder
15. labelencoder_x= LabelEncoder()
16. x[:, 3]= labelencoder_x.fit_transform(x[:,3])
17. onehotencoder= OneHotEncoder(categorical_features= [3])
18. x= onehotencoder.fit_transform(x).toarray()
19.
20. #Avoiding the dummy variable trap:
113
21. x = x[:, 1:]
22.
23.
24. # Splitting the dataset into training and test set.
25. from sklearn.model_selection import train_test_split
26. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_s
tate=0)
27.
28. #Fitting the MLR model to the training set:
29. from sklearn.linear_model import LinearRegression
30. regressor= LinearRegression()
31. [Link](x_train, y_train)
32.
33. #Predicting the Test set result;
34. y_pred= [Link](x_test)
35.
36. #Checking the score
37. print('Train Score: ', [Link](x_train, y_train))
38. print('Test Score: ', [Link](x_test, y_test))
From the above code, we got training and test set result as:
Note: On the basis of this score, we will estimate the effect of features on our model after using the
Backward elimination process.
114
o Adding a column in matrix of features: As we can check in our MLR
equation (a), there is one constant term b0, but this term is not present in our
matrix of features, so we need to add it manually. We will add a column
having values x0 = 1 associated with the constant term b 0.
To add this, we will use append function of Numpy library (nm which we
have already imported into our code), and will assign a value of 1. Below is
the code for it.
Here we have used axis =1, as we wanted to add a column. For adding a row, we
can use axis =0.
Output: By executing the above line of code, a new column will be added into our
matrix of features, which will have all values equal to 1. We can check it by clicking
on the x dataset under the variable explorer option.
115
As we can see in the above output image, the first column is added successfully,
which corresponds to the constant term of the MLR equation.
Step: 2:
116
new class OLS of statsmodels library. Then we will fit it by using
the fit() method.
o Next we need p-value to compare with SL value, so for this we will
use summary() method to get the summary table of all the values. Below is
the code for it:
Output: By executing the above lines of code, we will get a summary table.
Consider the below image:
117
In the above image, we can clearly see the p-values of all the variables. Here x1,
x2 are dummy variables, x3 is R&D spend, x4 is Administration spend,
and x5 is Marketing spend.
From the table, we will choose the highest p-value, which is for x1=0.953 Now, we
have the highest p-value which is greater than the SL value, so will remove the x1
variable (dummy variable) from the table and will refit the model. Below is the
code for it:
1. x_opt=x[:, [0,2,3,4,5]]
2. regressor_OLS=[Link](endog = y, exog=x_opt).fit()
3. regressor_OLS.summary()
Output:
As we can see in the output image, now five variables remain. In these variables,
the highest p-value is 0.961. So we will remove it in the next iteration.
118
o Now the next highest value is 0.961 for x1 variable, which is another dummy
variable. So we will remove it and refit the model. Below is the code for it:
Output:
In the above output image, we can see the dummy variable(x2) has been
removed. And the next highest value is .602, which is still greater than .5, so we
need to remove it.
o Now we will remove the Admin spend which is having .602 p-value and again
refit the model.
119
1. x_opt=x[:, [0,3,5]]
2. regressor_OLS=[Link](endog = y, exog=x_opt).fit()
3. regressor_OLS.summary()
Output:
As we can see in the above output image, the variable (Admin spend) has been
removed. But still, there is one variable left, which is marketing spend as it has a
high p-value (0.60). So we need to remove it.
o Finally, we will remove one more variable, which has .60 p-value for
marketing spend, which is more than a significant level.
Below is the code for it:
1. x_opt=x[:, [0,3]]
2. regressor_OLS=[Link](endog = y, exog=x_opt).fit()
3. regressor_OLS.summary()
120
Output:
As we can see in the above output image, only two variables are left. So only
the R&D independent variable is a significant variable for the prediction. So we
can now predict efficiently using this variable.
121
Below is the code for Building Multiple Linear Regression model by only
using R&D spend:
1. # importing libraries
2. import numpy as nm
3. import [Link] as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('50_CompList1.csv')
8.
9. #Extracting Independent and dependent Variable
10. x_BE= data_set.iloc[:, :-1].values
11. y_BE= data_set.iloc[:, 1].values
12.
122
13.
14. # Splitting the dataset into training and test set.
15. from sklearn.model_selection import train_test_split
16. x_BE_train, x_BE_test, y_BE_train, y_BE_test= train_test_split(x_BE, y_BE, test
_size= 0.2, random_state=0)
17.
18. #Fitting the MLR model to the training set:
19. from sklearn.linear_model import LinearRegression
20. regressor= LinearRegression()
21. [Link]([Link](x_BE_train).reshape(-1,1), y_BE_train)
22.
23. #Predicting the Test set result;
24. y_pred= [Link](x_BE_test)
25.
26. #Cheking the score
27. print('Train Score: ', [Link](x_BE_train, y_BE_train))
28. print('Test Score: ', [Link](x_BE_test, y_BE_test))
Output:
After executing the above code, we will get the Training and test scores as:
As we can see, the training score is 94% accurate, and the test score is also 94%
accurate. The difference between both scores is .00149. This score is very much
close to the previous score, i.e., 0.0154, where we have included all the variables.
We got this result by using one independent variable (R&D spend) only
instead of four variables. Hence, now, our model is simple and accurate.
123
ML Polynomial Regression
o Polynomial Regression is a regression algorithm that models the relationship
between a dependent(y) and independent variable(x) as nth degree
polynomial. The Polynomial Regression equation is given below:
o It is also called the special case of Multiple Linear Regression in ML. Because
we add some polynomial terms to the Multiple Linear regression equation to
convert it into Polynomial Regression.
o It is a linear model with some modification in order to increase the accuracy.
o The dataset used in Polynomial regression for training is of non-linear nature.
o It makes use of a linear regression model to fit the complicated and non-
linear functions and datasets.
o Hence, "In Polynomial regression, the original features are
converted into Polynomial features of required degree (2,3,..,n) and
then modeled using a linear model."
124
o In the above image, we have taken a dataset which is arranged non-linearly.
So if we try to cover it with a linear model, then we can clearly see that it
hardly covers any data point. On the other hand, a curve is suitable to cover
most of the data points, which is of the Polynomial model.
o Hence, if the datasets are arranged in a non-linear fashion, then we should
use the Polynomial Regression model instead of Simple Linear Regression.
Note: A Polynomial Regression algorithm is also called Polynomial Linear Regression because it
does not depend on the variables, instead, it depends on the coefficients, which are arranged in a
linear fashion.
When we compare the above three equations, we can clearly see that all three
equations are Polynomial equations but differ by the degree of variables. The
Simple and Multiple Linear equations are also Polynomial equations with a single
degree, and the Polynomial regression equation is Linear equation with the nth
degree. So if we add a degree to our linear equations, then it will be converted into
Polynomial Linear equations.
125
Note: To better understand Polynomial Regression, you must have knowledge of Simple Linear
Regression.
o Data Pre-processing
126
o Build a Linear Regression model and fit it to the dataset
o Build a Polynomial Regression model and fit it to the dataset
o Visualize the result for Linear Regression and Polynomial Regression model.
o Predicting the output.
Note: Here, we will build the Linear regression model as well as Polynomial Regression to see the
results between the predictions. And Linear regression model is for reference.
The data pre-processing step will remain the same as in previous regression
models, except for some changes. In the Polynomial Regression model, we will not
use feature scaling, and also we will not split our dataset into training and test set.
It has two reasons:
o The dataset contains very less information which is not suitable to divide it
into a test and training set, else our model will not be able to find the
correlations between the salaries and levels.
o In this model, we want very accurate predictions for salary, so the model
should have enough information.
1. # importing libraries
2. import numpy as nm
3. import [Link] as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('Position_Salaries.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, 1:2].values
11. y= data_set.iloc[:, 2].values
Explanation:
127
o In the above lines of code, we have imported the important Python libraries
to import dataset and operate on it.
o Next, we have imported the dataset 'Position_Salaries.csv', which contains
three columns (Position, Levels, and Salary), but we will consider only two
columns (Salary and Levels).
o After that, we have extracted the dependent(Y) and independent variable(X)
from the dataset. For x-variable, we have taken parameters as [:,1:2],
because we want 1 index(levels), and included :2 to make it as a matrix.
Output:
As we can see in the above output, there are three columns present (Positions,
Levels, and Salaries). But we are only considering two columns because Positions
are equivalent to the levels or may be seen as the encoded form of Positions.
128
Here we will predict the output for level 6.5 because the candidate has 4+ years'
experience as a regional manager, so he must be somewhere between levels 7
and 6.
Now, we will build and fit the Linear regression model to the dataset. In building
polynomial regression, we will take the Linear regression model as reference and
compare both the results. The code is given below:
In the above code, we have created the Simple Linear model using lin_regs object
of LinearRegression class and fitted it to the dataset variables (x and y).
Output:
Now we will build the Polynomial Regression model, but it will be a little different
from the Simple Linear model. Because here we will
use PolynomialFeatures class of preprocessing library. We are using this class
to add some extra features to our dataset.
129
After executing the code, we will get another matrix x_poly, which can be seen
under the variable explorer option:
Output:
Now we will visualize the result for Linear regression model as we did in Simple
Linear Regression. Below is the code for it:
130
2. [Link](x,y,color="blue")
3. [Link](x,lin_regs.predict(x), color="red")
4. [Link]("Bluff detection model(Linear Regression)")
5. [Link]("Position Levels")
6. [Link]("Salary")
7. [Link]()
Output:
In the above output image, we can clearly see that the regression line is so far
from the datasets. Predictions are in a red straight line, and blue points are actual
values. If we consider this output to predict the value of CEO, it will give a salary of
approx. 600000$, which is far away from the real value.
So we need a curved model to fit the dataset other than a straight line.
Here we will visualize the result of Polynomial regression model, code for which is
little different from the above model.
Output:
As we can see in the above output image, the predictions are close to the real
values. The above plot will vary as we will change the degree.
For degree= 3:
If we change the degree=3, then we will give a more accurate plot, as shown in
the below image.
132
SO as we can see here in the above output image, the predicted salary for level
6.5 is near to 170K$-190k$, which seems that future employee is saying the truth
about his salary.
Degree= 4: Let's again change the degree to 4, and now will get the most
accurate plot. Hence we can get more accurate results by increasing the degree of
Polynomial.
Now, we will predict the final output using the Linear regression model to see
whether an employee is saying truth or bluff. So, for this, we will use
the predict() method and will pass the value 6.5. Below is the code for it:
133
1. lin_pred = lin_regs.predict([[6.5]])
2. print(lin_pred)
Output:
[330378.78787879]
Now, we will predict the final output using the Polynomial Regression model to
compare with Linear model. Below is the code for it:
1. poly_pred = lin_reg_2.predict(poly_regs.fit_transform([[6.5]]))
2. print(poly_pred)
Output:
[158862.45265153]
134
Classification Algorithm in Machine Learning
As we know, the Supervised Machine Learning algorithm can be broadly classified
into Regression and Classification Algorithms. In Regression algorithms, we have
predicted the output for continuous values, but to predict the categorical values,
we need Classification algorithms.
The main goal of the Classification algorithm is to identify the category of a given
dataset, and these algorithms are mainly used to predict the output for the
categorical data.
Classification algorithms can be better understood using the below diagram. In the
below diagram, there are two classes, class A and Class B. These classes have
features that are similar to each other and dissimilar to other classes.
135
The algorithm which implements the classification on a dataset is known as a
classifier. There are two types of Classifications:
1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until
it receives the test dataset. In Lazy learner case, classification is done on the
basis of the most related data stored in the training dataset. It takes less
time in training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning
136
2. Eager Learners:Eager Learners develop a classification model based on a
training dataset before receiving a test dataset. Opposite to Lazy learners,
Eager Learner takes more time in learning, and less time in
prediction. Example: Decision Trees, Naïve Bayes, ANN.
o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification
137
o For Binary classification, cross-entropy can be calculated as:
1. ?(ylog(p)+(1?y)log(1?p))
2. Confusion Matrix:
3. AUC-ROC curve:
o ROC curve stands for Receiver Operating Characteristics Curve and AUC
stands for Area Under the Curve.
o It is a graph that shows the performance of the classification model at
different thresholds.
o To visualize the performance of the multi-class classification model, we use
the AUC-ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate)
on Y-axis and FPR(False Positive Rate) on X-axis.
138
Use cases of Classification Algorithms
Classification algorithms can be used in different places. Below are some popular
use cases of Classification Algorithms:
139
o The curve from the logistic function indicates the likelihood of something
such as whether the cells are cancerous or not, a mouse is obese or not
based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has
the ability to provide probabilities and classify new data using continuous
and discrete datasets.
o Logistic Regression can be used to classify the observations using different
types of data and can easily determine the most effective variables used for
the classification. The below image is showing the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is
called logistic regression, but is used to classify samples; Therefore, it falls under the classification
algorithm.
140
o In logistic regression, we use the concept of the threshold value, which
defines the probability of either 0 or 1. Such as values above the threshold
value tends to 1, and a value below the threshold values tends to 0.
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide
the above equation by (1-y):
141
o Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as "cat", "dogs", or
"sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible
ordered types of dependent variables, such as "low", "Medium", or "High".
50.2M
922
Java Try Catch
Example: There is a dataset given which contains the information of various users
obtained from the social networking sites. There is a car making company that has
recently launched a new SUV car. So the company wanted to check how many
users from the dataset, wants to purchase the car.
For this problem, we will build a Machine Learning model using the Logistic
regression algorithm. The dataset is shown in the below image. In this problem, we
will predict the purchased variable (Dependent Variable) by using age and
salary (Independent variables).
142
Steps in Logistic Regression: To implement the Logistic Regression using
Python, we will use the same steps as we have done in previous topics of
Regression. Below are the steps:
143
3. import numpy as nm
4. import [Link] as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')
By executing the above lines of code, we will get the dataset as the output.
Consider the given image:
Now, we will extract the dependent and independent variables from the given
dataset. Below is the code for it:
144
2. x= data_set.iloc[:, [2,3]].values
3. y= data_set.iloc[:, 4].values
In the above code, we have taken [2, 3] for x because our independent variables
are age and salary, which are at index 2, 3. And we have taken 4 for y variable
because our dependent variable is at index 4. The output will be:
Now we will split the dataset into a training set and test set. Below is the code for
it:
145
For test
set:
146
In logistic regression, we will do feature scaling because we want accurate result of
predictions. Here we will only scale the independent variable because dependent
variable have only 0 and 1 values. Below is the code for it:
1. #feature Scaling
2. from [Link] import StandardScaler
3. st_x= StandardScaler()
4. x_train= st_x.fit_transform(x_train)
5. x_test= st_x.transform(x_test)
We have well prepared our dataset, and now we will train the dataset using the
training set. For providing training or fitting the model to the training set, we will
import the LogisticRegression class of the sklearn library.
After importing the class, we will create a classifier object and use it to fit the
model to the logistic regression. Below is the code for it:
147
1. #Fitting Logistic Regression to the training set
2. from sklearn.linear_model import LogisticRegression
3. classifier= LogisticRegression(random_state=0)
4. [Link](x_train, y_train)
Output: By executing the above code, we will get the below output:
Out[5]:
Our model is well trained on the training set, so we will now predict the result by
using test set data. Below is the code for it:
In the above code, we have created a y_pred vector to predict the test set result.
Output: By executing the above code, a new vector (y_pred) will be created under
the variable explorer option. It can be seen as:
148
The above output image shows the corresponding predicted users who want to
purchase or not purchase the car.
Now we will create the confusion matrix here to check the accuracy of the
classification. To create it, we need to import the confusion_matrix function of
the sklearn library. After importing the function, we will call it using a new
variable cm. The function takes two parameters, mainly y_true( the actual values)
and y_pred (the targeted value return by the classifier). Below is the code for it:
Output:
By executing the above code, a new confusion matrix will be created. Consider the
below image:
149
We can find the accuracy of the predicted result by interpreting the confusion
matrix. By above output, we can interpret that 65+24= 89 (Correct Output) and
8+3= 11(Incorrect Output).
Finally, we will visualize the training set result. To visualize the result, we will
use ListedColormap class of matplotlib library. Below is the code for it:
150
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. [Link]('Logistic Regression (Training set)')
14. [Link]('Age')
15. [Link]('Estimated Salary')
16. [Link]()
17. [Link]()
Output: By executing the above code, we will get the below output:
o In the above graph, we can see that there are some Green points within the
green region and Purple points within the purple region.
151
o All these data points are the observation points from the training set, which
shows the result for purchased variables.
o This graph is made by using two independent variables i.e., Age on the x-
axis and Estimated salary on the y-axis.
o The purple point observations are for which purchased (dependent
variable) is probably 0, i.e., users who did not purchase the SUV car.
o The green point observations are for which purchased (dependent
variable) is probably 1 means user who purchased the SUV car.
o We can also estimate from the graph that the users who are younger with
low salary, did not purchase the car, whereas older users with high
estimated salary purchased the car.
o But there are some purple points in the green region (Buying the car) and
some green points in the purple region(Not buying the car). So we can say
that younger users with a high estimated salary purchased the car, whereas
an older user with a low estimated salary did not purchase the car.
We have successfully visualized the training set result for the logistic regression,
and our goal for this classification is to divide the users who purchased the SUV car
and who did not purchase the car. So from the output graph, we can clearly see
the two regions (Purple and Green) with the observation points. The Purple region
is for those users who didn't buy the car, and Green Region is for those users who
purchased the car.
Linear Classifier:
As we can see from the graph, the classifier is a Straight line or linear in nature as
we have used the Linear model for Logistic Regression. In further topics, we will
learn for non-linear Classifiers.
Our model is well trained using the training dataset. Now, we will visualize the
result for new observations (Test set). The code for the test set will remain same
as above except that here we will use x_test and y_test instead of x_train and
y_train. Below is the code for it:
152
2. from [Link] import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = [Link]([Link](start = x_set[:, 0].min() - 1, stop = x_set[:, 0].m
ax() + 1, step =0.01),
5. [Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. [Link](x1, x2, [Link]([Link]([[Link](), [Link]()]).T).reshape(
[Link]),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. [Link]([Link](), [Link]())
9. [Link]([Link](), [Link]())
10. for i, j in enumerate([Link](y_set)):
11. [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. [Link]('Logistic Regression (Test set)')
14. [Link]('Age')
15. [Link]('Estimated Salary')
16. [Link]()
17. [Link]()
Output:
The above graph shows the test set result. As we can see, the graph is divided into
two regions (Purple and Green). And Green observations are in the green region,
and Purple observations are in the purple region. So we can say it is a good
153
prediction and model. Some of the green and purple data points are in different
regions, which can be ignored as we have already calculated this error using the
confusion matrix (11 Incorrect output).
Hence our model is pretty good and ready to make new predictions for this
classification problem.
154
measure. Our KNN model will find the similar features of the new data set to
the cats and dogs images and based on the most similar features it will put it
in either cat or dog category.
155
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
156
o Next, we will calculate the Euclidean distance between the data points.
The Euclidean distance is the distance between two points, which we have
already studied in geometry. It can be calculated as:
157
o As we can see the 3 nearest neighbors are from category A, hence this new
data point must belong to category A.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.
158
Disadvantages of KNN Algorithm:
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between
the data points for all the training samples.
Problem for K-NN Algorithm: There is a Car manufacturer company that has
manufactured a new SUV car. The company wants to give the ads to the users who
are interested in buying that SUV. So for this problem, we have a dataset that
contains multiple user's information through the social network. The dataset
contains lots of information but the Estimated Salary and Age we will consider
for the independent variable and the Purchased variable is for the dependent
variable. Below is the dataset:
159
Steps to implement the K-NN algorithm:
The Data Pre-processing step will remain exactly the same as Logistic Regression.
Below is the code for it:
1. # importing libraries
2. import numpy as nm
3. import [Link] as mtp
160
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_
state=0)
16.
17. #feature Scaling
18. from [Link] import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
By executing the above code, our dataset is imported to our program and well pre-
processed. After feature scaling our test dataset will look like:
161
From the above output image, we can see that our data is successfully scaled.
And then we will fit the classifier to the training data. Below is the code for it:
162
Output: By executing the above code, we will get the output as:
Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
o Predicting the Test Result: To predict the test set result, we will create
a y_pred vector as we did in Logistic Regression. Below is the code for it:
Output:
163
o Creating the Confusion Matrix:
Now we will create the Confusion Matrix for our K-NN model to see the
accuracy of the classifier. Below is the code for it:
In above code, we have imported the confusion_matrix function and called it using
the variable cm.
Output: By executing the above code, we will get the matrix as below:
In the above image, we can see there are 64+29= 93 correct predictions and
3+4= 7 incorrect predictions, whereas, in Logistic Regression, there were 11
incorrect predictions. So we can say that the performance of the model is
improved by using the K-NN algorithm.
164
1. #Visulaizing the trianing set result
2. from [Link] import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = [Link]([Link](start = x_set[:, 0].min() - 1, stop = x_set[:, 0].m
ax() + 1, step =0.01),
5. [Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. [Link](x1, x2, [Link]([Link]([[Link](), [Link]()]).T).reshape(
[Link]),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. [Link]([Link](), [Link]())
9. [Link]([Link](), [Link]())
10. for i, j in enumerate([Link](y_set)):
11. [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. [Link]('K-NN Algorithm (Training set)')
14. [Link]('Age')
15. [Link]('Estimated Salary')
16. [Link]()
17. [Link]()
Output:
165
The output graph is different from the graph which we have occurred in Logistic
Regression. It can be understood in the below points:
o As we can see the graph is showing the red point and green points.
The green points are for Purchased(1) and Red Points for not
Purchased(0) variable.
o The graph is showing an irregular boundary instead of showing any
straight line or any curve because it is a K-NN algorithm, i.e., finding
the nearest neighbor.
o The graph has classified users in the correct categories as most of the
users who didn't buy the SUV are in the red region and users who
bought the SUV are in the green region.
o The graph is showing good result but still, there are some green points
in the red region and red points in the green region. But this is no big
issue as by doing this model is prevented from overfitting issues.
o Hence our model is well trained.
o Visualizing the Test set result:
After the training of the model, we will now test the result by putting a new
dataset, i.e., Test dataset. Code remains the same except some minor
changes: such as x_train and y_train will be replaced by x_test and
y_test.
Below is the code for it:
166
10. for i, j in enumerate([Link](y_set)):
11. [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. [Link]('K-NN algorithm(Test set)')
14. [Link]('Age')
15. [Link]('Estimated Salary')
16. [Link]()
17. [Link]()
Output:
The above graph is showing the output for the test data set. As we can see in the
graph, the predicted output is well good as most of the red points are in the red
region and most of the green points are in the green region.
However, there are few green points in the red region and a few red points in the
green region. So these are the incorrect observations that we have observed in the
confusion matrix(7 Incorrect output).
167
Support Vector Machine Algorithm
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
However, primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is
called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is termed
as Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane:
168
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if
we want a model that can accurately identify whether it is a cat or dog, so such a
model can be created by using the SVM algorithm. We will first train our model
with lots of images of cats and dogs so that it can learn about different features of
cats and dogs, and then we test it with this strange creature. So as support vector
creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On
the basis of the support vectors, it will classify it as a cat. Consider the below
diagram:
169
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if
a dataset can be classified into two classes by using a single straight line,
then such data is termed as linearly separable data, and classifier is used
called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then
such data is termed as non-linear data and classifier used is called as Non-
linear SVM classifier.
The dimensions of the hyperplane depend on the features present in the dataset,
which means if there are 2 features (as shown in image), then hyperplane will be a
170
straight line. And if there are 3 features, then hyperplane will be a 2-dimension
plane.
We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect
the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.
So as it is 2-d space so by just using a straight line, we can easily separate these
two classes. But there can be multiple lines that can separate these classes.
Consider the below image:
171
Hence, the SVM algorithm helps to find the best line or decision boundary; this
best boundary or region is called as a hyperplane. SVM algorithm finds the
closest point of the lines from both the classes. These points are called support
vectors. The distance between the vectors and the hyperplane is called as margin.
And the goal of SVM is to maximize this margin. The hyperplane with maximum
margin is called the optimal hyperplane.
172
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for
non-linear data, we cannot draw a single straight line. Consider the below image:
173
So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a
third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
174
So now, SVM will divide the datasets into classes in the following way. Consider the
below image:
175
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If
we convert it in 2d space with z=1, then it will become as:
176
Hence we get a circumference of radius 1 in case of non-linear data.
Now we will implement the SVM algorithm using Python. Here we will use the same
dataset user_data, which we have used in Logistic regression and KNN
classification.
Till the Data pre-processing step, the code will remain the same. Below is the code:
177
9.
10. #Extracting Independent and dependent Variable
11. x= data_set.iloc[:, [2,3]].values
12. y= data_set.iloc[:, 4].values
13.
14. # Splitting the dataset into training and test set.
15. from sklearn.model_selection import train_test_split
16. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_
state=0)
17. #feature Scaling
18. from [Link] import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
After executing the above code, we will pre-process the data. The code will give
the dataset as:
178
The scaled output for the test set will be:
Now the training set will be fitted to the SVM classifier. To create the SVM
classifier, we will import SVC class from [Link] library. Below is the code
for it:
In the above code, we have used kernel='linear', as here we are creating SVM for
linearly separable data. However, we can change it for non-linear data. And then
we fitted the classifier to the training dataset(x_train, y_train)
Output:
Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, tol=0.001, verbose=False)
179
The model performance can be altered by changing the value
of C(Regularization factor), gamma, and kernel.
Output: Below is the output for the prediction of the test set:
180
o Creating the confusion matrix:
Now we will see the performance of the SVM classifier that how many
incorrect predictions are there as compared to the Logistic regression
classifier. To create the confusion matrix, we need to import
the confusion_matrix function of the sklearn library. After importing the
function, we will call it using a new variable cm. The function takes two
parameters, mainly y_true( the actual values) and y_pred (the targeted
value return by the classifier). Below is the code for it:
181
Output:
As we can see in the above output image, there are 66+24= 90 correct predictions
and 8+2= 10 correct predictions. Therefore we can say that our SVM model
improved as compared to the Logistic regression model.
182
12. [Link]('SVM classifier (Training set)')
13. [Link]('Age')
14. [Link]('Estimated Salary')
15. [Link]()
16. [Link]()
Output:
As we can see, the above output is appearing similar to the Logistic regression
output. In the output, we got the straight line as hyperplane because we
have used a linear kernel in the classifier. And we have also discussed above
that for the 2d space, the hyperplane in SVM is a straight line.
183
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. [Link]([Link](), [Link]())
9. [Link]([Link](), [Link]())
10. for i, j in enumerate([Link](y_set)):
11. [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. [Link]('SVM classifier (Test set)')
14. [Link]('Age')
15. [Link]('Estimated Salary')
16. [Link]()
17. [Link]()
Output:
As we can see in the above output image, the SVM classifier has divided the users
into two regions (Purchased or Not purchased). Users who purchased the SUV are
in the red region with the red scatter points. And users who did not purchase the
SUV are in the green region with green scatter points. The hyperplane has divided
the two classes into Purchased and not purchased variable.
184
Naïve Bayes Classifier Algorithm
o Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional
training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can
make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of
the probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
185
o Naïve: It is called Naïve because it assumes that the occurrence of a certain
feature is independent of the occurrence of other features. Such as if the
fruit is identified on the bases of color, shape, and taste, then red, spherical,
and sweet fruit is recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used
to determine the probability of a hypothesis with prior knowledge. It depends
on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
186
play or not on a particular day according to the weather conditions. So to solve this
problem, we need to follow the below steps:
Problem: If the weather is sunny, then the Player should play or not?
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
187
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other
Algorithms.
o It is the most popular choice for text classification problems.
189
Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier,
but the predictor variables are the independent Booleans variables. Such as
if a particular word is present or not in a document. This model is also
famous for document classification tasks.
Steps to implement:
190
10.
11. # Splitting the dataset into the Training set and Test set
12. from sklearn.model_selection import train_test_split
13. x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, rando
m_state = 0)
14.
15. # Feature Scaling
16. from [Link] import StandardScaler
17. sc = StandardScaler()
18. x_train = sc.fit_transform(x_train)
19. x_test = [Link](x_test)
In the above code, we have loaded the dataset into our program using "dataset =
pd.read_csv('user_data.csv'). The loaded dataset is divided into training and
test set, and then we have scaled the feature variable.
191
2) Fitting Naive Bayes to the Training Set:
After the pre-processing step, now we will fit the Naive Bayes model to the
Training set. Below is the code for it:
192
3. classifier = GaussianNB()
4. [Link](x_train, y_train)
In the above code, we have used the GaussianNB classifier to fit it to the
training dataset. We can also use other classifiers as per our requirement.
Output:
Output:
193
The above output shows the result for prediction vector y_pred and real vector
y_test. We can see that some predications are different from the real values, which
are the incorrect predictions.
Output:
As we can see in the above confusion matrix output, there are 7+3= 10 incorrect
predictions, and 65+25=90 correct predictions.
194
2. from [Link] import ListedColormap
3. x_set, y_set = x_train, y_train
4. X1, X2 = [Link]([Link](start = x_set[:, 0].min() - 1, stop = x_set[:, 0].m
ax() + 1, step = 0.01),
5. [Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, s
tep = 0.01))
6. [Link](X1, X2, [Link]([Link]([[Link](), [Link]()]).T).reshape
([Link]),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
8. [Link]([Link](), [Link]())
9. [Link]([Link](), [Link]())
10. for i, j in enumerate([Link](y_set)):
11. [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. [Link]('Naive Bayes (Training set)')
14. [Link]('Age')
15. [Link]('Estimated Salary')
16. [Link]()
17. [Link]()
Output:
195
In the above output we can see that the Naïve Bayes classifier has segregated the
data points with the fine boundary. It is Gaussian curve as we have
used GaussianNB classifier in our code.
Output:
196
The above output is final output for test set data. As we can see the classifier has
created a Gaussian curve to divide the "purchased" and "not purchased" variables.
There are some wrong predictions which we have calculated in Confusion matrix.
But still it is pretty good classifier.
197