0% found this document useful (0 votes)
8 views197 pages

Machine Learning Tutorial

The Machine Learning tutorial covers both basic and advanced concepts, designed for students and professionals, explaining how machines can learn from data to make predictions. It discusses various machine learning techniques including Supervised, Unsupervised, and Reinforcement learning, along with their applications in real-world scenarios such as image recognition, speech recognition, and self-driving cars. The tutorial also highlights the importance of machine learning in solving complex problems and improving decision-making across various sectors.

Uploaded by

NiharikaGuptas
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views197 pages

Machine Learning Tutorial

The Machine Learning tutorial covers both basic and advanced concepts, designed for students and professionals, explaining how machines can learn from data to make predictions. It discusses various machine learning techniques including Supervised, Unsupervised, and Reinforcement learning, along with their applications in real-world scenarios such as image recognition, speech recognition, and self-driving cars. The tutorial also highlights the importance of machine learning in solving complex problems and improving decision-making across various sectors.

Uploaded by

NiharikaGuptas
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Machine Learning Tutorial

Machine Learning tutorial provides basic and advanced concepts of machine


learning. Our machine learning tutorial is designed for students and working
professionals.

Machine learning is a growing technology which enables computers to learn


automatically from past data. Machine learning uses various algorithms
for building mathematical models and making predictions using historical
data or information. Currently, it is being used for various tasks such as image
recognition, speech recognition, email filtering, Facebook auto-
tagging, recommender system, and many more.

This machine learning tutorial gives you an introduction to machine learning along
with the wide range of machine learning techniques such
as Supervised, Unsupervised, and Reinforcement learning. You will learn
about regression and classification models, clustering methods, hidden Markov
models, and various sequential models.

What is Machine Learning


In the real world, we are surrounded by humans who can learn everything from
their experiences with their learning capability, and we have computers or
machines which work on our instructions. But can a machine also learn from
experiences or past data like a human does? So here comes the role of Machine
Learning.

1
Machine Learning is said as a subset of artificial intelligence that is mainly
concerned with the development of algorithms which allow a computer to learn
from the data and past experiences on their own. The term machine learning was
first introduced by Arthur Samuel in 1959. We can define it in a summarized way
as:

Machine learning enables a machine to automatically learn from data, improve performance from
experiences, and predict things without being explicitly programmed.

With the help of sample historical data, which is known as training data, machine
learning algorithms build a mathematical model that helps in making predictions
or decisions without being explicitly programmed. Machine learning brings
computer science and statistics together for creating predictive models. Machine
learning constructs or uses the algorithms that learn from historical data. The more
we will provide the information, the higher will be the performance.

A machine has the ability to learn if it can improve its performance by


gaining more data.

How does Machine Learning work


A Machine Learning system learns from historical data, builds the prediction
models, and whenever it receives new data, predicts the output for it. The
accuracy of predicted output depends upon the amount of data, as the huge

2
amount of data helps to build a better model which predicts the output more
accurately.

Suppose we have a complex problem, where we need to perform some predictions,


so instead of writing a code for it, we just need to feed the data to generic
algorithms, and with the help of these algorithms, machine builds the logic as per
the data and predict the output. Machine learning has changed our way of thinking
about the problem. The below block diagram explains the working of Machine
Learning algorithm:

Features of Machine Learning:


o Machine learning uses data to detect various patterns in a given dataset.
o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the
huge amount of the data.

Need for Machine Learning


The need for machine learning is increasing day by day. The reason behind the
need for machine learning is that it is capable of doing tasks that are too complex
for a person to implement directly. As a human, we have some limitations as we
cannot access the huge amount of data manually, so for this, we need some
computer systems and here comes the machine learning to make things easy for
us.

We can train machine learning algorithms by providing them the huge amount of
data and let them explore the data, construct the models, and predict the required
output automatically. The performance of the machine learning algorithm depends
on the amount of data, and it can be determined by the cost function. With the
help of machine learning, we can save both time and money.

3
The importance of machine learning can be easily understood by its uses cases,
Currently, machine learning is used in self-driving cars, cyber fraud
detection, face recognition, and friend suggestion by Facebook, etc. Various
top companies such as Netflix and Amazon have build machine learning models
that are using a vast amount of data to analyze the user interest and recommend
product accordingly.

Following are some key points which show the importance of Machine
Learning:

o Rapid increment in the production of data


o Solving complex problems, which are difficult for a human
o Decision making in various sector including finance
o Finding hidden patterns and extracting useful information from data.

Classification of Machine Learning


At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

4
1) Supervised Learning
Supervised learning is a type of machine learning method in which we provide
sample labeled data to the machine learning system in order to train it, and on
that basis, it predicts the output.

The system creates a model using labeled data to understand the datasets and
learn about each data, once the training and processing are done then we test the
model by providing a sample data to check whether it is predicting the exact
output or not.

The goal of supervised learning is to map input data with the output data. The
supervised learning is based on supervision, and it is the same as when a student
learns things in the supervision of the teacher. The example of supervised learning
is spam filtering.

Supervised learning can be grouped further in two categories of algorithms:

o Classification
o Regression

2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any
supervision.

The training is provided to the machine with the set of data that has not been
labeled, classified, or categorized, and the algorithm needs to act on that data
without any supervision. The goal of unsupervised learning is to restructure the
input data into new features or a group of objects with similar patterns.

In unsupervised learning, we don't have a predetermined result. The machine tries


to find useful insights from the huge amount of data. It can be further classifieds
into two categories of algorithms:

o Clustering
o Association

3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a learning
agent gets a reward for each right action and gets a penalty for each wrong action.
The agent learns automatically with these feedbacks and improves its

5
performance. In reinforcement learning, the agent interacts with the environment
and explores it. The goal of an agent is to get the most reward points, and hence,
it improves its performance.

The robotic dog, which automatically learns the movement of his arms, is an
example of Reinforcement learning.

Note: We will learn about the above types of machine learning in detail in later chapters.

History of Machine Learning


Before some years (about 40-50 years), machine learning was science fiction, but
today it is the part of our daily life. Machine learning is making our day to day life
easy from self-driving cars to Amazon virtual assistant "Alexa". However, the
idea behind machine learning is so old and has a long history. Below some
milestones are given which have occurred in the history of machine learning:

The early history of Machine Learning (Pre-1940):

o 1834: In 1834, Charles Babbage, the father of the computer, conceived a


device that could be programmed with punch cards. However, the machine
was never built, but all modern computers rely on its logical structure.
6
o 1936: In 1936, Alan Turing gave a theory that how a machine can determine
and execute a set of instructions.

The era of stored program computers:

o 1940: In 1940, the first manually operated computer, "ENIAC" was invented,
which was the first electronic general-purpose computer. After that stored
program computer such as EDSAC in 1949 and EDVAC in 1951 were
invented.
o 1943: In 1943, a human neural network was modeled with an electrical
circuit. In 1950, the scientists started applying their idea to work and
analyzed how human neurons might work.

Computer machinery and intelligence:

o 1950: In 1950, Alan Turing published a seminal paper, "Computer


Machinery and Intelligence," on the topic of artificial intelligence. In his
paper, he asked, "Can machines think?"

Machine intelligence in Games:

o 1952: Arthur Samuel, who was the pioneer of machine learning, created a
program that helped an IBM computer to play a checkers game. It performed
better more it played.
o 1959: In 1959, the term "Machine Learning" was first coined by Arthur
Samuel.

The first "AI" winter:

o The duration of 1974 to 1980 was the tough time for AI and ML researchers,
and this duration was called as AI winter.
o In this duration, failure of machine translation occurred, and people had
reduced their interest from AI, which led to reduced funding by the
government to the researches.

Machine Learning from theory to reality


7
o 1959: In 1959, the first neural network was applied to a real-world problem
to remove echoes over phone lines using an adaptive filter.
o 1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a neural
network NETtalk, which was able to teach itself how to correctly pronounce
20,000 words in one week.
o 1997: The IBM's Deep blue intelligent computer won the chess game
against the chess expert Garry Kasparov, and it became the first computer
which had beaten a human chess expert.

Machine Learning at 21st century

o 2006: In the year 2006, computer scientist Geoffrey Hinton has given a new
name to neural net research as "deep learning," and nowadays, it has
become one of the most trending technologies.
o 2012: In 2012, Google created a deep neural network which learned to
recognize the image of humans and cats in YouTube videos.
o 2014: In 2014, the Chabot "Eugen Goostman" cleared the Turing Test. It
was the first Chabot who convinced the 33% of human judges that it was not
a machine.
o 2014: DeepFace was a deep neural network created by Facebook, and they
claimed that it could recognize a person with the same precision as a human
can do.
o 2016: AlphaGo beat the world's number second player Lee sedol at Go
game. In 2017 it beat the number one player of this game Ke Jie.
o 2017: In 2017, the Alphabet's Jigsaw team built an intelligent system that
was able to learn the online trolling. It used to read millions of comments
of different websites to learn to stop online trolling.

Machine Learning at present:


Now machine learning has got a great advancement in its research, and it is
present everywhere around us, such as self-driving cars, Amazon
Alexa, Catboats, recommender system, and many more. It
includes Supervised, unsupervised, and reinforcement learning with
clustering, classification, decision tree, SVM algorithms, etc.

8
Modern machine learning models can be used for making various predictions,
including weather prediction, disease prediction, stock market analysis,
etc.

Prerequisites
Before learning machine learning, you must have the basic knowledge of
followings so that you can easily understand the concepts of machine learning:

o Fundamental knowledge of probability and linear algebra.


o The ability to code in any computer language, especially in Python language.
o Knowledge of Calculus, especially derivatives of single variable and
multivariate functions.

Audience
Our Machine learning tutorial is designed to help beginner and professionals.

Problems
We assure you that you will not find any difficulty while learning our Machine
learning tutorial. But if there is any mistake in this tutorial, kindly post the problem
or error in the contact form so that we can improve it.

Applications of Machine learning


9
Machine learning is a buzzword for today's technology, and it is growing very
rapidly day by day. We are using machine learning in our daily life even without
knowing it such as Google Maps, Google assistant, Alexa, etc. Below are some
most trending real-world applications of Machine Learning:

1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It
is used to identify objects, persons, places, digital images, etc. The popular use
case of image recognition and face detection is, Automatic friend tagging
suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever we


upload a photo with our Facebook friends, then we automatically get a tagging
suggestion with name, and the technology behind this is machine learning's face
detection and recognition algorithm.

10
It is based on the Facebook project named "Deep Face," which is responsible for
face recognition and person identification in the picture.

Play Videox

2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under
speech recognition, and it's a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is


also known as "Speech to text", or "Computer speech recognition." At
present, machine learning algorithms are widely used by various applications of
speech recognition. Google assistant, Siri, Cortana, and Alexa are using
speech recognition technology to follow the voice instructions.

3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the
correct path with the shortest route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or


heavily congested with the help of two ways:

o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It takes
information from the user and sends back to its database to improve the
performance.

4. Product recommendations:
11
Machine learning is widely used by various e-commerce and entertainment
companies such as Amazon, Netflix, etc., for product recommendation to the
user. Whenever we search for some product on Amazon, then we started getting
an advertisement for the same product while internet surfing on the same browser
and this is because of machine learning.

Google understands the user interest using various machine learning algorithms
and suggests the product as per customer interest.

As similar, when we use Netflix, we find some recommendations for entertainment


series, movies, etc., and this is also done with the help of machine learning.

5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars.
Machine learning plays a significant role in self-driving cars. Tesla, the most
popular car manufacturing company is working on self-driving car. It is using
unsupervised learning method to train the car models to detect people and objects
while driving.

6. Email Spam and Malware Filtering:


Whenever we receive a new email, it is filtered automatically as important, normal,
and spam. We always receive an important mail in our inbox with the important
symbol and spam emails in our spam box, and the technology behind this is
Machine learning. Below are some spam filters used by Gmail:

o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters

Some machine learning algorithms such as Multi-Layer Perceptron, Decision


tree, and Naïve Bayes classifier are used for email spam filtering and malware
detection.

7. Virtual Personal Assistant:


We have various virtual personal assistants such as Google
assistant, Alexa, Cortana, Siri. As the name suggests, they help us in finding
the information using our voice instruction. These assistants can help us in various

12
ways just by our voice instructions such as Play music, call someone, Open an
email, Scheduling an appointment, etc.

These virtual assistants use machine learning algorithms as an important part.

These assistant record our voice instructions, send it over the server on a cloud,
and decode it using ML algorithms and act accordingly.

8. Online Fraud Detection:


Machine learning is making our online transaction safe and secure by detecting
fraud transaction. Whenever we perform some online transaction, there may be
various ways that a fraudulent transaction can take place such as fake
accounts, fake ids, and steal money in the middle of a transaction. So to detect
this, Feed Forward Neural network helps us by checking whether it is a genuine
transaction or a fraud transaction.

For each genuine transaction, the output is converted into some hash values, and
these values become the input for the next round. For each genuine transaction,
there is a specific pattern which gets change for the fraud transaction hence, it
detects it and makes our online transactions more secure.

9. Stock Market trading:


Machine learning is widely used in stock market trading. In the stock market, there
is always a risk of up and downs in shares, so for this machine learning's long
short term memory neural network is used for the prediction of stock market
trends.

10. Medical Diagnosis:


In medical science, machine learning is used for diseases diagnoses. With this,
medical technology is growing very fast and able to build 3D models that can
predict the exact position of lesions in the brain.

It helps in finding brain tumors and other brain-related diseases easily.

11. Automatic Language Translation:


Nowadays, if we visit a new place and we are not aware of the language then it is
not a problem at all, as for this also machine learning helps us by converting the
text into our known languages. Google's GNMT (Google Neural Machine
Translation) provide this feature, which is a Neural Machine Learning that
translates the text into our familiar language, and it called as automatic
translation.
13
The technology behind the automatic translation is a sequence to sequence
learning algorithm, which is used with image recognition and translates the text
from one language to another language.

Machine learning Life cycle


Machine learning has given the computer systems the abilities to automatically
learn without being explicitly programmed. But how does a machine learning
system work? So, it can be described using the life cycle of machine learning.
Machine learning life cycle is a cyclic process to build an efficient machine learning
project. The main purpose of the life cycle is to find a solution to the problem or
project.

Machine learning life cycle involves seven major steps, which are given below:

o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment

14
The most important thing in the complete process is to understand the problem
and to know the purpose of the problem. Therefore, before starting the life cycle,
we need to understand the problem because the good result depends on the
better understanding of the problem.

In the complete life cycle process, to solve a problem, we create a machine


learning system called "model", and this model is created by providing "training".
But to train a model, we need data, hence, life cycle starts by collecting data.

1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this
step is to identify and obtain all data-related problems.

In this step, we need to identify the different data sources, as data can be
collected from various sources such as files, database, internet, or mobile
devices. It is one of the most important steps of the life cycle. The quantity and
quality of the collected data will determine the efficiency of the output. The more
will be the data, the more accurate will be the prediction.

15
This step includes the below tasks:

o Identify various data sources


o Collect data
o Integrate the data obtained from different sources

By performing the above task, we get a coherent set of data, also called as
a dataset. It will be used in further steps.

2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation
is a step where we put our data into a suitable place and prepare it to use in our
machine learning training.

In this step, first, we put all data together, and then randomize the ordering of
data.

This step can be further divided into two processes:

o Data exploration:
It is used to understand the nature of data that we have to work with. We
need to understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.
o Data pre-processing:
Now the next step is preprocessing of data for its analysis.

3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable
format. It is the process of cleaning the data, selecting the variable to use, and
transforming the data in a proper format to make it more suitable for analysis in
the next step. It is one of the most important steps of the complete process.
Cleaning of data is required to address the quality issues.

16
It is not necessary that data we have collected is always of our use as some of the
data may not be useful. In real-world applications, collected data may have various
issues, including:

o Missing Values
o Duplicate data
o Invalid data
o Noise

So, we use various filtering techniques to clean the data.

It is mandatory to detect and remove the above issues because it can negatively
affect the quality of the outcome.

4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step
involves:

o Selection of analytical techniques


o Building models
o Review the result

The aim of this step is to build a machine learning model to analyze the data using
various analytical techniques and review the outcome. It starts with the
determination of the type of the problems, where we select the machine learning
techniques such as Classification, Regression, Cluster analysis, Association,
etc. then build the model using prepared data, and evaluate the model.

Hence, in this step, we take the data and use machine learning algorithms to build
the model.

5. Train Model
Now the next step is to train the model, in this step we train our model to improve
its performance for better outcome of the problem.

17
We use datasets to train the model using various machine learning algorithms.
Training a model is required so that it can understand the various patterns, rules,
and, features.

6. Test Model
Once our machine learning model has been trained on a given dataset, then we
test the model. In this step, we check for the accuracy of our model by providing a
test dataset to it.

Testing the model determines the percentage accuracy of the model as per the
requirement of project or problem.

7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the
model in the real-world system.

If the above-prepared model is producing an accurate result as per our


requirement with acceptable speed, then we deploy the model in the real system.
But before deploying the project, we will check whether it is improving its
performance using available data or not. The deployment phase is similar to
making the final report for a project.

Installing Anaconda and Python


18
To learn machine learning, we will use the Python programming language in this
tutorial. So, in order to use Python for machine learning, we need to install it in our
computer system with compatible IDEs (Integrated Development
Environment).

In this topic, we will learn to install Python and an IDE with the help of Anaconda
distribution.

Anaconda distribution is a free and open-source platform for Python/R


programming languages. It can be easily installed on any OS such as Windows,
Linux, and MAC OS. It provides more than 1500 Python/R data science packages
which are suitable for developing machine learning and deep learning models.

Anaconda distribution provides installation of Python with various IDE's such


as Jupyter Notebook, Spyder, Anaconda prompt, etc. Hence it is a very
convenient packaged solution which you can easily download and install in your
computer. It will automatically install Python and some basic IDEs and libraries
with it.

Below some steps are given to show the downloading and installing process of
Anaconda and IDE:

Step-1: Download Anaconda Python:

o To download Anaconda in your system, firstly, open your favorite browser


and type Download Anaconda Python, and then click on the first link as given
in the below image. Alternatively, you can directly download it by clicking on
this link, [Link]

19
o After clicking on the first link, you will reach to download page of Anaconda,
as shown in the below image:

o Since, Anaconda is available for Windows, Linux, and Mac OS, hence, you can
download it as per your OS type by clicking on available options shown in
below image. It will provide you Python 2.7 and Python 3.7 versions, but the

20
latest version is 3.7, hence we will download Python 3.7 version. After
clicking on the download option, it will start downloading on your computer.

Note: In this topic, we are downloading Anaconda for Windows you can choose it as per your OS.

Step- 2: Install Anaconda Python (Python 3.7 version):


Once the downloading process gets completed, go to downloads → double click on
the ".exe" file (Anaconda3-2019.03-Windows-x86_64.exe) of Anaconda. It will
open a setup window for Anaconda installations as given in below image, then click
on Next.

21
o It will open a License agreement window click on "I Agree" option and move
further.

22
o In the next window, you will get two options for installations as given in the
below image. Select the first option (Just me) and click on Next.

o Now you will get a window for installing location, here, you can leave it as
default or change it by browsing a location, and then click on Next. Consider
the below image:

23
o Now select the second option, and click on install.

o Once the installation gets complete, click on Next.

24
o Now installation is completed, tick the checkbox if you want to learn more
about Anaconda and Anaconda cloud. Click on Finish to end the process.

25
Note: Here, we will use the Spyder IDE to run Python programs.

Step- 3: Open Anaconda Navigator

o After successful installation of Anaconda, use Anaconda navigator to launch


a Python IDE such as Spyder and Jupyter Notebook.
o To open Anaconda Navigator, click on window Key and search for
Anaconda navigator, and click on it. Consider the below image:

o After opening the navigator, launch the Spyder IDE by clicking on


the Launch button given below the Spyder. It will install the Spyder IDE in
your system.

Run your Python program in Spyder IDE.

o Open Spyder IDE, it will look like the below image:

26
o Write your first program, and save it using the .py extension.
o Run the program using the triangle Run button.
o You can check the program's output on console pane at the bottom right
side.

Step- 4: Close the Spyder IDE.

27
How to get datasets for Machine Learning
The key to success in the field of machine learning or to become a great data
scientist is to practice with different types of datasets. But discovering a suitable
dataset for each kind of machine learning project is a difficult task. So, in this topic,
we will provide the detail of the sources from where you can easily get the dataset
according to your project.

Before knowing the sources of the machine learning dataset, let's discuss datasets.

What is a dataset?
A dataset is a collection of data in which data is arranged in some order. A
dataset can contain any data from a series of an array to a database table. Below
table shows an example of the dataset:

Country Age Salary Purchased

India 38 48000 No

France 43 45000 Yes

Germany 30 54000 No

France 48 65000 No

Germany 40 Yes

India 35 58000 Yes

A tabular dataset can be understood as a database table or matrix, where each


column corresponds to a particular variable, and each row corresponds to
the fields of the dataset. The most supported file type for a tabular dataset
is "Comma Separated File," or CSV. But to store a "tree-like data," we can use
the JSON file more efficiently.

Types of data in datasets

o Numerical data:Such as house price, temperature, etc.


o Categorical data:Such as Yes/No, True/False, Blue/green, etc.

28
o Ordinal data:These data are similar to categorical data but can be
measured on the basis of comparison.

Note: A real-world dataset is of huge size, which is difficult to manage and process at
the initial level. Therefore, to practice machine learning algorithms, we can use any
dummy dataset.

Need of Dataset
To work with machine learning projects, we need a huge amount of data, because,
without the data, one cannot train ML/AI models. Collecting and preparing the
dataset is one of the most crucial parts while creating an ML/AI project.

The technology applied behind any ML projects cannot work properly if the dataset
is not well prepared and pre-processed.

During the development of the ML project, the developers completely rely on the
datasets. In building ML applications, datasets are divided into two parts:

o Training dataset:
o Test Dataset

29
Note: The datasets are of large size, so to download these datasets, you must have
fast internet on your computer.

Popular sources for Machine Learning datasets


Below is the list of datasets which are freely available for the public to work on it:

1. Kaggle Datasets

30
Kaggle is one of the best sources for providing datasets for Data Scientists and
Machine Learners. It allows users to find, download, and publish datasets in an
easy way. It also provides the opportunity to work with other machine learning
engineers and solve difficult Data Science related tasks.

Kaggle provides a high-quality dataset in different formats that we can easily find
and download.

The link for the Kaggle dataset is [Link]

2. UCI Machine Learning Repository

31
UCI Machine learning repository is one of the great sources of machine learning
datasets. This repository contains databases, domain theories, and data
generators that are widely used by the machine learning community for the
analysis of ML algorithms.

Since the year 1987, it has been widely used by students, professors, researchers
as a primary source of machine learning dataset.

It classifies the datasets as per the problems and tasks of machine learning such
as Regression, Classification, Clustering, etc. It also contains some of the
popular datasets such as the Iris dataset, Car Evaluation dataset, Poker
Hand dataset, etc.

32
The link for the UCI machine learning repository
is [Link]

3. Datasets via AWS

We can search, download, access, and share the datasets that are publicly
available via AWS resources. These datasets can be accessed through AWS
resources but provided and maintained by different government organizations,
researches, businesses, or individuals.

Anyone can analyze and build various services using shared data via AWS
resources. The shared dataset on cloud helps users to spend more time on data
analysis rather than on acquisitions of data.

33
This source provides the various types of datasets with examples and ways to use
the dataset. It also provides the search box using which we can search for the
required dataset. Anyone can add any dataset or example to the Registry of
Open Data on AWS.

The link for the resource is [Link]

4. Google's Dataset Search Engine

Google dataset search engine is a search engine launched


by Google on September 5, 2018. This source helps researchers to get online
datasets that are freely available for use.

34
The link for the Google dataset search engine
is [Link]

5. Microsoft Datasets

The Microsoft has launched the "Microsoft Research Open data" repository with
the collection of free datasets in various areas such as natural language
processing, computer vision, and domain-specific sciences.

Using this resource, we can download the datasets to use on the current device, or
we can also directly use it on the cloud infrastructure.

35
The link to download or use the dataset from this resource
is [Link]

6. Awesome Public Dataset Collection

Awesome public dataset collection provides high-quality datasets that are


arranged in a well-organized manner within a list according to topics such as
Agriculture, Biology, Climate, Complex networks, etc. Most of the datasets are
available free, but some may not, so it is better to check the license before
downloading the dataset.

The link to download the dataset from Awesome public dataset collection
is [Link]

36
7. Government Datasets
There are different sources to get government-related data. Various countries
publish government data for public use collected by them from different
departments.

The goal of providing these datasets is to increase transparency of government


work among the people and to use the data in an innovative approach. Below are
some links of government datasets:

o Indian Government dataset

o US Government Dataset

o Northern Ireland Public Sector Datasets

o European Union Open Data Portal

8. Computer Vision Datasets

37
Visual data provides multiple numbers of the great dataset that are specific to
computer visions such as Image Classification, Video classification, Image
Segmentation, etc. Therefore, if you want to build a project on deep learning or
image processing, then you can refer to this source.

The link for downloading the dataset from this source is [Link]

9. Scikit-learn dataset

38
Scikit-learn is a great source for machine learning enthusiasts. This source
provides both toy and real-world datasets. These datasets can be obtained from
[Link] package and using general dataset API.

The toy dataset available on scikit-learn can be loaded using some predefined
functions such as, load_boston([return_X_y]), load_iris([return_X_y]), etc,
rather than importing any file from external sources. But these datasets are not
suitable for real-world projects.

The link to download datasets from this source


is [Link]

39
Data Preprocessing in Machine learning
Data preprocessing is a process of preparing the raw data and making it suitable
for a machine learning model. It is the first and crucial step while creating a
machine learning model.

When creating a machine learning project, it is not always a case that we come
across the clean and formatted data. And while doing any operation with data, it is
mandatory to clean it and put in a formatted way. So for this, we use data
preprocessing task.

Why do we need Data Preprocessing?


A real-world data generally contains noises, missing values, and maybe in an
unusable format which cannot be directly used for machine learning models. Data
preprocessing is required tasks for cleaning the data and making it suitable for a
machine learning model which also increases the accuracy and efficiency of a
machine learning model.

It involves below steps:

o Getting the dataset


o Importing libraries
o Importing datasets
o Finding Missing Data
o Encoding Categorical Data
o Splitting dataset into training and test set
o Feature scaling

1) Get the Dataset


To create a machine learning model, the first thing we required is a dataset as a
machine learning model completely works on data. The collected data for a
particular problem in a proper format is known as the dataset.

Dataset may be of different formats for different purposes, such as, if we want to
create a machine learning model for business purpose, then dataset will be
different with the dataset required for a liver patient. So each dataset is different

40
from another dataset. To use the dataset in our code, we usually put it into a
CSV file. However, sometimes, we may also need to use an HTML or xlsx file.

What is a CSV File?


CSV stands for "Comma-Separated Values" files; it is a file format which allows
us to save the tabular data, such as spreadsheets. It is useful for huge datasets
and can use these datasets in programs.

Here we will use a demo dataset for data preprocessing, and for practice, it can be
downloaded from here, "[Link]
learning. For real-world problems, we can download datasets online from various
sources such
as [Link] [Link]
etc.

We can also create our dataset by gathering data using various API with Python
and put that data into a .csv file.

2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some
predefined Python libraries. These libraries are used to perform some specific jobs.
There are three specific libraries that we will use for data preprocessing, which are:

Numpy: Numpy Python library is used for including any type of mathematical
operation in the code. It is the fundamental package for scientific calculation in
Python. It also supports to add large, multidimensional arrays and matrices. So, in
Python, we can import it as:

1. import numpy as nm

Here we have used nm, which is a short name for Numpy, and it will be used in the
whole program.

Matplotlib: The second library is matplotlib, which is a Python 2D plotting


library, and with this library, we need to import a sub-library pyplot. This library is
used to plot any type of charts in Python for the code. It will be imported as below:

1. import [Link] as mpt

Here we have used mpt as a short name for this library.

41
Pandas: The last library is the Pandas library, which is one of the most famous
Python libraries and used for importing and managing the datasets. It is an open-
source data manipulation and analysis library. It will be imported as below:

Here, we have used pd as a short name for this library. Consider the below image:

3) Importing the Datasets


Now we need to import the datasets which we have collected for our machine
learning project. But before importing a dataset, we need to set the current
directory as a working directory. To set a working directory in Spyder IDE, we need
to follow the below steps:

1. Save your Python file in the directory which contains dataset.


2. Go to File explorer option in Spyder IDE, and select the required directory.
3. Click on F5 button or run option to execute the file.

Note: We can set any directory as a working directory, but it must contain the required dataset.

Here, in the below image, we can see the Python file along with required dataset.
Now, the current folder is set as a working directory.

42
read_csv() function:

Now to import the dataset, we will use read_csv() function of pandas library, which
is used to read a csv file and performs various operations on it. Using this function,
we can read a csv file locally as well as through an URL.

We can use read_csv function as below:

1. data_set= pd.read_csv('[Link]')

Here, data_set is a name of the variable to store our dataset, and inside the
function, we have passed the name of our dataset. Once we execute the above
line of code, it will successfully import the dataset in our code. We can also check

43
the imported dataset by clicking on the section variable explorer, and then
double click on data_set. Consider the below image:

As in the above image, indexing is started from 0, which is the default indexing in
Python. We can also change the format of our dataset by clicking on the format
option.

Extracting dependent and independent variables:

In machine learning, it is important to distinguish the matrix of features


(independent variables) and dependent variables from dataset. In our dataset,

44
there are three independent variables that are Country, Age, and Salary, and
one is a dependent variable which is Purchased.

Extracting independent variable:

To extract an independent variable, we will use iloc[ ] method of Pandas library. It


is used to extract the required rows and columns from the dataset.

1. x= data_set.iloc[:,:-1].values

In the above code, the first colon(:) is used to take all the rows, and the second
colon(:) is for all the columns. Here we have used :-1, because we don't want to
take the last column as it contains the dependent variable. So by doing this, we
will get the matrix of features.

By executing the above code, we will get output as:

1. [['India' 38.0 68000.0]


2. ['France' 43.0 45000.0]
3. ['Germany' 30.0 54000.0]
4. ['France' 48.0 65000.0]
5. ['Germany' 40.0 nan]
6. ['India' 35.0 58000.0]
7. ['Germany' nan 53000.0]
8. ['France' 49.0 79000.0]
9. ['India' 50.0 88000.0]
10. ['France' 37.0 77000.0]]

As we can see in the above output, there are only three variables.

Extracting dependent variable:

To extract dependent variables, again, we will use Pandas .iloc[] method.

1. y= data_set.iloc[:,3].values

Here we have taken all the rows with the last column only. It will give the array of
dependent variables.

By executing the above code, we will get output as:

Output:

45
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)

Note: If you are using Python language for machine learning, then extraction is mandatory, but for
R language it is not required.

4) Handling Missing data:


The next step of data preprocessing is to handle missing data in the datasets. If
our dataset contains some missing data, then it may create a huge problem for our
machine learning model. Hence it is necessary to handle missing values present in
the dataset.

Ways to handle missing data:

There are mainly two ways to handle missing data, which are:

By deleting the particular row: The first way is used to commonly deal with null
values. In this way, we just delete the specific row or column which consists of null
values. But this way is not so efficient and removing data may lead to loss of
information which will not give the accurate output.

By calculating the mean: In this way, we will calculate the mean of that column
or row which contains any missing value and will put it on the place of missing
value. This strategy is useful for the features which have numeric data such as
age, salary, year, etc. Here, we will use this approach.

To handle missing values, we will use Scikit-learn library in our code, which
contains various libraries for building machine learning models. Here we will
use Imputer class of [Link] library. Below is the code for it:

1. #handling missing data (Replacing missing data with the mean value)
2. from [Link] import Imputer
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
4. #Fitting imputer object to the independent variables x.
5. imputerimputer= [Link](x[:, 1:3])
6. #Replacing missing data with the calculated mean value
7. x[:, 1:3]= [Link](x[:, 1:3])

Output:

array([['India', 38.0, 68000.0],


['France', 43.0, 45000.0],
['Germany', 30.0, 54000.0],

46
['France', 48.0, 65000.0],
['Germany', 40.0, 65222.22222222222],
['India', 35.0, 58000.0],
['Germany', 41.111111111111114, 53000.0],
['France', 49.0, 79000.0],
['India', 50.0, 88000.0],
['France', 37.0, 77000.0]], dtype=object

As we can see in the above output, the missing values have been replaced with the
means of rest column values.

5) Encoding Categorical data:


Categorical data is data which has some categories such as, in our dataset; there
are two categorical variable, Country, and Purchased.

Since machine learning model completely works on mathematics and numbers, but
if our dataset would have a categorical variable, then it may create trouble while
building the model. So it is necessary to encode these categorical variables into
numbers.

For Country variable:

Firstly, we will convert the country variables into categorical data. So to do this, we
will use LabelEncoder() class from preprocessing library.

1. #Catgorical data
2. #for Country Variable
3. from [Link] import LabelEncoder
4. label_encoder_x= LabelEncoder()
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])

Output:

Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)

Explanation:

47
In above code, we have imported LabelEncoder class of sklearn library. This
class has successfully encoded the variables into digits.

But in our case, there are three country variables, and as we can see in the above
output, these variables are encoded into 0, 1, and 2. By these values, the machine
learning model may assume that there is some correlation between these
variables which will produce the wrong output. So to remove this issue, we will
use dummy encoding.

Dummy Variables:

Dummy variables are those variables which have values 0 or 1. The 1 value gives
the presence of that variable in a particular column, and rest variables become 0.
With dummy encoding, we will have a number of columns equal to the number of
categories.

In our dataset, we have 3 categories so it will produce three columns having 0 and
1 values. For Dummy Encoding, we will use OneHotEncoder class
of preprocessing library.

1. #for Country Variable


2. from [Link] import LabelEncoder, OneHotEncoder
3. label_encoder_x= LabelEncoder()
4. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
5. #Encoding for dummy variables
6. onehot_encoder= OneHotEncoder(categorical_features= [0])
7. x= onehot_encoder.fit_transform(x).toarray()

Output:

array([[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,


6.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.30000000e+01,
4.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
5.40000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
6.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
6.52222222e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.50000000e+01,
5.80000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.11111111e+01,
5.30000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.90000000e+01,
7.90000000e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 5.00000000e+01,

48
8.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
7.70000000e+04]])

As we can see in the above output, all the variables are encoded into numbers 0
and 1 and divided into three columns.

It can be seen more clearly in the variables explorer section, by clicking on x


option as:

For Purchased Variable:

1. labelencoder_y= LabelEncoder()
2. y= labelencoder_y.fit_transform(y)

49
For the second categorical variable, we will only use labelencoder object
of LableEncoder class. Here we are not using OneHotEncoder class because the
purchased variable has only two categories yes or no, and which are automatically
encoded into 0 and 1.

Output:

Out[17]: array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

It can also be seen as:

6) Splitting the Dataset into the Training set and Test set
In machine learning data preprocessing, we divide our dataset into a training set
and test set. This is one of the crucial steps of data preprocessing as by doing this,
we can enhance the performance of our machine learning model.

50
Suppose, if we have given training to our machine learning model by a dataset and
we test it by a completely different dataset. Then, it will create difficulties for our
model to understand the correlations between the models.

If we train our model very well and its training accuracy is also very high, but we
provide a new dataset to it, then it will decrease the performance. So we always
try to make a machine learning model which performs well with the training set
and also with the test dataset. Here, we can define these datasets as:

Training Set: A subset of dataset to train the machine learning model, and we
already know the output.

Test set: A subset of dataset to test the machine learning model, and by using the
test set, model predicts the output.

For splitting the dataset, we will use the below lines of code:

1. from sklearn.model_selection import train_test_split


2. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=
0)

Explanation:

o In the above code, the first line is used for splitting arrays of the dataset into
random train and test subsets.
o In the second line, we have used four variables for our output that are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data

51
o In train_test_split() function, we have passed four parameters in which
first two are for arrays of data, and test_size is for specifying the size of the
test set. The test_size maybe .5, .3, or .2, which tells the dividing ratio of
training and testing sets.
o The last parameter random_state is used to set a seed for a random
generator so that you always get the same result, and the most used value
for this is 42.

Output:

By executing the above code, we will get 4 different variables, which can be seen
under the variable explorer section.

As we can see in the above image, the x and y variables are divided into 4
different variables with corresponding values.

7) Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a
technique to standardize the independent variables of the dataset in a specific
range. In feature scaling, we put our variables in the same range and in the same
scale so that no any variable dominate the other variable.

Consider the below dataset:

52
As we can see, the age and salary column values are not on the same scale. A
machine learning model is based on Euclidean distance, and if we do not scale
the variable, then it will cause some issue in our machine learning model.

Euclidean distance is given as:

53
If we compute any two values from age and salary, then salary values will
dominate the age values, and it will produce an incorrect result. So to remove this
issue, we need to perform feature scaling for machine learning.

There are two ways to perform feature scaling in machine learning:

Standardization

Normalization

54
Here, we will use the standardization method for our dataset.

For feature scaling, we will import StandardScaler class


of [Link] library as:

1. from [Link] import StandardScaler

Now, we will create the object of StandardScaler class for independent variables
or features. And then we will fit and transform the training dataset.

1. st_x= StandardScaler()
2. x_train= st_x.fit_transform(x_train)

For test dataset, we will directly apply transform() function instead


of fit_transform() because it is already done in training set.

1. x_test= st_x.transform(x_test)

Output:

By executing the above lines of code, we will get the scaled values for x_train and
x_test as:

x_train:

55
x_test:

56
As we can see in the above output, all the variables are scaled between values -1
to 1.

Note: Here, we have not scaled the dependent variable because there are only two values 0 and
1. But if these variables will have more range of values, then we will also need to scale those
variables.

Combining all the steps:

Now, in the end, we can combine all the steps together to make our complete code
more understandable.

1. # importing libraries
2. import numpy as nm
3. import [Link] as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('[Link]')
8.
9. #Extracting Independent Variable
10. x= data_set.iloc[:, :-1].values
11.

57
12. #Extracting Dependent variable
13. y= data_set.iloc[:, 3].values
14.
15. #handling missing data(Replacing missing data with the mean value)
16. from [Link] import Imputer
17. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
18.
19. #Fitting imputer object to the independent varibles x.
20. imputerimputer= [Link](x[:, 1:3])
21.
22. #Replacing missing data with the calculated mean value
23. x[:, 1:3]= [Link](x[:, 1:3])
24.
25. #for Country Variable
26. from [Link] import LabelEncoder, OneHotEncoder
27. label_encoder_x= LabelEncoder()
28. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
29.
30. #Encoding for dummy variables
31. onehot_encoder= OneHotEncoder(categorical_features= [0])
32. x= onehot_encoder.fit_transform(x).toarray()
33.
34. #encoding for purchased variable
35. labelencoder_y= LabelEncoder()
36. y= labelencoder_y.fit_transform(y)
37.
38. # Splitting the dataset into training and test set.
39. from sklearn.model_selection import train_test_split
40. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_s
tate=0)
41.
42. #Feature Scaling of datasets
43. from [Link] import StandardScaler
44. st_x= StandardScaler()

58
45. x_train= st_x.fit_transform(x_train)
46. x_test= st_x.transform(x_test)

In the above code, we have included all the data preprocessing steps together. But
there are some steps or lines of code which are not necessary for all machine
learning models. So we can exclude them from our code to make it reusable for all
models.

Supervised Machine Learning


Supervised learning is the types of machine learning in which machines are trained
using well "labelled" training data, and on basis of that data, machines predict the
output. The labelled data means some input data is already tagged with the
correct output.

In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly. It applies the
same concept as a student learns in the supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output


data to the machine learning model. The aim of a supervised learning algorithm is
to find a mapping function to map the input variable(x) with the output
variable(y).

In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.

How Supervised Learning Works?


In supervised learning, models are trained using labelled dataset, where the model
learns about each type of data. Once the training process is completed, the model
is tested on the basis of test data (a subset of the training set), and then it predicts
the output.

The working of Supervised learning can be easily understood by the below


example and diagram:

59
Suppose we have a dataset of different types of shapes which includes square,
rectangle, triangle, and Polygon. Now the first step is that we need to train the
model for each shape.

o If the given shape has four sides, and all the sides are equal, then it will be
labelled as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.

Now, after training, we test our model using the test set, and the task of the model
is to identify the shape.

The machine is already trained on all types of shapes, and when it finds a new
shape, it classifies the shape on the bases of a number of sides, and predicts the
output.

Steps Involved in Supervised Learning:


o First Determine the type of training dataset
o Collect/Gather the labelled training data.
o Split the training dataset into training dataset, test dataset, and
validation dataset.
o Determine the input features of the training dataset, which should have
enough knowledge so that the model can accurately predict the output.

60
o Determine the suitable algorithm for the model, such as support vector
machine, decision tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need
validation sets as the control parameters, which are the subset of training
datasets.
o Evaluate the accuracy of the model by providing the test set. If the model
predicts the correct output, which means our model is accurate.

Types of supervised Machine learning Algorithms:


Supervised learning can be further divided into two types of problems:

1. Regression

Regression algorithms are used if there is a relationship between the input variable
and the output variable. It is used for the prediction of continuous variables, such
as Weather forecasting, Market Trends, etc. Below are some popular Regression
algorithms which come under supervised learning:

o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression

2. Classification

61
Classification algorithms are used when the output variable is categorical, which
means there are two classes such as Yes-No, Male-Female, True-false, etc.

Spam Filtering,

o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines

Note: We will discuss these algorithms in detail in later chapters.

Advantages of Supervised learning:


o With the help of supervised learning, the model can predict the output on the
basis of prior experiences.
o In supervised learning, we can have an exact idea about the classes of
objects.
o Supervised learning model helps us to solve various real-world problems
such as fraud detection, spam filtering, etc.

Disadvantages of supervised learning:


o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is
different from the training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of
object.

62
Unsupervised Machine Learning
In the previous topic, we learned supervised machine learning in which models are
trained using labeled data under the supervision of training data. But there may be
many cases in which we do not have labeled data and need to find the hidden
patterns from the given dataset. So, to solve such types of cases in machine
learning, we need unsupervised learning techniques.

What is Unsupervised Learning?


As the name suggests, unsupervised learning is a machine learning technique in
which models are not supervised using training dataset. Instead, models itself find
the hidden patterns and insights from the given data. It can be compared to
learning which takes place in the human brain while learning new things. It can be
defined as:

Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.

Unsupervised learning cannot be directly applied to a regression or classification


problem because unlike supervised learning, we have the input data but no
corresponding output data. The goal of unsupervised learning is to find the
underlying structure of dataset, group that data according to similarities,
and represent that dataset in a compressed format.

Example: Suppose the unsupervised learning algorithm is given an input dataset


containing images of different types of cats and dogs. The algorithm is never
trained upon the given dataset, which means it does not have any idea about the
features of the dataset. The task of the unsupervised learning algorithm is to
identify the image features on their own. Unsupervised learning algorithm will
perform this task by clustering the image dataset into the groups according to
similarities between images.

63
Why use Unsupervised Learning?
Below are some main reasons which describe the importance of Unsupervised
Learning:

o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their
own experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which
make unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding
output so to solve such cases, we need unsupervised learning.

Working of Unsupervised Learning


Working of unsupervised learning can be understood by the below diagram:

64
Here, we have taken an unlabeled input data, which means it is not categorized
and corresponding outputs are also not given. Now, this unlabeled input data is fed
to the machine learning model in order to train it. Firstly, it will interpret the raw
data to find the hidden patterns from the data and then will apply suitable
algorithms such as k-means clustering, Decision tree, etc.

Once it applies the suitable algorithm, the algorithm divides the data objects into
groups according to the similarities and difference between the objects.

Types of Unsupervised Learning Algorithm:


The unsupervised learning algorithm can be further categorized into two types of
problems:

65
o Clustering: Clustering is a method of grouping the objects into clusters such
that objects with most similarities remains into a group and has less or no
similarities with the objects of another group. Cluster analysis finds the
commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which
is used for finding the relationships between variables in the large database.
It determines the set of items that occurs together in the dataset.
Association rule makes marketing strategy more effective. Such as people
who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam)
item. A typical example of Association rule is Market Basket Analysis.

Note: We will learn these algorithms in later chapters.

Unsupervised Learning algorithms:


Below is the list of some popular unsupervised learning algorithms:

o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection

66
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition

Advantages of Unsupervised Learning


o Unsupervised learning is used for more complex tasks as compared to
supervised learning because, in unsupervised learning, we don't have
labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in
comparison to labeled data.

Disadvantages of Unsupervised Learning


o Unsupervised learning is intrinsically more difficult than supervised learning
as it does not have corresponding output.
o The result of the unsupervised learning algorithm might be less accurate as
input data is not labeled, and algorithms do not know the exact output in
advance.

67
Difference between Supervised and Unsupervised
Learning
Supervised and Unsupervised learning are the two techniques of machine learning.
But both the techniques are used in different scenarios and with different datasets.
Below the explanation of both learning methods along with their difference table is
given.

Supervised Machine Learning:


Supervised learning is a machine learning method in which models are trained
using labeled data. In supervised learning, models need to find the mapping
function to map the input variable (X) with the output variable (Y).

Supervised learning needs supervision to train the model, which is similar to as a


student learns things in the presence of a teacher. Supervised learning can be
used for two types of problems: Classification and Regression.

Learn more Supervised Machine Learning

Skip Ad

68
Example: Suppose we have an image of different types of fruits. The task of our
supervised learning model is to identify the fruits and classify them accordingly. So
to identify the image in supervised learning, we will give the input data as well as
output for that, which means we will train the model by the shape, size, color, and
taste of each fruit. Once the training is completed, we will test the model by giving
the new set of fruit. The model will identify the fruit and predict the output using a
suitable algorithm.

Unsupervised Machine Learning:


Unsupervised learning is another machine learning method in which patterns
inferred from the unlabeled input data. The goal of unsupervised learning is to find
the structure and patterns from the input data. Unsupervised learning does not
need any supervision. Instead, it finds patterns from the data by its own.

Unsupervised learning can be used for two types of


problems: Clustering and Association.

Example: To understand the unsupervised learning, we will use the example given
above. So unlike supervised learning, here we will not provide any supervision to
the model. We will just provide the input dataset to the model and allow the model
to find the patterns from the data. With the help of a suitable algorithm, the model
will train itself and divide the fruits into different groups according to the most
similar features between them.

The main differences between Supervised and Unsupervised learning are given
below:

Supervised Learning Unsupervised Learning

Supervised learning algorithms are Unsupervised learning algorithms are


trained using labeled data. trained using unlabeled data.

Supervised learning model takes direct Unsupervised learning model does not
feedback to check if it is predicting take any feedback.
correct output or not.

Supervised learning model predicts the Unsupervised learning model finds the
output. hidden patterns in data.

In supervised learning, input data is In unsupervised learning, only input

69
provided to the model along with the data is provided to the model.
output.

The goal of supervised learning is to The goal of unsupervised learning is to


train the model so that it can predict the find the hidden patterns and useful
output when it is given new data. insights from the unknown dataset.

Supervised learning needs supervision Unsupervised learning does not need


to train the model. any supervision to train the model.

Supervised learning can be categorized Unsupervised Learning can be


in Classification and Regression probl classified
ems. in Clustering and Associations probl
ems.

Supervised learning can be used for Unsupervised learning can be used for
those cases where we know the input as those cases where we have only input
well as corresponding outputs. data and no corresponding output
data.

Supervised learning model produces an Unsupervised learning model may give


accurate result. less accurate result as compared to
supervised learning.

Supervised learning is not close to true Unsupervised learning is more close to


Artificial intelligence as in this, we first the true Artificial Intelligence as it
train the model for each data, and then learns similarly as a child learns daily
only it can predict the correct output. routine things by his experiences.

It includes various algorithms such as It includes various algorithms such as


Linear Regression, Logistic Regression, Clustering, KNN, and Apriori algorithm.
Support Vector Machine, Multi-class
Classification, Decision tree, Bayesian
Logic, etc.

Note: The supervised and unsupervised learning both are the machine learning methods, and
selection of any of these learning depends on the factors related to the structure and volume of
your dataset and the use cases of the problem.

70
Algorithms
Regression Analysis in Machine learning
Regression analysis is a statistical method to model the relationship between a
dependent (target) and independent (predictor) variables with one or more
independent variables. More specifically, Regression analysis helps us to
understand how the value of the dependent variable is changing corresponding to
an independent variable when other independent variables are held fixed. It
predicts continuous/real values such as temperature, age, salary, price, etc.

We can understand the concept of regression analysis using the below example:

Example: Suppose there is a marketing company A, who does various


advertisement every year and get sales on that. The below list shows the
advertisement made by the company in the last 5 years and the corresponding
sales:

Now, the company wants to do the advertisement of $200 in the year 2019 and
wants to know the prediction about the sales for this year. So to solve such
type of prediction problems in machine learning, we need regression analysis.

Regression is a supervised learning technique which helps in finding the correlation


between variables and enables us to predict the continuous output variable based
on the one or more predictor variables. It is mainly used for prediction,

71
forecasting, time series modeling, and determining the causal-effect
relationship between variables.

In Regression, we plot a graph between the variables which best fits the given
datapoints, using this plot, the machine learning model can make predictions
about the data. In simple words, "Regression shows a line or curve that
passes through all the datapoints on target-predictor graph in such a
way that the vertical distance between the datapoints and the regression
line is minimum." The distance between datapoints and line tells whether a
model has captured a strong relationship or not.

Some examples of regression can be as:

o Prediction of rain using temperature and other factors


o Determining Market trends
o Prediction of road accidents due to rash driving.

Terminologies Related to the Regression Analysis:


o Dependent Variable: The main factor in Regression analysis which we
want to predict or understand is called the dependent variable. It is also
called target variable.
o Independent Variable: The factors which affect the dependent variables or
which are used to predict the values of the dependent variables are called
independent variable, also called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or
very high value in comparison to other observed values. An outlier may
hamper the result, so it should be avoided.
o Multicollinearity: If the independent variables are highly correlated with
each other than other variables, then such condition is called
Multicollinearity. It should not be present in the dataset, because it creates
problem while ranking the most affecting variable.
o Underfitting and Overfitting: If our algorithm works well with the training
dataset but not well with test dataset, then such problem is
called Overfitting. And if our algorithm does not perform well even with
training dataset, then such problem is called underfitting.

72
Why do we use Regression Analysis?
As mentioned above, Regression analysis helps in the prediction of a continuous
variable. There are various scenarios in the real world where we need some future
predictions such as weather condition, sales prediction, marketing trends, etc., for
such case we need some technology which can make predictions more accurately.
So for such case we need Regression analysis which is a statistical method and
used in machine learning and data science. Below are some other reasons for
using Regression analysis:

o Regression estimates the relationship between the target and the


independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most
important factor, the least important factor, and how each factor is
affecting the other factors.

Types of Regression
There are various types of regressions which are used in data science and machine
learning. Each type has its own importance on different scenarios, but at the core,
all the regression methods analyze the effect of the independent variable on
dependent variables. Here we are discussing some important types of regression
which are given below:

o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:

73
Linear Regression:

o Linear regression is a statistical regression method which is used for


predictive analysis.
o It is one of the very simple and easy algorithms which works on regression
and shows the relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent
variable (X-axis) and the dependent variable (Y-axis), hence called linear
regression.
o If there is only one input variable (x), then such linear regression is
called simple linear regression. And if there is more than one input
variable, then such linear regression is called multiple linear regression.

74
o The relationship between variables in the linear regression model can be
explained using the below image. Here we are predicting the salary of an
employee on the basis of the year of experience.

o Below is the mathematical equation for Linear regression:

1. Y= aX+b

Here, Y = dependent variables (target variables),


X= Independent variables (predictor variables),
a and b are the linear coefficients

Some popular applications of linear regression are:

o Analyzing trends and sales estimates


o Salary forecasting
o Real estate prediction
o Arriving at ETAs in traffic.

75
Logistic Regression:

o Logistic regression is another supervised learning algorithm which is used to


solve the classification problems. In classification problems, we have
dependent variables in a binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or
1, Yes or No, True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of
probability.
o Logistic regression is a type of regression, but it is different from the linear
regression algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a
complex cost function. This sigmoid function is used to model the data in
logistic regression. The function can be represented as:

o f(x)= Output between the 0 and 1 value.


o x= input to the function
o e= base of natural logarithm.

When we provide the input values (data) to the function, it gives the S-curve as
follows:

76
o It uses the concept of threshold levels, values above the threshold level are
rounded up to 1, and values below the threshold level are rounded up to 0.

There are three types of logistic regression:

o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)

Polynomial Regression:

o Polynomial Regression is a type of regression which models the non-linear


dataset using a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve
between the value of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in
a non-linear fashion, so for such case, linear regression will not best fit to
those datapoints. To cover such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into
polynomial features of given degree and then modeled using a

77
linear model. Which means the datapoints are best fitted using a
polynomial line.

o The equation for polynomial regression also derived from linear regression
equation that means Linear regression equation Y= b 0+ b1x, is transformed
into Polynomial regression equation Y= b 0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression
coefficients. x is our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic

Note: This is different from Multiple Linear regression in such a way that in Polynomial regression,
a single element has different degrees instead of multiple variables with the same degree.

Support Vector Regression:


Support Vector Machine is a supervised learning algorithm which can be used for
regression as well as classification problems. So if we use it for regression
problems, then it is termed as Support Vector Regression.

Support Vector Regression is a regression algorithm which works for continuous


variables. Below are some keywords which are used in Support Vector
Regression:

o Kernel: It is a function used to map a lower-dimensional data into higher


dimensional data.

78
o Hyperplane: In general SVM, it is a separation line between two classes, but
in SVR, it is a line which helps to predict the continuous variables and cover
most of the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane,
which creates a margin for datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to
the hyperplane and opposite class.

In SVR, we always try to determine a hyperplane with a maximum margin, so that


maximum number of datapoints are covered in that margin. The main goal of
SVR is to consider the maximum datapoints within the boundary lines
and the hyperplane (best-fit line) must contain a maximum number of
datapoints. Consider the below image:

Here, the blue line is called hyperplane, and the other two lines are known as
boundary lines.

Decision Tree Regression:

o Decision Tree is a supervised learning algorithm which can be used for


solving both classification and regression problems.
o It can solve problems for both categorical and numerical data

79
o Decision Tree regression builds a tree-like structure in which each internal
node represents the "test" for an attribute, each branch represent the result
of the test, and each leaf node represents the final decision or result.
o A decision tree is constructed starting from the root node/parent node
(dataset), which splits into left and right child nodes (subsets of dataset).
These child nodes are further divided into their children node, and
themselves become the parent node of those nodes. Consider the below
image:

Above image showing the example of Decision Tee regression, here, the model is
trying to predict the choice of a person between Sports cars or Luxury car.

o Random forest is one of the most powerful supervised learning algorithms


which is capable of performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which
combines multiple decision trees and predicts the final output based on the
average of each tree output. The combined decision trees are called as base
models, and it can be represented more formally as:

80
g(x)= f0(x)+ f1(x)+ f2(x)+....

o Random forest uses Bagging or Bootstrap Aggregation technique of


ensemble learning in which aggregated decision tree runs in parallel and do
not interact with each other.
o With the help of Random Forest regression, we can prevent Overfitting in the
model by creating random subsets of the dataset.

Ridge Regression:

o Ridge regression is one of the most robust versions of linear regression in


which a small amount of bias is introduced so that we can get better long
term predictions.
o The amount of bias added to the model is known as Ridge Regression
penalty. We can compute this penalty term by multiplying with the lambda
to the squared weight of each individual features.

81
o The equation for ridge regression will be:

o A general linear or polynomial regression will fail if there is high collinearity


between the independent variables, so to solve such problems, Ridge
regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the
complexity of the model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.

Lasso Regression:

o Lasso regression is another regularization technique to reduce the


complexity of the model.
o It is similar to the Ridge Regression except that penalty term contains only
the absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas
Ridge Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will
be:

82
Linear Regression in Machine Learning
Linear regression is one of the easiest and most popular Machine Learning
algorithms. It is a statistical method that is used for predictive analysis. Linear
regression makes predictions for continuous/real or numeric variables such
as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y)


and one or more independent (y) variables, hence called as linear regression.
Since linear regression shows the linear relationship, which means it finds how the
value of the dependent variable is changing according to the value of the
independent variable.

The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:

83
Mathematically, we can represent a linear regression as:

y= a0+a1x+ ε

Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model
representation.

Types of Linear Regression


Linear regression can be further divided into two types of the algorithm:

o Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical

84
dependent variable, then such a Linear Regression algorithm is called Simple
Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm is
called Multiple Linear Regression.

Linear Regression Line


A linear line showing the relationship between the dependent and independent
variables is called a regression line. A regression line can show two types of
relationship:

o Positive Linear Relationship:


If the dependent variable increases on the Y-axis and independent variable
increases on X-axis, then such a relationship is termed as a Positive linear
relationship.

o Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent variable
increases on the X-axis, then such a relationship is called a negative linear
relationship.

85
Finding the best fit line:
When working with linear regression, our main goal is to find the best fit line that
means the error between predicted values and actual values should be minimized.
The best fit line will have the least error.

The different values for weights or the coefficient of lines (a 0, a1) gives a different
line of regression, so we need to calculate the best values for a 0 and a1 to find the
best fit line, so to calculate this we use cost function.

Cost function-

o The different values for weights or coefficient of lines (a 0, a1) gives the
different line of regression, and the cost function is used to estimate the
values of the coefficient for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures
how a linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping function,
which maps the input variable to the output variable. This mapping function
is also known as Hypothesis function.

For Linear Regression, we use the Mean Squared Error (MSE) cost function,
which is the average of squared error occurred between the predicted values and
actual values. It can be written as:

86
For the above linear equation, MSE can be calculated as:

Where,

N=Total number of observation


Yi = Actual value
(a1xi+a0)= Predicted value.

Residuals: The distance between the actual value and predicted values is called
residual. If the observed points are far from the regression line, then the residual
will be high, and so cost function will high. If the scatter points are close to the
regression line, then the residual will be small and hence the cost function.

Gradient Descent:

o Gradient descent is used to minimize the MSE by calculating the gradient of


the cost function.
o A regression model uses gradient descent to update the coefficients of the
line by reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively
update the values to reach the minimum cost function.

Model Performance:
The Goodness of fit determines how the line of regression fits the set of
observations. The process of finding the best model out of various models is
called optimization. It can be achieved by below method:

1. R-squared method:

o R-squared is a statistical method that determines the goodness of fit.


o It measures the strength of the relationship between the dependent and
independent variables on a scale of 0-100%.

87
o The high value of R-square determines the less difference between the
predicted values and actual values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
o It can be calculated from the below formula:

Assumptions of Linear Regression


Below are some important assumptions of Linear Regression. These are some
formal checks while building a Linear Regression model, which ensures to get the
best possible result from the given dataset.

o Linear relationship between the features and target:


Linear regression assumes the linear relationship between the dependent
and independent variables.
o Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent variables.
Due to multicollinearity, it may difficult to find the true relationship between
the predictors and target variables. Or we can say, it is difficult to determine
which predictor variable is affecting the target variable and which is not. So,
the model assumes either little or no multicollinearity between the features
or independent variables.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the
values of independent variables. With homoscedasticity, there should be no
clear pattern distribution of data in the scatter plot.
o Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal
distribution pattern. If error terms are not normally distributed, then
confidence intervals will become either too wide or too narrow, which may
cause difficulties in finding coefficients.

88
It can be checked using the q-q plot. If the plot shows a straight line without
any deviation, which means the error is normally distributed.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If
there will be any correlation in the error term, then it will drastically reduce
the accuracy of the model. Autocorrelation usually occurs if there is a
dependency between residual errors.

Simple Linear Regression in Machine Learning


Simple Linear Regression is a type of Regression algorithms that models the
relationship between a dependent variable and a single independent variable. The
relationship shown by a Simple Linear Regression model is linear or a sloped
straight line, hence it is called Simple Linear Regression.

The key point in Simple Linear Regression is that the dependent variable must
be a continuous/real value. However, the independent variable can be
measured on continuous or categorical values.

Simple Linear regression algorithm has mainly two objectives:

89
o Model the relationship between the two variables. Such as the
relationship between Income and expenditure, experience and Salary, etc.
o Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a year,
etc.

Simple Linear Regression Model:


The Simple Linear Regression model can be represented using the below equation:

417
When Crypto Becomes Intelligent
y= a0+a1x+ ε

Where,

a0= It is the intercept of the Regression line (can be obtained putting


x=0)
a1= It is the slope of the regression line, which tells whether the line is
increasing or decreasing.
ε = The error term. (For a good model it will be negligible)

Implementation of Simple Linear Regression Algorithm


using Python
Problem Statement example for Simple Linear Regression:

Here we are taking a dataset that has two variables: salary (dependent variable)
and experience (Independent variable). The goals of this problem is:

o We want to find out if there is any correlation between these two


variables
o We will find the best fit line for the dataset.
o How the dependent variable is changing by changing the
independent variable.

In this section, we will create a Simple Linear Regression model to find out the best
fitting line for representing the relationship between these two variables.

90
To implement the Simple Linear regression model in machine learning using
Python, we need to follow the below steps:

Step-1: Data Pre-processing

The first step for creating the Simple Linear Regression model is data pre-
processing

. We have already done it earlier in this tutorial. But there will be some changes, which are given in the
below steps:
o First, we will import the three important libraries, which will help us for
loading the dataset, plotting the graphs, and creating the Simple Linear
Regression model.

1. import numpy as nm
2. import [Link] as mtp
3. import pandas as pd

o Next, we will load the dataset into our code:

1. data_set= pd.read_csv('Salary_Data.csv')

By executing the above line of code (ctrl+ENTER), we can read the dataset on our
Spyder IDE screen by clicking on the variable explorer option.

91
The above output shows the dataset, which has two variables: Salary and
Experience.

Note: In Spyder IDE, the folder containing the code file must be saved as a working directory, and
the dataset or csv file should be in the same folder.

o After that, we need to extract the dependent and independent variables from
the given dataset. The independent variable is years of experience, and the
dependent variable is salary. Below is code for it:

1. x= data_set.iloc[:, :-1].values

92
2. y= data_set.iloc[:, 1].values

In the above lines of code, for x variable, we have taken -1 value since we want to
remove the last column from the dataset. For y variable, we have taken 1 value as
a parameter, since we want to extract the second column and indexing starts from
the zero.

By executing the above line of code, we will get the output for X and Y variable as:

In the above output image, we can see the X (independent) variable and Y
(dependent) variable has been extracted from the given dataset.

o Next, we will split both variables into the test set and training set. We have
30 observations, so we will take 20 observations for the training set and 10
observations for the test set. We are splitting our dataset so that we can

93
train our model using a training dataset and then test the model using a test
dataset. The code for this is given below:

1. # Splitting the dataset into training and test set.


2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 1/3, random_state=
0)

By executing the above code, we will get x-test, x-train and y-test, y-train dataset.
Consider the below images:

Test-dataset:

Training Dataset:

94
o For simple linear Regression, we will not use Feature Scaling. Because
Python libraries take care of it for some cases, so we don't need to perform it
here. Now, our dataset is well prepared to work on it and we are going to
start building a Simple Linear Regression model for the given problem.

Step-2: Fitting the Simple Linear Regression to the Training Set:

Now the second step is to fit our model to the training dataset. To do so, we will
import the LinearRegression class of the linear_model library from the scikit
learn. After importing the class, we are going to create an object of the class
named as a regressor. The code for this is given below:

95
1. #Fitting the Simple Linear Regression model to the training dataset
2. from sklearn.linear_model import LinearRegression
3. regressor= LinearRegression()
4. [Link](x_train, y_train)

In the above code, we have used a fit() method to fit our Simple Linear Regression
object to the training set. In the fit() function, we have passed the x_train and
y_train, which is our training dataset for the dependent and an independent
variable. We have fitted our regressor object to the training set so that the model
can easily learn the correlations between the predictor and target variables. After
executing the above lines of code, we will get the below output.

Output:

Out[7]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,


normalize=False)

Step: 3. Prediction of test set result:

dependent (salary) and an independent variable (Experience). So, now, our model
is ready to predict the output for the new observations. In this step, we will provide
the test dataset (new observations) to the model to check whether it can predict
the correct output or not.

We will create a prediction vector y_pred, and x_pred, which will contain
predictions of test dataset, and prediction of training set respectively.

1. #Prediction of Test and Training set result


2. y_pred= [Link](x_test)
3. x_pred= [Link](x_train)

On executing the above lines of code, two variables named y_pred and x_pred will
generate in the variable explorer options that contain salary predictions for the
training set and test set.

Output:

You can check the variable by clicking on the variable explorer option in the IDE,
and also compare the result by comparing values from y_pred and y_test. By
comparing these values, we can check how good our model is performing.

96
Step: 4. visualizing the Training set results:

Now in this step, we will visualize the training set result. To do so, we will use the
scatter() function of the pyplot library, which we have already imported in the pre-
processing step. The scatter () function will create a scatter plot of observations.

In the x-axis, we will plot the Years of Experience of employees and on the y-axis,
salary of employees. In the function, we will pass the real values of training set,
which means a year of experience x_train, training set of Salaries y_train, and color
of the observations. Here we are taking a green color for the observation, but it
can be any color as per the choice.

Now, we need to plot the regression line, so for this, we will use the plot()
function of the pyplot library. In this function, we will pass the years of experience
for training set, predicted salary for training set x_pred, and color of the line.

Next, we will give the title for the plot. So here, we will use the title() function of
the pyplot library and pass the name ("Salary vs Experience (Training Dataset)".

After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel()
function.

Finally, we will represent all above things in a graph using show(). The code is
given below:

1. [Link](x_train, y_train, color="green")


2. [Link](x_train, x_pred, color="red")
3. [Link]("Salary vs Experience (Training Dataset)")
4. [Link]("Years of Experience")
5. [Link]("Salary(In Rupees)")
6. [Link]()

Output:

By executing the above lines of code, we will get the below graph plot as an
output.

97
In the above plot, we can see the real values observations in green dots and
predicted values are covered by the red regression line. The regression line shows
a correlation between the dependent and independent variable.

The good fit of the line can be observed by calculating the difference between
actual values and predicted values. But as we can see in the above plot, most of
the observations are close to the regression line, hence our model is
good for the training set.

Step: 5. visualizing the Test set results:

In the previous step, we have visualized the performance of our model on the
training set. Now, we will do the same for the Test set. The complete code will
remain the same as the above code, except in this, we will use x_test, and y_test
instead of x_train and y_train.

Here we are also changing the color of observations and regression line to
differentiate between the two plots, but it is optional.

98
1. #visualizing the Test set results
2. [Link](x_test, y_test, color="blue")
3. [Link](x_train, x_pred, color="red")
4. [Link]("Salary vs Experience (Test Dataset)")
5. [Link]("Years of Experience")
6. [Link]("Salary(In Rupees)")
7. [Link]()

Output:

By executing the above line of code, we will get the output as:

In the above plot, there are observations given by the blue color, and prediction is
given by the red regression line. As we can see, most of the observations are close
to the regression line, hence we can say our Simple Linear Regression is a good
model and able to make good predictions.

99
Multiple Linear Regression
In the previous topic, we have learned about Simple Linear Regression, where a
single Independent/Predictor(X) variable is used to model the response variable
(Y). But there may be various cases in which the response variable is affected by
more than one predictor variable; for such cases, the Multiple Linear Regression
algorithm is used.

Moreover, Multiple Linear Regression is an extension of Simple Linear regression as


it takes more than one predictor variable to predict the response variable. We can
define it as:

Multiple Linear Regression is one of the important regression algorithms which models the linear
relationship between a single dependent continuous variable and more than one independent
variable.

Example:

Prediction of CO2 emission based on engine size and number of cylinders in a car.

Some key points about MLR:

o For MLR, the dependent or target variable(Y) must be the continuous/real,


but the predictor or independent variable may be of continuous or
categorical form.
o Each feature variable must model the linear relationship with the dependent
variable.
o MLR tries to fit a regression line through a multidimensional space of data-
points.

MLR equation:
In Multiple Linear Regression, the target variable(Y) is a linear combination of
multiple predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple
Linear Regression, so the same is applied for the multiple linear regression
equation, the equation becomes:

100
1. Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</
sub>x<sub>2</sub>+ b<sub>3</sub>x<sub>3</sub>+...... bnxn ...............
(a)

Where,

Y= Output/Response variable

b0, b1, b2, b3 , bn....= Coefficients of the model.

x1, x2, x3, x4,...= Various Independent/feature variable

Assumptions for Multiple Linear Regression:

o A linear relationship should exist between the Target and predictor


variables.
o The regression residuals must be normally distributed.
o MLR assumes little or no multicollinearity (correlation between the
independent variable) in data.

Implementation of Multiple Linear Regression model using Python:


To implement MLR using Python, we have below problem:

Problem Description:

We have a dataset of 50 start-up companies. This dataset contains five main


information: R&D Spend, Administration Spend, Marketing Spend, State,
and Profit for a financial year. Our goal is to create a model that can easily
determine which company has a maximum profit, and which is the most affecting
factor for the profit of a company.

Since we need to find the Profit, so it is the dependent variable, and the other four
variables are independent variables. Below are the main steps of deploying the
MLR model:

1. Data Pre-processing Steps


2. Fitting the MLR model to the training set
3. Predicting the result of the test set

Step-1: Data Pre-processing Step:


101
The very first step is data pre-processing, which we have already discussed in this
tutorial. This process contains the below steps:

o Importing libraries: Firstly we will import the library which will help in
building the model. Below is the code for it:

1. # importing libraries
2. import numpy as nm
3. import [Link] as mtp
4. import pandas as pd

o Importing dataset: Now we will import the dataset(50_CompList), which


contains all the variables. Below is the code for it:

1. #importing datasets
2. data_set= pd.read_csv('50_CompList.csv')

Output: We will get the dataset as:

102
In above output, we can clearly see that there are five variables, in which four
variables are continuous and one is categorical variable.

o Extracting dependent and independent Variables:

1. #Extracting Independent and dependent Variable


2. x= data_set.iloc[:, :-1].values
3. y= data_set.iloc[:, 4].values

Output:

Out[5]:

array([[165349.2, 136897.8, 471784.1, 'New York'],


[162597.7, 151377.59, 443898.53, 'California'],
[153441.51, 101145.55, 407934.54, 'Florida'],

103
[144372.41, 118671.85, 383199.62, 'New York'],
[142107.34, 91391.77, 366168.42, 'Florida'],
[131876.9, 99814.71, 362861.36, 'New York'],
[134615.46, 147198.87, 127716.82, 'California'],
[130298.13, 145530.06, 323876.68, 'Florida'],
[120542.52, 148718.95, 311613.29, 'New York'],
[123334.88, 108679.17, 304981.62, 'California'],
[101913.08, 110594.11, 229160.95, 'Florida'],
[100671.96, 91790.61, 249744.55, 'California'],
[93863.75, 127320.38, 249839.44, 'Florida'],
[91992.39, 135495.07, 252664.93, 'California'],
[119943.24, 156547.42, 256512.92, 'Florida'],
[114523.61, 122616.84, 261776.23, 'New York'],
[78013.11, 121597.55, 264346.06, 'California'],
[94657.16, 145077.58, 282574.31, 'New York'],
[91749.16, 114175.79, 294919.57, 'Florida'],
[86419.7, 153514.11, 0.0, 'New York'],
[76253.86, 113867.3, 298664.47, 'California'],
[78389.47, 153773.43, 299737.29, 'New York'],
[73994.56, 122782.75, 303319.26, 'Florida'],
[67532.53, 105751.03, 304768.73, 'Florida'],
[77044.01, 99281.34, 140574.81, 'New York'],
[64664.71, 139553.16, 137962.62, 'California'],
[75328.87, 144135.98, 134050.07, 'Florida'],
[72107.6, 127864.55, 353183.81, 'New York'],
[66051.52, 182645.56, 118148.2, 'Florida'],
[65605.48, 153032.06, 107138.38, 'New York'],
[61994.48, 115641.28, 91131.24, 'Florida'],
[61136.38, 152701.92, 88218.23, 'New York'],
[63408.86, 129219.61, 46085.25, 'California'],
[55493.95, 103057.49, 214634.81, 'Florida'],
[46426.07, 157693.92, 210797.67, 'California'],
[46014.02, 85047.44, 205517.64, 'New York'],
[28663.76, 127056.21, 201126.82, 'Florida'],
[44069.95, 51283.14, 197029.42, 'California'],
[20229.59, 65947.93, 185265.1, 'New York'],
[38558.51, 82982.09, 174999.3, 'California'],
[28754.33, 118546.05, 172795.67, 'California'],
[27892.92, 84710.77, 164470.71, 'Florida'],
[23640.93, 96189.63, 148001.11, 'California'],
[15505.73, 127382.3, 35534.17, 'New York'],
[22177.74, 154806.14, 28334.72, 'California'],
[1000.23, 124153.04, 1903.93, 'New York'],
[1315.46, 115816.21, 297114.46, 'Florida'],
[0.0, 135426.92, 0.0, 'California'],
[542.05, 51743.15, 0.0, 'New York'],
[0.0, 116983.8, 45173.06, 'California']], dtype=object)

As we can see in the above output, the last column contains categorical variables
which are not suitable to apply directly for fitting the model. So we need to encode
this variable.

Encoding Dummy Variables:

104
As we have one categorical variable (State), which cannot be directly applied to
the model, so we will encode it. To encode the categorical variable into numbers,
we will use the LabelEncoder class. But it is not sufficient because it still has
some relational order, which may create a wrong model. So in order to remove this
problem, we will use OneHotEncoder, which will create the dummy variables.
Below is code for it:

1. #Catgorical data
2. from [Link] import LabelEncoder, OneHotEncoder
3. labelencoder_x= LabelEncoder()
4. x[:, 3]= labelencoder_x.fit_transform(x[:,3])
5. onehotencoder= OneHotEncoder(categorical_features= [3])
6. x= onehotencoder.fit_transform(x).toarray()

Here we are only encoding one independent variable, which is state as other
variables are continuous.

Output:

105
As we can see in the above output, the state column has been converted into
dummy variables (0 and 1). Here each dummy variable column is
corresponding to the one State. We can check by comparing it with the original
dataset. The first column corresponds to the California State, the second column
corresponds to the Florida State, and the third column corresponds to the New
York State.

Note: We should not use all the dummy variables at the same time, so it must be 1 less than the
total number of dummy variables, else it will create a dummy variable trap.

o Now, we are writing a single line of code just to avoid the dummy variable
trap:

1. #avoiding the dummy variable trap:


2. x = x[:, 1:]

106
If we do not remove the first dummy variable, then it may introduce
multicollinearity in the model.

As we can see in the above output image, the first column has been removed.

o Now we will split the dataset into training and test set. The code for this is
given below:

1. # Splitting the dataset into training and test set.


2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=
0)

The above code will split our dataset into a training set and test set.

107
Output: The above code will split the dataset into training set and test set. You
can check the output by clicking on the variable explorer option given in Spyder
IDE. The test set and training set will look like the below image:

Test set:

Training set:

108
Note: In MLR, we will not do feature scaling as it is taken care by the library, so we don't need to
do it manually.

Step: 2- Fitting our MLR model to the Training set:


Now, we have well prepared our dataset in order to provide training, which means
we will fit our regression model to the training set. It will be similar to as we did
in Simple Linear Regression model. The code for this will be:

1. #Fitting the MLR model to the training set:


2. from sklearn.linear_model import LinearRegression
3. regressor= LinearRegression()
4. [Link](x_train, y_train)

Output:

109
Out[9]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)

Now, we have successfully trained our model using the training dataset. In the
next step, we will test the performance of the model using the test dataset.

Step: 3- Prediction of Test set results:


The last step for our model is checking the performance of the model. We will do it
by predicting the test set result. For prediction, we will create a y_pred vector.
Below is the code for it:

1. #Predicting the Test set result;


2. y_pred= [Link](x_test)

By executing the above lines of code, a new vector will be generated under the
variable explorer option. We can test our model by comparing the predicted values
and test set values.

Output:

110
In the above output, we have predicted result set and test set. We can check
model performance by comparing these two value index by index. For example,
the first index has a predicted value of 103015$ profit and test/real value
of 103282$ profit. The difference is only of 267$, which is a good prediction, so,
finally, our model is completed here.

o We can also check the score for training dataset and test dataset. Below is
the code for it:

1. print('Train Score: ', [Link](x_train, y_train))


2. print('Test Score: ', [Link](x_test, y_test))

Output: The score is:

Train Score: 0.9501847627493607


Test Score: 0.9347068473282446

The above score tells that our model is 95% accurate with the training
dataset and 93% accurate with the test dataset.

Note: In the next topic, we will see how we can improve the performance of the model using
the Backward Elimination process.

Applications of Multiple Linear Regression:


There are mainly two applications of Multiple Linear Regression:

o Effectiveness of Independent variable on prediction:


o Predicting the impact of changes:

111
What is Backward Elimination?
Backward elimination is a feature selection technique while building a machine
learning model. It is used to remove those features that do not have a significant
effect on the dependent variable or prediction of output. There are various ways to
build a model in Machine Learning, which are:

1. All-in
2. Backward Elimination
3. Forward Selection
4. Bidirectional Elimination
5. Score Comparison

Above are the possible methods for building the model in Machine learning, but we
will only use here the Backward Elimination process as it is the fastest method.

Steps of Backward Elimination


Below are some main steps which are used to apply backward elimination process:

Step-1: Firstly, We need to select a significance level to stay in the model.


(SL=0.05)

Step-2: Fit the complete model with all possible predictors/independent variables.

Step-3: Choose the predictor which has the highest P-value, such that.

a. If P-value >SL, go to step 4.


b. Else Finish, and Our model is ready.

Step-4: Remove that predictor.

Step-5: Rebuild and fit the model with the remaining variables.

Need for Backward Elimination: An optimal Multiple Linear Regression


model:
In the previous chapter, we discussed and successfully created our Multiple Linear
Regression model, where we took 4 independent variables (R&D spend,

112
Administration spend, Marketing spend, and state (dummy variables))
and one dependent variable (Profit). But that model is not optimal, as we have
included all the independent variables and do not know which independent model
is most affecting and which one is the least affecting for the prediction.

Unnecessary features increase the complexity of the model. Hence it is good to


have only the most significant features and keep our model simple to get the
better result.

So, in order to optimize the performance of the model, we will use the Backward
Elimination method. This process is used to optimize the performance of the MLR
model as it will only include the most affecting feature and remove the least
affecting feature. Let's start to apply it to our MLR model.

Steps for Backward Elimination method:


We will use the same model which we build in the previous chapter of MLR. Below
is the complete code for it:

1. # importing libraries
2. import numpy as nm
3. import [Link] as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('50_CompList.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, :-1].values
11. y= data_set.iloc[:, 4].values
12.
13. #Catgorical data
14. from [Link] import LabelEncoder, OneHotEncoder
15. labelencoder_x= LabelEncoder()
16. x[:, 3]= labelencoder_x.fit_transform(x[:,3])
17. onehotencoder= OneHotEncoder(categorical_features= [3])
18. x= onehotencoder.fit_transform(x).toarray()
19.
20. #Avoiding the dummy variable trap:

113
21. x = x[:, 1:]
22.
23.
24. # Splitting the dataset into training and test set.
25. from sklearn.model_selection import train_test_split
26. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_s
tate=0)
27.
28. #Fitting the MLR model to the training set:
29. from sklearn.linear_model import LinearRegression
30. regressor= LinearRegression()
31. [Link](x_train, y_train)
32.
33. #Predicting the Test set result;
34. y_pred= [Link](x_test)
35.
36. #Checking the score
37. print('Train Score: ', [Link](x_train, y_train))
38. print('Test Score: ', [Link](x_test, y_test))

From the above code, we got training and test set result as:

Train Score: 0.9501847627493607


Test Score: 0.9347068473282446

The difference between both scores is 0.0154.

Note: On the basis of this score, we will estimate the effect of features on our model after using the
Backward elimination process.

Step: 1- Preparation of Backward Elimination:

o Importing the library: Firstly, we need to import


the [Link] library, which is used for the estimation of
various statistical models such as OLS(Ordinary Least Square). Below is the
code for it:

1. import [Link] as smf

114
o Adding a column in matrix of features: As we can check in our MLR
equation (a), there is one constant term b0, but this term is not present in our
matrix of features, so we need to add it manually. We will add a column
having values x0 = 1 associated with the constant term b 0.
To add this, we will use append function of Numpy library (nm which we
have already imported into our code), and will assign a value of 1. Below is
the code for it.

1. x = [Link](arr = [Link]((50,1)).astype(int), values=x, axis=1)

Here we have used axis =1, as we wanted to add a column. For adding a row, we
can use axis =0.

Output: By executing the above line of code, a new column will be added into our
matrix of features, which will have all values equal to 1. We can check it by clicking
on the x dataset under the variable explorer option.

115
As we can see in the above output image, the first column is added successfully,
which corresponds to the constant term of the MLR equation.

Step: 2:

o Now, we are actually going to apply a backward elimination process. Firstly


we will create a new feature vector x_opt, which will only contain a set of
independent features that are significantly affecting the dependent variable.
o Next, as per the Backward Elimination process, we need to choose a
significant level(0.5), and then need to fit the model with all possible
predictors. So for fitting the model, we will create a regressor_OLS object of

116
new class OLS of statsmodels library. Then we will fit it by using
the fit() method.
o Next we need p-value to compare with SL value, so for this we will
use summary() method to get the summary table of all the values. Below is
the code for it:

1. x_opt=x [:, [0,1,2,3,4,5]]


2. regressor_OLS=[Link](endog = y, exog=x_opt).fit()
3. regressor_OLS.summary()

Output: By executing the above lines of code, we will get a summary table.
Consider the below image:

117
In the above image, we can clearly see the p-values of all the variables. Here x1,
x2 are dummy variables, x3 is R&D spend, x4 is Administration spend,
and x5 is Marketing spend.

From the table, we will choose the highest p-value, which is for x1=0.953 Now, we
have the highest p-value which is greater than the SL value, so will remove the x1
variable (dummy variable) from the table and will refit the model. Below is the
code for it:

1. x_opt=x[:, [0,2,3,4,5]]
2. regressor_OLS=[Link](endog = y, exog=x_opt).fit()
3. regressor_OLS.summary()

Output:

As we can see in the output image, now five variables remain. In these variables,
the highest p-value is 0.961. So we will remove it in the next iteration.
118
o Now the next highest value is 0.961 for x1 variable, which is another dummy
variable. So we will remove it and refit the model. Below is the code for it:

1. x_opt= x[:, [0,3,4,5]]


2. regressor_OLS=[Link](endog = y, exog=x_opt).fit()
3. regressor_OLS.summary()

Output:

In the above output image, we can see the dummy variable(x2) has been
removed. And the next highest value is .602, which is still greater than .5, so we
need to remove it.

o Now we will remove the Admin spend which is having .602 p-value and again
refit the model.

119
1. x_opt=x[:, [0,3,5]]
2. regressor_OLS=[Link](endog = y, exog=x_opt).fit()
3. regressor_OLS.summary()

Output:

As we can see in the above output image, the variable (Admin spend) has been
removed. But still, there is one variable left, which is marketing spend as it has a
high p-value (0.60). So we need to remove it.

o Finally, we will remove one more variable, which has .60 p-value for
marketing spend, which is more than a significant level.
Below is the code for it:

1. x_opt=x[:, [0,3]]
2. regressor_OLS=[Link](endog = y, exog=x_opt).fit()
3. regressor_OLS.summary()

120
Output:

As we can see in the above output image, only two variables are left. So only
the R&D independent variable is a significant variable for the prediction. So we
can now predict efficiently using this variable.

Estimating the performance:


In the previous topic, we have calculated the train and test score of the model
when we have used all the features variables. Now we will check the score with
only one feature variable (R&D spend). Our dataset now looks like:

121
Below is the code for Building Multiple Linear Regression model by only
using R&D spend:

1. # importing libraries
2. import numpy as nm
3. import [Link] as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('50_CompList1.csv')
8.
9. #Extracting Independent and dependent Variable
10. x_BE= data_set.iloc[:, :-1].values
11. y_BE= data_set.iloc[:, 1].values
12.
122
13.
14. # Splitting the dataset into training and test set.
15. from sklearn.model_selection import train_test_split
16. x_BE_train, x_BE_test, y_BE_train, y_BE_test= train_test_split(x_BE, y_BE, test
_size= 0.2, random_state=0)
17.
18. #Fitting the MLR model to the training set:
19. from sklearn.linear_model import LinearRegression
20. regressor= LinearRegression()
21. [Link]([Link](x_BE_train).reshape(-1,1), y_BE_train)
22.
23. #Predicting the Test set result;
24. y_pred= [Link](x_BE_test)
25.
26. #Cheking the score
27. print('Train Score: ', [Link](x_BE_train, y_BE_train))
28. print('Test Score: ', [Link](x_BE_test, y_BE_test))

Output:

After executing the above code, we will get the Training and test scores as:

Train Score: 0.9449589778363044


Test Score: 0.9464587607787219

As we can see, the training score is 94% accurate, and the test score is also 94%
accurate. The difference between both scores is .00149. This score is very much
close to the previous score, i.e., 0.0154, where we have included all the variables.

We got this result by using one independent variable (R&D spend) only
instead of four variables. Hence, now, our model is simple and accurate.

123
ML Polynomial Regression
o Polynomial Regression is a regression algorithm that models the relationship
between a dependent(y) and independent variable(x) as nth degree
polynomial. The Polynomial Regression equation is given below:

y= b0+b1x1+ b2x12+ b2x13+...... bnx1n

o It is also called the special case of Multiple Linear Regression in ML. Because
we add some polynomial terms to the Multiple Linear regression equation to
convert it into Polynomial Regression.
o It is a linear model with some modification in order to increase the accuracy.
o The dataset used in Polynomial regression for training is of non-linear nature.
o It makes use of a linear regression model to fit the complicated and non-
linear functions and datasets.
o Hence, "In Polynomial regression, the original features are
converted into Polynomial features of required degree (2,3,..,n) and
then modeled using a linear model."

Need for Polynomial Regression:


The need of Polynomial Regression in ML can be understood in the below points:

o If we apply a linear model on a linear dataset, then it provides us a good


result as we have seen in Simple Linear Regression, but if we apply the same
model without any modification on a non-linear dataset, then it will
produce a drastic output. Due to which loss function will increase, the error
rate will be high, and accuracy will be decreased.
o So for such cases, where data points are arranged in a non-linear
fashion, we need the Polynomial Regression model. We can
understand it in a better way using the below comparison diagram of the
linear dataset and non-linear dataset.

124
o In the above image, we have taken a dataset which is arranged non-linearly.
So if we try to cover it with a linear model, then we can clearly see that it
hardly covers any data point. On the other hand, a curve is suitable to cover
most of the data points, which is of the Polynomial model.
o Hence, if the datasets are arranged in a non-linear fashion, then we should
use the Polynomial Regression model instead of Simple Linear Regression.

Note: A Polynomial Regression algorithm is also called Polynomial Linear Regression because it
does not depend on the variables, instead, it depends on the coefficients, which are arranged in a
linear fashion.

Equation of the Polynomial Regression Model:


Simple Linear Regression equation: y = b0+b1x .........(a)

Multiple Linear Regression equation: y= b0+b1x+ b2x2+ b3x3+....+


bnxn .........(b)

Polynomial Regression equation: y= b0+b1x + b2x2+ b3x3+....+ bnxn


..........(c)

When we compare the above three equations, we can clearly see that all three
equations are Polynomial equations but differ by the degree of variables. The
Simple and Multiple Linear equations are also Polynomial equations with a single
degree, and the Polynomial regression equation is Linear equation with the nth
degree. So if we add a degree to our linear equations, then it will be converted into
Polynomial Linear equations.

125
Note: To better understand Polynomial Regression, you must have knowledge of Simple Linear
Regression.

Implementation of Polynomial Regression using Python:


Here we will implement the Polynomial Regression using Python. We will
understand it by comparing Polynomial Regression model with the Simple Linear
Regression model. So first, let's understand the problem for which we are going to
build the model.

Problem Description: There is a Human Resource company, which is going to


hire a new candidate. The candidate has told his previous salary 160K per annum,
and the HR have to check whether he is telling the truth or bluff. So to identify this,
they only have a dataset of his previous company in which the salaries of the top
10 positions are mentioned with their levels. By checking the dataset available, we
have found that there is a non-linear relationship between the Position
levels and the salaries. Our goal is to build a Bluffing detector
regression model, so HR can hire an honest candidate. Below are the steps to
build such a model.

Steps for Polynomial Regression:


The main steps involved in Polynomial Regression are given below:

o Data Pre-processing

126
o Build a Linear Regression model and fit it to the dataset
o Build a Polynomial Regression model and fit it to the dataset
o Visualize the result for Linear Regression and Polynomial Regression model.
o Predicting the output.

Note: Here, we will build the Linear regression model as well as Polynomial Regression to see the
results between the predictions. And Linear regression model is for reference.

Data Pre-processing Step:

The data pre-processing step will remain the same as in previous regression
models, except for some changes. In the Polynomial Regression model, we will not
use feature scaling, and also we will not split our dataset into training and test set.
It has two reasons:

o The dataset contains very less information which is not suitable to divide it
into a test and training set, else our model will not be able to find the
correlations between the salaries and levels.
o In this model, we want very accurate predictions for salary, so the model
should have enough information.

The code for pre-processing step is given below:

1. # importing libraries
2. import numpy as nm
3. import [Link] as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('Position_Salaries.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, 1:2].values
11. y= data_set.iloc[:, 2].values

Explanation:

127
o In the above lines of code, we have imported the important Python libraries
to import dataset and operate on it.
o Next, we have imported the dataset 'Position_Salaries.csv', which contains
three columns (Position, Levels, and Salary), but we will consider only two
columns (Salary and Levels).
o After that, we have extracted the dependent(Y) and independent variable(X)
from the dataset. For x-variable, we have taken parameters as [:,1:2],
because we want 1 index(levels), and included :2 to make it as a matrix.

Output:

By executing the above code, we can read our dataset as:

As we can see in the above output, there are three columns present (Positions,
Levels, and Salaries). But we are only considering two columns because Positions
are equivalent to the levels or may be seen as the encoded form of Positions.

128
Here we will predict the output for level 6.5 because the candidate has 4+ years'
experience as a regional manager, so he must be somewhere between levels 7
and 6.

Building the Linear regression model:

Now, we will build and fit the Linear regression model to the dataset. In building
polynomial regression, we will take the Linear regression model as reference and
compare both the results. The code is given below:

1. #Fitting the Linear Regression to the dataset


2. from sklearn.linear_model import LinearRegression
3. lin_regs= LinearRegression()
4. lin_regs.fit(x,y)

In the above code, we have created the Simple Linear model using lin_regs object
of LinearRegression class and fitted it to the dataset variables (x and y).

Output:

Out[5]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,


normalize=False)

Building the Polynomial regression model:

Now we will build the Polynomial Regression model, but it will be a little different
from the Simple Linear model. Because here we will
use PolynomialFeatures class of preprocessing library. We are using this class
to add some extra features to our dataset.

1. #Fitting the Polynomial regression to the dataset


2. from [Link] import PolynomialFeatures
3. poly_regs= PolynomialFeatures(degree= 2)
4. x_poly= poly_regs.fit_transform(x)
5. lin_reg_2 =LinearRegression()
6. lin_reg_2.fit(x_poly, y)

In the above lines of code, we have used poly_regs.fit_transform(x), because


first we are converting our feature matrix into polynomial feature matrix, and then
fitting it to the Polynomial regression model. The parameter value(degree= 2)
depends on our choice. We can choose it according to our Polynomial features.

129
After executing the code, we will get another matrix x_poly, which can be seen
under the variable explorer option:

Next, we have used another LinearRegression object, namely lin_reg_2, to fit


our x_poly vector to the linear model.

Output:

Out[11]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,


normalize=False)

Visualizing the result for Linear regression:

Now we will visualize the result for Linear regression model as we did in Simple
Linear Regression. Below is the code for it:

1. #Visulaizing the result for Linear Regression model

130
2. [Link](x,y,color="blue")
3. [Link](x,lin_regs.predict(x), color="red")
4. [Link]("Bluff detection model(Linear Regression)")
5. [Link]("Position Levels")
6. [Link]("Salary")
7. [Link]()

Output:

In the above output image, we can clearly see that the regression line is so far
from the datasets. Predictions are in a red straight line, and blue points are actual
values. If we consider this output to predict the value of CEO, it will give a salary of
approx. 600000$, which is far away from the real value.

So we need a curved model to fit the dataset other than a straight line.

Visualizing the result for Polynomial Regression

Here we will visualize the result of Polynomial regression model, code for which is
little different from the above model.

Code for this is given below:

1. #Visulaizing the result for Polynomial Regression


2. [Link](x,y,color="blue")
3. [Link](x, lin_reg_2.predict(poly_regs.fit_transform(x)), color="red")
4. [Link]("Bluff detection model(Polynomial Regression)")
131
5. [Link]("Position Levels")
6. [Link]("Salary")
7. [Link]()

In the above code, we have taken lin_reg_2.predict(poly_regs.fit_transform(x),


instead of x_poly, because we want a Linear regressor object to predict the
polynomial features matrix.

Output:

As we can see in the above output image, the predictions are close to the real
values. The above plot will vary as we will change the degree.

For degree= 3:

If we change the degree=3, then we will give a more accurate plot, as shown in
the below image.

132
SO as we can see here in the above output image, the predicted salary for level
6.5 is near to 170K$-190k$, which seems that future employee is saying the truth
about his salary.

Degree= 4: Let's again change the degree to 4, and now will get the most
accurate plot. Hence we can get more accurate results by increasing the degree of
Polynomial.

Predicting the final result with the Linear Regression model:

Now, we will predict the final output using the Linear regression model to see
whether an employee is saying truth or bluff. So, for this, we will use
the predict() method and will pass the value 6.5. Below is the code for it:

133
1. lin_pred = lin_regs.predict([[6.5]])
2. print(lin_pred)

Output:

[330378.78787879]

Predicting the final result with the Polynomial Regression model:

Now, we will predict the final output using the Polynomial Regression model to
compare with Linear model. Below is the code for it:

1. poly_pred = lin_reg_2.predict(poly_regs.fit_transform([[6.5]]))
2. print(poly_pred)

Output:

[158862.45265153]

As we can see, the predicted output for the Polynomial Regression is


[158862.45265153], which is much closer to real value hence, we can say that
future employee is saying true.

134
Classification Algorithm in Machine Learning
As we know, the Supervised Machine Learning algorithm can be broadly classified
into Regression and Classification Algorithms. In Regression algorithms, we have
predicted the output for continuous values, but to predict the categorical values,
we need Classification algorithms.

What is the Classification Algorithm?


The Classification algorithm is a Supervised Learning technique that is used to
identify the category of new observations on the basis of training data. In
Classification, a program learns from the given dataset or observations and then
classifies new observation into a number of classes or groups. Such as, Yes or No,
0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as
targets/labels or categories.

Unlike regression, the output variable of Classification is a category, not a value,


such as "Green or Blue", "fruit or animal", etc. Since the Classification algorithm is
a Supervised learning technique, hence it takes labeled input data, which means it
contains input with the corresponding output.

In classification algorithm, a discrete output function(y) is mapped to input


variable(x).

1. y=f(x), where y = categorical output

The best example of an ML classification algorithm is Email Spam Detector.

The main goal of the Classification algorithm is to identify the category of a given
dataset, and these algorithms are mainly used to predict the output for the
categorical data.

Classification algorithms can be better understood using the below diagram. In the
below diagram, there are two classes, class A and Class B. These classes have
features that are similar to each other and dissimilar to other classes.

135
The algorithm which implements the classification on a dataset is known as a
classifier. There are two types of Classifications:

o Binary Classifier: If the classification problem has only two possible


outcomes, then it is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG,
etc.
o Multi-class Classifier: If a classification problem has more than two
outcomes, then it is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.

Learners in Classification Problems:


In the classification problems, there are two types of learners:

1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until
it receives the test dataset. In Lazy learner case, classification is done on the
basis of the most related data stored in the training dataset. It takes less
time in training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning

136
2. Eager Learners:Eager Learners develop a classification model based on a
training dataset before receiving a test dataset. Opposite to Lazy learners,
Eager Learner takes more time in learning, and less time in
prediction. Example: Decision Trees, Naïve Bayes, ANN.

Types of ML Classification Algorithms:


Classification Algorithms can be further divided into the Mainly two category:

o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification

Note: We will learn the above algorithms in later chapters.

Evaluating a Classification model:


Once our model is completed, it is necessary to evaluate its performance; either it
is a Classification or Regression model. So for evaluating a Classification model, we
have the following ways:

1. Log Loss or Cross-Entropy Loss:

o It is used for evaluating the performance of a classifier, whose output is a


probability value between the 0 and 1.
o For a good binary Classification model, the value of log loss should be near to
0.
o The value of log loss increases if the predicted value deviates from the
actual value.
o The lower log loss represents the higher accuracy of the model.

137
o For Binary classification, cross-entropy can be calculated as:

1. ?(ylog(p)+(1?y)log(1?p))

Where y= Actual output, p= predicted output.

2. Confusion Matrix:

o The confusion matrix provides us a matrix/table as output and describes the


performance of the model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a
total number of correct predictions and incorrect predictions. The matrix
looks like as below table:

Actual Positive Actual Negative


o

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

3. AUC-ROC curve:

o ROC curve stands for Receiver Operating Characteristics Curve and AUC
stands for Area Under the Curve.
o It is a graph that shows the performance of the classification model at
different thresholds.
o To visualize the performance of the multi-class classification model, we use
the AUC-ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate)
on Y-axis and FPR(False Positive Rate) on X-axis.

138
Use cases of Classification Algorithms
Classification algorithms can be used in different places. Below are some popular
use cases of Classification Algorithms:

o Email Spam Detection


o Speech Recognition
o Identifications of Cancer tumor cells.
o Drugs Classification
o Biometric Identification, etc.

Logistic Regression in Machine Learning


o Logistic regression is one of the most popular Machine Learning algorithms,
which comes under the Supervised Learning technique. It is used for
predicting the categorical dependent variable using a given set of
independent variables.
o Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be
either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0
and 1.
o Logistic Regression is much similar to the Linear Regression except that how
they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification
problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).

139
o The curve from the logistic function indicates the likelihood of something
such as whether the cells are cancerous or not, a mouse is obese or not
based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has
the ability to provide probabilities and classify new data using continuous
and discrete datasets.
o Logistic Regression can be used to classify the observations using different
types of data and can easily determine the most effective variables used for
the classification. The below image is showing the logistic function:

Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is
called logistic regression, but is used to classify samples; Therefore, it falls under the classification
algorithm.

Logistic Function (Sigmoid Function):

o The sigmoid function is a mathematical function used to map the predicted


values to probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot
go beyond this limit, so it forms a curve like the "S" form. The S-form curve is
called the Sigmoid function or the logistic function.

140
o In logistic regression, we use the concept of the threshold value, which
defines the probability of either 0 or 1. Such as values above the threshold
value tends to 1, and a value below the threshold values tends to 0.

Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.


o The independent variable should not have multi-collinearity.

Logistic Regression Equation:


The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are given
below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide
the above equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of


the equation it will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:


On the basis of the categories, Logistic Regression can be classified into three
types:

o Binomial: In binomial Logistic regression, there can be only two possible


types of the dependent variables, such as 0 or 1, Pass or Fail, etc.

141
o Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as "cat", "dogs", or
"sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible
ordered types of dependent variables, such as "low", "Medium", or "High".

Python Implementation of Logistic Regression (Binomial)


To understand the implementation of Logistic Regression in Python, we will use the
below example:

50.2M
922
Java Try Catch

Example: There is a dataset given which contains the information of various users
obtained from the social networking sites. There is a car making company that has
recently launched a new SUV car. So the company wanted to check how many
users from the dataset, wants to purchase the car.

For this problem, we will build a Machine Learning model using the Logistic
regression algorithm. The dataset is shown in the below image. In this problem, we
will predict the purchased variable (Dependent Variable) by using age and
salary (Independent variables).

142
Steps in Logistic Regression: To implement the Logistic Regression using
Python, we will use the same steps as we have done in previous topics of
Regression. Below are the steps:

o Data Pre-processing step


o Fitting Logistic Regression to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

1. Data Pre-processing step: In this step, we will pre-process/prepare the data


so that we can use it in our code efficiently. It will be the same as we have done in
Data pre-processing topic. The code for this is given below:

1. #Data Pre-procesing Step


2. # importing libraries

143
3. import numpy as nm
4. import [Link] as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')

By executing the above lines of code, we will get the dataset as the output.
Consider the given image:

Now, we will extract the dependent and independent variables from the given
dataset. Below is the code for it:

1. #Extracting Independent and dependent Variable

144
2. x= data_set.iloc[:, [2,3]].values
3. y= data_set.iloc[:, 4].values

In the above code, we have taken [2, 3] for x because our independent variables
are age and salary, which are at index 2, 3. And we have taken 4 for y variable
because our dependent variable is at index 4. The output will be:

Now we will split the dataset into a training set and test set. Below is the code for
it:

1. # Splitting the dataset into training and test set.


2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state
=0)

The output for this is given below:

145
For test
set:

For training set:

146
In logistic regression, we will do feature scaling because we want accurate result of
predictions. Here we will only scale the independent variable because dependent
variable have only 0 and 1 values. Below is the code for it:

1. #feature Scaling
2. from [Link] import StandardScaler
3. st_x= StandardScaler()
4. x_train= st_x.fit_transform(x_train)
5. x_test= st_x.transform(x_test)

The scaled output is given below:

2. Fitting Logistic Regression to the Training set:

We have well prepared our dataset, and now we will train the dataset using the
training set. For providing training or fitting the model to the training set, we will
import the LogisticRegression class of the sklearn library.

After importing the class, we will create a classifier object and use it to fit the
model to the logistic regression. Below is the code for it:

147
1. #Fitting Logistic Regression to the training set
2. from sklearn.linear_model import LogisticRegression
3. classifier= LogisticRegression(random_state=0)
4. [Link](x_train, y_train)

Output: By executing the above code, we will get the below output:

Out[5]:

1. LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,


2. intercept_scaling=1, l1_ratio=None, max_iter=100,
3. multi_class='warn', n_jobs=None, penalty='l2',
4. random_state=0, solver='warn', tol=0.0001, verbose=0,
5. warm_start=False)

Hence our model is well fitted to the training set.

3. Predicting the Test Result

Our model is well trained on the training set, so we will now predict the result by
using test set data. Below is the code for it:

1. #Predicting the test set result


2. y_pred= [Link](x_test)

In the above code, we have created a y_pred vector to predict the test set result.

Output: By executing the above code, a new vector (y_pred) will be created under
the variable explorer option. It can be seen as:

148
The above output image shows the corresponding predicted users who want to
purchase or not purchase the car.

4. Test Accuracy of the result

Now we will create the confusion matrix here to check the accuracy of the
classification. To create it, we need to import the confusion_matrix function of
the sklearn library. After importing the function, we will call it using a new
variable cm. The function takes two parameters, mainly y_true( the actual values)
and y_pred (the targeted value return by the classifier). Below is the code for it:

1. #Creating the Confusion matrix


2. from [Link] import confusion_matrix
3. cm= confusion_matrix()

Output:

By executing the above code, a new confusion matrix will be created. Consider the
below image:

149
We can find the accuracy of the predicted result by interpreting the confusion
matrix. By above output, we can interpret that 65+24= 89 (Correct Output) and
8+3= 11(Incorrect Output).

5. Visualizing the training set result

Finally, we will visualize the training set result. To visualize the result, we will
use ListedColormap class of matplotlib library. Below is the code for it:

1. #Visualizing the training set result


2. from [Link] import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = [Link]([Link](start = x_set[:, 0].min() - 1, stop = x_set[:, 0].m
ax() + 1, step =0.01),
5. [Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. [Link](x1, x2, [Link]([Link]([[Link](), [Link]()]).T).reshape(
[Link]),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. [Link]([Link](), [Link]())
9. [Link]([Link](), [Link]())
10. for i, j in enumerate([Link](y_set)):
11. [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],

150
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. [Link]('Logistic Regression (Training set)')
14. [Link]('Age')
15. [Link]('Estimated Salary')
16. [Link]()
17. [Link]()

In the above code, we have imported the ListedColormap class of Matplotlib


library to create the colormap for visualizing the result. We have created two new
variables x_set and y_set to replace x_train and y_train. After that, we have
used the [Link] command to create a rectangular grid, which has a range
of -1(minimum) to 1 (maximum). The pixel points we have taken are of 0.01
resolution.

To create a filled contour, we have used [Link] command, it will create


regions of provided colors (purple and green). In this function, we have passed
the [Link] to show the predicted data points predicted by the
classifier.

Output: By executing the above code, we will get the below output:

The graph can be explained in the below points:

o In the above graph, we can see that there are some Green points within the
green region and Purple points within the purple region.

151
o All these data points are the observation points from the training set, which
shows the result for purchased variables.
o This graph is made by using two independent variables i.e., Age on the x-
axis and Estimated salary on the y-axis.
o The purple point observations are for which purchased (dependent
variable) is probably 0, i.e., users who did not purchase the SUV car.
o The green point observations are for which purchased (dependent
variable) is probably 1 means user who purchased the SUV car.
o We can also estimate from the graph that the users who are younger with
low salary, did not purchase the car, whereas older users with high
estimated salary purchased the car.
o But there are some purple points in the green region (Buying the car) and
some green points in the purple region(Not buying the car). So we can say
that younger users with a high estimated salary purchased the car, whereas
an older user with a low estimated salary did not purchase the car.

The goal of the classifier:

We have successfully visualized the training set result for the logistic regression,
and our goal for this classification is to divide the users who purchased the SUV car
and who did not purchase the car. So from the output graph, we can clearly see
the two regions (Purple and Green) with the observation points. The Purple region
is for those users who didn't buy the car, and Green Region is for those users who
purchased the car.

Linear Classifier:

As we can see from the graph, the classifier is a Straight line or linear in nature as
we have used the Linear model for Logistic Regression. In further topics, we will
learn for non-linear Classifiers.

Visualizing the test set result:

Our model is well trained using the training dataset. Now, we will visualize the
result for new observations (Test set). The code for the test set will remain same
as above except that here we will use x_test and y_test instead of x_train and
y_train. Below is the code for it:

1. #Visulaizing the test set result

152
2. from [Link] import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = [Link]([Link](start = x_set[:, 0].min() - 1, stop = x_set[:, 0].m
ax() + 1, step =0.01),
5. [Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. [Link](x1, x2, [Link]([Link]([[Link](), [Link]()]).T).reshape(
[Link]),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. [Link]([Link](), [Link]())
9. [Link]([Link](), [Link]())
10. for i, j in enumerate([Link](y_set)):
11. [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. [Link]('Logistic Regression (Test set)')
14. [Link]('Age')
15. [Link]('Estimated Salary')
16. [Link]()
17. [Link]()

Output:

The above graph shows the test set result. As we can see, the graph is divided into
two regions (Purple and Green). And Green observations are in the green region,
and Purple observations are in the purple region. So we can say it is a good

153
prediction and model. Some of the green and purple data points are in different
regions, which can be ignored as we have already calculated this error using the
confusion matrix (11 Incorrect output).

Hence our model is pretty good and ready to make new predictions for this
classification problem.

K-Nearest Neighbor(KNN) Algorithm for Machine


Learning
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms
based on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar to
the available categories.
o K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to
the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat
and dog, but we want to know either it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a similarity

154
measure. Our KNN model will find the similar features of the new data set to
the cats and dogs images and based on the most similar features it will put it
in either cat or dog category.

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have a
new data point x1, so this data point will lie in which of these categories. To solve
this type of problem, we need a K-NN algorithm. With the help of K-NN, we can
easily identify the category or class of a particular dataset. Consider the below
diagram:

155
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
o Step-4: Among these k neighbors, count the number of the data points in
each category.
o Step-5: Assign the new data points to that category for which the number of
the neighbor is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category.
Consider the below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.

156
o Next, we will calculate the Euclidean distance between the data points.
The Euclidean distance is the distance between two points, which we have
already studied in geometry. It can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as three


nearest neighbors in category A and two nearest neighbors in category B.
Consider the below image:

157
o As we can see the 3 nearest neighbors are from category A, hence this new
data point must belong to category A.

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN
algorithm: There is no particular way to determine the best value for "K", so we
need to try some values to find the best out of them. The most preferred value for
K is 5.

o A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

158
Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between
the data points for all the training samples.

Python implementation of the KNN algorithm


To do the Python implementation of the K-NN algorithm, we will use the same
problem and dataset which we have used in Logistic Regression. But here we will
improve the performance of the model. Below is the problem description:

Problem for K-NN Algorithm: There is a Car manufacturer company that has
manufactured a new SUV car. The company wants to give the ads to the users who
are interested in buying that SUV. So for this problem, we have a dataset that
contains multiple user's information through the social network. The dataset
contains lots of information but the Estimated Salary and Age we will consider
for the independent variable and the Purchased variable is for the dependent
variable. Below is the dataset:

159
Steps to implement the K-NN algorithm:

o Data Pre-processing step


o Fitting the K-NN algorithm to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

Data Pre-Processing Step:

The Data Pre-processing step will remain exactly the same as Logistic Regression.
Below is the code for it:

1. # importing libraries
2. import numpy as nm
3. import [Link] as mtp

160
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_
state=0)
16.
17. #feature Scaling
18. from [Link] import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)

By executing the above code, our dataset is imported to our program and well pre-
processed. After feature scaling our test dataset will look like:

161
From the above output image, we can see that our data is successfully scaled.

o Fitting K-NN classifier to the Training data:


Now we will fit the K-NN classifier to the training data. To do this we will
import the KNeighborsClassifier class of Sklearn Neighbors library. After
importing the class, we will create the Classifier object of the class. The
Parameter of this class will be
o n_neighbors: To define the required neighbors of the algorithm.
Usually, it takes 5.
o metric='minkowski': This is the default parameter and it decides the
distance between the points.
o p=2: It is equivalent to the standard Euclidean metric.

And then we will fit the classifier to the training data. Below is the code for it:

1. #Fitting K-NN classifier to the training set


2. from [Link] import KNeighborsClassifier
3. classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
4. [Link](x_train, y_train)

162
Output: By executing the above code, we will get the output as:

Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')

o Predicting the Test Result: To predict the test set result, we will create
a y_pred vector as we did in Logistic Regression. Below is the code for it:

1. #Predicting the test set result


2. y_pred= [Link](x_test)

Output:

The output for the above code will be:

163
o Creating the Confusion Matrix:
Now we will create the Confusion Matrix for our K-NN model to see the
accuracy of the classifier. Below is the code for it:

1. #Creating the Confusion matrix


2. from [Link] import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

In above code, we have imported the confusion_matrix function and called it using
the variable cm.

Output: By executing the above code, we will get the matrix as below:

In the above image, we can see there are 64+29= 93 correct predictions and
3+4= 7 incorrect predictions, whereas, in Logistic Regression, there were 11
incorrect predictions. So we can say that the performance of the model is
improved by using the K-NN algorithm.

o Visualizing the Training set result:


Now, we will visualize the training set result for K-NN model. The code will
remain same as we did in Logistic Regression, except the name of the graph.
Below is the code for it:

164
1. #Visulaizing the trianing set result
2. from [Link] import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = [Link]([Link](start = x_set[:, 0].min() - 1, stop = x_set[:, 0].m
ax() + 1, step =0.01),
5. [Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. [Link](x1, x2, [Link]([Link]([[Link](), [Link]()]).T).reshape(
[Link]),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. [Link]([Link](), [Link]())
9. [Link]([Link](), [Link]())
10. for i, j in enumerate([Link](y_set)):
11. [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. [Link]('K-NN Algorithm (Training set)')
14. [Link]('Age')
15. [Link]('Estimated Salary')
16. [Link]()
17. [Link]()

Output:

By executing the above code, we will get the below graph:

165
The output graph is different from the graph which we have occurred in Logistic
Regression. It can be understood in the below points:

o As we can see the graph is showing the red point and green points.
The green points are for Purchased(1) and Red Points for not
Purchased(0) variable.
o The graph is showing an irregular boundary instead of showing any
straight line or any curve because it is a K-NN algorithm, i.e., finding
the nearest neighbor.
o The graph has classified users in the correct categories as most of the
users who didn't buy the SUV are in the red region and users who
bought the SUV are in the green region.
o The graph is showing good result but still, there are some green points
in the red region and red points in the green region. But this is no big
issue as by doing this model is prevented from overfitting issues.
o Hence our model is well trained.
o Visualizing the Test set result:
After the training of the model, we will now test the result by putting a new
dataset, i.e., Test dataset. Code remains the same except some minor
changes: such as x_train and y_train will be replaced by x_test and
y_test.
Below is the code for it:

1. #Visualizing the test set result


2. from [Link] import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = [Link]([Link](start = x_set[:, 0].min() - 1, stop = x_set[:, 0].m
ax() + 1, step =0.01),
5. [Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. [Link](x1, x2, [Link]([Link]([[Link](), [Link]()]).T).reshape(
[Link]),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. [Link]([Link](), [Link]())
9. [Link]([Link](), [Link]())

166
10. for i, j in enumerate([Link](y_set)):
11. [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. [Link]('K-NN algorithm(Test set)')
14. [Link]('Age')
15. [Link]('Estimated Salary')
16. [Link]()
17. [Link]()

Output:

The above graph is showing the output for the test data set. As we can see in the
graph, the predicted output is well good as most of the red points are in the red
region and most of the green points are in the green region.

However, there are few green points in the red region and a few red points in the
green region. So these are the incorrect observations that we have observed in the
confusion matrix(7 Incorrect output).

167
Support Vector Machine Algorithm
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
However, primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is
called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is termed
as Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane:

168
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if
we want a model that can accurately identify whether it is a cat or dog, so such a
model can be created by using the SVM algorithm. We will first train our model
with lots of images of cats and dogs so that it can learn about different features of
cats and dogs, and then we test it with this strange creature. So as support vector
creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On
the basis of the support vectors, it will classify it as a cat. Consider the below
diagram:

169
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM
SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if
a dataset can be classified into two classes by using a single straight line,
then such data is termed as linearly separable data, and classifier is used
called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then
such data is termed as non-linear data and classifier used is called as Non-
linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:


Hyperplane: There can be multiple lines/decision boundaries to segregate the
classes in n-dimensional space, but we need to find out the best decision boundary
that helps to classify the data points. This best boundary is known as the
hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset,
which means if there are 2 features (as shown in image), then hyperplane will be a

170
straight line. And if there are 3 features, then hyperplane will be a 2-dimension
plane.

We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect
the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.

How does SVM works?


Linear SVM:

The working of the SVM algorithm can be understood by using an example.


Suppose we have a dataset that has two tags (green and blue), and the dataset
has two features x1 and x2. We want a classifier that can classify the pair(x1, x2)
of coordinates in either green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these
two classes. But there can be multiple lines that can separate these classes.
Consider the below image:

171
Hence, the SVM algorithm helps to find the best line or decision boundary; this
best boundary or region is called as a hyperplane. SVM algorithm finds the
closest point of the lines from both the classes. These points are called support
vectors. The distance between the vectors and the hyperplane is called as margin.
And the goal of SVM is to maximize this margin. The hyperplane with maximum
margin is called the optimal hyperplane.

172
Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for
non-linear data, we cannot draw a single straight line. Consider the below image:

173
So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a
third dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

174
So now, SVM will divide the datasets into classes in the following way. Consider the
below image:

175
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If
we convert it in 2d space with z=1, then it will become as:

176
Hence we get a circumference of radius 1 in case of non-linear data.

Python Implementation of Support Vector Machine

Now we will implement the SVM algorithm using Python. Here we will use the same
dataset user_data, which we have used in Logistic regression and KNN
classification.

o Data Pre-processing step

Till the Data pre-processing step, the code will remain the same. Below is the code:

1. #Data Pre-processing Step


2. # importing libraries
3. import numpy as nm
4. import [Link] as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')

177
9.
10. #Extracting Independent and dependent Variable
11. x= data_set.iloc[:, [2,3]].values
12. y= data_set.iloc[:, 4].values
13.
14. # Splitting the dataset into training and test set.
15. from sklearn.model_selection import train_test_split
16. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_
state=0)
17. #feature Scaling
18. from [Link] import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)

After executing the above code, we will pre-process the data. The code will give
the dataset as:

178
The scaled output for the test set will be:

Fitting the SVM classifier to the training set:

Now the training set will be fitted to the SVM classifier. To create the SVM
classifier, we will import SVC class from [Link] library. Below is the code
for it:

1. from [Link] import SVC # "Support vector classifier"


2. classifier = SVC(kernel='linear', random_state=0)
3. [Link](x_train, y_train)

In the above code, we have used kernel='linear', as here we are creating SVM for
linearly separable data. However, we can change it for non-linear data. And then
we fitted the classifier to the training dataset(x_train, y_train)

Output:

Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, tol=0.001, verbose=False)

179
The model performance can be altered by changing the value
of C(Regularization factor), gamma, and kernel.

o Predicting the test set result:


Now, we will predict the output for test set. For this, we will create a new
vector y_pred. Below is the code for it:

1. #Predicting the test set result


2. y_pred= [Link](x_test)

After getting the y_pred vector, we can compare the result


of y_pred and y_test to check the difference between the actual value and
predicted value.

Output: Below is the output for the prediction of the test set:

180
o Creating the confusion matrix:
Now we will see the performance of the SVM classifier that how many
incorrect predictions are there as compared to the Logistic regression
classifier. To create the confusion matrix, we need to import
the confusion_matrix function of the sklearn library. After importing the
function, we will call it using a new variable cm. The function takes two
parameters, mainly y_true( the actual values) and y_pred (the targeted
value return by the classifier). Below is the code for it:

1. #Creating the Confusion matrix


2. from [Link] import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

181
Output:

As we can see in the above output image, there are 66+24= 90 correct predictions
and 8+2= 10 correct predictions. Therefore we can say that our SVM model
improved as compared to the Logistic regression model.

o Visualizing the training set result:


Now we will visualize the training set result, below is the code for it:

1. from [Link] import ListedColormap


2. x_set, y_set = x_train, y_train
3. x1, x2 = [Link]([Link](start = x_set[:, 0].min() - 1, stop = x_set[:, 0].m
ax() + 1, step =0.01),
4. [Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
5. [Link](x1, x2, [Link]([Link]([[Link](), [Link]()]).T).reshape(
[Link]),
6. alpha = 0.75, cmap = ListedColormap(('red', 'green')))
7. [Link]([Link](), [Link]())
8. [Link]([Link](), [Link]())
9. for i, j in enumerate([Link](y_set)):
10. [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
11. c = ListedColormap(('red', 'green'))(i), label = j)

182
12. [Link]('SVM classifier (Training set)')
13. [Link]('Age')
14. [Link]('Estimated Salary')
15. [Link]()
16. [Link]()

Output:

By executing the above code, we will get the output as:

As we can see, the above output is appearing similar to the Logistic regression
output. In the output, we got the straight line as hyperplane because we
have used a linear kernel in the classifier. And we have also discussed above
that for the 2d space, the hyperplane in SVM is a straight line.

o Visualizing the test set result:

1. #Visulaizing the test set result


2. from [Link] import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = [Link]([Link](start = x_set[:, 0].min() - 1, stop = x_set[:, 0].m
ax() + 1, step =0.01),
5. [Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. [Link](x1, x2, [Link]([Link]([[Link](), [Link]()]).T).reshape(
[Link]),

183
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. [Link]([Link](), [Link]())
9. [Link]([Link](), [Link]())
10. for i, j in enumerate([Link](y_set)):
11. [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. [Link]('SVM classifier (Test set)')
14. [Link]('Age')
15. [Link]('Estimated Salary')
16. [Link]()
17. [Link]()

Output:

By executing the above code, we will get the output as:

As we can see in the above output image, the SVM classifier has divided the users
into two regions (Purchased or Not purchased). Users who purchased the SUV are
in the red region with the red scatter points. And users who did not purchase the
SUV are in the green region with green scatter points. The hyperplane has divided
the two classes into Purchased and not purchased variable.

184
Naïve Bayes Classifier Algorithm
o Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional
training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can
make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of
the probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.

Why is it called Naïve Bayes?


The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can
be described as:

185
o Naïve: It is called Naïve because it assumes that the occurrence of a certain
feature is independent of the occurrence of other features. Such as if the
fruit is identified on the bases of color, shape, and taste, then red, spherical,
and sweet fruit is recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used
to determine the probability of a hypothesis with prior knowledge. It depends
on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed


event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the


probability of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the


evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:


Working of Naïve Bayes' Classifier can be understood with the help of the below
example:

Suppose we have a dataset of weather conditions and corresponding target


variable "Play". So using this dataset we need to decide that whether we should

186
play or not on a particular day according to the weather conditions. So to solve this
problem, we need to follow the below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:

187
Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)


188
Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:

o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other
Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

o Naive Bayes assumes that all features are independent or unrelated, so it


cannot learn the relationship between features.

Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring.


o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is
an eager learner.
o It is used in Text classification such as Spam filtering and Sentiment
analysis.

Types of Naïve Bayes Model:


There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a normal


distribution. This means if predictors take continuous values instead of
discrete, then the model assumes that these values are sampled from the
Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data
is multinomial distributed. It is primarily used for document classification
problems, it means a particular document belongs to which category such as

189
Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier,
but the predictor variables are the independent Booleans variables. Such as
if a particular word is present or not in a document. This model is also
famous for document classification tasks.

Python Implementation of the Naïve Bayes algorithm:


Now we will implement a Naive Bayes Algorithm using Python. So for this, we will
use the "user_data" dataset, which we have used in our other classification
model. Therefore we can easily compare the Naive Bayes model with the other
models.

Steps to implement:

o Data Pre-processing step


o Fitting Naive Bayes to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

1) Data Pre-processing step:


In this step, we will pre-process/prepare the data so that we can use it efficiently in
our code. It is similar as we did in data-pre-processing. The code for this is given
below:

1. Importing the libraries


2. import numpy as nm
3. import [Link] as mtp
4. import pandas as pd
5.
6. # Importing the dataset
7. dataset = pd.read_csv('user_data.csv')
8. x = [Link][:, [2, 3]].values
9. y = [Link][:, 4].values

190
10.
11. # Splitting the dataset into the Training set and Test set
12. from sklearn.model_selection import train_test_split
13. x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, rando
m_state = 0)
14.
15. # Feature Scaling
16. from [Link] import StandardScaler
17. sc = StandardScaler()
18. x_train = sc.fit_transform(x_train)
19. x_test = [Link](x_test)

In the above code, we have loaded the dataset into our program using "dataset =
pd.read_csv('user_data.csv'). The loaded dataset is divided into training and
test set, and then we have scaled the feature variable.

The output for the dataset is given as:

191
2) Fitting Naive Bayes to the Training Set:
After the pre-processing step, now we will fit the Naive Bayes model to the
Training set. Below is the code for it:

1. # Fitting Naive Bayes to the Training set


2. from sklearn.naive_bayes import GaussianNB

192
3. classifier = GaussianNB()
4. [Link](x_train, y_train)

In the above code, we have used the GaussianNB classifier to fit it to the
training dataset. We can also use other classifiers as per our requirement.

Output:

Out[6]: GaussianNB(priors=None, var_smoothing=1e-09)

3) Prediction of the test set result:


Now we will predict the test set result. For this, we will create a new predictor
variable y_pred, and will use the predict function to make the predictions.

1. # Predicting the Test set results


2. y_pred = [Link](x_test)

Output:

193
The above output shows the result for prediction vector y_pred and real vector
y_test. We can see that some predications are different from the real values, which
are the incorrect predictions.

4) Creating Confusion Matrix:


Now we will check the accuracy of the Naive Bayes classifier using the Confusion
matrix. Below is the code for it:

1. # Making the Confusion Matrix


2. from [Link] import confusion_matrix
3. cm = confusion_matrix(y_test, y_pred)

Output:

As we can see in the above confusion matrix output, there are 7+3= 10 incorrect
predictions, and 65+25=90 correct predictions.

5) Visualizing the training set result:


Next we will visualize the training set result using Naïve Bayes Classifier. Below is
the code for it:

1. # Visualising the Training set results

194
2. from [Link] import ListedColormap
3. x_set, y_set = x_train, y_train
4. X1, X2 = [Link]([Link](start = x_set[:, 0].min() - 1, stop = x_set[:, 0].m
ax() + 1, step = 0.01),
5. [Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, s
tep = 0.01))
6. [Link](X1, X2, [Link]([Link]([[Link](), [Link]()]).T).reshape
([Link]),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
8. [Link]([Link](), [Link]())
9. [Link]([Link](), [Link]())
10. for i, j in enumerate([Link](y_set)):
11. [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. [Link]('Naive Bayes (Training set)')
14. [Link]('Age')
15. [Link]('Estimated Salary')
16. [Link]()
17. [Link]()

Output:

195
In the above output we can see that the Naïve Bayes classifier has segregated the
data points with the fine boundary. It is Gaussian curve as we have
used GaussianNB classifier in our code.

6) Visualizing the Test set result:

1. # Visualising the Test set results


2. from [Link] import ListedColormap
3. x_set, y_set = x_test, y_test
4. X1, X2 = [Link]([Link](start = x_set[:, 0].min() - 1, stop = x_set[:, 0].m
ax() + 1, step = 0.01),
5. [Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, s
tep = 0.01))
6. [Link](X1, X2, [Link]([Link]([[Link](), [Link]()]).T).reshape
([Link]),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
8. [Link]([Link](), [Link]())
9. [Link]([Link](), [Link]())
10. for i, j in enumerate([Link](y_set)):
11. [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. [Link]('Naive Bayes (test set)')
14. [Link]('Age')
15. [Link]('Estimated Salary')
16. [Link]()
17. [Link]()

Output:

196
The above output is final output for test set data. As we can see the classifier has
created a Gaussian curve to divide the "purchased" and "not purchased" variables.
There are some wrong predictions which we have calculated in Confusion matrix.
But still it is pretty good classifier.

197

You might also like