0% found this document useful (0 votes)
6 views234 pages

Machine Learning Made Easy Using Python

The document is a guidebook for beginners interested in mastering machine learning with Python, emphasizing the importance of machine learning in automating tasks and making data-driven decisions. It covers essential concepts, tools, and libraries such as NumPy, Pandas, and Scikit-learn, while also introducing advanced topics. The book aims to provide a comprehensive understanding of data science and machine learning for those familiar with Python programming.

Uploaded by

Duane Fisher
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views234 pages

Machine Learning Made Easy Using Python

The document is a guidebook for beginners interested in mastering machine learning with Python, emphasizing the importance of machine learning in automating tasks and making data-driven decisions. It covers essential concepts, tools, and libraries such as NumPy, Pandas, and Scikit-learn, while also introducing advanced topics. The book aims to provide a comprehensive understanding of data science and machine learning for those familiar with Python programming.

Uploaded by

Duane Fisher
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

PYTHON

MACHINE LEARNING
Machine learning is one of the skills
considered must for the future yeors. As
tosks are increasing it has became
time consuming to program, machine learning
allows machines to learn on their own and
produce the same results. Machine
learning allows machines to learn
on their own through feeding them data.
I assume that you have prior
knowledge of Python programming and
data science, if you don't
you can check these books on the next page.

A complete guidebook for anyone who


wants to master machine learning
with Python.

Rahul Mula
Mochine Learning with Python
by Rahul Mula
© 2020 Machine Learning with Python
All rights reserved. No portion of this
book may be reproduced in any form
without permission from the copyright
holder, except as permitted by U.S.
copyright law.
Cover by Rahul Mula.
All the programs written in this book
are tested and verified by the author.
Cover Template from [Link]

ISBN : 979-8-58-426755-1
/----------------------------------------------------------------------------------------- X

MAKE SURE TO CHECK THEM OUT


<_____________________________________________________ >

Python For Beginner


A beginners guide to programming
with python

Data Science with Python


Learn how to perform tasks like data
processing, cleansing, analysis and
visualization
Why should you learn machine learning? or what are
its uses? Would be the questions that may come to
your mind. The answer is simple, think that you are
given a data from a online store about its products
and recommend products. The data has a product
name, its category, its quantity, and rate columns
with several hundred rows of products. If you want to
perform some analysis like the product which is most
purchased in that day, it will take a lot of time to do it
manually. To ease up these tasks, we use data
analysis, i.e. we run a program with codes to perform
a certain data analysis. The computer runs the
program and we get the output in just few seconds.
Then we classify the user and suggest it products
based on preferable categories on the basis of it's
previous search [Link], How to do that? Well, we
need to learn Data Science and Machine Learning to
perform those tasks.
Businesses S organizations are trying to deal with it
by building intelligent systems using the concepts and
methodologies from Data science, Data Mining and
Machine learning. Among them, machine learning is
the most exciting field of computer science. It would
not be wrong if we call machine learning the
application and science of algorithms that provides
sense to the data.
This book is prepared especially for beginners (at
Data Science and Machine Learning), but you should
be familiar with programming in Python. We will work
with packages and modules like NumPy, SciPy,
Pandas, Matplotlib, Scikit Learn, etc. to perform
analysis and other tasks. I kept this book open to the
basic concepts of data science to help the beginners
to understand everything but the book only covers
data science concepts prior for machine learning, as
the name suggests, the book is not for you if you're
looking for data science, you check the other books
page to find that. I also included advanced topics to
not limit you to the basics. Machine learning,
algorithms, data science, etc. moy seem tough end
boring, but as you handle more end more data, you'll
ploy with it!
(contents)

PANDAS
03
CHAPTER pandas
• Features of Pandas Library
• Series
• Data Frames
1

MATPLOTLIB
06
CHAPTER
l • Features of matplotlib
matpMib . Data visualizationp
• PyPlot in matplotlib
1
(contents)
SCIKIT LEARN

k
CHAPTER
• Features of Scikit-learn library
• How to work with data?
• Why use Python?
1
TYPES OF MACHINE LEARNING
08 b?
k
CHAPTER
1APTER O
• Supervised learning
• Unsupervised learning
• Deep learning 1
MATHEMATICS FOR MACHINE LEARNING

k
CHAPTER
I
IDIID • Data instances
I I • Statistics
• Probability
1
SCIKIT LEARN ALGORITHMS

CHAPTER
• Regression algorithm
• Classification algorithm
• Clustering algorithm
1
IMPORTING DATA

CHAPTER
I I
I
I I
• Importing CSV data
• Importing JSON data
• Importing Excel data
I
r DATA OPERATIONS

V
CHAPTER
• NumPy operations
• Pandas operations
• Cleaning data
1
(contents)
DATA ANALYSIS 8 PROCESSING

k
CHAPTER
• Data analytics
• Correlations between attributes
• Skewness of the data
1
DATA VISUALIZATION
14
CHAPTER
• Plotting data
• Univaritae plots
• Multivariate plots
1

r CLASSIFICATION
16
CHAPTER
• Decision tree
• Linear regression
• Naive Bayes
1
(contents)

PERFORMANCE 8 METRICS
20
CHAPTER
• Calculating the model
• Improving the model
• Saving and loading models
(contents)
MACHINE LEARNING
U1 INTRODUCTION

• What is Machine
Learning?
• Use of Machine
Learning
• How Machines
Learn?

o
(—[Link]---------------------
MACHINE LEARNING INTRODUCTION
A

'--- '□-------------------- y

What is Machine Learning?


//

Data is what you need to


do ANALYTICS,
Information is what you
need to do BUSSINESS.

Commonly referred to as the “OiL of the 21st


century" our digital data carries the most
importance in the field. It has incalculable
benefits in business, research and our everyday
lives.
Machine Learning is the field of computer science
where machines provide meaning to the data like we
humans do. Machine Learning is an type of
artifical intelligence which finds patterns in raw
data through various algorithms and perform
predictions like humans. Machine Learning also
means machines learn on their own. To better
understand it think of a new born child and refer
it to a machine learning model. The parents cannot
teach them everything that's why they leave them
to schools which can be you in this case with the
machine learning model. The school has text books,
tests, etc. to help you learn on your own which
w

data
' •/ '
]_4 MACHINE LEARNING INTRODUCTION
)

Uses of Machine Learning


Organizations are investing heavily in
technologies like Artificial Intelligence, Machine
Learning and Deep Learning to get the key
information from data to perform several
real-world tasks and solve problems. We can call
it data-driven decisions taken by machines,
particularly to automate the process. These
data-driven decisions can be used, instead of
using programing logic, in the problems that
cannot be programmed inherently. The fact is that
we can't do without human intelligence, but other
aspect is that we all need to solve real-world
problems with efficiency at a huge scale. That is
why the need for machine learning arises.
Followings are some of it's applications in real
world:

Forecasting weather of
a day beforehand through finding
patters of weather in the data
of weather of previous days

Predicting the future


prices of stocks in
stock market

suggesting a product to a
customer in an oniine store
according to the users previous
search terms
' •/ '
J.5 MACHINE LEARNING INTRODUCTION
'

How do machines learn?


So what magic happens that machines learns like
us and perform tasks? Let's understand that by an
example. Let's say you are a new computer dealer.
You have very basic experience in it. So you ask
another dealer and obtain information. You
summarized the following points to be important
like the processor cores, ram and gpu. The dealer
tells you these and you learn in return. Then you
are provided with following data about 8gb ram of
different brands:

3500

„ 3000
o’
c
I 2500
<u
I 2000
§
1500

-------------------- >
60 65 70 75 80 85
Price

By observing the data we can tell that the price


increases with the increase in frequency speeds.
You understand an simple logic behind the data.
Then if you get ram with the following
specifications you can tell it's price, like a ram
with 1666 Mhz of frequency so its price is 60.
frequency
RAM
' •/ '
16 MACHINE LEARNING INTRODUCTION
)

But what if you get an frequency that you don't


have record of like 2600Mhz? Then you have to
learn how to decide the prices. We start to find a
way to calculate with the given data. We assume
that their is a linear relationship between the
two. We define the relationship as a straight line
as shown below:

Price
Now we can use the line as reference and predict
values. SOj for 2600Mhz the cost will be about 77
So how do we draw out the line. We follow the
formula cost - a + b * Mhz, but what are a & b?
a and b are parameters of the straight line which
you don't need to sweat about.
frequency
RAM
' •/ '
]_7 MACHINE LEARNING INTRODUCTION

Likewise in machine learning, the machine i.e.


computer learns the patters or relations in the
data through algorithms and predict values when
new value is asked.
So will there be no errors? Definitely it will
predict wrong than the actual answer. We also do
many mistakes but learn from our mistakes or
change our tutor if the result stays negative.
Machine learning models too learn from their
mistakes and change algorithms when results are
not improved.
■x
SETTING-UP
ENVIRONMENT
• Installing Anocondo
• Jupyter notebook
• Working with
Jupyter notebook
A
021/
X__ SETTING-UP ENVIRONMENT
__________________________________ J

Installing Anaconda
Head to [Link]/products/individuat to
download the latest version of Anaconda.

Anaconda Installers

Windows ■■ MacOS « Linux A

64-Bit Graphical Installer (466 MB) 64-Bit Graphical Installer (462 MB) 64-Bit (x86) Installer (550 MB)

32-Bit Graphical Installer (397 MB) 64-Bit Command Line Installer (454 MB) 64-Bit (Power8 and Power9) Installer (
MB)

You can download the anaconda-installer for your


system, whether it is Windows, Mac or Linux. After
installing it, just run the installer and install

Search the web


Anaconda Prompt (anaconda3)
P anaconda Prompt (anaconda3) - See App
web results

P anaconda prompt anaconda3


CT Open
yP anaconda prompt anaconda3 conda
c0 Run as administrator

yP anaconda prompt anaconda3


uninstall
> Open file location

“P3 Pin to Start

Pin to taskbar

® Uninstall

P anaconda Prompt (anaconda3)


* •/
20 SETTING-UP ENVIRONMENT
---------

This is the Anaconda Command Prompt, from where we


can run programs or perform other operations using
code's as commands.
Anaconda Prompt (anaconda3)

(base) C:\Users\Rahul>

Jupyter Notebook
The lupyter Notebook is an open-source jupyter
web application that allows you to create
and share documents that contain live
code, equations, visualizations and
explanatory text. We will use it to perform
our data processing, analytics and
visualization, etc. on the go.
To open lupter Notebook, write jupyter notebook in
the anaconda command prompt and press enter.
5 Anaconda Prompt (anaconda3) □ >
(base) C:\Users\Rahul>jupyter notebook
21 SETTING-UP ENVIRONMENT
-----

fS’ Anaconda Prompt (anaconda3) - jupyter notebook


1
(base) C:\Users\Rahul>jupyter notebook
[I - - JupyterLab extension loaded from C:\Users\Rahul\anaconda3\lib\site-packages\jupyterlab
NotebookApp]
[I NotebookApp] JupyterLab application directory is C:\Users\Rahul\anaconda3\share\jupyter\lab
[I NotebookApp] Serving notebooks from local directory: C:\Users\Rahul
[I NotebookApp] The Jupyter Notebook is running at:
[I NotebookApp] [Link]
[I NotebookApp] or’[Link]
[I NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C NotebookApp]

To access the notebook, open this file in a browser:


[Link]
Or copy and paste one of these URLs:
[Link]
or [Link]
J

Anaconda will redirect you to your browser, [it may


ask you, in which browser to host your jupyter
notebook if you have more than one browsers] a new
tab will appear with your Jupyter notebook hosted.
You can host your Python files here, and also run the
code on the fly.

P’ jupyter Quit Logout

Files Running Clusters

Select items to perform actions on them. Upload New » C

o - to/ Name ♦ Last Modified File size

C3 3D Objects 5 days ago

Co anaconda3 4 days ago

Co ansel 5 days ago

| Co Contacts 5 days ago

Co Creative Cloud Files 4 days ago

Co Desktop 3 days ago

| Ca Documents a day ago

Ca Downloads 10 minutes ago

Ca Favorites 5 days ago

Ca Links 5 days ago

Ca Music a day ago

Ca OneDrive an hour ago

Ca Pictures 5 days ago

Ca Saved Games 5 days ago

Ca Searches 5 days ago

Ca Videos 4 days ago

S Newipynb 4 days ago 72 B

In Jupyter Notebook, we donJt need to install any


other module, package or library externally
everything we need is already present here and the
best thing is that you can code online without
installing any IDE or the Python Interpreter, which
makes it the best choice for data scientists.
' •/ '
22 DATA SCIENCE INTRODUCTION
—J

Working with Jupyter Notebook


To start coding, click on New and select Python 3 to
open a new Python file.

jupyter Quit Logout

Files Running Clusters

Select items to perform actions on them. Upload New» C

□ 0 - to / Name Last Modified File size

□ O 3D Objects 5 days ago

□ Ca anaconda3 4 days ago

□ CJ ansel 5 days ago

□ Ca Contacts 5 days ago

□ Ca Creative Cloud Files 4 days ago

C jupyter Quit Logout

Files Running Clusters

Select items to perform actions on them. ______ I I —I I_____


Create a new notebook with Python 3
□ 0 -r */ Name 4 :e
Python 3
□ Ca 3D Objects

□ Ca anaconda3 Text File


□ Ca ansel Folder
Terminal
□ Ca Contacts

□ Ca Creative Cloud Files 4 days ago

This is the place where we will write our code [in


the cell] and run it.
JUpyter Untitled Last Checkpoint: a few seconds ago (unsaved changes) t* Visit repo Copy Binder link

Kernel Trusted | Python 3 O

O GitHub % Binder Memory: 168/2048 MB

If you cannot create new file or encounter any


error, you can head directly to [Link]/try and
choose Python.
Try Classic Notebook Try JupyterLab Try Jupyter with Julia

jupyter

A tutorial introducing basic features JupyterLab is the new interface for A basic example of using Jupyter
of Jupyter notebooks and the Jupyter notebooks and is ready for with Julia.
I Python kernel using the classic general use. Give it a try!
Jupyter Notebook interface.
23 SETTING-UP ENVIRONMENT
--------------- '• '

We can rename our file, by clicking the name


[untitled]
jupyter Untitled Last Checkpoint: a few seconds ago (unsaved changes)

File Edit View Insert Cell Kernel Widgets Help Trusted | Python 3 O

E + ® ft *4- H Run ■ C » Code v Q i Download A A O GitHub % Binder Memory: 168/2048 MB

I In [ ]: Q

We have only one code cell, in this cell we will


write our code
3 jupyter New Last Checkpoint: 2 minutes ago (autosaved) f® Visit repo Copy Binder link

File Edit View Insert Cell Kernel Widgets Help Trusted | Python 3 O

B + »: ft ft ♦4’ H Run ■ C » |code v| D ± Download A A O GitHub % Binder Memory: 219/2048 MB

In [ ]:

There are three type cells - code cells, markdown


cells and raw cells.
We can use markdown cells to display headings or
titles.
J U py ter New Last Checkpoint: 7 minutes ago (unsaved changes)

File Edit View Insert Cell Kernel Widgets Help Trusted ✓ | Python 3 O

B + ® t +4' H Run ■ C Markdown v o i Download H d O GitHub % Binder Memory: 219/2048 MB

I # Jupyter Notebook

Now run the cell by clicking the run button on the


header.
^jupyter New Last Checkpoint: 7 minutes ago (unsaved changes)

File Edit View Insert Cell Kernel Widgets Help Trusted ✓ | Python 3 O

B + 9® C ♦ 4 H Run | ■ C » Code v ra i Download 41 d O GitHub % Binder Memory: 219/2048 MB

Jupyter Notebook
I" [ ]:
* •/
24 SETTING-UP ENVIRONMENT
-----

In code cells, we can write Python codes and execute


them instantly.
3 jupyter New Last Checkpoint: 12 minutes ago (autosaved) Visit repo Copy Binder link

File Edit View Insert Cell Kernel Widgets Help Trusted ✓ | Python 3 O

B + 3^ I?) C ♦ * H Run ■ C » Code v Q ± Download a a O GitHub % Binder Memory: 119/2048 MB

Jupyter Notebook
In [ ]: 123*525|

^.JUpyter New Last Checkpoint: 13 minutes ago (unsaved changes) Visit repo Copy Binder link

File Edit View Insert Cell Kernel Widgets Help Trusted | Python 3 O

B + ® + 4> H Run ■ C H Code v E3 i Download a a O GitHub % Binder Memory: 119 / 2048 MB

Jupyter Notebook
In [1]: 123*525

Out[l]: 64575

To insert a new cell below the selected cell, press


b on your keyborad or click the + icon.
3 jupyter New Last Checkpoint: 17 minutes ago (unsaved changes) f® Visitrepo Copy Binder link

File Edit View Insert Cell Kernel Widgets Help Trusted | Python 3 O

B + ® 6 * 4- H Run ■ C ►* Code * a ± Download a a O GitHub % Binder Memory: 119/2048 MB

Jupyter Notebook
In [1]: 123*525

Out[l]: 64575

In [ J:

You can select [blue] or edit [green] a cell, by


clicking outside the text feild or inside the text
feild respectively.
3 jupyter New Last Checkpoint: 19 minutes ago (autosaved) •®

File Edit View Insert Cell Kernel Widgets Help Trusted | Python 3 O

a I + 3- ® t + 4- H Run ■ c H code £ Download a a O GitHub % Binder Memory: 119/2048 MB

Jupyter Notebook
SETTING-UP ENVIRONMENT

We have more access to the markdown cells, to diplay


texts more gracefully. We can add headings,
sub-headings and lower-headings, using # 1, 2 and 3
times followed by space and then text respectively.

Jupyter Notebook

IPython

Data

In [ ]:

We can create ordered and bulleted lists


1. Data Science
1. Python
2. Jupyter Notebook
3. Libraries

1. Data Science
A. Python
B. Jupyter Notebook
C. Libraries
Ordered List
____________ J
z - Data Science ------ ----------------- \
* Python
* Jupyter Notebook
* Libraries

• Data Science
■ Python
■ Jupyter Notebook
■ Libraries
BuLLeted List
_____________ d
To create an ordered list, use 1 for the first list
item and then use tabspace for the sub-list items and
use correct numbering. [The text should be written
followed by space after the numbers]
To create a bulleted list, use - for square bullets
and * for round bullets, and same manner as above for
list and sub-list items.
* •/
26 SETTING-UP ENVIRONMENT
-- '•
We can also links, using [] & (). Write the display
text in [] and put the link in (), you can also add a
hover text inside of ( ) using " ” quotes.

TJupter Notebook for Pvthonl(httDs://[Link]/trv "Try it!")

Jupter Notebook for Python

We can also use **<text>** or <text> to render


bold text and *<text>* or _<text>_ to render
italicized text

We can also insert images by going to the


Edit>Insert Image and browse your image to enter it
File Edit View Insert Cell Kernel Widgets Help Trusted | Python 3

Create tables using | and strictly following the


below example
|Product|Price|Quantity|
|----------- 1-— |-.............. |
|Biscuits|5|2|
|Milk|7|5L|

Product Price Quantity

Biscuits 5 2
Milk 7 5L
27 SETTING-UP ENVIRONMENT
I V z

HereJs a complete list of shortcuts of various


operations with cells.
r ’I
Operations Shortcut
change cell to code y
change cell to markdown m
change cell to raw r
close the pager Esc
restart kernal 0 + 0
copy selected cell c
cut selected cell X
delete selected cell d + d
enter edit mode Enter
extend selection below Shift + j
extend selection above Shift + k
find and replace f
ignore Shift
insert cell above a
insert cell below b
interrupt the kernal i + i
Merge cells Shift + m
paste cells above Shift + v
paste cells below V
run cell and insert below Alt + Enter
run cell and select below Shift + Enter
run selected cells Ctrl + Enter
save notebook Ctrl + s
scroll notebook up SHIFT + Space
scroll notebook down Space
select all Ctrl + a
show keyboard shortcuts h
toggle all line numbers Shift + 1
toggle cell output 0
toggle cell scrolling Shift + o
toggle line numbers 1
undo cell deletion z
■x

PANADAS
UO LIBRARY
• Features of Pandas
library i'll
• Series I'1
• Dataframes pandas

o J
/• A

03
k _______ /
PANDAS LIBRARY
________________________________________ >

Pandas
Data science requires high-performance data ma­
nipulation and data analysis, which we can achieve
with Pandas Data [Link] with pandas is
in use in a variety of academic and commercial
domains, including Finance, Economics, Statistics,
Advertising, Web Analytics, and more. Using
Pandas, we can accomplish five typical steps in
the processing and analysis of data, regardless of
the origin of data - load, organize, manipulate,
model, and analyse the data.

Key features of Pandas library


We can achieve a lot with Pandas library using
its features like:
• Fast and efficient DataFrame object with
default and customized indexing.
• Tools for loading data into in-memory data
objects from different file formats.
• Data alignment and integrated handling of
missing data.
' if '
30 PANDAS LIBRARY

• Label-based slicing, indexing and subsetting


of large data sets.
• Columns from a data structure can be deleted
or inserted.
• Group by data for aggregation and
transformations.

Series
Pandas deals with data with itJs data
structures known as series, data frames and panel.
Series is an one-dimensional array like structure
with homogeneous data. For example, the following
series is a collection of integers

10 17 23 55 67 71 92
As series are homogeneous data structure, it can
contain only one type of data [here integer]. So,
we conclude that Pandas Series is:
• It is a homogeneous data structure
• Its size cannot be mutated
• Values in series can be mutated

Data Frames
DataFrame is a two-dimensional array with
heterogeneous data.

Day Sales
Monday 33
Tuesday 37
Wednesday 14
Thursday 29
31 PANDAS LIBRARY

The data shows the sales of certain product for 4


days. You can think of Data Frames a container for
2 or more series. So, we conclude that pandas data
frames is:
• It can contain heterogeneous data
• Its size is mutable
• ALso its data is mutable.

We will use Pandas series and data frames a lot


in the future lessons, make sure to go through the
lesson again and get the grasp of it.

Key Points
• Pandas library is a high performance data
manupilation and data analysing tool.
• Pandas data structures include series and
data frames
• Series is a 1-Dimensional array of
homogeneous data, whose size is immutable
but values in a series are mutable.
• Data Frames is a 2-Dimensional array of
heterogeneous data of 2 or more series,
whose size and data are mutable.
■x

f"\/| NUMPY
MH PACKAGE
• Features of NumPy
• ndarray Objects
• List vs. ndarrays

o J
A

04 NUMPY PACKAGE
_________________________________________>

NumPy
NumPy is a Python package which stands for
'Numerical Python'. It is a library consisting of
multidimensional array objects and a collection of
routines for processing of array.

3D array

2D array
ID array
5.2 3.0 4.5

7 2 9 10 9.1 0.1 0.3

axis 0 axis 1

shape: (4,) shape: (2, 3)

Key features of NumPy


NumPy is powerful that consists of many features
like :
• Mathematical and logical operations on arrays.
• Fourier transforms and routines for shape
manipulation.
• Operations related to linear algebra. NumPy
has in-built functions for linear algebra and
random number generation.
• NumPy ndarrays are much much faster than
Python Built-in lists and less memoray
consuming.
• Most of the part that requires fast
computation are written C and C++
34 NUMPY PACKAGE

ndarray objects
NumPy aims to provide an array object that is up to
50x faster that traditional Python lists. The array
object in NumPy is called ndarray, it provides a lot
of supporting functions that make working with
ndarray very easy. Arrays are very frequently used in
data science, where speed and resources are very
important.
In NumPy, we can create 0-D,l-D,2-D and 3-D
ndarrays.

0-D (33)

1- D ([11,27,18])

2- D ([ 3, 5,6],
[5, 7,11])

3- D ([ 5,8,19],
[ 6, 9,10],
[4,1,11])

In breif ndarrays or n-dimensional arrays are:


• It describes the collection of items of the
same type.
• Items in the collection can be accessed using a
zero-based index.
• Every item in an ndarray takes the same size of
block in the memory.
• Each element in ndarray is an object of data-type
object (called dtype). Any item extracted from
ndarray object (by slicing) is represented by a
Python object of one of array scalar types.
35 NUMPY PACKAGE

Lists vs. ndarray


In Python we have lists that serve the purpose of
arrays, but they are slow to process. NumPy aims to
provide an array object that is up to 50x faster that
traditional Python lists.
1
Lists ndarrays
• List is an array of • ndarray is an array of
heterogeneous objects homogeneous objects
• List arrays are • ndarrays arrays are
stored in different stored in one
places in the memory continuous place in
which, makes it slow the memory which,
to process data. makes it faster to
• Lists are not process data.
optimized to work • ndarrays are
with latest CPU's optimized to work
• A 1-Dimensional with latest CPU's
List • A 1-Dimensional
ndarray

['A',56,67.05] ([ 12, 17, 25])

Lists arrays
memory Loe
-12044567

memory too
-12044568

memory too
-12044569

0 x 310718 memory too


-12044570

memory too

0 x 310719 -12044571

memory too
-12044572
0 x 310720 memory too
-12044573

0 x 310721 memory too


-12044574

memory too

0 x 310722 -12044575

memory too
-12044576
(----------------------------------------------------- X
List arrays memory 0 x 310723 memory too
-12044577

allocation 0 x 310726 memory too


-12044578
X_______________________________ /
36 NUMPY PACKAGE
-

ndarrays

PyObject_Head 1
data 2
“7
3
dimensions
4
strides
5
(--------------------------------
6 ndarrays memory
7 allocation
\_____________________________ /

You can clearly understand why the built-in list


arrays are slower than ndarrays.
To accelerate and process data much faster we will
use NumPy in future lessons., make sure to geta hold
of it.

r
Key Points
• NumPy stands for Numerical Python, which
is a Python Package used for working with
arrays.
• It also has functions for working in domain
of linear algebra, fourier transform, and
matrices.
• ndarrays or n-dimensional arrays are
homogeneous arrays, which are optimized
for fast processing.
• ndarrays also provide many functions that
makes it suitable to work with data
■x

SCIPY
UD PACKAGE
• Features of SciPy
• Data Structures 1

• SciPy
Sub-Packages

o
/----------------------------------------------------- A
05 SCIPY PACKAGE
k J

SciPy
The SciPy library of Python is built to work with
NumPy arrays and provides many user-friendly and
efficient numerical practices such as routines for
numerical integration and optimization. Together,
they run on all popular operating systems, are
quick to install.
/-------------
In [1]: #Import packages
from scipy import integrate
import numpy as np

def my_integrator(a,b,c):
my_fun = lambda x: a*[Link](b*x)+c
NumPy
y,err = [Link](my_fun,0,100)
print(’ans: %1.4e, error: %1.4e' % (y,err))
return(y,err) z

#CaLL function
my_integrator (5,-10,3)

ans: 3.0050e+02, error: 4.5750e-10

Out[l]: (300.5, 4.574965520082099e-10)


\_________________________

Key features of SciPy


SciPy combined with NumPy results a powerful tool
for data processing with features like:
• The SciPy package contains various toolboxes
dedicated to common issues in scientific
computing. Its different submodules correspond
to different applications, such as
interpolation, integration, optimization, image
processing, statistics, special functions, etc.
• SciPy is the core package for scientific
routines in Python; it is meant to operate
efficiently on NumPy arrays, so that numpy and
scipy work hand in hand.
• SciPy is organized into sub-packages covering
different scientific computing domains, which
makes it more efficient.
/
39 SCIPY PACKAGE

Data structures
The basic data structure used by SciPy is a mul­
tidimensional array provided by the NumPy module.
NumPy provides some functions for Linear Algebra,
Fourier Transforms and Random Number Generation,
but not with the generality of the equivalent
functions in SciPy. Except for these, SciPy offers
Physical and mathematical constants, fourier
transform, interpolation, data input and output,
sparse metrics, etc.

Dense Matrix Sparse Matrix


1 2 31 2 9 7 34 22 11 5 1 3 9 3

11 92 4 3 2 2 3 3 2 1 11 4 2 1

3 9 13 8 21 17 4 2 1 4 1 4 1

8 32 1 2 34 18 7 78 10 7 8 3 1

9 22 3 9 8 71 12 22 17 3 9 1 17

13 21 21 9 2 47 1 81 21 9 13 21 9 2 47 1 81 21 9

21 12 53 12 91 24 81 8 91 2

61 8 33 82 19 87 16 3 1 55 19 8 16 55

54 4 78 24 18 11 4 2 99 5 54 4 11

13 22 32 42 9 15 9 22 1 21 2 22 21

Use of Sparse matrix


_________ __________ J

SciPy sub-packages
As we already know, SciPy is organized into
sub-packages covering different scientific comput­
ing domains, we can import them according to our
needs rather than importing the whole library.
The following table shows the list of all the
sub-packages of SciPy :
[next page]
z—• r--------------------------------------
40 SCIPY PACKAGE
-

[Link] Mathematical constants


[Link] Fourier transform
[Link] Integrate routines
[Link] Interpolation
[Link] Data input and output
[Link] Linear algebra routines
[Link] Optimization
[Link] Signal processing
[Link] Sparse matrices
[Link] Spatial data structures
[Link] Special mathematics
[Link] Statistics

Key Points
• SciPy Package is a toolbox which is used
for common scientific issues.
• SciPy together with NumPy creates a
dynamic tool for data processing.
• Along with NumPy functions, SciPy provides
a lot of functions to perform different
tasks with ndarrays.
• SciPy is divided into sub-packages
determined for different tasks.
■x

r\OMALPLOTLIB
MO LIBRARY

’SSOf matplstlib
• Data Visualization
• PyPlot in Matplotlib

o
A

06J
\______ /
MATPLOTLIB LIBRARY
_____________________________J

MatPlotLib
Matplotlib is a python library used to create 2D
graphs and plots by using python scripts. It has a
module named pyplot which makes things easy for
plotting by providing feature to control line
stylesj font properties., formatting axes etc.

50

40

30

20 ■ *
10

Sun Sat Thur Fri


X

10

• •
6 •• ••*

10 20 30 40 50

Key features of MatPlotLib


Matplotlib is the best choice for data
visualization because of its features like:
• It supports a very wide variety of graphs and
plots namely - histogram, bar charts, power
spectra, error charts, and many more.
• It is used along with NumPy to provide an
environment that is an effective open source
alternative for MatLab.
• Using its PyPlot module, plotting simple
graphs or any other charts is very easy.
43 MATPLOTLIB LIBRARY

Data Visualization
Data visualization is the graphical representa­
tion of information and data. By using visual
elements like charts, graphs, and maps, data visu­
alization tools provide an accessible way to see
and understand trends, outliers, and patterns in
data.

In the world of Big Data, data visualization


tools and technologies are essential to analyze
massive amounts of information and make
data-driven decisions. Data visulaization helps us
to view data in a graphical or more interesting
way rather than viewing a big chunk of numbers in
a uniform line.
We will process, analyze and then visualize our
data, if we don't visualize our data, it loose a
lot of impact as it will in the form bar graphs,
pie charts, etc.
44 MATPLOTLIB LIBRARY

PyPlot in Matplotlib
[Link] is a collection of functions
that make matplotlib work like MATLAB. Each pyplot
function makes some change to a figure: e.g.,
creates a figure, creates a plotting area in a
figure, plots some lines in a plotting area,
decorates the plot with labels, etc.
To test it yourself, jump to lupyter Notebook and
start of by importing the matplotlib. pyplot
module.

In [ ]: import [Link] as mplt

To plot a simple graph, use the plot function


and pass a list, and then use the show function
to view the graph

We have successfully plotted our graph with some


random values in a list.
If we want we can name x and y axis using the
xtabet and ytabet repectively.
Z—•
45 MATPLOTLIB LIBRARY

In [2]: import [Link] as mplt


[Link]([l,3,6,9])
[Link]('X_Axis')
[Link]('Y_Axis')
[Link]()

The graph has solid blue line, we change itJs


color and the line style by passing another
argument to the plot function like, 'ro' for 'r'
red and 'o' circles.

In [2]: import [Link] as mplt


[Link]([l,3,6,9],’ro*)
[Link](’X_Axis’)
[Link]('Y_Axis')
[Link]()

9 •
8
7
6 •
</i

3 •
2
1 •
0.0 0.5 10 15 2.0 25 3.0
XAxis

The letters and symbols of the format string like


'ro' are from MATLAB, and you concatenate a color
string with a line style string. There are many
symbols for different shapes and colors like,
'b—' for blue solid. You-’ll find all the symbols
for different color and shapes in the following
list.
Z—•
46 MATPLOTLIB LIBRARY

line and shape styles


r
- Solid line
-- Dashed line
Dotted line
Dash-Dot line
'o' Circle
Plus sign
'*' Asterisk
'e' Point
'X 1 Cross
Horizontal line
■r Vertical line
' s' Square
'd' Diamond
1 A 1
Upward-pointing triangle
' v1 Downward-pointing triangle
'>' Right-pointing triangle
Left-pointing triangle
'P' Pentagram
' h' Hexagram
k_________
color styles
r
y yellow
m magenta
c cyan
r red
g green
b blue
w white
k black
k________________
47 MATPLOTLIB LIBRARY

Except for the color and line & shape style we


have a lot of editibility on the plotted graphs,
you can learn those in the seperate book dedicated
for data science or data visualization.

r
Key Points
• MatPlotLib is a library used for
visualizing our data using itJs MATLAB
like functions
• MatPlotLib-’s PyPlot module makes it easier
to plot data, with full control over
color, line & shape, font, axis-labels,
etc.
• It supports wide range of graphs and plots
like, histogram, bar graphs, pie charts,
and even 3-D graphs.
• MatPlotLib is the best library for data
visualization
■x

r\~f SCIKIT LEARN


\Jf LIBRARY

• Features of
Scikit learn
library

o J
A

07 SCIKIT LEARN LIBRARY


________________________________________ >

Scikit Learn or Sklearn


Scikit-learn or Sklearn is the most useful and
robust library for machine learning in Python. It
provides a selection of efficient tools for
machine learning and statistical modeling
including classification, regression, clustering
and dimensionality reduction via a consistence
interface in Python. This library, which is
largely written in Python, is built upon NumPy,
SciPy and Matplotlib.

Features of Sckikit Learn


Scikit-learn focuses on modelling data. The
followings are the most popular groups provided by
the library:
• Supervised learning algorithms, like Linear
Regression, Support Vector Machine (SVM), Decision
Tree etc. are the part of scikit-learn.

© $ ®
• Unsupervised learning algorithms like
clustering, factor analysis, PCA (Principal
Component Analysis) to unsupervised neural
50 SCILKIT LEARN LIBRARY

• Cross Validation, Dimensionality Reduction,


Ensemble methods, Feature extraction, Feature
selection are also features of scikit learn that
are used to check the accuracy of supervised
models, reducing the number of attributes in a
data, combining the prediction of multiple
supervised models, extract features and identify
useful featurews in adata, respectively.
no TYPES OF
UO MACHINE LEARNING

Supervised leorning
• Unsupervise leorning
• Deep learning
• Reinforcement
leorning
• Deep reinforcement
learning
o-------------
A

08 TYPES OF MACHINE LEARNING


_____________________________ )

In the previous lessons we learned about the


various libraries and packages used in the process
of machine learning. Now letJs look at the types
of machine learning

Types of machine learning


The followings are the different type of machine
learning:

reinforcement
Learnifig {J^pervised
Learning

^supervised
Learning
deep
reinforcement
Learnirfg
Deep Learning

Supervised learning
As it's name suggest in supervised
learning we train a machine learning
model with supervision. We feed the
data, train the model and tell the
model to make predictions. If the
predictions are correct we leave the
model to work on itJs own from there else
we help the model to predict correctly
until it learns so. It is the same as
teaching a child to solve questions at first
until he can solve them on his own.
53 TYPES OF MACHINE LEARNING

Types of supervised learning


Regression and classification are two types of
supervised machine learning. They can be under­
stood as:
• Regression, is the type of machine learning in
which we feed the model with data like rA' (input,
i.e. X) has value of 65 (output, i.e. Y), fB' has
value of 66, etc. Based on the given data, the
model learns the relation between the input and
output (here fA' & 65).
Once the machine is trained with sufficient data
we provide a input let's say rC' and let the model
predict the output, but you must know the real
output of that input. You check the prediction
with the real value and check whether it is
correct or wrong. If the predictions are correct
we pass the model. If the predictions aren't
54 TYPES OF MACHINE LEARNING

Regression inturn have different ypes like linear


and logistic regression, which we will learn in
it's separate lesson.
• Classification, is the type of machine learning
in which we feed data and the model classifies the
data into different groups. Consider the following
example,

the data has different type of shapes in it. We


will teach the model which is what shape or what
are the different groups in the data. We will
provide the groups with their features like:

/
circle square oval rounded
squares
• ■
■ ■

■ ■
\_____________ J
55 TYPES OF MACHINE LEARNING

Now the trained model can classify any data after


learning how the groups are formed. If a new shape
is passed it will classify it according to what it
has learned. Like regression, we will keep feeding
it data until it classifies the data correctly.

Classification has also different types like


decision tree, Naive Bayes classification, support
vector machines, etc.
We will learn about them in the lesson dedicated
for this topic.
56 TYPES OF MACHINE LEARNING

Unsupervised learning
Unlike supervised learning., we
don't teach or check the
predictions made by the models,
instead we feed the data and ask
for predictions directly. And it
is obvious that much data you'll
feed the results will be much
accurate. Unsupervised learning is
used in artificial intelligence
applications like face detection, object detection,
etc.

Deep learning
Deep learning models are based on Artificial
Neural Networks (ANN), or more specifically
Convolutional Neural Networks (CNN)s. There are
several architectures used in deep learning such
as deep neural networks, deep belief networks,
recurrent neural networks, and convolutional
neural networks.
These models are used to solve
problems like:
• computer vision, image
classification, etc.
• bioinformatics
• drug design
• games, etc.
Deep learning is entirely a
concept in itself that it is
completely different type of machine learning or
different from it, we will not discuss it in much
detail but let's see what is it and how do it
solves problems.
Deep learning requires a lot of data and
computational power. But nowadays high performance
57 TYPES OF MACHINE LEARNING
-

computing is available to us. Let's consider an


example where the deep learning model tells us
about whether an animal is horse or not. The net­
work consists of large amount of horse photos as
data and analyze them and try to extract patterns
from it like horns, color, saddle, eyes, etc.

the neural networks came to conclusion whether the


animal his horse or not, but how did it reach the
conclusion is unknown. The reasoning cannot be
obtained from deep learning models that's it is
also considered as black box.

Reinforcement learning
Reinforcement learning consists of learning
models which doesn't require any input or output
data instead they learn to perform certain tasks
like solving puzzles or playing games. If the
model performs steps correctly it is rewarded
points, unless it penalized. The models learns the
more it performs from it's mistakes. The model
creates data as it performs the functions unlike
receiving data at the beginning.
For example consider the following model:
58 TYPES OF MACHINE LEARNING
a

Deep Reinforcement learning


As the name suggests this is the combination of
reinforcement and deep learning. Reinforcements
algorithms are combined with deep learning to
create powerful Deep Reinforcement learning models
that is used in fields like robotics, video games,
finance and healthcare. Many unsolvable problems
are solved by these models. DRL models are still
at new to us and there is a lot to learn about it
■x

MATHEMATICS FOR
Vv? MACHINE LEARNING

• Doto instances i 11
• Statistics IDDIDI
• Probability
• Bayes Theorem

O J
A

09 MATHEMATICS FOR ML
_____________________________ )

Although every mathematical calculation will be


performed by the computer, but you need to know
the important formulaes and mathematical notations
even if you-’re not solving them yourselves. In
this chapter we will go through the important
concepts in mathematics required for machine
learning

Data instances
Data is what we need to perform all the
functions, i.e. data is the base of everything.
You need to know what types of data is required in
which process. LetJs consider the following as our
data:

r Day Sales
Monday 33
Tuesday 37
Wednesday 14
Thursday 29

In the above table there are two columns Day &


Sales and four rows. In the data, we have two
things, feature and label i.e. feature of the data
(numeric values like 33) and labels of the data
(descriptive values like Monday). Here Day is the
label and Sales is the label.

Monday
Tuesday 37
Wednesday 14
Thursday 29
MATHEMATICS FOR MACHINE LEARNING

The labels also have the following two types:


• Nominal, these data aren't ordered. They
have no heirarchy or upper or lower status.
In the following data the labels have either
True or False value i.e. considered nominal data

Answer Question
True 01
False 02
False 03
True 04

• Oridnal, these data are ordered. They have upper


or lower status. In the following data the
labels have an order in teh values like
Good > Average > Bad i.e. ordinal data

Product ID Rating
101 Average
102 Good
103 Average
104 Bad

Similarly the features have also two different


types. The followings are the two different types:
• Discrete, or finite values. These values have
a limit for example in the following data
the feature, numbers of children (NoC) in
different families is finite to 1, 2 or 3 i.e.
called discrete data

Family NoC
Smith's 2
Matrin's 1
Cox' s 3
Hyde's 2
MATHEMATICS FOR MACHINE LEARNING

• Continuous, or infinite values. These values


doesn't have a limit for example in the
following data the feature, weight of different
people isn't finite. It could be 110 pounds,
it can be 110.20 punds or even 110.21 pounds
i.e. continuous data.

Person Weight 1
Thon 110
Max 122
Mary 96
Alex 120

Data Collection
Data is collected from many sources. Let's say we
want data of the whole countrey. So we need to
survey the whole population whihc is time
consuming or we can select a sample of the
population and survey it's data. The sample can be
selected randomly, on the basis characters or
other features.
Likewise instead of feeding the a whole data set
to model we can obtain a sample from it to save
time and have better results.

\____________________ 7
Population
MATHEMATICS FOR MACHINE LEARNING

Statistics
Statistics is often thought of data visualization
like bar graphs, etc. but statistics also include
data collection, data analysis and it's represen­
tation. As you may have learn't in school, we
perform statistical analysis of data like finding
tge central tedency and visulaizing the data onto
graphs.
Descriptive and Inferntial stastics are used in
machine learning.

Descriptive statistics
In descriptive statistics we work with the whole
data i.e. population rather than sample. In de­
scriptive statistics we have the followings:
• Central Tendency
The mean, median and mode of an data is refered
as it's central tendency. We can find each of
them very easily, let's consider the following
data and find it's central tendency,

Day Sales
Monday 33
Tuesday 37
Wednesday 14
Thursday 29

To find the mean or the average of the sales, we


need to add all the values together and divide it
with the total number of values

sum of all values 33 + 37 + 14 + 29 ->Q nc


mean (x) = ----------------------- = ----------------------- = 28.25
Total no. of values 4

the average or mean sales is 28.25. As mentioned


earlier you don't need to perform the calculations,
it will be done in computer. Even there is
MATHEMATICS FOR MACHINE LEARNING

seperate functions in the pandas library like


[Link]() to find it, you just need to know
about it and how the value is obtained, so you
understand what happens in which analysis.
To find the median or the middle value sort the
numbers and if the data has odd numbers of data,
the middle value is the median like

12 34(56)71 77
56 is the median of the above data. But if the
number of data is even like our sales data we need
to find the sum of the middle pair and divide that
by 2

x sum of mid pair 29 + 33


median (x) = ---------------------
2 2

And at last we have the mode or the most occuring


value. It can be visually calculated but in our
data we don't we any repeating values so we will
consider the following example

®34(l2)71 77©56 78

12 is the mode in the above data as it ocurred


for three times, there can be many repetitions in
a data
65 MATHEMATICS FOR MACHINE LEARNING

• Variability or Spread
Range, interquratile range, variance and
standard deviation are referred to variability
or measurement of spread

Spread

>""■...... *"■......\
Interquatile Standard
Range Variance
Range Deviation
1 J
Range is the difference between the maximum and
minimum value in a data. Like the range of the
sales data is
Z
range = max - min
= 37-14
= 23
I______________ 7

Interquartile Range is similar to the range but a


bit different. Let's consider teh following data

12 27 33 35 35 42 45 47 51 53 54
We will divide the data into quarters with the
numbers as separators

12 27)33 [35 35] 42 [45 47] 51 [53 54]


And subtract the third seperator from the first
seprator i.e. interquartile range

Interquartile range = 3rd Seperator -1 st Seperator


= 51-33
= 18
\_____________________________________________________ 7
* •/
gg MATHEMATICS FOR MACHINE LEARNING
--- '•------------------
Variance or difference of random variables from
the expected value can be obtained with the fol­
lowing formula where x is individual data points,
x is mean and n is the total number of data values

fi(xt-x)1
2
s2 = ------------------
n
If you want you can find the variance of any data
by replacing the values in the formula but make
sure to remember how the variance is found

Next if we want to find deviation i.e. or the


difference of each value from itJs average or
mean, we can use the following formula where i
represents the number of values in the data and u
represents the mean
Deviation = (xt- u)
If we want to find the deviation of a data of a
population we will use the mean of the whole
population which is represented by u

o2 = (xr u)2

But if we want to find the deviation of a data of


a sample from a population, we will use the mean
of the sample instead of the whole population
which is represented by x i.e. called inferential
statistics
s2 = (xrx)2

Similarly we can find the standard deviation or


the dispersion of data from itJs mean through the
following formula where N is the number of data
points or values

1 ——I
o (xi-u)2
1=1
67 MATHEMATICS FOR MACHINE LEARNING
I

LetJs consider an example to understand how to


find standard deviation. We will find the standard
deviation of our sales data

First we need to find the mean or u i.e. 28.25


which we found earlier

Then we will find the difference of all the sales


data from the mean and square them and add them
* •/
gQ MATHEMATICS FOR MACHINE LEARNING
--- '•------------------
Finally we can find the root of the product of
1/N and 87.74 to find O

° = fP^74
o = J~jTx87-74 [N = 4]
o = /zT.93

o = 4.67

Therefore, the standard deviation of the sales


data is approximately 4.67 and as mentioned
earlier don't sweat on the calculation just
unserstand the application of the formula

Entropy and Information gain


Entropy or the uncertainity in a data can be
found with the following problem:
N

H(S) = -^[Link]
1=1

where S stands for set of all instances in a data,


N refers to the number of distinct values and p.
stands for probability of the event. Through
entropy we can further calculate information gain
from a varaible through the following formula

v IS -I
Gain(A, S) = H(S) x H(SJ)
j=i |S|

where A is a feature or variable whose information


gain is being calaculated, H(S) is the entropy of
the whole dataset,|Sj| is the number of instances
with value j of the feature A, |S| is all data
69 MATHEMATICS FOR MACHINE LEARNING
I

instances in the dataset, v is the set of distinct


values of the feature A, H(Sj) is the entropy of
subset of data instances of feature A and H(A,S)
is entropy of feature A of the dataset
Let's consider an example to understand what is
entropy and information gain more clearly. We have
the following dataset
Day Discount Advertisement Sales

1 10% No Average

2 25% No Maximum

3 20% Yes Maximum

4 10% Yes Maximum

5 25% No Average

6 10% Yes Maximum

7 20% No Maximum

8 20% No Maximum

9 10% Yes Average

10 20% Yes Maximum

You are told to find the best feature i.e.


Discount or Advertisement to have Sales as
Maximum. So which feature will you choose to
create a model to predict the best values to have
maximum sales? We will find the information gain
from each feature to figure that out. Let's find
the information gain from Discount, we have the
following details about Discount
z....................... \
Total values zZ \
10 Discount
Ix . J

z z z
Avg. Max 10%
3 7 z

\ \ z
Max Avg Max
2 0 1
V z z
70 MATHEMATICS FOR MACHINE LEARNING
I

Now we can find the entropy of the whole dataset


using the entropy formula
N

H(S) = -EPi-log2Pi
i=1

So total values or N is 10 and propbability of


Avg. Sales and Max Sales are 3/10 and 7/10
resepectively
N

H(S) = -EpJog^ Max Sales Probability


i=1

H(S) = -37;log2-^- -T7rlog2-^-


10 y 10 10 y 10
Avg. Sales Probability
H(S) ~ 0.82
After obtaining the entropy we can substitute the
values in the information gain formula to find the
information gain from Discount feature

Discount Sales

10% Average

25% Maximum

20% Maximum

10% Maximum

25% Average

10% Maximum

20% Maximum

20% Maximum = 0.82-0.4-0.0-0.2


10% Average = 0.82 - 0.6
20% Maximum = 0.22
The information gain from the Discount feature is
0.22, the feature with highest information gain
values is used in models to predict values to get
better results. So if Advertisement is the feature
then information gain is
Avg. 1
Yes V _
Advertisement
> •/
71 MATHEMATICS FOR MACHINE LEARNING

Similarly we can find the information gain of the


Advertisement feature
Advertisement Sales
Gain(A, S) = H(S) - H(A,S)
No Average
= 0.82- ^(410544-1094)
No Maximum

Yes Maximum
+ 4(-4^4-4^4)
Yes Maximum

No Average
= 0.82 - 0.36 - 0.445
Yes Maximum

No Maximum
= 0.82 - 0.805
No Maximum
= 0.015
Yes Average

Yes Maximum

It is clear that we need to use the Discount


feature rather than Advertisement feature because
of more information gain. And again as mentioned
earlier just understand what's going on, this is
one of the important techniques for data scientists

Confusion matrix
Confusion matrix is used to calculate the
accuracy of an model. To calculate that we need to
create the confusion matrix first, let's consider
we created a model that predicts weather and we
asked it to predict whether it will rain for the
next 30 days. After 30 days, we matched the
predicted values with the actual values and found
that the model predicted 25 days correctly but 5
were incorrect. Here 8 is referrred as True
negatives(T-), 3 is referred /---------- X
Predicted Predicted
as False positives(F+), 2 is 30 no rain rain
referred as False negatives
(F-) and 17 as True Positives Actual
8 3
(T+). We can use these values no rain

to calculate the accuracy of Actual


2 17
the model using the following Vrain____ J
formula:

accuracy =

So the accuracy of our model is


17 + 8
accuracy = -----------------------
17+8+3+2
25
accuracy = -------
30
accuracy = 0.8
Similarly we can calculate the error rate or mis­
classification rate using the following formula:
(F+) + (F-)
Error rate =----------------------------
(T+) + (T-) + (F+) + (F-)
3+2
Error rate = -----------------------
17+8+3+2

Error rate = -------


30
Error rate = 0.2
It is same as 1-Accuracy. Next, we can calculate the
precision of the model using the below formula:

Precision = ----------------
(T+) +(F+)

Precision = ----------------
17 + 3

Precision = -----
20
Precision = 0.85

precision can also be defined as how many true


positive predictions our model makes
' •/
73 MATHEMATICS FOR MACHINE LEARNING

We can also calculate the Recall using the below


formula:
(T+)
Recall =
(T+) +(F-)
17
Recall =
17 + 2
17
Recall =
19
Recall = 0.89
And finally we can calculate F-measure if models
have high precision & low recall or vice versa
using the below formula:
2 x Recall x Precision
F-measure =
Recall + Precision
2 x 0.89 x 0.85
F-measure =
0.89 + 0.85
1.51
F-measure =
1.74
F-measure = 0.86

Probability
It is the easiest mathematical calculation to
predict the outcome of an event using the
following formula

r Favourable outcomes
Probability of event A =-------------------------------
Total outcomes

There are three different concepts required for


machine learning i.e. Probability Density
Function, Noraml distribution and central limit
theorem which are both statistics and probability
Probability Density Function
The probability function states the following
points:
• It is continuous over the
range
• Area under the curve and the
x-axis is equals to 1
• Probability of events will
lie between a and b
Any variable that satisfies
these conditions is called
continuous random variable

Normal Distribution
Variable's (features) with mean as 0 and variance
as 1 are called noraml random variables. A normal
distribution is an arrangement of a data set in
which most values cluster in the middle of the
range and the rest taper off symmetrically toward
either extreme. In an normal distribution mean,
median and mode are same. We can represent the
distribution as the below graph:

where uis mean and is ostandard deviation. The


formula of normal distribution can be represented
as:
y (Normal Variable) = [1/ox V2TT1 e(x'u)2/2°2

where e = 2.718
7g MATHEMATICS FOR MACHINE LEARNING
L—

Central Limit Theorem


The theorem states that the mean u of samples
from a population should be equal to the
population mean

Samplel
mean
V J
Sample2
Population
mean
mean V ! - J
Sample3
mean
V J

Types of Probability
Probability can be classified into the following
three types:
• Marginial probability, probability of an event
without conditions like drawing a number from
the first ten natural numbers.
• Joint probability, probability of two events
at once like drawing a red card with 4 number
from a deck of cards ^4
• Conditional probability, 4 number ----- red coLor
probability of one or more
event with conditions. The
condition may be fulfilled already or need at
the moment of the event. For example, drawing
a joker card from your friend, where it may or
may not be already present.
MATHEMATICS FOR MACHINE LEARNING

Bayes Theorem
Bayes theorem is way of finding probability of an
event when we know about the probility of other
events or conditions. The formula is given as:

P(A|B) = P(B|A) P(A)/P(B)


Let's say P(Fire) means how often there is fire,
and P(Smoke) means how often we see smoke, then:

• P(Fire|Smoke) means how often there is fire when


we can see smoke
• P(Smoke|Fire) means how often we can see smoke
when there is fire
So we know the following probabilities:
• dangerous fire = 1%
• fire with smoke = 10%
• dangerous fire with smoke = 90%
Then probability of dangerous fire when there is
smoke:

P(Fire|Smoke) = 1% x 90% / 10%


P(Fire|Smoke) = 9%
Using bayes theorem we can find many
probabilities of events. Even there is an naive
bayes theorem model in machine learning which we
will learn in the upcoming lessons.
■x

1 SCIKIT LEARN
IV ALGORITMS

• Regression
algorithm
• Classification
algorithm
• Clustering algorithm

o
r
10
___ /
SCIKIT LEARN ALGORITHMS
_________________________________>

Algorithms
We already learnt about the different types of
machine learning algorithms like, supervised
learning, etc. But how will you choose the best
algorithm for your problem? For that purpose we
can understand the algorithms we are going to use
so we can decide which algorithm is suitable for
our problem

Regression Algorithm
Any scikit-learn algorithm requires data points
or values more than 50. The following visual will
help you to understand the working of the scikit
learn's regression model which used to predict
quantity:

<100k
"F
SGD Regressor
Some features
should be important t—I—I
■ ■ Z Stochastic
Gradient
Descent for
sparse data
f
when there are F
multiple features
which are correlated ▼ Support Vector
V. _ — Regression
linear model (Kernel-linear)
V
that estimates
sparse coefficients Ordinary Least Squares
' •/ '
79 SCIKIT LEARN ALGORITHMS
----- '• '

If the Ridge regressor and SVR(kernel-linear)


doesn't work out, we can use EnsembleRegressor and
SVR(Kernel-rbf) instead

Classification Algorithm
Scikit-learn's classification algorithm works in
the following manner to predict categories when we
have labeled data:

Labeled
>100k Data <100k
4~ "I
SGD Classifier Linear SVC

Not Working Not Working

Kernel approximation
[ I
Text Data

Naive KNeighbour
Bayes Classifier
Ensemble
s
classifiers Not Working

Linear SVC(Support Vector Classification) faster


for the case of linear kernel i.e. it doesn't have
the kernel parameter.
On the other hand SGD(Stochastic Gradient
Descent) Classifier implements regularized linear
models with stochastic gradient descent (SGD)
learning
*■
80 SCIKIT LEARN ALGORITHMS
---------------------------------------------------------

Clustering Algorithm
Scikit-learn-’s clustering algorithm works in the
following manner to predict categories when we
have unlabeled data and no. of categories:

Unlabeled
>10k

MiniBatch KMeans
KMeans
Not Working

I I

Gaussian
Mixture Model
Z
But if we donJt the no. of categories the
algorithm works as follows:

Unlabeled
Data

T
f v
MeanShift Variational
Bayesian Gaussian
mixture model
J
' •/ '
81 SCIKIT LEARN ALGORITHMS
----- '• '

We also have a dimension reduction algorithm to


just view data. The sole of this chapter was to
understand the different algorithms as we have
already learnt the different machine learning
types. So before solving any problem you will able
to choose the right algorithm suitable for you
without actually choosing randomly and selecting
another if the accuracy is very low
■x

IMPORTING
U DATA
• Importing CSV
Data
• Importing JSON I I
Data I I
I
• Importing XLS
Doto

o J
—□/--------------------- ■>
11 IMPORTING DATA
y □------------------- J
Data files
Often we have data in multiple file formats like.,
data of sales of any product, number of
subscribers, etc. There are a lot of ways to import
data but we will use pandas library here

Importing CSV Data


Reading data from CSV(comma separated values) is a
fundamental necessity in Data Science. Often, we get
data from various sources which can get exported to
CSV format so that they can be used by other
systems. To work with csv files we need one first,
you can download the sample file from here [Link]
(check the Resouces)
To import it, you need move the csv file in the
place where your Jupyter Notebook is hosted, to find
it use the below code

In [32]: import os
print([Link]())

C:\Users\Rahul

Move your file there, and use the read_csv


function of pandas library to import the csv file

In [38]: import pandas as pan


dt = pan.read_csv("csv_data.csv")
dt

id name price sales brand

0 101 biscuits 5.00 227 HomeFoods

1 102 cookies 7.25 158 TBakery

2 103 cake 12.00 50 TBakery

3 104 whey_supplement 34.90 24 MusleUp

4 105 protein_bars 4.90 85 Muslellp

5 106 potato_chips 1.75 121 HomeFoods


84 IMPORTING DATA
----- '•

We can access a single column of the csv data using


slicing like DataFrames

In [39]: import pandas as pan


dt = pan.read_csv("csv_data.csv")
dt['sales']

Out[39]: 0 227
1 158
2 50
3 24
4 85
5 121
Name:: sales., dtype: int64

We can extract only 2 or more columns from the data


using the toe[:,[<*cotumns>]] syntax

In [44]: import pandas as pan


dt = pan.read_csv("csv_data.csv")
[Link][:>['name','sales']]

name sales

0 biscuits 227

1 cookies 158

2 cake 50

3 whey_supplement 24

4 protein_bars 85

5 potato_chips 121

or just with some rows


In [46]: import pandas as pan
dt = pan.read_csv("csv_data.csv")
[Link][4:6j'name'>'sales']]

0ut[46]:
name sales

4 protein_bars 85

5 potato_chips 121
To access a single element, we can use its
row-column index with the values function

In [49]: import pandas as pan


dt = pan.read_csv("csv_data.csv")
dt

id name price sales brand

0 101 biscuits 5.00 227 HomeFoods

1 102 cookies 7.25 158 TBakery

2 103 cake 12.00 50 TBakery

3 104 whey_supplement 34.90 24 Muslellp

4 105 protein_bars 4.90 85 MusleUp

5 106 potato_chips 1.75 121 HomeFoods

In [50]: biscuits_sales = [Link][0,3]


biscuits_sales

Out[50]: 227

0 1 2 3 4
Z------------- id name price sales brand
[Link]
Zl01 biscuits 5^^(227) HomeFoodsJ

1 102 ^^ceoKies 7.25 158 TBakery

2 cake 12.00 50 TBakery


z*
dt,va~Lues[Q,3] 3 104 whey_supplement 34.90 24 MusleUp

4 105 protein_bars 4.90 85 MusleUp

5 106 potato_chips 1.75 121 HomeFoods

The data values are stored as ndarrays so, to access


single elements we can using slicing similar to that
of DataFrames
86 IMPORTING DATA

Importing JSON Data


JSON file stores data as text in human-readable
format. ISON stands for JavaScript Object Nota­
tion. Get your sample ISON data here
json_data (Check the Resources)
Pandas can read ISON files using the read_json
function.
In [2]: import pandas as pan
dt = pan.read_json("json_data.json")
dt

ID Name Price Sales Brand

0 101 Biscuits 5.00 227 HomeFoods

1 102 Cookies 7.25 158 TBakery

2 103 Cake 12.00 52 TBakery

3 104 Whey Supplement 34.90 24 Muslellp

4 105 Protein Bars 4.90 85 MusleUp

5 106 Potato Chips 1.75 121 HomeFoods

Similar to the CSV files., we can perform all the


slicing and data extraction with JSON data files

In [6]: import pandas as pan


dt = pan.read_json("json_data.json")
print([Link][:,["ID","Name","Sales"]])
print(dt["Name"])
print([Link][5,4])

ID Name Sales
0 101 Biscuits 227
1 102 Cookies 158
2 103 Cake 52
3 104 Whey Supplement 24
4 105 Protein Bars 85
5 106 Potato Chips 121
0 Biscuits
1 Cookies
2 Cake
3 Whey Supplement
4 Protein Bars
5 Potato Chips
Name: Name, dtype: object
HomeFoods
87 IMPORTING DATA

Importing EXCEL Data


Microsoft Excel is a very widely used spread
sheet program. Its user friendliness and appealing
features makes it a very frequently used tool in
Data Science. Get your sample ISON data here
xtsx_data (check the Resources)
The read_excet function of the pandas library is
used read the content of an Excel file into the
python environment as a pandas DataFrame.

In [9]: import pandas as pan


dt = pan.read_excel("xlsx_data.xlsx")
dt

id name price sales brand Unnamed: 5

0 101 biscuits 5.00 227 HomeFoods NaN

1 102 cookies 7.25 158 TBakery NaN

2 103 cake 12.00 50 34 TBakery

3 104 whey_supplement 34.90 24 Muslellp NaN

4 105 protein_bars 4.90 85 Muslellp NaN

5 106 potato_chips 1.75 121 HomeFoods NaN

As execel sheets are imported as Pandas Data-


FrameSj we can perform all the tasks on the excel
data like Data Frames.
You may notice., we have a Unnamed: 5 column with
NaN values [except [Link][2,5]]. Let's clean up
our data.
First we need to remove the Unnamed: 5 column,
which we can do using the det keyword

In [10]: import pandas as pan


dt = pan.read_excel("xlsx_data.xlsx")
del dt["Unnamed: 5"]

As we have learned earlier the det keyword


removes the whole column we don't need to deal
with the Data Cleansing
88 IMPORTING DATA

We have removed the Unnamed: 5 column

In [11]: import pandas as pan


dt = pan.read_excel("xlsx_data.xlsx")
del dt["Unnamed: 5"]
dt

id name price sales brand

0 101 biscuits 5.00 227 HomeFoods

1 102 cookies 7.25 158 TBakery

2 103 cake 12.00 50 (34)

3 104 whey_supplement 34.90 24 MusleUp

4 105 protein_bars 4.90 85 MusleUp

5 106 potato_chips 1.75 121 HomeFoods

Now, we need to replace [Link][2,5] i.e. 34


with TBakery. We can use the reptace method

In [12]: import pandas as pan


dt = pan.read_excel("xlsx_data.xlsx")
del dt["Unnamed: 5"]
[Link]({34:"TBakery"})

id name price sales brand

0 101 biscuits 5.00 227 HomeFoods

1 102 cookies 7.25 158 TBakery

2 103 cake 12.00 50 TBakery

3 104 whey_supplement 34.90 24 MusleUp

4 105 protein_bars 4.90 85 MusleUp

5 106 potato_chips 1.75 121 HomeFoods

So our data is clean with no errors. Try recaping


the chapter and attempt the Exercise, where youJll
be provided with sample data files [links] with
lots of errors and you have to perform all the data
cleansing practised in the previous lesson, this
will be a very good exercise to help you understand
about data processing and cleansing more
1 O DATA
1C OPERATIONS
• NumPy
operations
• Pandas
operations
• Cleaning data

o
/--------
12 data analysis
'----------
Python handles data of various formats mainly
through the two libraries., Pandas and Numpy. We
have already seen the important features of these
two libraries in the previous chapters. In this
chapter we will see some basic examples from each
of the libraries on how to operate on data and
perform different tasks like cleaning the data,
analytics, etc.

NumPy Operations
To start working with NumPy, we need to import
numpy to create NumPy arrays.

In [ ]: import numpy

Now let's create an array, using the arrayO


function and print it.

In [2]: import numpy


ar = [Link]([l,5,7])
print(ar)

[1 5 7]

an is a 1-Dimensional array, we can also create a


2-Dimensional array by creating one or more
1-Dimensional array inside of another array

In [8]: import numpy


ar = [Link]([[l,5,7], [2,3,9]])
print(ar)

[[1 5 7]
[2 3 9]]

[[1 5 7] 1-D array

I [2 3 9]]J = 2-D array


- 91
data operations
-J

We can specify the dimension of an array during


creation using the ndmin parameter

In [10]: import numpy


ar = [Link]([lj5,7]> ndmin = 2 )
print(ar)

[[1 5 ?]]

Although we passed a 1-Dimensional array, it


became a 2-Dimensional array because of the
specification of the dimensions of the array in the
ndmin parameter
We created an array with integers so, let's create
arrays with strings and floats using the dtype
parameter with the same values
In [11]: import numpy
ar_str = [Link]([l,5,7], dtype = str )
ar_flt = [Link]([l,5,7], dtype = float)
print(ar_str)
print(ar_fIt)

[■r '5' ■?■]


[1. 5. 7.]

ar_str is an array of string literals and ar_ftt


is an array of floats. We can also change these
numbers to complex numbers the same way using
complex as dtype

In [13]: import numpy


ar_str = [Link]([l,5,7], dtype = str )
ar_flt = [Link]([l,5,7], dtype = float)
ar_cmx = [Link]([l,5,7], dtype = complex )
print(ar_str)
print(ar_flt)
print(ar_cmx)

fl1 '5' -7’]


[1. 5. 7.]
[l.+e.j 5.+0.J 7.+0.j]

[Link], ar_fT_t and ar_cmx are arrays created


with same data, but with different data types as
strings, floats and complex numbers repectively.
92 data operations

Pandas Operations
Pandas handles data through Series,Data Frame,
and Panel. We will learn to create each of these.

Pandas Series
We already know what Pandas Series is. A pandas
Series can be created using the SeriesO function
so, let's import pandas and create series.

In [14]: import pandas


sr = [Link]([1,5,7])
print(sr)

0 1
1 5
2 7
dtype: int64

As you can see our data is indexed form 0 to 2


with the data type printed as integer, we can
specify our own indexes in the index parameter

In [16]: import pandas


sr = [Link]([1,5,7], index = ['A','B','C'])
print(sr)

A 1
B 5
C 7
dtype: int64

Like ndarrays, we can also specify the data type


in pandas series using dtype parameter during
series creation

In [18]: import pandas


sr = [Link]([1,5,7], dtype = complex )
print(sr)

0 1.000000+0.000000j
1 5.000000+0.000000j
2 7.000000+0.000000j
dtype: complexl28
f
93 data operations

We can use a ndarray to create a pandas series

In [19]: import numpy


import pandas
ar = [Link]([1,5,7])
sr = [Link]( data = ar, copy = True )
#is same as sr = [Link](ar, copy = True)
print(sr)

Q 1
1 5
2 7
dtype: int32

We passed the ar ndarray as the data for the


series [use of the data parameter isn-’t necessary,
its just for better understanding] and also used
the copy parameter to create a copy of the data.
If you want to get the data, without the indexes
use the values function

In [21]: import numpy


import pandas
ar = [Link]([l,5,7])
sr = [Link](ar)
print([Link])

[1 5 7]

You can print a more detailed version of the


above using the array function

In [22]: import numpy


ray Type
import pandas
ar = [Link]([l,5,7]) <PandasArray>
sr = [Link](ar) [1, 5, 7] {vplues
print([Link]) Lengtl

■ ta type
<PandasArray>
[1, 5, 7]
Length: 3, dtype: int32

You can use values or array function according


to your needs whether you want just the values or
summarized detail of the arrays in that panda
series. Also note the difference in the array
function in NumPy and Pandas.
94 data operations

Pandas Data Frames


Pandas Data Frames aligns data in a tabular
fashion of rows and columns. A pandas DataFrame
can be created using the DataFrameO function, we
need pass a dictionary as the data

In [23]: import pandas


df = [Link]({"Product":['CookiesBiscuits'],
"Sales":[157,227]})
print(df)

Product Sales
0 Cookies 157
1 Biscuits 227

Dictionary keys are the columns and their values


are the content of the rows of the Data Frame. We
can also use index parameter here

In [24]: import pandas


df = [Link]({"Product":['CookiesBiscuits'],
"Sales":[157,227]}, index = [1,2])
print(df)

Product Sales
1 Cookies 157
2 Biscuits 227

We can define the columns and it's data


seperately using ndarrays
In [42]: import pandas
import numpy
ar = [Link]([[l,3],[6,2]])

df = [Link](data = ar,
index = ['A','B'],
columns = [,C1','C2'])
print(df)

Cl C2
A 1 3
B 6 2

The data is stored in the ndarray and the columns


are defined in the DataFrame's columns parameter.
Note that, a 2-Dimensional ndarray with 2
1-Dimensional arrays in it is passed to the data
parameter to act as the data
95 data operations

We can add columns to the DataFrame using the


<DataFrame>[<cotumn_name>] = <values>
syntax

In [44]: import pandas


import numpy
ar = [Link]([[1,3],[6,2]])

df = [Link](data = ar,
index = ['A','B'],
columns = ['Cl','C2'])
df['C3'] = (df['Cl']*5)
print(df)

Cl C2 C3
A 1 3 5
B 6 2 30

We can delete columns from the DataFrame using


the det function

In [45]: import pandas


import numpy
ar = [Link]([[1,3],[6,2]])

df = [Link](data = ar,
index = ['A','B'],
columns = [,C1,,'C2'])
df['C3‘] = (df['Cl']*5)
del df['C2']
print(df)

Cl C3
A 1 5
B 6 30

We can print a column of the DataFrame using the


<DataFrame> [<cotumn_name>] syntax

In [46]: import pandas


import numpy
ar = [Link]([[l,3],[6,2]])

df = [Link](data = ar,
index = ['A','B'],
columns = ['C1','C2'])
print(df['Cl'])

A 1
B 6
Name: Cl, dtype: int32
96 data operations

Slicing Syntax
To get a single element from a ndarray or pandas
series or pandas dataframes, we need to use the
slice syntax <array>[start:end:step(optional)]
LetJs extract some elements from the arrays we
have created so far.

In [59]: import numpy as npy


arl = [Link]([1, 5])
ar2 = [Link]([[1, 3],
[5, 2]])
ar3 = [Link]([[[1, 3],
[5, 2]],
[[2, 4],
[4, 6]]])

#SLicing 1-Dimensional, array


print(arl[0])

#SLicing 2-DimensionaL array


print(ar2[0,l])

#SLicing 3-Dimensional, array


print(ar3[l,0j1])

1
3
4

We use a comma } to slice further in 2 or more


dimensional arrays, the following figure will help
you understand the slicing of the 3-Dimensional
array more better

ar3[] [[[1, 3],


[5, 2]],
[[2, 4],
Full orroy [4, 6]]]
97 data operations

ar3[l] [[[C 3],


[5, 2]]j
[[2, 4]j
First Slice [4, 6]] ]

ar3[l,0] [[[1> 3],


[5, 2]]j
l[2> 4],

Second Slice [4j 6]] ]

ar3[l,0,l] [[[1> 3]j


[5, 2]]j
[[2, 4 ],
[4j 6]] ]
ar3[l,0,l]

iFinol Slicej
Slicing may seem a bit tough for beginners due to
the dimensions, thatJs why I created the figure to
help you understand slicing better. If you are
confident try solving the slicing questions in the
Exercise
* ’
98 data operations

To get a element from a pandas series., we use the


<series>[<expticit_index> or <impticit_index>]
syntax

In [4]: import pandas as pan


sr = [Link]([1, 3, 5], index = ['a','b','c'])

print(sr['a']) #impLicit indexing


print(sr[0]) #expLicit indexing

1
1

If you have indexes like numbers like these


In [6]: import pandas as pan
sr = [Link]([1, 3, 5], index = [2,4,6])

If you want the second element using the implicit


index [indexing defined in index parameter] use
<series>.Loc [<LabeL>] syntax and using the
explicit indexing [ 0,1,2,... ] use
<series>. iloc [<index>] syntax

In [7]: import pandas as pan


sr = [Link]([1, 3, 5], index = [2,4,6])

print([Link][4])
print([Link][l])

3
3

We can modify or delete the elements using slicing


In [9]: import pandas as pan
sr = [Link]([1, 3, 5], index = [2,4,6])

sr[4] = 7
print(sr)

del sr[6]
print(sr)

2 1
4 7
6 5
dtype: int64
2 1
4 7
dtype: int64
99 data operations
•------------------------

Let's say we have a DataFrame like this

In [27]: import pandas as pan


sr = [Link]({'Product':['Biscuit','Cookies'],
'Sales':[227,158]}, index = [1,2])
sr

0ut[27]:
Product Sales

1 Biscuit 227

2 Cookies 158

And want the Sales Column only, so use the


<DataFrame> [<column_label>] syntax
In [32]: import pandas as pan
sr = [Link]({'Product':[1 Biscuit','Cookies'],
'Sales':[227,158]}, index = [1,2])
sr['Sales']

Out[32]: 1 227
2 158
Name: Sales, dtype: int64

or to get the second row only, so use the


<DataFrame>.loc [<row_index>] syntax
In [33]: import pandas as pan
sr = [Link]({'Product':['Biscuit','Cookies'],
'Sales':[227,158]}, index = [1,2])
[Link][2] #You can aLso use [Link][l]

Out[33]: Product Cookies


Sales 158
Name: 2, dtype: object

or to get the sales of cookies only, so use the


<DataFrame>.values[<index>] syntax
In [37]: import pandas as pan
sr = [Link]({'Product':['Biscuit','Cookies'],
'Sales':[227,158]}, index = [1,2])
[Link][l,l]

Out[37]: 158
________________________________________________________________________________________________

The values are stored as ndarrays, that's why it


used slicing similar to that of 2-Dimensional
ndarrays
We can delete a whole column from the DataFrame
In [41]: import pandas as pan
sr = [Link]({'Product':['Biscuit','Cookies'],
'Sales':[227,158]}, index = [1,2])
sr

0ut[41]:
Product Sales

1 Biscuit 227

2 Cookies 158

In [45]: del sr['Sales']


sr

Out[45]:
Product

1 Biscuit

2 Cookies

but we cannot delete a value


In [47]: import pandas as pan
sr = [Link]({'Product':['BiscuitCookies'],
'Sales':[227,158]}, index = [1,2])
del [Link][l,l]

ValueError Traceback (most recent call last)


<ipython-input-47-4422f0e71c59> in <module>
2 sr = [Link]({'Product':['Biscuit','Cookies'],
3 'Sales':[227,158]}, index = [1,2])
------ > 4 del [Link][l,l]

ValueError: cannot delete array elements

nor you can modify a value


In [48]: import pandas as pan
sr = [Link]({'Product':['Biscuit','Cookies'],
'Sales':[227,158]}, index = [1,2])
sr

0ut[48]:
Product Sales

1 Biscuit 227
2 Cookies 158

In [51]: [Link][l,l] = 162


[Link][l,l]

0ut[51]: 158
101 data operations

More with ndarrays


We can reverse a ndarray using <array>[ ::-1] syntax
In [56]: import numpy as npy
ar = [Link]([1,2,3,4])
ar

Out[56]: array([l, 2, 3, 4])

In [55]: ar = ar[::-1]
ar

Out[55]: array([4, 3, 2, 1])

We can broadcast a whole ndarray without doing it


the long way

In [63]: import numpy as npy


ar = [Link]([5,1,3,9])
ar

Out[63]: array([5, 1, 3, 9])

In [64]: [Link]()
ar

Out[64]: array([l, 3, 5, 9])

There are many built-in ndarray methods that will


not be discussed now, but will be used in the future
lessons in various steps, you may go to the
documentation to find all the functions and their
roles, as we don't require every function for our
data processing and analyzing, all the miscellaneous
functions are not discussed in this book
DATA OPERATIONS

Data Cleansing
Let's consider a situtation like below

In [71]: import pandas as pan


import numpy as npy
ar = [Link]([[1,2,3],[4,7,2],[4,9,1]])
df = [Link]( data = ar, index = [' a ', ' c ', ' e ' ],
columns = ['Cl','C2','C3'])
df

0ut[71]:
C1 C2 C3

a 1 2 3

c 4 7 2

e 4 9 1

In [72]: df = [Link](['a','b','c','d','e'])
df

0ut[72]:
C1 C2 C3

a 1.0 2.0 3.0

b NaN NaN NaN

c 4.0 7.0 2.0

d NaN NaN NaN

e 4.0 9.0 1.0

The reindexed Data Frame has NaN values in the b and


d rows. This happened because, there is no data for b
and d rows. Using reindexing, we have created a Data-
Frame with missing values. In the output, NaN means
Not a Number. To make detecting missing values easier
(and across different array dtypes), Pandas provides
the isnuULO and notnullO functions, which are also
methods on Series and DataFrame objects

Name: Cl, dtype: bool


103 data operations

Pandas provides various methods for cleaning the


missing values. The fillna function can fittna NaN
values with non-null data in a couple of ways like
replacing NaN values with 0

In [74]: [Link](0)

Out[74]:
C1 C2 C3

a 1.0 2.0 3.0

b 0.0 0.0 0.0

c 4.0 7.0 2.0

d 0.0 0.0 0.0

e 4.0 9.0 1.0

We can copy the value above or below that data using


pad or bfi’L’L in method parameter of fittna function

In [75]: [Link]( method = 'pad' )

C1 C2 C3

a 1.0 2.0 3.0

b 1.0 2.0 3.0

c 4.0 7.0 2.0

d 4.0 7.0 2.0

e 4.0 9.0 1.0

We can drop the rows with missing values with dropna


function

In [76]: [Link]()

0ut[76]:
C1 C2 C3

a 1.0 2.0 3.0

c 4.0 7.0 2.0

e 4.0 9.0 1.0


If we want to change a single value in a Data
Framejwe can use the replace function

In [78]: import pandas as pan


import numpy as npy
ar = [Link]([[1,2,3],[4,7,2],[4,9,1]])
df = [Link]( data = ar, index = [' a',' c', ' e' ]
columns = ['Cl','C2'>'C3'])
[Link]({3:13})

0ut[78]:
C1 C2 C3

a 1 2 13

c 4 7 2

e 4 9 1
1 Q DATA ANALYSIS
IO S PROCESSING

• Doto Analytics
• Correlations
between
attributes
• Skewness of
the data

o
A

13
_____ /
data analysis s processing
_________________________________________ y

As we learned in the mathematics for machine


learning lesson, we need to a lot of analytics or
statistics of our data to know more about the data.
As we already know central tendency i.e. mean,
median and mode are the basic statistics of our
data which tells us about the average of the data,
50% or middle value and the most occuring value in
the whole data
Likewise we will analyze our data and as mentioned
earlier we don't need to calculate them manually or
through formula's, there's plenty of functions
present in different libraries to conduct the
analysis

Data analytics
Before training any model we need to check the
data and it's details. We will use the [Link]'
as your data for now. You can get the
file either scanning the qr-code or [Link]
the link.
Make sure to move the file your home
directory of lupyter Notebook and then
import the csv data. Before doing any
further action, let's have a look at
our raw data

In [2]: import pandas


dt = pandas.read_csv(’[Link]')
dt

Out[2]:
Index "Girth (in)” "Height (ft)” ”Volume(ftA3)”

0 1 8.3 70 10.3

1 2 8.6 65 10.3

2 3 8.8 63 10.2

3 4 10.5 72 16.4

4 5 10.7 81 18.8
107 data analysis a processing

The first analysis is to know the shape of the


data or how amny rows and columns are present in
the data. We can do so by using the shape
attribute of the dataframe object

In [4]: import pandas


dt = pandas.read_csv('[Link]')
[Link]

0ut[4]: (31, 4)

So our data has 31 rows and 4 columns i.e. 124


values in total. If we want we can just inspect
the first 10 values using the head 0 function and
passing 10 as argument

In [5]: import pandas


dt = pandas.read_csv('[Link]')
[Link](10)

Index "Girth (in)" "Height (ft)" "Volume(ftA3)"

0 1 8.3 70 10.3

1 2 8.6 65 10.3

2 3 8.8 63 10.2

3 4 10.5 72 16.4

4 5 10.7 81 18.8

5 6 10.8 83 19.7

6 7 11.0 66 15.6

7 8 11.0 75 18.2

8 9 11.1 80 22.6

9 10 11.2 75 19.9

To get a statistical overview of the whole data


we can use the describeO function which provides
8 properties i.e. count, mean, standard deviation,
minimum value, maximum value, 25% (first
interquartile sperator), 50% (median) and 75%
(third interquartile seperator)
In [6]: import pandas
dt = pandas.read_csv('[Link]')
[Link]()

Index "Girth (in)" "Height (ft)" "Volume(ftA3)"

count 31.000000 31.000000 31.000000 31.000000

mean 16.000000 13.248387 76.000000 30.170968

std 9.092121 3.138139 6.371813 16.437846

min 1.000000 8.300000 63.000000 10.200000

25% 8.500000 11.050000 72.000000 19.400000

50% 16.000000 12.900000 76.000000 24.200000

75% 23.500000 15.250000 80.000000 37.300000

max 31.000000 20.600000 87.000000 77.000000

If you want the values rounded-off to say 2


decimal places we can use the pandas set_option()
function and specify the precision as 2. We can
specify a lot of options through this function

In [7]: import pandas


dt = pandas.read_csv('[Link]')
pandas.set_option('precision',2)
[Link]()

Out[7]:
Index "Girth (in)" "Height (ft)" "Volume(ftA3)"

count 31.00 31.00 31.00 31.00

mean 16.00 13.25 76.00 30.17

std 9.09 3.14 6.37 16.44

min 1.00 8.30 63.00 10.20

25% 8.50 11.05 72.00 19.40

50% 16.00 12.90 76.00 24.20

75% 23.50 15.25 80.00 37.30

max 31.00 20.60 87.00 77.00


' •/>
109 DATA ANALYSIS S PROCESSING
----- '•--------------------------------

Correlation between attributes


The relation between two attributes (feature or
label) in a data is called relationship. It is
important to know the relations between the
attributes. We can do so using the corr() function
and using the Pearson's Correlation Coefficient to
calculate that. The Pearson's Correlation
Coefficiet can be understood by the following:
• 1 represents positive correlation
• 0 represents no relation at all
• -1 represents negative correlation

In [2]: import pandas


dt = pandas.read_csv('[Link]')
pandas.set_option('precision' ,2)
[Link](method='pearson')

Index "Girth (in)" "Height (ft)" "Volume(ftA3)"

Index 1.00 0.97 0.47 0.90

"Girth (in)" 0.97 1.00 0.52 0.97

"Height (ft)" 0.47 0.52 1.00 0.60

"Volume(ftA3)" 0.90 0.97 0.60 1.00

Note that we used the precision of the values as


2 to keep the values rounded-off to 2 decimal
places. In the corr() function we specified
pearson in the method parameter.
As we already know that Girth, Height and Volume
of tree are correlated that's why we get the
values around 0.5 - 1.0 which represents positive
correlationship i.e. if Height is changed the
volume will be affected, if the Girth is changed
the volume will be affected and vice versa
* •/
110 DATA ANALYSIS S PROCESSING
----- '•--------------------------------

Skewness of the data


Skewness of a data is the situation when the data
appears to have normal distribution but it may be
skewed to either left or right. We need the
skewness of a data to correct the data during it's
preparation. The more the value is close to 0 it
is less skewed and more the value is close it -1
or 1 it is more skewed to either left or right
side, let's check the skewness of our tress data
using the skew() function

In [3]: import pandas


dt = pandas.read_csv('[Link]')
pandas.set_option('precision',2)
[Link]()

Out[3]: index 0.00


"Girth (in)" 0.55
"Height (ft)" -0.39
"Volume(ftA3)" 1.12
dtype: float64

As the index column has values from 1 to 31 it's


skewness is 0 i.e. not skewness at all. On the
other hand Girth can be said to be skewed to the
right side, Height is skewed to the left side and
Volume is highly skewed to the right side i.e.
beyond 1. While data preparation we must consider
the skewness and keep it close as much as possible
to 0

Data Processing
Before feeding the data to models we need to
pre-process the data because the algorithms are
completely depended on the data so it must be
clean and appropriate as much as possible. While
finding skewness we found that you data is skewed
i.e. it needs to be closer to 0 for better
results, so let's look at some processes to ready
our data
• /---------------------------------
Ill data analysis s processing

Scaling
Our data is spread over a wide range with
different scales i.e. not suitable to train
models. We need to bring our data in a more
appropriate scale, we can do so using the
MinMaxScalen class and it's fit_transforrn()
method of the scikit-learn library. We can scale
our data in the range of 0 to 1 which is the most
appropriate range for the algorithms

In [28]: import pandas


from sklearn import preprocessing
dt = pandas.read_csv('[Link]')
ar = [Link] # array
# Scoter Object
Sclr = [Link](feature_range=(0?l))
skl_ar = Sclr.fit_transform(ar) ^Seating
# Seated data
skl_dt = pandas .DataFrame(skl_ar_,
columns=['[Link].','Height','Height','Volume'])

skl_dt.round(1).loc[5:10]

Out[28]:
[Link]. Height Height Volume

5 0.2 0.2 0.8 0.1

6 0.2 0.2 0.1 0.1

7 0.2 0.2 0.5 0.1

8 0.3 0.2 0.7 0.2

9 0.3 0.2 0.5 0.1

10 0.3 0.2 0.7 0.2

You can compare the values [Link]. Girth Height Volume


with the values beside i.e. 5 5 10.7 81 18.8
unsealed. If you want you can
6 6 10.8 83 19.7
change the range to say 0-100
7 7 11.0 66 15.6
through the feature_range
parameter in MinMaxScaler 8 8 11.0 75 18.2
while the scaler class 9 9 11.1 80 22.6
intialization 10 10 11.2 75 19.9
* •/
112 DATA ANALYSIS S PROCESSING
----- '•--------------------------------

Normalization
Normalization is used to rescale each row of data
to have a length of 1. It is mainly useful in
Sparse dataset where we have lots of zeros. There
are two types of normalization process namely LI
and L2. With the LI method, all the values in each
row will sum upto 1. We can demonstrate the same
using the Normalizer class and it's transform
method. To use the LI method specify fLl' in the
norm parameter of the class

In [3]: import pandas


from sklearn import preprocessing
dt = pandas.read_csv('[Link]')
ar = [Link]
Nm = [Link](norm='11')
nm_ar = [Link](ar)
print(nm_ar[:5])
# Sum of the rows
for i in [0,1,2,3,4]:
print(sum(nm_ar[i]))

[[0.01116071 0.09263393 0.78125 0.11495536]


[0.02328289 0.10011641 0.75669383 0.11990687]
[0.03529412 0.10352941 0.74117647 0.12 ]
[0.03887269 0.10204082 0.69970845 0.15937804]
[0.04329004 0.09264069 0.7012987 0.16277056]]
1.0
1.0
1.0
0.9999999999999999
1.0

We created the Nm object of the Normalizer class


with the normalizing method as 11 in the norm
parameter while intialization and normalized our
ar data values with the transform method of the
Normalizer class and stored it in nm_ar variable.
Then we printed the 5 rows of the normalized data
values
We also created a for loop to print the sum of
each row of the normalized data i.e. l(except for
the 4th row i.e. 0.99). Note that we didn't per­
form any rounding-off
In the next method i.e. L2 Normalization, all the
squares of values in each row sum upto 1. So let's
use f12' in the norm parameter and check their
sums

In [12]: import pandas


from sklearn import preprocessing
dt = pandas.read_csv('[Link]')
ar = [Link]
Nm = [Link](norm='12')
nm_ar = [Link](ar)

print('L2 Normalization\n')
print(nm_ar[:5])

# Sum of vaLues in the rows


print('\nSum of the values in each row\n')
for i in [0,1,2]:
print(sum(nm_ar[i]))

# Sum of the squares of the vaLues in the rows


sm_row = '\nSum of squares of the values in each row\n'
for i in [0,1,2,3]:
print(sm_row)
sm_row = 0
for val in nm_ar[i]:
sm_row += val*val

L2 Normalization

[[0.01403589 0.11649791 0.98251251 0.1445697 ]


[0.03012017 0.12951675 0.97890567 0.1551189 ]
[0.04651593 0.13644674 0.97683459 0.15815417]
[0.05355175 0.14057333 0.96393143 0.21956216]
[0.05953254 0.12739964 0.96442719 0.22384236]]

Sum of the values in each row

1.2576160130346932
1.2936614951212533
1.3179514295489714

Sum of squares of the values in each row

1.0
1.0
1.0
* •/
114 DATA ANALYSIS S PROCESSING
----- '•--------------------------------

The code may a bit hard to understand because of


the for loop but let's try to understand it. We
normalize the data as we did we before but this
time we used the f12' method and printed the data
values. Then we printed the sum of the values of
first three rows of the normalized data but they
didn't sum upto 1. Next as the L2 method states,
we printed the sum of the squared values in the
first three rows using for loop which turned out
to be exactly 1
Before the for loop, we created an sm_row
varaible in which we will add our squared values
in the rows but we stored a string at start. Then
we created the outer loop in which we will get the
index of the rows. Also we entered one more number
in the list because at the first iteration the
string in the sm_row will be printed and after
printing it we changed it's value to 0 and then we
create the innner in which we will perform
addition. In each iteration of the inner loop we
will add the square of each element in the rows
with += compound assignment operator. After all
the values are sumed up, we return to the outer
loop and print it and again revert the value to 0
to store teh sum for the next row until all the
values are printed
(Sum holder variable (vessel)]

[sm_row] = [’\nSum of squares of.7)


for i in [0,1,2,3]: [ ist’Run j- nm_ar[0]
print (sm_row)<<— 1 I r [0.01403589,
[1st Run] sm_row = 0 > 0.11649791,
for [val jin (nm_ar [i]): > 0.98251251,
(sm_row]+= val*val L 0.1445697]
___________
~—I
for i in [0,1,2,3]: [2nd Run]
nT)--------- [print (sm_row)}*—I
1.0 /[sm_row = 0]
1.0 Output
X_____ _______ z
Binarization
In binarization we binarize our data i.e. reduce
differences to only two to leave crisp vales with
a threshold. For exaple if we set the threshold to
10, all the value in a data set under 10 will be
converted to 0 and above 10 will be converted into
1. Let's binarize our data with Binarizer class
and transform() method

In [21]: import pandas


from sklearn import preprocessing
dt = pandas.read_csv('[Link]')
ar = [Link]
Nm = [Link](norm='11')
nm_ar = [Link](ar)

Bin = [Link](threshold=0.1)
bin_ar = [Link](nm_ar)
bin_ar[10:16]

0ut[21]: array([[0., 0., 1-, 1.],


[0., 0.t 1., 1-],
[1-, 0., 1., 1-],
[1., 1., 1., 1.],
[1-, 0., 1.],
[1-, 1., I-]])

As you can see used the normalized (LI) that had


a range of 0 to 1 which made things easier to set
a threshold which is specified in the threshold
parameter i.e. 0.1
So all the values below 0.1 are changed to 0 and
all the values above 0.1 are changed to 1

0
' •/>
116 DATA ANALYSIS S PROCESSING
----- '•--------------------------------

Standardization
Standardization or Standard scaling is the method
of changing the distribution of data arttributes
to Gausiann distribution (Normal distribution). In
this mthe mean is changes to 0 and standard devia­
tion is changed to 1. Let's standardize our data
using the StandardScater class and it's fit()
and transform() methods

In [14]: import pandas


import numpy
from sklearn import preprocessing
dt = pandas.read_csv(1 trees.csv1)
ar = [Link]
# Standardizer
Std = [Link]().fit(ar)
std_ar = [Link](ar)
print(std_ar[0:3])

print(’Mean:1, round([Link](std_ar)} 2))


print('[Link]:'}round([Link](std_ar)>2))

[[-1.67705098 -1.60291968 -0.9572127 -1.22883711]


[-1.56524758 -1.50574137 -1.75488995 -1.22883711]
[-1.45344419 -1.44095583 -2.07396086 -1.23502119]]
Mean: -0.0
[Link]: 1.0

While the StandardScater object intialization


we also called the fit() function to fit the
scaler to our ar array and also transformed it, if
you don't call the fit you'll get an error like
This StandardScater instance is not fitted
yet. Catt ’fit1 with appropriate arguments
before using this estimator
you can also use the fit & transform functions in
the previous preprocessing methods, for
demonstration purpose they aren't used in previous
examples but make sure to use them in it's
applications
Note that we used meant) and std() functions of
the numpy package to calcualte the mean and
standard deviation i.e. 0 and 1
117 DATA ANALYSIS S PROCESSING
----- '•--------------------------------

Label encoding
In many cases our data has more labels (word)
than features (numeric) but using words (strings)
in processing limits many activities. For that
purpose we need to change those labels into
numeric notations or features like the following
example

In [15]: import pandas


from sklearn import preprocessing
dt = [Link]({'Questions':['A'>'B'> 'C','DE'],
’Answers':[’True’> 'True','False','True','False']})
dt

Out[15]:
Questions Answers

0 A True

1 B True

2 C False
3 D True

4 E False

We can use the Label-Encoder class for label encoding

In [17]: import pandas


from sklearn import preprocessing
dt = [Link]({'Questions':['A','B','C','D','E'],
'Answers':['True','True','False','True','False']})
Enc = [Link]( )
[Link](dt['Answers'])
# Encoded LabeLs
dt['Answers'] = [Link](dt['Answers'])
dt

Out[17]:
Questions Answers

0 A 1

1 B 1

2 C 0
3 D 1

4 E 0
' •/>
118 DATA ANALYSIS S PROCESSING
----- '•--------------------------------

As you can see that we had the Questions as A-E


and Answers as True or False. But we encoded the
Answers label to be 0(False) or l(True)
If we want we can get the label for the value or
decode the 0 or 1 values using the
inverse_transform() function
In [18]: import pandas
from sklearn import preprocessing
dt = [Link]({'Questions':['A'B' C','D'>' E'],
’Answers*:[’True’,'True',’False’,'True','False']})
Enc = [Link]()
[Link](dt['Answers'])
# Encoded LabeLs
dt['Answers'] = [Link](dt['Answers'])
# Decoding LabeLs
print(Enc.inverse_transform([0,1]))

['False' 'True']

By encoding we can hide the true values and


perform a lot operations with them because they are
numerical values. In this data we had less only two
label values i.e. True and False, but when there
are more values the encoding will range from 0 to
their respective lengths
<Or
1A data
14 VISUALIZATION
• Plotting data
• Univariate plots
• Multivariate
plots

o
A

V
14 J data visualization
□------------------ J

Python has excellent libraries for data


visualization. A combination of Pandas, numpy and
matplotlib can help in creating in nearly all types
of visualizations charts. We can use the visuals to
better understand our data

Plotting data
We use numpy library to create the required
numbers to be mapped for creating the chart and the
pyplot module of matplotlib which draws the
actual chart

In [11]: import numpy as npy


import [Link] as pit

x = [Link](0,10) ^outputs [0123456789]


y = x A 2
[Link](x,y)

Out[ll]: [<[Link].Line2D at 0x26clalbcl00>]

The arange(0,10) function creates a ndarray of


numbers from 0 to 10 [excluding 10] and the plot
function plots a simple chart of the data we
provided. We can also plot any imported data using
the same
f
121 DATA VISUALIZATION

Editing labels and colors


As we already know matplotlib uses MATLAB symbols
as formatted strings to customize the colors [refer
to page 32] and how to add labels to the plots

In [15]: import numpy as npy


import [Link] as pit
x = [Link](0,10)
y = x A 2
# Editing LabeLs
[Link]("Matplotlib")
[Link]("X_Axis")
[Link]("Y_Axis")
# Editing Line styLe and coLor
[Link](x,y,'c')
[Link](x,y,'*')

Out[15]: [<[Link].Line2D at 0x26cla312130>]

We added a title for the plot Matplolib using the


title function, X axis and Y axis labels using
xtabet and ylabet functions respectfully.
The fc' represents cyan which is the color of the
line with as symbol for stars
Note that we didn't passed these values in any
parameter because they are treated as formatted
strings matplotlib. pyplot. plot still interprets
it as positional argument
122 DATA VISUALIZATION
--- '•-------------
Univariate plots
Univariate plots are the type of plots or visuals
with different plots for each variable to
understand them individually. We will use the
[Link] data, which we used in the previous
lesson. First of all let's plot histogram for each
variable using the hist() function of the pandas
library

In [1]: import pandas


dt = pandas.read_csv('[Link]')
[Link]()

Out[l]: array([[<[Link]._subplots.AxesSubplot object


<[Link]._subplots.AxesSubplot object
[<[Link]._subplots.AxesSubplot object
<[Link]._subplots.AxesSubplot object
dtype=object)

As you can see we have histograms plotted for


each variable i.e. Girth, Height, Volume and
Index(which can be ignored). Inspecting the
visuals we can get a lot of information about the
data, it's distribution, maximum value, minimum
value, etc
We can also visualize individual variables using
the histC) function and the sliced variable or
the column
123 DATA VISUALIZATION
--- '•-------------
In [2]: import pandas
dt = pandas.read_csv('[Link]'j
names=['Index','Girth','Height','Volume'])
[Link](9,inplace=True)
dt['Height'].hist()

70 65 63 72 81 83 66 75 80 79 76 69 74 85 86 71 64 78 77 82 87

Note that we renamed the columns while importing


the csv data using the names parameter of the
read_csv() function, but the default column names
will be stored as the first row, we need to remove
that, to do so we use the dropO function and
provide the 0 as the index i.e. first row and also
True for the inptace parameter which will remove
that from the actual data
Next is density plots which are similar to
histograms but have smooth lines like the below
chart
124 DATA VISUALIZATION

We craete density plots using the pLot()


function of pandas library and specifying
density in the kind paramter
In [7]: import pandas
dt = pandas.read_csv('[Link]')
del dt['Index']
[Link](kind='density')

Out[7]: [Link]._subplots.AxesSubplot at 0x261f2dc

If you want seperated plots., specify True in


subplots parameter
[Link](kind='density',subplots=True)

0ut[8]: array([<[Link]._subplots.AxesSubplot object


<[Link]._subplots.AxesSubplot object
<[Link]._subplots.AxesSubplot object
dtype=object)
125 DATA VISUALIZATION

Another univariate plot called box and whiskers


plot can be used too. We can specify box in the
kind parameter of ptot() function this time

In [9]: import pandas


dt = pandas.read_csv('[Link]')
del dt['Index']
[Link](kind='box'>subplots=True)

0ut[9]: "Girth (in)" AxesSubplot(0.125,0.125;0.227941:


"Height (ft)" AxesSubplot(0.398529,0.125;0.227941:
"Volume(ftA3)" AxesSubplot(0.672059,0.125;0.227941:
dtype: object

The box plots have the following features:


126 DATA VISUALIZATION
--- '•-------------
Multivariate plots
As it sounds, through multivariate plots we can
understand the realtions btween different
attributes or variables in a data. One of the
multivariate plots is a correlation matrix. We can
plot correlation matrix for our data like so:

In [16]: from matplotlib import pyplot


import pandas
dt = pandas.read_csv('[Link]')
del dt['Index']
cor = [Link]() # correLations
# PLotting correLation matrix
vis = [Link]()
# Adding the coLor meter
ax = vis.add_subplot(lll)
cax = [Link](cor, vmin=-l, vmax=l)
[Link](cax)
[Link]()

0 1 2

We used the figure() function of the matlplotlib


library, along with the matrix we also added a color
bar to indicate the value of colors using the
correlations data i.e. calculated using the corr()
function. We added the matrix using the
add_subplot() function and 111 as argument i.e. itJs
position. Then we used the matshow() function and
passed our correlations and min and max values.
Finally we plotted our matrix with the colorbar
' •/>
127 DATA VISUALIZATION
----- '•--------------------

We can also label the axes as our column names


using the set_xticks() and set_yticks()
functions to set the positions and
set_xtickLabetO and set-yticklabelO functions
to label them accordingly

In [19]: vis = [Link]()


ax = vis.add_subplot(lll)
cax = [Link](cor, vmin=-l, vmax=l)
[Link](cax)
ax.set_xticks([0,1,2])
ax.set_yticks([0,1,2])
ax.set_xticklabels(['Girth','Height','Volume'])
ax.set_yticklabels(['Girth','Height','Volume'])
[Link]()

U>lume

■-0 00

■ -0 25

■--0 50

-0.75

u-1.00

We can easily understand the relations between


the variables i.e. Girth & Volume is close to 1
correlation value(full positive) and Height &
Volume have a bit less realtion than Girth &
Volume.
We can also view scatter matrix plots to under­
stand realtions between variables or attributes
using dot graphs. We can do so using the pandas
scatten_matrix() function. We need to use the
[Link].scatter_matrix0 syntax to
access the function and pass the data inside of
the parenthesis of the function
128 DATA VISUALIZATION

In [1]: from matplotlib import pyplot


import pandas
dt = pandas.read_csv('[Link]')
del dt['Index']
[Link].scatter_matrix(dt)

Out[l]: array([[<[Link]._subplots.AxesSubplot object


<[Link]._subplots.AxesSubplot object
<[Link]._subplots.AxesSubplot object
[<[Link]._subplots.AxesSubplot object
<[Link]._subplots.AxesSubplot object
<[Link]._subplots.AxesSubplot object
[<[Link]._subplots.AxesSubplot object
<[Link]._subplots.AxesSubplot object
<[Link]._subplots.AxesSubplot object
dtype=object)

« »
•••

Ju
••
15 ■

10 ■
s

■Jl,
g
80

f • • •

u.
70
\ .

50 ■
. V •*
• • ••
25 -
—,----------- .------
a a
—, —a R 8 « 8 £
"Girth (in)" "Height (ft)’ "Volume(ft~3)"

Girth & Volume

JuI..?--'
•••

■Jj, V.-'
• •

• V ••

L=-r-------- ii----
jq
"Girth (in)"

i
£
i

g

•• •• .
i

"Height (ft)"
ui
}Q 8
r

"Volume(ft~3)"
i
|C
1 E REGRESSION
!□ ALGORITHM

• What is
Regression
• Regressor
model
• Linear regression

o J
A

\ 15
REGRESSION
□------------ J

What is regression?
As we already know about the regression algorithm
that it is a case of supervised machine learning
where we feed input data(numeric) and the algorithm
learns the patterns in the data and predicts the
output data. The performance of the algorithm is
entirely based on it's learning so we need to do
our best to feed the best data to our model. So
letJs create a regressor with the scikit-learn-’s
readymade algorithms

Regressor Model
LetJs create a simple regressor model to predict
the weight of a person if the height is given as
input value. You can download the file from here.
Then jump to your jupyter notebook
and import the following modules to [Link]
start the model

import pandas
from sklearn import Linear_model
import [Link]
import [Link]

In [1]: import pandas


from sklearn import linear_model
import [Link]
import [Link]

We need pandas library to import our csv data,


linear_model to create the regressor model,
[Link] to evaluate the accuracy of our
model and matplotlib to visualize our data in
multiple steps
Now we can import the csv data and have a look at
it before training the model
131 REGRESSION

In [2]: import pandas


from sklearn import linear_model
import [Link]
import [Link]
dt = pandas.read_csv('height_and_weight.csv')
[Link]()

Out[2]:
Index Height(ln) Weight(lbs)

0 1 65.78 112.99

1 2 71.52 136.49

2 3 69.40 153.03

3 4 68.22 142.34

4 5 67.79 144.30

We imported the csv data and inspect the first


five rows using the head() function (by default
if oyu don't pass any value it will show the first
5 rows). We already have indexes so we can remove
the Index column. We can check the distribution
of the data by plotting a histogram of the data
using the pandas hist() function

In [3]: import pandas


from sklearn import linear_model
import [Link]
import [Link]
dt = pandas.read_csv('height_and_weight.csv')
del dt['Index']
[Link]()

0ut[3]: Height(ln) Weightfibs)

ma
40

20

0 0
65 70 100 120 140 160

We have a normal distributed data so we don-’t


need to do any preprocessing,, we can move onto
training the model
132 REGRESSION

But we need to seperate the values into input


data i.e. X and output data i.e. y. For our input
data we will use the height data and for the
output we will use the weight data. So moving onto
a new cell (Hotkey : b) let's seperate the data
into inp_X and out_y variables

In [4]: inp_X = dt['Height(In)']


out_y = dt['Weight(lbs)']

Now we can create our regression model RegModel


using the LinearRegression class and use to
fit() method and pass the input value inp_X and
output value out_y as arguments

In [5]: inp_X = [Link](columns=['Weight(lbs)'])


out_y = [Link](columns=['Height(In)'])
RegModel = linear_model.LinearRegression()
[Link](inp_X,out_y)

0ut[5]: LinearRegression()

And our model is ready! Yes., we can ask the model


to make predictions for weight of a person if it's
height is 60 inches. We will use the predict()
method to do the same

In [6]: [Link]([[60]])

Out[6]: array([[99.93286131]])

So as you can see if a person has height of 60


inches then the weight of the person will be
approx 99.93 pounds, is what our model predicted.
As you can see we are in a new cell and used the
predictO function to predict the value and we
also passed the value as [[60]], because actually
the value will be treadted as [Link] ([ [60]])
and if you pass 60 or [60] you'll encounter an
error because the the dataset for training is a
2-D array so we need the same for predictions. And
if a question is tinkering you that how did the
model made prediction just review the first lesson
Z—•
133 REGRESSION

But we don't know whether the value is correct or


not? Then we can't calculate the accuracy of the
model either. Then we need to divide our dataset
into training set say 90% of the data to train the
model and 10% as testing set whose input values
will be provided to the model to make predictions
and then we will compare the predictions with the
real ones
In the cell where we trained the model., we will
divide the input values i.e. Height into 90% in
inp_X and 10% in tst_X and the same with the
output values

In [1]: import pandas


from sklearn import linear_model
from sklearn.model_selection import train_test_split
import [Link]
import [Link]
dt = pandas.read_csv('height_and_weight.csv')
del dt['Index']

In [2]: height = [Link](columns=['Weight(lbs)'])


weight = [Link](columns=['Height(In)'])
inP_Xj tst_Xj out_y, tst_y = train_test_split(
heightjweightj test_size=0.1)
RegModel = linear_model.LinearRegression()
[Link](inp-XjOut-y)

Out[2]: LinearRegression()

To ease up the data spliting task we used the


train_test_spl.it 0 function in the
sktearn.modet_setection. That's why we imported
the function in the first cell where we imported
data and the required libraries and modules. In
the second cell we divided the input data and
output data into height and weight. Then we
created four variable inp_X, tst_X, out_y and
tst_y to store input for training, input for
testing, output for training and output for
comparing the predictions respectively using the
train_test_spl.it 0 function. We passed the input
data height, ouput data weight as arguments and
Z—•
134 —REGRESSION
0.1(10%) to the test_size parameter. The spliting
and assigning of the data can be understood from
the below figure:
train_test_sptit(
height,weight,test_size=0. 1)
height weight
90% 10% 90% 10%

after that as we did earlier we create our model


and train it
Now we can ask our RegModet to predict the
weight of the tst_X values and compare it with
the tst_y values

In [3]: pred_y = [Link](tst_X)


cmp = [Link]({'Predictionspred_y.flatten(),
'Actual values':tst_y['Weight(lbs)'].values})
[Link](kind='bar'3figsize=(7.5,6))

0ut[3]: <[Link]._subplots.AxesSubplot at 0x2038d756cl0>


Z—•
135 REGRESSION

We used the predict() method and passed the test


height values tst_X as arguments and stored the
predictions in the pned_y. Then we created a data
frame using the pred_y values (we flattened the
array using the flatten() function to change it
to 1-D array from 2-D array before passing it as
values) and tst_y actual values (we sliced the
Weight(tbs) column and extracted itJs values) and
plotted a grouped bar graph using the ptot()
function and bar to kind parameter and also
specified the size of the plot using figsize
parameter
As you can observe, the predicted values as blue
and actual values are orange so we can tell
visually what is the performance of our model.
Most of the bars are close to each other but some
are way taller or shorter than the other i.e. our
model is performing good but in some cases it
predicted wrong. We can also visualize the
regression line and data points

In [4]: [Link](tst_X4 tst_y, color='black')


[Link](tst_X, pred_y, color='yellow')
[Link]()

140

130

120

110

100

We used the scatter() function to plot the data


points and plot() function to plot the regression
line
136
•• REGRESSION

To know how much error is their in the


predictions or the MAE (Mean Absolute Error) we
can use the mean_absoLute_error() function in
[Link]. It will return an average of
the differences between the predicted and actual
values

In [6]: metrics.mean_absolute_error(tst_yi» pred_y)

0ut[6]: 11.529538265252379

We passed our actual values tst_y, predicted


values precLy as arguments to the function. So at
an average the error values will have 11.5 pounds
of difference from the actual weights with our
current model

Linear Regression
As learnt earlier, regression algorithms find
relationship between two variables and predict the
values based on that relation. In linear
regression, the algorithm finds the linear
relationship between two variables i.e. if a
independent varaible is changed the dependent
variable will be affected. For example consider
the following dataset, we have data of different
squares with itJs length of one side and itJs
area. And it is clear that if the side
(independent variable) is changed say increased
the area (dependent variable) will
change(increase) too because area is calculated
from the length of the side

Side 1_____ Area_____ 1


Z—•
137 REGRESSION

Mathematically the relation can be expressed as:

Y = mX + b

where, Y is the dependent(side), X is the


dependent variable(area) we are using to make
predictions, m is the slop of the regression line
which represents the effect X has on Y and b is a
constant, known as the Y-intercept. If X = 0,Y
would be equal to b
The relationship can be either positive or
negative

Simple Linear Regression


Simple linear regression or SLR is the types of
regression in which we predict a value using only
one feature like we did before, we predicted the
weight of a person using the height

140

130

120

no

100

64 65 66 67 68 69 70 71
Z—•
138 REGRESSION

Multiple Linear Regression


Multiple Linear Regression or MLR is the type of
regression in which we predict a value using two
or more features like predicting the weight of a
person using the age and height values. Here, the
regression line is calculated using the following
formula:

h(xt) = bo + bixu+ ... + bpxip+ et

where p is the number of the features, h(\) is the


predicted value, bo, bi, etc. are regression
coefficients and e^are the residual errors i.e.
errors in the data. This also gives rise to the
following formula

yt= h(xi) + ei

where y is the actual value or the dependent


variable
The training process for a MLR model is same as
we did before, but this time we have more features
from which predictions will be made. LetJs create
a model to predict the salary of person if we get
the age and gender as input. You can download the
dataset from here Check the Resources

We need to need to import the same [Link]


things as we did we before

import pandas
from sktearn import tinear_modet
from sktearn import metrics
from matptottib import pyptot
from sktearn.modet_setection -
-import train_test_sptit

we will import train_test_split before


Z—•
139 REGRESSION

and also import the data set and inspect the first
five rows using the head() function

In [1]: import pandas


from sklearn import linear_model, metrics
from sklearn.model_selection import train_test_split
from matplotlib import pyplot
dt = pandas.read_csv('[Link]')
[Link] ()

CustomerlD Genre Age Annual Income (k$) Spending Score (1-100)

0 1 Male 19 15 39

1 2 Male 21 15 81

2 3 Female 20 16 6

3 4 Female 23 16 77

4 5 Female 31 17 40

Our data has 200 rows and 5 columns. As you can


see the Male and Female are not numeric values
which is fine, but it is best time to pratice our
preprocessing skills. We will encode them into
numeric values. If you don't remember how to do
so, go back to the Data analysis and processing
lesson
In [2]: import pandas
from sklearn import linear_model.> metrics
from sklearn.model_selection import train_test_split
from matplotlib import pyplot
from sklearn import preprocessing
dt = pandas.read_csv('[Link]')
Enc = [Link]()
[Link](['Male','Female'])
dt['Genre'] = [Link](dt['Genre'])
[Link]()

CustomerlD Genre Age Annual Income (k$) Spending Score (1-100)

0 1 1 19 15 39

1 2 1 21 15 81

2 3 0 20 16 6

3 4 0 23 16 77

4 5 0 31 17 40
Z—•
140 REGRESSION

We have encoded our Male and Female labels into 1


and 0 features respectively. Now we can move onto
a new cell and split our data

In [3]: Input = [Link](columns=['CustomerlD'>


'Annual Income (k$)'->
'Spending Score (1-100)'])
Output = [Link](columns=['CustomerlD',
'Genre'>'Age',
'Spending Score (1-100)'])
inp-Xjtst-XjOut-Yjtst-y = train_test_split(
InputjOutputtest_size=0.1)

The input values are stored in Input and the


output are stored in the Output. Then we splitted
the data for training and testing using the
train_tets_spl.it 0 function. Now we can create
our model and train it

In [4]: Input = [Link](columns=['CustomerlD'>


'Annual Income (k$)',
'Spending Score (1-100)'])
Output = [Link](columns=['CustomerlD',
'Genre','Age',
'Spending Score (1-100)'])
inp_Xj tst_X<,out_y^tst_y = train_test_split(
Input,Output,test_size=0.1)
# Training ModeL
RegModel = linear_model.LinearRegression()
[Link](inp_X,out_y)

0ut[4]: LinearRegression()

Our RegModet is ready to predict the annual


income of people if the gender and age are passed
as input. But we don't know the range of age here
so we can execute the following code to know so
In [31]: dt['Age'].max() - dt['Age'].min()

I Out[31]: 52

The range of the age in the dataset is 52


—•
141 REGRESSION

Before comparing the predictions and actual


values we can ask the model to predict some values
like how much a 30 years old female is earning and
how much a 42 years old male is earning

In [5]: [Link]([ [0,30], [1,42] ])

Out[5]: array([[60.5504497 ],
[62.09356734]])

Note the way we passed the values. As Female and


Male are encoded as 0 and 1, we pass [ [0,30],
[1,42] ] and our model is telling that a 30 years
old female earns about 60.5k and 42 years old male
earns about 62k. Well let's predict the test input
and compare the results with actual ones
In [6]: pred_y = [Link](tst_X)
cmp = [Link]({1 Predictedpred_y.flatten(),
'Actual':tst_y['Annual Income (k$)'].values})
[Link](kind='bar’,figsize=(7.5,6))

Out[6]: <[Link]._subplots.AxesSubplot at 0xl402ccc89a0>

As you can observe, most of the values are


incorrect because of the distribution of the data,
so what do we do now?

142 REGRESSION

There are lot of ways to improve the performance


of model, we can increase the data as in our case
we have only 200 rows, to predict values precisely
we need at least 10 times the data we have now
because the distribution isn't noraml in this case

In [7]: cmp[’Predicted'].plot(kind='density')
[Link]()
cmp[’Actual'].plot(kind='density',color='orange')
[Link]()

0014

0012

0010

>. 0 008
<z>
I 0 006

0 004

0 002

0 000

-25 0 25 50 75 100 125 150

We will learn about more methods to improve the


performance of our models in detail in the
performance and metrics lesson
<Or
1 G CLASSIFICATION
IO ALGORITHM

• Decision Tree I 0 n
o' O1 I
• Logistic In □ □
regression
• Nolve Boyes
I

o
A

16 CLASSIFICATION
\_______ /
□------------- J

We already know what is a classification


algorithm, but there are two types of
classification algorithms - lazy learners and eager
learners i.e. lazy learners learns less during
trainging but more in predicting like KNN
algorithms but eager learners learns in training
and less in testing like decision tree, Naive
Bayes, etc. Now let's create a classifier using the
decision tree

Decision tree
Let's create a classifier to classify a customer
into which falvour he/she likes if the age and
gender are provided as input. To do so we will use
the decision tree, an algorithm that can perform
both classification and regression tasks. It learns
the categories in a dataset and creates categories
using a decision tree and then predicts the
category of an input
We will use the scikit learns
DecisionTreeClassifier to create our
model. You can download the data set [Link]
from here -
Now jump onto your jupyter notebook
and import the packages and modules
needed in this project

import pandas
from [Link] import -
-DecisionT reeClassifier
from sklearn.model_selection import
train_test_split (optional)

In [1]: import pandas


from [Link] import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
145 CLASSIFICATION

Now we can import our dataset and previwe it


using the head() function

In [2]: import pandas


from [Link] import DecisionTreeClassifier

dt = pandas.read_csv('[Link]')
[Link]()

Age Gender Flavour

0 6 Male Chocolate

1 6 Female Strawberry

2 7 Male Chocolate

3 8 Female Strawberry

4 11 Male Butterscotch

As you can see we need to encode the Gender


labels into numeric values. So let's import the
preprocessing module and encode the labels

In [3]: import pandas


from [Link] import DecisionTreeClassifier
from sklearn import preprocessing
dt = pandas.read_csv('[Link]')
Enc = [Link]().fit(['Male',
'Female'])
dt['Gender'] = [Link](dt['Gender'])
[Link]()

Out[3]
Age Gender Flavour

0 6 1 Chocolate

1 6 0 Strawberry

2 7 1 Chocolate

3 8 0 Strawberry

4 11 1 Butterscotch

So the Mate and Female labels are encoded into 1


and 0 respectively using the LabetEncoder. Note
that we fitted Mate and Female while intialization
146 CLASSIFICATION

Now we can move onto a new cell and split our


data into input and output. Then we will create
our classifier CModel using DecisionTreeClassifier
and training the model using the input and ouput
values

In [4]: inp_X = [Link](columns='Flavour')


out_y = [Link](columns=['AgeGender'])
# CLassifier
CModel = DecisionTreeClassifier()
[Link](inp_X,out_y)

0ut[4]: DecisionTreeClassifier()

So we trained our model using the fit() function


and our model is ready to make predictions. Let's
ask our model which flavour will a 7 years old boy
and 9 years old girl will prefer:

In [5]: [Link]([ [7,1], [9,0] ])

Out[5]: array(['Chocolate', 'Strawberry'], dtype=object)

According to our model a 7 years old boy [7,1]


will prefer chocolate and a 9 years old girl [9,0]
will prefer strawberry which is pretty much
correct. But we trained our model with only 20
rows of data but we are getting some descent
results. But what if we ask the model to predict
what a 30 years old women will prefer, i.e. beyond
the range of age we provided in the data

In [8]: [Link]([[30,0]])

Out[8]: array(['Coffee'], dtype=object)

According to our model a 30 years old women [7,1]


will prefer coffee. Which maybe not correct
because the maximum age for women provided in the
dataset is 24. So how did the model predicted that
value? Let's find out how our model is performing
classifications
147 CLASSIFICATION

To view the the decision tree or the algorithm


which is used by the model to classify the values
can be seen using the below code

In [1]: import graphviz


from [Link] import export_graphviz

data = export_graphviz(CModel,
out_file='dt_tree.dot' ,feature_names=[ 'Age', 'Gender' ],
class_names=['Chocolate', 'Strawberry',
'Butterscotch', 'Vanilla', 'Mango',
'Almond_Choco', 'Coffee'],
filled=True, rounded=True,special_characters=True)
graph = [Link](data)
graph

Age < 20.5


gini = 0.825
samples = 20
value = [3. 6, 2. 3. 2, 2, 2]
class = Strawberry

Age < 9.0 Age <23.0


gini = 0.781 gini = 0.375
samples = 16 samples = 4
value = [2. 6, 2, 0. 2, 2, 2] value = [1.0, 0. 3, 0, 0,0]
class = Strawberry j yz± class = Vanilla

Gender s 0.5 Genders 0.5 Genders 0.5


gini = 0.5 gini = 0.667 gini = 0.5
samples = 2
samples = 4 samples = 12 samples = 2
value = [0. 0, 0, 2, 0, 0, 0]
value = [0, 0, 2, 0, 0, 2, 0] value = [2, 6, 0, 0, 2, 0, 2] value = [1,0, 0, 1,0, 0,0]
class = Vanilla_____ y
class = Butterscotch class = Strawberry class = Chocolate

Ages 15.5 Age <13.5


gini = 0.0 gini = 0.0 gini = 0.0
gini = 0.444 gini = 0.667
samples = 2 samples = 2 samples = 1 samples = 1
samples = 6 samples = 6
value = [0. 0, 0, 0. 0, 2, 0] value = [0, 0, 2, 0, 0. 0. 0] value = [0. 0. 0,1,0, 0,0] value = [1.0, 0, 0, 0, 0,0]
value = [0.4, 0, 0. 0, 0, 2] value = [2, 2, 0. 0, 2, 0. 0]
class = Almond Choco class = Butterscotch class = Vanilla class = Chocolate
class = Strawberry class = Chocolate

/" Age< 18.0


Age <12.0
gini = 0.0 "X /■ gini = 0.0
gini = 0.444 gini = 0.5
samples = 3 samples = 2
samples = 3 samples = 4
value = [0. 3. 0, 0. 0. 0, 0] value = [0. 2. 0, 0. 0. 0. 0]
value = [0,1, 0, 0, 0, 0, 2] value = [2. 0, 0, 0, 2, 0, 0]
class = Strawberry j class = Strawberry
class = Coffee class = Chocolate
J
' gini = 0.0 gini = 0.0 gini = 0.0 gini = 0.0
samples = 1 samples = 2 samples = 2 samples = 2
value = [0. 1, 0, 0, 0, 0, 0] value = [0, 0, 0, 0, 0, 0, 2] value = [0, 0, 0, 0, 2, 0, 0] value = [2. 0, 0, 0, 0, 0, 0]
class = Strawberry class = Coffee, class = Mango class = Chocolate

Before visualizing the decision tree you need to


install the graphviz using the below code in your
anaconda prompt
conda install -c conda-forge python-graphviz
Then import the export_graphviz and graphviz.
Using export_graphviz we visualize the decision
tree and store it in data and using Source
function of graphviz we view the tree
But as you can see the tree is much big to fit in
the page so let's understand it by breaking it
down
148 CLASSIFICATION

LetJs see what will happen if [7jl] (7-year old


boy) is the input

Starting from the root the first comparison is


whether the age is less than or equal to 20.5 Age
20.5 and the age of the input is 7 so we move
on to the True side

Next the algorithm compares whether the age is


smaller than or equal to 9? and our age is smaller
so we move down (straight to green)
149 CLASSIFICATION

Now we are checking whether the gender is smaller


than or equal to 0.5 i.e. 0 or not. But here our
gender is 1 so we move to the non green side

Once again we compare age, whether age is smaller


to equal or smaller than 18 and 7 is so we finally
stop at the orage box (orange = True, blue =
False) and choose the class i.e. Chocolate

Starting from the root we reach the conclusion


that the input has a class chocolate i.e. a 7
years old boy likes chocolate
But you may what are the other attributes present
there like gini, samples, etc. Gini is the name of
the cost function that is used to evaluate the
binary splits in the dataset and works with the
categorial target variable "Success” or "Fail­
ure”.A perfect Gini index value is 0 and worst is
0.5 which used to split further or not. You can
see gini score with 0 aren't splitted further,
samples is the number of data points in the given
dataset with the respective characterestics
z ■
150 CLASSIFICATION

Here is the full decision tree for the


classification of the 7 years old boyJs
preference. If you want to look at the whole tree
in more quality execute the code
z Age < 20.5
"X

gini = 0.825
samples = 20
value = [3, 6, 2, 3, 2, 2, 2]
class = Strawberry

z Age < 9.0 Age < 23.0 >


gini = 0.781 gini = 0.375
samples = 16 samples = 4
value = [2. 6. 2. 0, 2. 2. 2] value = [1.0, 0. 3. 0. 0.0]
class = Strawberry class = Vanilla j

Gender < 0.5 Gender < 0.5


gini = 0.5 gini = 0.667
samples = 4 samples = 12
value = [0. 0. 2. 0. 0. 2, 0] value = [2. 6. 0. 0. 2. 0. 2]
class = Butterscotch class = Strawberry
y

z Age < 15.5 Age< 13.5


gini = 0.444 gini = 0.667
samples = 6 samples = 6
value = [0. 4. 0. 0. 0. 0. 2] value = [2. 2. 0. 0. 2. 0. 0]
class = Strawberry y X,
class = Chocolate

z gini = 0.0
Age< 18.0 ■X

gini = 0.5
samples = 2
samples = 4
value = [0. 2, 0. 0, 0. 0, 0]
value = [2, 0. 0. 0. 2. 0. 0]
class = Strawberry
class = Chocolate
X.- s

"X
gini = 0.0
samples = 2
value = [2. 0. 0. 0. 0. 0. 0]
class = Chocolate y

Likewise we can use the decision tree to solve


different kind of problems based on classification
But you may also ask how does the tree creates
those comparisions or splits? It isn-’t necessary
to know but you should. First the algorithm
calculates the gini index for each attribute using
the below formula:
p2 + q2
which is the sum of the square of probability for
success(p) and failure(q). Then the dataset is
splitted into two lists of rows having index of an
attribute and a split value of that attribute.
Then it finds the best possible split by
evaluating the cost(gini) of the split
151 CLASSIFICATION

Logistic Regression
Logistic regression is a type of model that
predicts the outcome of output values as Yes or no
as numeric values 1 or 0 respectively. We can use
these type of models to classify a day as rainy or
notj a person as healthy or sick, etc. But there
are different types of logistic regression used
for to different situations.

Binomial Logistic Regression


Binomial or binary logistic regression used to
predict exactly two outcomes i.e. either
l(positive) or 0(negative)
Let's use an dataset to predict whether it will
rain or not if the temperature and humidity
percent are provided as input. You can download
the data set from here - rhprk thp Rpsourrps

Let's import the modules and the [Link]


dataset together. This time we will import
linear_model and train_test_split
from sklearn

import pandas
from sklearn import linear_model
from sklearn.model_selection import -
- train_test_split
from [Link] import accuracy_score

In [1]: import pandas


from sklearn import linear_model
from sklearn.model_selection import train_test_split
from [Link] import accuracy_score

We also imported the accuracy_score0 function


from sklearn. metrics to calculate the accuracy
of our model
Now we can import our dataset and this time let's
view it as it is
152 CLASSIFICATION

In [2]: import pandas


from sklearn import linear_model
from sklearn.model_selection import train_test_split
from [Link] import accuracy_score
dt = pandas.read_csv('Rainfall_data.csv')
dt

Out[2]:
Unnamed: 0 Temperature Humidity% Rain

0 0 34 74.2 Yes

1 1 19 68.2 No

2 2 28 67.2 Yes

3 3 29 66.6 Yes

4 4 26 57.9 Yes

19995 19995 30 77.9 Yes

19996 19996 20 74.8 Yes

19997 19997 14 69.4 No

19998 19998 20 60.6 No

19999 19999 22 64.8 No

20000 rows x 4 columns

As you can see we have 20000 rows and 4 columns


worth of data!
Now we can move on to a new cell and perform the
splitting of the data into train-test input and
train-test output

In [3]: Input = [Link](columns=['Unnamed: O'^'Rain'])


Output = dt['Rain']
inp-X, tst_X, out—y, tst_y = train_test_split(
[Link],Output,test_size=0.01)

We stored the input features i.e. Temperature in


Input and the output i.e. Rain (Yes or No) in
Output. Then we passed these values to the
train_test_sptit0 function and splitted the
data into training input, testing input, trainging
output and testing output where the test size is
0.01 (1% i.e. 200)
z <
153 CLASSIFICATION

Now we can create out logistic regression CModel


and train it

In [4]: Input = [Link](columns=['Unnamed: 0','Rain'])


Output = dt['Rain1]
inp_Xj tst_X, out_y, tst_y = train_test_split(
[Link],Output,test_size=0.01)
CModel = linear_model.LogisticRegression()
[Link](inp_Xjout_y)

0ut[4]: LogisticRegression()

So our model is ready to make predictions, letJs


move onto a new cell and let the model predict.
Then we will compare the values and print the
accuracy score

In [5]: from sklearn import preprocessing


pred_y = [Link](tst_X)
Enc = [Link]().fit(['Yes'No'])
cmp = [Link]({'Predicted':[Link](pred_y),
'Actual':[Link](tst_y)})
print('Accuracy Score:',accuracy_score(tst_y,pred_y))
[Link](kind='density')

Accuracy Score: 0.91

Out[5]: <[Link]._subplots.AxesSubplot at 0xlfb7a395d30>

So the model has accuracy score of 0.91 i.e. 91%,


which is really good! You can also see the density
plot where only 9% of values are predicted wrong
by the model
154 CLASSIFICATION

So how did our model predicted teh values or how


do the logistic regression works? To understand we
will see what is the mathematics behind the
algorithm, if you want you can move ahead or give
it read. The followings are the steps of linear
function of binomial logistics regression:
• We already know that the output will be either
0(No) or l(Yes). For that the linear function is
basically used as an input to another function
such as g in the following relation

h0(x) = g(0Tx) [0 h0 sS 1 ]
gis the logistic or sigmoid function which can be
found with the following formula:

where z is 0Tx
• We can visualize the sigmoid curve can be
understood by the following graph

the classes can be divided into positive or


negative. The output comes under the probability
of positive class if it lies between 0 and 1. For
our implementation, we are interpreting the output
of hypothesis function as positive if it is bigger
than or equal to 0.5 (>0.5), otherwise negative
• We also need to define a loss function to
measure how well the algorithm performs using
the weights on functions, represented by 6 and h
is equal to g(X0):
155 CLASSIFICATION

after defining the loss function our prime goal is


to minimize the loss function
• It can be done with the help of fitting the
weights which means by increasing or decreasing
the weights. With the help of derivatives of the
loss function with respect to each weight, we
would be able to know what parameters should
have high weight and what should have smaller
weight. The following gradient descent equation
tells us how loss would change if we modified
the parameters:

=—XT (g(X0) — y)
60j m

Multinomial Logistic Regression


As the name suggest this time we will have to pre­
dict outputs more than 2 times. In multinomial lo­
gistic regression we perform classification into 2
or more categories also the categories can be just
different types like Rain, Hailstorm, Snow, etc. or
ordinal like Heavy rain, moderate rain or low rain­
fall
Let's consider the previous situation where we
predicted whether it will rain or not, Chprk the Rpsourcps

so let's create a model to predict [Link]


whether it will rain heavy, moderate
or low. You can download the dataset
from here -
and import the modules as we did
while creating model to predict the
rainfall
data

In [1]: import pandas


from sklearn import linear_model, metrics
from sklearn.model_selection import train_test_split
156 CLASSIFICATION

Now we can import our data and preview it without


the head() function

In [2]: import pandas


from sklearn import linear_model.> metrics
from sklearn.model_selection import train_test_split
dt = pandas.read_csv('[Link]')
dt

Temperature Humidity% Rainfall

0 34 74.2 Low

1 19 68.2 No Rain

2 28 67.2 Moderate

3 29 66.6 Moderate

4 26 57.9 Low

17996 31 89.7 No Rain

17997 21 84.7 No Rain

17998 28 74.7 No Rain

17999 30 78.2 No Rain

18000 34 80.4 Low

18001 rows x 3 columns

We have the same temperature, Humidity percent


columns but the rain is classified into No rain,
low, moderate and high. Now we can move onto the
next i.e. splitting the data

In [3]: Input = [Link](columns='Rainfall')


Output = [Link](columns=['Temperature','Humidity%'])
inp-Xjtst-XjOut-Yjtst-y = train_test_split(
Input,Output,test_size=0.1)

Next we need to scale our Input data (optional)
or we may encounter error. We will import
preprocessing module and scale our input data.
Then we can split our data into training and
testing sets and train our model after creating it
157 CLASSIFICATION

In [4]: from sklearn import preprocessing


Input = [Link](dt .drop(columns='Rainfall').values)
Output = dt['Rainfall']
inp_X,tst_X,out_y,tst_y = train_test_split(
Input,Output,test_size=0.2)
CModel = linear_model.LogisticRegression()
[Link](inp_X,out_y)

0ut[4]: LogisticRegression()

Our model is trained. Now we can test out model's


predictions with actual values. To visualize it we
need to use the LabetEncoder and encode the
Rainfall labels into numeric values. We will also
print the accuracy of our model

In [5]: pred_y = [Link](tst_X)


Enc = [Link]().fit(['No Rain',
'Low','Moderate','High'])
cmp = [Link]({'Predicted':[Link](
pred_y),'Actual':[Link](tst_y)})
acc = metrics.accuracy_score(tst_y,pred_y)
print('Accuracy:’,acc)
[Link](kind='density*)

Accuracy: 0.435156900860872

Out[5]: <[Link]._subplots.AxesSubplot at 0x239fe078.

----- Predicted
----- /krtual

—J
l------------------ 1 i i------------------ r i
-10 12 3 4

so our model didn't performed well. So here's a


question for you - why is the accuracy of our
model is below 50%? (without reading further)
158 CLASSIFICATION

So what do you think? Is it because we scaled the


data? Yes, you are obviously wrong. The scaling
wasn't necessary but it was included to lure you
to think that it would've been the reason but
that's not it. Instead it is a good practice to do
so. The actual reason is the distribution in the
data. If you observe the plotted graph of the
predicted and actual values

you'll notice that the distribution is really


different. Our dataset has a spike i.e. less 'high
rainfall' data only concentrated at one place and
no or low rainfall a lot. We provided our model
with 18000 rows worth of data but the distribution
wasn't good i.e. we didn't get enough data for
moderate or high rainfall. The places where the
lines are together or overlapping are the
predictions made correct by our model i.e. mostly
no rainfall and low rainfall. Our model didn't
predicted moderate or high rainfall for any value
at all
Encountering errors like this helps us to counter
problems in actual situations. During this time we
need to come up with different methods to improve
our model (that will be covered in the Performance
and metrics lesson) or change the algorithm, so
let's look at another classification algorithm
159 CLASSIFICATION

Naive Bayes
Naive Bayes algorithm is based on the Bayes
theorem which we already learned in previous
lessons. We have three types of Naive Bayes
algorithms:
• Gaussian, is used when the data in labels is
drawn from a gaussian distribution
• Multinomial, is used when the data in labels is
drawn from a multinomial distribution
• Bernoulli, is used when we have to predict
binary features like 0 or 1

So let's use the naive bayes algorithm to create


to model to predict whether it will rain or not
with the dataset used in the Binomial logistic
regression. We will use the Gaussian Naive Bayes
algorithm for this model

In [1]: import pandas


from sklearn import naive_bayes, metrics
from [Link] import LabelEncoder
from sklearn.model_selection import train_test_split
dt = pandas.read_csv('Rainfall_data.csv')

This time we imported naive_bayes from sklearn


to create our model and LabelEncoder to encode
the labels while comparing the predictions. Next
we can split out data into training input, testing
input, training output and testing output
respectively using the train_test_sptit()
function and the test_size as 10%

In [2]: Input = [Link](columns=['Unnamed: 0','Rain'])


Output = dt['Rain’]
inp_X,tst_X,out_y,tst_y = train_test_split (
Input,Output,test_size=0.1)

Now to create our model, we will use the


GaussianNB class from naive_bayes and train it
with the fit() method
160 CLASSIFICATION

In [3]: Input = [Link](columns=['Unnamed: 0','Rain'])


Output = dt['Rain']
inp_X<»tst_Xi,out_y<,tst_y = train_test_split(
InputjOutput,test_size=0.1)
CModel = naive_bayes.GaussianNB()
[Link](inp_X,out_y)

Out[3]: GaussianNB()

Our CModel is ready to predict. So let's test


our model with the predict0 method and compare
the answers using the density plot. We will also
print the accuracy score

In [4]: pred_y = [Link](tst_X)


Enc = LabelEncoder().fit(['Yes','No'])
cmp = [Link]({'[Link](
pred_y),'Actual':[Link](tst_y.values)})
acc = metrics.accuracy_score(tst_y,pred_y)
print('Accuracy:'>acc)
[Link](kind='density')

Accuracy: 0.9275

0ut[4]: <[Link]._subplots.AxesSubplot at 0x26al30ee<

The model has an accuracy of 92% i.e. the


predicted line and the actual line in the graph is
almost overlapping each other. Our naive bayes
model has performed well than the logistic
regression model by 1%
SUPPORT VECTOR
MACHINE
• What is SVM
• SVM Models
• SVM Kernels

o J
A

17
_____ /
SUPPORT VECTOR MACHINES
i__________________________________________ )

What is Support Vector Machine


Support vector machines or SVM is case of
supervised machine learning which is used for both
regressions and classifications. But the working
of the SVMJs can be considered a bit different. In
SVM algorithms divides a dataset into different
classes or categories in a hyperplane in
multidimensional space. The categories are divided
in a manner to find the maximum marginal
hyperplanes or MMH. These hyperplanes are
generated in an iterative manner to minimize
errors

In the above graph, the data points closest to


the hyperplane are called support vectors
The line separating the class A and class B is
the line called hyperplane i.e. dividing the data
into two class
And for the line called margin is the gap between
two lines on the closet data points of different
classes. It can be calculated as the perpendicular
distance from the line to the support vectors.
Large margin is considered as a good margin and
small margin is considered as a bad margin
The SVM algorithms find the maximum marginal
hyperplane to divide the datapoints into different
classes
163 SUPPORT VECTOR MACHINES

Support Vector Machine models


We can create our own SVM model using sample
datasets, so jump into your jupyter notebook and
import the following modules:

import numpy
from matplottib import pyptot
import seaborn
from [Link].samptes_generator -
-import make_btobs

In [1]: import numpy


from matplotlib import pyplot
import seaborn
from [Link].samples_generator\
import make_blobs

We will use numpy to modify the data, matplotlib


and seaborn to visualize our data and the
make_btobs() to create a dataset which will be
linearly seperable

In [2]: X, y = make_blobs(n_samples=100^centers=2,
cluster_std=0.50)

Using the make_btobs() function we created a


data set with 100 samples (data points) specified
in the n_samptes parameter, we want data with 2
centres thatJs why we passed 2 to the centers
parameter
★ **----------- 7two (centres)
/concentrations

------------------------- »x

and finally the standard deviation of the cluster


as 0.5 i.e. specified in the cluster_std
parameter
' •/>
164 SUPPORT VECTOR MACHINES
----- '•------------------------------

We can see the input and output data, but make


sure to view them in a different cell because
everytime you run the cell with the make_biobs()
function the values will be reassigned i.e. they
will be changed

In [3]: X[:5],y[:5]

Out[3]: (array([[-3.04236107, -5.08307767],


[ 4.45165507, 7.04554459],
[-4.0071927 , -5.82995397],
[-3.27505396, -4.85255011],
[-3.86582416, -4.44910783]]),
array([l, 0, 1, 1, 1]))

So we have a 2-D array with 1-D arrays with 2


elements(features) in them as the input. And 0 and
1 as the output values. All the values in the 1-D
arrays with index 0 are data points have one
center and all the values in the 1-D arrays with
index 1 have another center

(array([ [-3.04236107, -5.08307767],


[ 4.45165507, 7.04554459], • •

• •• [-4.0071927 , -5.82995397], •V
•••
• [-3.27505396, -4.85255011],
[-3.86582416, -4.44910783]] ) /
- index 0 index 1 -

We need to divide them into different varaibales


to visualize them, so let's slice all the values
with 0 index and 1 index from the 1-D arrays and
store them into XI and X2

In [4]: XI = X[:, 0]
X2 = X[:, 1]
print(Xl[:2])
print(X2[:2])

[-3.04236107 4.45165507]
[-5.08307767 7.04554459]
' •/>
165 SUPPORT VECTOR MACHINES
----- '•------------------------------

Now we can visualize our dataset using the


scatter plot

In [5]: XI = X[:, 0]
X2 = X[:, 1]
[Link](XI,X2)

0ut[5]: <[Link] at 0xl043f478

As you can see we have we the datapoints


concentrated in two places or formed 2 clusters.
We can draw lines to seperate them like:

Using numpy to create the line and matplotlib to


plot the lines in the graph we can visualize the
same
In [10]: xfit = [Link](-5,5)
[Link](XIX2,c=y)
for m, b in [(0.375,0.5),(-0.5,2)]:
[Link](xfit, m * xfit + b, '-k')

[Link](-5,5)

Out[10]: (-5.0, 5.0)

the code may be a bit hard to understand so letJs


break it down. First of all we created a xfit
varaible and stored all the values from -5 and 5 as
an array using the linspace() function, simply we
used to create values for the line i.e. from -5 to
5. Then we plotted a graph with our two data
clusters. Then we created a for loop to plot the
lines in the graph using the plot() function. We
passed the values of the plots in the for loop, and
performed multiplication and addition to extend the
line to the other end. This may difficult to follow
but this is whatJs going on there
m[0] * xfit[:] + b[0] = right y value (line
start)
m[l] * xfit [: ] + b[l] = left y value (line end)
and then we limit the length of the x-axis to -5 to
5 to view our line stretched from the right to the
left
167 SUPPORT VECTOR MACHINES
----- '•------------------------------

So we have plotted two lines that has seperated


the data into two classes. As we already know, the
SVM algorithms finds the maximum marginal
hyperplane (MMH), we can do so for the two lines
by drawing margins around them of some width like
so:

In [16]: xfit = [Link](-5,5)


[Link](XI,X2,c=y)
for in, b, d in [ (0.375,0.5,3), (-0.5,2,6) ]:
yfit = m * xfit + b
[Link](xfit, yfit, '-k’)
pyplot.fill_between(xfit, yfit - d, yfit + d,
edgecolor='none',color='#AAAAAA', alpha=0.4)
[Link](-5,5)

Out[16]: (-5.0, 5.0)

As you can we see we have drawn marigns around


the lines to find the nearest support vector. The
margins are very wide because the data clusters
are a bit far
We did so as same as before, plotting our graph
with data clusters, seperator lines and then we
used the fitt_between() functions to create those
margins. We passed the value of one X and 2 Y axes
to fill between them to represent our margin. We
didn't need edgecotor, and the fill color as
black and the alpha as 0.4 i.e. transparency
Now we can import the SVC support vector
classifier from sklearn. svm to create our model

In [36]: from [Link] import SVC


CModel = SVC(kernel='linear')
[Link](X, y)

Out[36]: SVC(kernel='linear')

Let's create a function to plot the maximum


marginal hyperplane and the support vectors using
our CModel

In [40]: def MMH(modelj ax=None, plot_support=True):


# 2-D graph pLot
if ax is None:
ax = [Link]()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
# Creating grid
x = [Link](xlim[0], xlim[l], 30)
y = [Link](ylim[0], ylim[l]_» 30)
Y, X = [Link](y, x)
xy = [Link]([[Link](), [Link]()]).T
P = model.decision_function(xy).reshape([Link])
# PLotting boundaries and margins
[Link](X, P, colors='k'J
levels=[-l, 0, 1], alpha=0.5j
linestyles=[’--'])
# PLotting support vectors
if plot_support:
[Link](model.support_vectors_[:, 0],
model.support_vectors_[:> 1],
s=300, linewidth=l, facecolors='none')
ax.set_xlim(xlim)
ax.set_ylim(ylim)

First of all we get the model., ax (axes) and


plot_support (to plot support vectors or not) as
arguments and parameters. Then we start off with
the 2-D graph plot and if we don't pass the axes
we will find them using the gca() function. We
also find the x axis limit and y axis limit using
the get_xlim() and get_ylim() function
repectively and store them in xlim and ylim
169 SUPPORT VECTOR MACHINES
----- '•------------------------------

Then we create the grid where or the base of our


plot using the xiim and ytim. As we did before we
create values for the lines using the tinspaceO
function and create the grid using the meshgrid()
function and then use the vstackO function to
vertically stack the arrays where the values are
reshaped using the ravetC) function. We also call
the decision_f unction (J to get the valyes for
the boundaries and margins
Next we use the data to plot the boundaries and
margins using the contourO function to draw the
lines and specify the linestyles and the other
properties using the respective parameters
Atlas, we check whether to plot the support
vectors or not and plot them if to using the
scatter() functions by using the
support_vectors_ 0 and 1 indexed values in the
array
Finally we can plot our data clusters and call
the MMH () function and pass our SVC model

In [41]: [Link](Xl>X2>c=y)
MMH(CModel)

Finally we have the maximum marginal hyperplane


plotted for our data clusters with the support
vectors
' •/>
170 SUPPORT VECTOR MACHINES
----- '•------------------------------

Support Vector Machine Kernels


Support vector machines are implemented with
kernels that transforms a input data space into
multidimensional for more flexiblity and smooth
workflow for the support vectors machines. As in
the previous model we used the linear kernel there
are different types of kernels like:
• Linear Kerenel, is used when predicting two
outcomes
• Polynomial kernel, is more generalized version
of the linear kernel where the input space is
non-linear
• Radial Basis Function kernel, is used for
SVM? s that maps the input space into infinite
dimensions

This time we will use the sample dataset provided


by the sklearn to understand the different
kernels. First of all we need to import our data
and prepare it. Import the the followings

import numpy
from sklearn import svc,datasets
from matplotlib import pyptot

In [1]: import numpy


from sklearn import svm., datasets
from matplotlib import pyplot

We load the iris (sample iris flower dataset)


dataset from the dataset

In [2]: dt = datasets.load_iris()
X = [Link][:, :2]
y = [Link]

We imported the dataset and splitted it into


input data[:,:2] (elements with 0-2 indexes in
alt the arrays) and output target and stored
them into X and y respectively
' •/>
171 SUPPORT VECTOR MACHINES
----- '•------------------------------

Now we need the data to plot the SVM boundaries


or input data spaces(different classes). To do so
we need to create a grid as we did before. To plot
the grid we need minimum and maximum values of the
input and output datasets. Then we reshape them
using ravet() and pass them to the c_() function
which nn particular stacks arrays along their last
axis after being upgraded to at least 2-D and
stored in the X_ptot as testing input

In [3]: x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1


y_min, y_max = X[:, l].min() - 1, X[:> l].max() + 1
h = (x_max / x_min)/100
xx, yy = [Link]([Link](x_min, x_max, h),
[Link](y_min, y_max, h))
X_plot = numpy.c_[[Link](), [Link]()]

We have the required data to train the SVC


classifier, so let's create it

In [4]: SvcModel = [Link](kernel='linear',C=1.0).fit(X, y)

We can create the SvcModel using the SVC()


function and pass linear to the kernel parameter
and 1.©(float) to the regularization C parameter.
Also train it using the fit () method
Now we can predict the X_plot values and store
it in Z. We will reshape it using the reshape()
function to the shape of the xx meshgrid
First of all we will plot the figure(base). Then
we will add a subplot and draw the filled contours
using the subplot(121) and contourfO function.
We passed the values created with the meshgridO
function and Z predicted values. Now we can plot
the data clusters using the scatten() plot and
finally limiting the x-axis to maximum and minimum
values of the xx meshgrid
We can see how our dataset is divided intom
different spaces by the support vector classifier
with linear kernel
In [5]: Z = [Link](X_plot)
Z = [Link]([Link])
[Link](figsize=(15, 5))
[Link](121)
[Link](xx, yy> Z, alpha=0.3)
[Link](X[:, 0], X[:, l],c=y)
[Link]([Link](), [Link]())

0ut[5]: (3.3, 8.882727272727251)

Similarly we can create a SVC using the Radial


Basis Function kernel.

In [10]: RbfSvc = [Link](kernel='rbf',C=1.0).fit(X, y)


Z = [Link](X_plot)
Z = [Link]([Link])
[Link](figsize=(15, 5))
[Link](121)
[Link](xx, yy, Z,alpha=0.3)
[Link](X[:, 0], X[:, l],c=y)
[Link]([Link](), [Link]())

(Output on the next Page)

You can observe both of the plots using the


linear and rbf and notice and notice a clear
difference in lines and curves
Out[10]: (3.3, 8.882727272727251)

10 ------ j---------- 1------------------------- 1-------------------------1------------------------- 1-------------------------r-


4 5 6 7 8
1 Q CLUSTERING
IO ALGORITHM
• Clustering
• K-Means
algorithm
• Mean shift
algorithm
• Heirarchical clustering

o
A

18
k ______/
CLUSTERING
i__________________________________________ j

Clustering is a case of unsupervised machine


learning. The clustering algorithms learns
relations in the data and classifies it into
groups according to whether number of groups
provided with input or not

Clustering
The followings are the different types of
clustering:
• Density-based, clusters are formed as dense
regions. These algorithms have good accuracy
and capibility to merge two clusters together.
Like, Density-Based Spatial Clustering of
Applications with Noise (DBSCAN), Ordering
Points to identify Clustering structure (OPTICS)
• Heirarchial-based, clusters are formed in a
heirarchical tree which has Agglomerative
(Bottom up approach) and Divisive (Top down
approach). Like Clustering using Representatives
(CURE), Balanced iterative Reducing Clustering
using Hierarchies (BIRCH)
• Partitioning, clusters are formed by partioning
the objects into k, number of clusters will be
equal to that of partitions. Like K-Means
• Grid, clusters are formed as grid. This method
is fast and independent on the number of
objects. Like Statistical Information Grid
(STING), Clustering in Quest (CLIQUE)
176 CLUSTERING

Until now we have calculated the


accuracy of supervised learning
algorithms with the predicted
values and actual values, but how
can we do so for unsupervised
learning algorithms when we are
dealing with unlabeled data?
There are some metrics that can be
used to evaluate the performance or quality of
different unsupervised learning algorithms by the
changes in the clusters
Silhouette analysis used to check the quality of
clustering model by measuring the distance
between the clusters. It basically provides us a
way to assess the parameters like number of
clusters with the help of Silhouette score
The silhouette score ranges from -1 to 1. The
different numbers represent the followings:
• 1 is the situation when the cluster is far
away from it's neighbouring cluster
• 0 is the situation when the cluster is very
close or on the decision boundary itself i.e.
seperating the clusters
• -1 is the situation when cluster aren't formed
correctly

the silhouette score can be calculated using the


following formula:

Silhouette Score = p-q/max(p,q)

where p is the mean distance to the points in the


nearest cluster and q is the mean intra-cluster
distance to all the points
Next we can use the Davis-Bouldin Index to know
whether the clusters are well spaced from each
other or not and the density of the clusters. We
can calculate the DB index using the following
formula:
177 CLUSTERING

where nis the number of clusters, o1 is the


average distance of all points in cluster i from
the cluster centroid ci.
Lower values indicate good performance, where 0
is the minimum value

Dunn index is another metric that can be used to


evaluate the performance of a clustering
algorithm. It is similar to the DB index but the
difference are:
• It considers only the clusters close together
whereas DB index considers all of the clusters
• Lower Dunn indexes indicates bad performance
whereas lower the DB index higher the
performance of the algorithm
The Dunn index can be calculated using the
following formula:

mini<i<j<nP( ijj )
mixi <i<k<n q(k)

where ijjjn are each indices for clusters, P is


the inter-cluster distance and q is the
intra-cluster distance
The Dunn index increases with the performance of
the clustering algorithm
So we can evaluate the performance of different
clustering algorithm (not accuracy of one
algorithm) with the following metrics:
• Silhouette Score
• Davis-Boulden Index
• Dunn Index
178 CLUSTERING

Let's see the clustering algorithm in action. We


will use the make_btobs() function to create
dataset with clusters as we did at the time of
SVM. So let's import the modules and functions we
need

In [1]: from [Link] import KMeans


from [Link] import make_blobs
from matplotlib import pyplot

Onto the new cell, create dataset with the


make_btobs0 function and pass 4 to the centers
parameter, 240 to n_samptes for 240 datapoints
and standard deviation as 0.6. Keep the cell
untouched or data will be re-assigned randomly

In [2]: dt,y = make_blobs(n_samples=240,centers=4,


cluster_std=0.60)

Let's visualize the data (as 0 and 1 indexes)


using the scatterO plot

In [3]: XI = dt[:,0]
X2 = dt[:,l]
[Link](XI,X2)

0ut[3]: <[Link] at 0x259b413

-2

-4

-6

-10 -5 0 5 10
179 CLUSTERING

Now we can create our clustering algorithm i.e.


K-Means. Create a Ctstr using the KMeans and
specify 4 in the n_clusters i.e. number of
clusters

In [4]: Clstr = KMeans(n_clusters=4)


[Link](dt)

Out[4]: KMeans(n_clusters=4)

So our clustering algorithm is ready! LetJs ask


the Clstr to predict the clusters in the dt
dataset and plot them

In [5]: pred = [Link](dt)


[Link](XI,X2,c=pred)
centers = Clstr.cluster_centers_
[Link](centers[:> 0]>centers[:, 1],
marker='x'}c='cyan',s=80)

0ut[5]: <[Link] at 0x259b3fa<

We stored the predictions to use them as the


color in the plot. We also extracted the centres
of our clusters from the Clstr using
ctuster_centers_ class variable and also plotted
them as the different centres of our clusters in
the dataset. As you can our Ctstr has clustered
the data into 4 clusters
180 CLUSTERING

K-Means Algorithm
We already know there are different types of
clustering algorithms and K-Means algorithm is one
of those. K-means clustering algorithm computes
the centroids and iterates until we it finds
optimal centroid. While using this algrithm we
always need to pass the number of cluster
(n_ctusters). It is also called flat clustering
algorithm. The number of clusters identified from
data by algorithm is represented by fKJ in K-means
We already used the K-Means algorithm in the
previous example. So let's see how did it worked
First of all it divided the dataset into
n_clusters by dividing the number of datapoints in
the dataset by it. In our case it created 4
clusters with 60 datapoints each

The process of dividing these clusters is totally


random. The algorithm takes K(n_ctusters)
datapoints from the sample and
divide them into a cluster
Next it will compute the
cluster centroids for
each of the clusters formed
181 CLUSTERING

Next the algorithm keeps iterating the followings


until it finds optimal centroid which is the
assignment of data points to the clusters that are
not changing any more:
• Sum of squared distance between the datapoints
and the centeroids
• Assign each datapoint to the cluster closer than
the other cluster(centeroid)
• Find the centeroid of each cluster formed by
taking the average of all the datapoints in it
K-means follows Expectation-Maximization approach
to solve the problem. The Expectation-step is used
for assigning the data points to the closest
cluster and the Maximization-step is used for
computing the centroid of each cluster

Mean-Shift Algorithm
Next we have the Mean-Shift algorithm which
assigns the datapoints to the clusters iteratively
by shifting points towards the highest density of
datapoints i.e. cluster centroid. The number of
clusters is is evaluated by the algorithm upon
recieving the data and analysing it
We can create a sample dataset using make_blobs()
function for clustering using the MeanShift
algorithm. Let's import the modules and functions
we need

In [1]: import numpy


from [Link] import MeanShift
from matplotlib import pyplot
from [Link] import make_blobs

Then we can move onto a new cell and create our


dataset using the make_btobs() function

In [2]: dt, y = make_blobs(n_samples=270,centers=[[2,3],


[4,5],[3,10]], cluster_std = 0.5)
182 CLUSTERING

This time we specified the position of the


clusters and passed the centers(x and y values) as
a 2-D array and the number of points is equal to
the number of clusters. We can view our data using
the scatterO plot

In [3]: XI = dt[:,0]
X2 = dt[:,l]
[Link](XI?X2)

Out[3]: <[Link] at 0xldbd820e

Now we can create a clustering model using the


MeanShift algorithm and pass our dt data

In [4]: Clstr = MeanShift()


[Link](dt)

0ut[4]: MeanShift()

We created the Clstr object of the MeanShift


class and passed our data through the fit()
method. You may note that we didn't specified the
number of clusters in our dataset as we did with
the K-Means algorithm
Now let's find the number of clusters predicted
by the algorithm and plot the clusters classified
by our MeanShift algorithm
183 CLUSTERING

In [5]: labels = Clstr.labels_


cen = Clstr.cluster_centers_
n_clusters = len([Link](labels))
print("Estimated No. of Clusters:", n_clusters)
colors = 10*['r.’g.’b.']
for i in range(len(dt)):
[Link](dt[i][0], dt[i][l], colors[labels[i]])
[Link](cen[:,0],cen[: ,1],
marker='x',color='k',s=100,zorder=10)

Estimated No. of Clusters: 3

Out[5]: <[Link] at 0xldbd929725

As you can see the Clstr has predicted 3


clusters in our dataset i.e. correct. We extracted
the label of each datapoint using the labels,
attribute which we will use to plot the clusters.
Then We found cluster centres or centeroids using
the cluster_centers_ attribute of the Clstr
class which is the x and y values of them. The
length of the centres (x and y values i.e. an 2-D
array) represents the number of clusters predicted
by the Clstr. Then using a for loop we plotted
the clusters (note the use of labels to color
different datapoints in different clusters) and
atlas we plotted the centers of the clusters using
the scatter() plot and specifying the zorder as
10 to plot it upfront
184 CLUSTERING

So how did the MeanShift algorithm found the


clusters even though we didn't specified the
number of clusters beforehand?
Firstof all it starts with the data points
assigned to a cluster of their own. Then the
algorithm will compute the centroids and change
the location of new centroids.
This process will be iterated and moved to the
higher density region until the centroids reach at
position from where it cannot move further.

Heirarchical clustering
As the name suggests, heirarchical clustering is
a form of clustering where the dataset is divided
into groups or categories like a tree which may be
into individual datapoints or a single cluster
There are two types of heirarchical clustering as
follows:
• Agglomerative, is a type of heirarchical
clustering where each individual datapoint is
considered as a cluster and then merge or
agglomerate into cluster of 2 to 4 to 8 and vice
versa until it forms a cluster of the whole
dataset
Whole dataset cluster
shapes

individual datapoints
185 CLUSTERING

• Divisive, is the opposite of the agglomerative


where the dataset is considered as a cluster
and then divided into individual datapoints

Whole dataset cluster


shapes

filled un-filled filled un-filled


circle circle square square
individual datapoints

• Divisive, is the opposite of the agglomerative


where the dataset is considered as a cluster
and then divided into individual datapoints

Let's create a heirarchical model to see it int


action. We will use the flavours. csv dataset used
in DecisionTreeCLassifier (page no. 145)
classification algorithm
lump into your jupyter notebook and import the
following modules

In [1]: import pandas


import numpy
from [Link] import AgglomerativeClustering
from matplotlib import pyplot
from [Link] import dendrogram
from [Link] import LabelEncoder
z «
186 CLUSTERING

Before model creation we need to encode the


flavour labels using the LabetEncoder. We will
import the data and encode it

In [2]: dt = pandas.read_csv('[Link]')
Enc = LabelEncoder().fit(dt['Flavour'])
dt['Flavour'] = [Link](dt['Flavour'])

Now we can create our agglomerative heirarchical


clustering Clstr model. We will use the age and
flavour data

In [3]: X = [Link](columns='Gender')
Clstr = AgglomerativeClustering(distance_threshold=0,
n_clusters=None)
[Link](X)

Out[3]: AgglomerativeClustering

We passed 0 to the distance_threshtod(The


linkage distance threshold above which, clusters
will not be merged) parameter which means we will
use the whole dataset and None to n_ctusters
Now we can create a function to plot the heirarchy

In [4]: # Create Linkage matrix and then pLot the dendrogram


def plot_dendrogram(modelj **kwargs):
# create the counts of samp Les under each node
counts = [Link](model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1 # Leaf node
else:
current_count += counts[
child_idx - n_samples]
counts[i] = current_count

linkage_matrix = numpy.column_stack([
model. children-., model. distances-,
counts]).astype(float)

# PLot the corresponding dendrogram


dendrogram(linkage_matrix, **kwargs)
187 CLUSTERING

In the plot-dendrogram() we create the counts


of samples under each node and plot the dendogram.
First of all we calculate the sample count using
the zerosO function which will return an array
with zeros of the shape of the children-
attribute (under each non-leaf node). We calculate
the total number of sample using the
[Link]- attribute (label of each
datapoint). Then using a for loop we create the
values(x,y) for heirarchical lines or linkage
matrix using enumerate() function (enumerate
function passes an element with numbers starting
from 0 to the respective length of the elements
with the elements like (0,children_[0]),
(1, children[1]), etc.). Then we create the
linkage matrix using the column_stack() function
and plot the heirarchy using the dendrogram()
function

In [5]: plot-dendrogram(Clstr, truncate_mode='level')

As you can see the values in the x-axis represent


each element indexes which are grouped in
succession until a whole dataset cluster is
formed. You can check the heirarchies like
elements indexed 0 and 2 represent chocolate which
are grouped and elements indexed 1 and 3
represents strawberry which are grouped together
z ■
188 CLUSTERING

and those two groups in turn are grouped together


because the range of the age for them is 6-8 so it
is a group of flavours liked by children

Age Gender Flavour

1° 6 Male Chocolate]--------

6 Female Strawberry]--------
6-
(2 7 Male Chocolate]--------

3 8 Female Strawberry]---------

4 11 Male Butterscotch

5 10 Female Butterscotch

6 12 Male Butterscotch

7 14 Female Vanilla

8 15 Male Mango

9 15 Female Vanilla

10 17 Male Mango

11 16 Female Butterscotch

12 19 Male Almond & Chocolate

13 18 Female Butterscotch

14 20 Male Almond & Chocolate

15 20 Female Butterscotch

16 21 Male Coffee

17 22 Female Coffee

18 24 Male Almond & Chocolate

19 24 Female Coffee
■x

1 a KNN
lO ALGORITHMS
• Finding neorest
neighbours
• Regression with
KNN
• Ciossification with
KNN

o J
A

19
k ______/
KNN ALGORITHMS
T_____________________________________________________________________________ J

KNN stands for K-Nearest Neighbors, which is a


case of supervised machine learning used for both
regression and classification problems

Finding nearest neighbours


KNN is a lazy learning algorithm i.e. it doesn't
have a special training phase instead it uses all
the data to for classification or creating
regression line i.e. at time of prediction. It is
also considered non-parametric because it doesn't
bother about the underlying data
K in K-NN stands for nearest datapoints, as it
uses 'feature similarity' to predict the values of
new datapoints which further means that the new
data point will be assigned a value based on how
closely it matches the points in the training set.
The KNN algorithms works in the following manner:
• First we need to specify K i.e. (the number of
of nearest datapoints) say 3
• Then calculate the distance between test data
and each row of training data
191 KNN ALGORITHMS
--------- '•-----------------------------

• Now sort them into ascending order on the basis


of distance value calculated above

• Finally assign a class or values to the point on


the basis of the nearest datapoint

Regression with KNN


We will perform regression with the KNN algorithm
to create a model to predict the weight of a
person if the height is provided as input, the
same problem solved in the Linear regression model
(page no. 130)
We can start-off by importing the necessary
modules and the dataset height_and_weight.csv

import pandas
from [Link] import -
- KNeighborsRegressor
from sktearn.modet_setection import -
- train_test_sptit
192 KNN ALGORITHMS

In [1]: import pandas


from [Link] import KNeighborsRegressor
from sklearn.model_selection import train_test_split
dt = pandas.read_csv('height_and_weight.csv')
dt

Out[l]:
Index Height(ln) Weight(lbs)

0 1 65.78 112.99

1 2 71.52 136.49

2 3 69.40 153.03

3 4 68.22 142.34

4 5 67.79 144.30

...

195 196 65.80 120.84

196 197 66.11 115.78

197 198 68.24 128.30

198 199 68.02 127.47

199 200 71.39 127.88

200 rows x 3 columns

Now we can perform the splitting of our dataset


into training and testing sets

In [2]: X = [Link](columns=['Index','Weight(lbs)'])
y = [Link](columns=['Index'>'Height(In)'])
inp-Xj tst_Xj out_y, tst_y = train_test_split(X,y,
test_size=0.1)

Our data is ready so let's create our KNN model


and train it

In [3]: KNN = KNeighborsRegressor()


[Link](inp_Xj out_y)

Out[3]: KNeighborsRegressor()
193 KNN ALGORITHMS

We can specify the K with the n_neighbors


parameter while the class intialization

In [3]: KNN = KNeighborsRegressor(n_neighbors=25)


[Link](inp_X,out_y)

0ut[3]: KNeighborsRegressor()

But as we didn't passed any the default value is


used i.e. 5
Our model is ready so we can let it predicted and
then compare the values

In [4]: pred_y = [Link](tst_X)


act = tst_y['Weight(lbs)'].tolist()
prd = (pred_y.flatten()).tolist()
cmp = [Link]({'Predictionsprd,
'Actual':act})
[Link](kind='bar')

0ut[4]: <[Link]._subplots.AxesSubplot at 0xlda4af9<

Our KNN model has performed pretty good, as you


can see the actual (orange) and predicted (blue)
bars are mostly close. And we can just improve
that by increasing the number of K or n_neighbors
while the KNN intialization to something about the
double i.e. 10 to improve the performance but
still, it is performing pretty well
194 KNN ALGORITHMS

Classification with KNN


Similarly we can use the KNN algorithm to create
classifiers. We can create a algorithm with the
Rainfall_data.csv used in the logistic
regression model (page no. 152). So let's import
the modules and the dataset

In [1]: import pandas


from [Link] import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from [Link] import LabelEncoder
from [Link] import accuracy_score
dt = pandas.read_csv('Rainfall_data.csv')

We need to change the Rain labels into numerics


values. We will use the LabelEncoder to do so

In [2]: Enc = LabelEncoder()


[Link]([’Yes’,’No’])
dt['Rain'] = [Link](dt['Rain'])

So let's preview our data using the head()


function to view the encoding

In [3]: [Link]()

Unnamed: 0 Temperature Humidity% Rain

0 0 34 74.2 1

1 1 19 68.2 0

2 2 28 67.2 1

3 3 29 66.6 1

4 4 26 57.9 1

Now we can perform the splitting of our dataset


into training input, testing input, training
output and testing output using the
train_test_split() function with the test_size
as 10%
195 KNN ALGORITHMS

In [4]: X = [Link](columns=['Unnamed: 0','Rain'])


y = dt['Rain']
inp_X<,tst_X>out_y>tst_y = train_test_spJ.it (X,y,
test_size=0.1)

Our data is ready so let's create our KNN


classifier and train it

In [5]: KNN = KNeighborsClassifier()


[Link](inp_X>out_y)

Out[5]: KNeighborsClassifier()

Now we can pass the test values to the KNN


classifier and print the accuracy score

In [6]: pred_y = [Link](tst_X)


acc = accuracy_score(tst_y,pred_y)
print('Accuracy:',acc)

Accuracy: 0.9335

Our model has 93% of accuracy which is more than


the logistic regression model by 2%. So this was
how we can create KNN algorithms to solve
different problems of regression and clustering
alike
PERFORMANCE
CU 5 METRICS

• Calculoting the
model
• Improving the
model
• Saving ond
loading models

o
/------------------------------------------------------------------\
PERFORMANCE 8 METRICS

So far we have created a lot of models with


different algorithms for different tasks like
regression, classification, etc. and also
evaluated their performance visually through
graphs or their accuracy score. In this lesson we
will look at the methods to calculate the
performance of algorithms

Calculating the model


In the maths for machine learning lesson we
learned about some methods to calculate error
rate, Precision, Recall and F-measure (page no.
71) using confusion matrix. All these values can
be used to evaluate the performance of a
classifier. Let's use the KNN classifier model we
created previously and calculate it's performance
First off all let's print the confusion matrix
using the confusion_matrix0 function

In [8]: from sklearn import metrics


cm = metrics.confusion_matrix(tst_y,pred_y)
cm

Out[8]: array([[ 657, 62],


[ 80, 1201]], dtype=int64)

We passed the actual values followed by predicted


values. We can visualize it better like

In [11]: from sklearn import metrics


cm = metrics.confusion_matrix(tst_y,pred_y)
cf = [Link]({'True +ve':cm[:,0],
'True -ve':cm[:,1]},
index=['Predicted +ve',
'Predicted -ve'])
cf

Out[ll]:
Ttue +ve Ttue -ve

Predicted +ve 657 62

Predicted -ve 80 1201


198 PERFORMANCE 8 METRICS

So we have 657 i.e. True Positives (Predicted


Positive values {1, fYes-’, etc.} that are Positive
too), 62 i.e. False Positives (Predicted Positive
values that are Negative), 80 i.e. False Negatives
(Predicted Negative values {0, fNoJ, etc.} that
are Positive) and 1201 i.e. True Negatives
(Predicted Negative values that are Negative too).
Using the confusion matrix we can calculate other
metrics like:

In [14]: from sklearn import metrics


cm = metrics.confusion_matrix(tst_y,pred_y)
cf = [Link]({'True +ve':cm[:,0],
'True -ve':cm[:,1]},
index=['Predicted +ve',
'Predicted -ve'])
val = (tst_y,pred_y)

acc = metrics.accuracy_score(*val)
pre = metrics.precision_score(*val)
rcl = metrics.recall_score(*val)
fms = metrics.fl_score(*val)

print('Accuracy: \acc)
print('Precision:',pre)
print('Recall:'rcl)
print('F-Measure:',fms)

Accuracy: 0.929
Precision: 0.950910530482977
Recall: 0.9375487900078064
F-Measure: 0.944182389937107

We have calculated the accuracy, precision (True


positive values predicted by the model from total
positive values predicted), recall (True positive
values predicted by the model from actual positive
values)and f-measure (also known as Fl score)
using the accuracy_score(), precision_score(),
recaUL_score() and fl_score() functions and
printed them respectively respectively

We can print all of them together in a tabular


form using the ctassification_report() function
199 PERFORMANCE 8 METRICS

In [17]: from sklearn import metrics

rep = metrics.classification_report(tst_y,pred_y)
print(rep)

precision recall fl-score support

0 0.89 0.91 0.90 719


1 0.95 0.94 0.94 1281

accuracy 0.93 2000


macro avg 0.92 0.93 0.92 2000
weighted avg 0.93 0.93 0.93 2000

The support is the number of Positive values in


the sample for the feature or label here 0 and 1
i.e. Rain or No Rain. Macro average stands for
(macro*score of class 0 + macro*score of class 1
where macro is 0.5 here) and weighted average
stands for (weighted score of class 0 + weighted
score class 1 where the weight is mostly
imbalanced). We can use these metrics to evaluate
the performance of a algorithm. Then look at the
metrics to evaluate a regression model. We will
use the KNN regressor we created to predict

In [6]: from sklearn import metrics

val = (tst_y.» pred_y)


err = metrics.max_error(*val)
mae = metrics.mean_absolute_error(*val)
mse = metrics.mean_squared_error(*val)
rsq = metrics.r2_score(*val)

print('Max Error:'?err)
print('MAE:',mae)
print('MSE:',mse)
print('R2:',rsq)

Max Error: 19.824399999999983


MAE: 7.167660000000001
MSE: 84.34250813599998
R2: 0.02620416439244655
' •/>
200 PERFORMANCE & METRICS
----- '•---------------------------

We calculated the Maximum Error(Maximum residual


error), MAE(Mean absolute error i.e. average
vertical distance between each point and the
regression line), MSE(mean of the squared distance
from each point to the regression line) and R2
(Explained variation / Total variation) using the
max_error(), mean_absoLute_error(),
mean_squared_error() and r2_score() functions
and printed them repectively. The lesser the Max
Error, MAE and MSE is the better the performance
of the model is. Where R2 is a percentage i.e.
more closer to 1.0 is more better. But a constant
model like our's that always predicts the expected
value of y, disregarding the input features, would
get a R2 score closer to 0.0

Improving the model


Upon calculating the metrics of an algorithm we
can perform the following steps to improve the
performance of our models:
• Make sure to train the model with adequate data.
The dataset shouldn't have abnormal distribution
of features or labels like 5 samples of Yes and
95 samples of No
• After loading data we should always
practice the best and suitable
preprocessing methods on our data
to improve it's quality like
encoding labels
• We shouldn't save much data for testing but
don't less too. For datasets with samples over
10k, 20% or less is adequate
• In cases of very less data you can create random
values for testing instead of splitting the
already scarce data
• You can always test different algorithms to
solve a problem, compare them with their
metrics and choose the best and improve it
201 PERFORMANCE 8 METRICS

Saving and loading models


So we have created our model, tested it and even
improved it. Let's say we want to use the model
somewhere else or share it so, how to do that?
Well, we can do so using the joblib module. So
let's save our KNN weight predicting model using
joblib

In [6]: import joblib

[Link](KNN, "[Link]")

0ut[6]: ['[Link]']

We used the dumpO function and passed the KNN


regressor adn the "WeightPred. sav" filename as
arguments. Make sure to use the .sav extension
after the model name. As we haven't specified any
specific location it is stored in the place where
jupyter notebook is hosted

D sales_data.csv

[Link]

[Link]

[Link]

Now we can open a new jupyter notebook, import


jobtib and our model
In [1]: import joblib
KNN = [Link]("[Link]")
[Link]([[70]])

Out[l]: array([[133.4852]])

We imported our model using the T_oad() function


and passed the saved model name. We also asked the
model to predict the weight of a person with 70
inches of height and it passed 133.5 pounds
ML
APPLICATION 1
• Movie
Recommender
ML APPLICATION 1
□----------------
Problem: You have to create a model who will
suggest the genre for movies a person likes if the
person's age, gender and previously watched movie
genre is provided as input. Here is the dataset
for sample recommendation:

[Link]

------------ [ data ]------------- '

So the first step is to decide which method to


use? If the task is to recommend the genre of a
movie, that is classify a person so we will use
classification. Next, we need to decide which
algorithm to use? We aren't dealing with a huge
dataset so we can go with the decision tree
classifier
So let's start of by importing all the modules we
need

In [1]: import pandas


from [Link] import DecisionTreeClassifier
from [Link] import LabelEncoder

Now we can import the dataset and preview it


using the headO function

In [2]: import pandas


from [Link] import DecisionTreeClassifier
from [Link] import LabelEncoder
dt = pandas.read_csv('[Link]*)
[Link](3)
ML APPLICATION

Out[2]:
Age Gender Watched Genre

0 19 Male Comdey Mystery

1 19 Female Romance Drama

2 19 Male Romance Drama

As you can see we need to encode all of the


labels into numeric values. We can use an encoder
for the Gender labels and another encoder for
Watched and Genre labels

In [3]: # Gender Encoder


gndr_enc = LabelEncoder()
gndr_enc.fit(['Male','Female'])

Out[3]: LabelEncoder()

We created the gndr_enc Gender encoder and


passed the Gender labels to the fit() method. Now
we can create the Genre encoder. But before that
we need all the unique Genre labels in both
Watched and Genre column

In [4]: # Unique Genre Label, extraction


watched = dt['Watched'].unique()
genre = dt['Genre'].unique()
Genres = [*watched]
for ele in genre:
if ele in Genres:
continue
else:
[Link](ele)

First of all we extracted the unique labels from


the Watched and Genre cloumns using the uniqueO
method. Then we created another list variable and
passed the Watched uniques (note that we need a
single list i.e. 1-D that's why the watched list
is unpacked by the * operator). Using the for
loop we added the uniques of the Genre column
labels that aren't present in the Genres list
ML APPLICATION

Now we can create our gnre_enc Genre encoder and


fit the Genres

In [5]: # Unique Genre Label, extraction


watched = dt['Watched'].unique()
genre = dt['Genre'].unique()
Genres = [*watched]
for ele in genre:
if ele in Genres:
continue
else:
[Link](ele)

# Genre Encoder
gnre_enc = LabelEncoder()
gnre_enc.fit(Genres)

Out[5]: LabelEncoder()

All the encoders are ready so let's encode the


labels in our dataset with them

In [6]: for col in ['Gender*,'Watched','Genre']:


if col == 'Gender': # Gender Encode
dt[col] = gndr_enc.transform(dt[col])
else: # Watched & Genre Encode
dt[col] = gnre_enc.transform(dt[col])

Now we can divide our dataset into input and


outputj so let's move onto another new cell
because if you re-run the above cell it will cause
error because the labels are encoded so when the
above cell is executed again the encoder will
recieve number insted of labels and cause error so
move onto a new cell

In [7]: X = [Link](columns='Genre')
y = dt['Genre']

CModel = DecisionTreeClassifier()
[Link](X,y)

0ut[7]: DecisionTreeClassifier()
206
• ML APPLICATION

We also have trained our CModel, and use it to


make predictions. So let's create a function to
pass the values and return the Genre label

In [8]: def recommend(age=18.,gnd=0,watched=0Jtest=False):


# Getting input is testing
if test:
age = int(input("Age:"))
gnd = int(input("Gender:"))
for g in Genres:
print(g,
*gnre_enc.transform([g]))
watched = int(input("Watched:"))
# Ask the model, for recommendation
pred = [Link]([[age,gnd,watched]])
# Decoding the prediction to LabeL
rec = gnre_enc.inverse_transform(pred)
return rec[0]

So we defined a recommend() function and defined


four parameters i.e. age by default 18, gnd
gender by default ©(female), watched genre of the
previously watched movie by default O(Comedy) and
test by default Fatse which we can use during
testing to pass the input values
Then if we pass test as True then the function
will ask for our input and also display the
encoded values for each genre. Then the model will
predict using the input values. We will take the
output(encoded value) and decode it and finally
return it
So let's move onto a new cell and call our
recommend () function and specify the True for
the test parameter

In [*]: recommend(test=True)

Age:|18 |

You can see we are prompted to the input prompt


called in the recommend 0 function. So let's pass
18 as the age

207 ML APPLICATION

In [*]: recommend(test=True)

Age:18

Gender: 1

Pass the Gender as l(Male)

In [*]: recommend(test=True)

Age:18
Gender:1
Comedy - 0
Romance - 5
Horror - 3
Mystery - 4
Drama - 1
Fantasy - 2

Watched:[0

You can see the function has displayed all the
encoded values for each genre, so let-’s pass
0(Comedy)

In [9]: recommend(test=True)

Age:18
Gender:1
Comedy - 0
Romance - 5
Horror - 3
Mystery - 4
Drama - 1
Fantasy - 2
Watched:0

Out[9]: 'Mystery'

Now we get the Mystery as the recommendation for


the 18 years-old Male who has watched a comedy
movie previously. Well because we have very less
data, so letJs check the answer visually using the
dataset

208 ML APPLICATION

Out[2] :
Age Gender Watched Genre

0 19 Male Comdey Mystery

1 19 Female Romance Drama

2 19 Male Romance Drama

In the second run we printed the first three rows


of our dataset and by looking we can say that if a
18 years old male (whose sample isn't present in
the dataset) have previously watched a Comedy
movie so most likely he'll like a Mystery movie
too along with the Comedy movies
So we have created our Movie Recommender Model
using very little dataset, now it's up to you to
test the model or even take opinions from your
relatives and recommend them using the model!
ML
APPLICATION 2
• Advertisement
handling

J
Problem: You have to create a model to decide
whether to show an ad to a user or not. If yes
then which Car or Insurance advt. where the age of
the user and user class i.e. a group provided from
another model based on the user's past search
results are provided as input. You are provided
with the following dataset

[Link]

Once again we need to decide which method to use


and clearly this is a problem of classification.
We can use the KNN classifier algorithm for this
task

So let's move onto jupyter notebook and import


the algorithm, LabelEncoder, pandas library and
the dataset

In [1]: import pandas


from [Link] import KNeighborsClassifier
from [Link] import LabelEncoder
dt = pandas.read_csv(1 [Link]’)
[Link](3)

Age Search Ad

0 18 Cars Car

1 19 Automobiles None

2 21 Automobiles None
211 ML APPLICATION

We again have very less data to work with. We


have the Age and Search as input and Ad as output.
But we need to preprocess the labels

In [2]: Enc = LabelEncoder()


[Link](['Cars','Automobiles’,'Health','Car',
'Insurance','None'])
for col in ['Search','Ad']:
dt[col] = [Link](dt[col])

So we encoded the values in the Search and Ad


column using the Enc Encoder. Now we can move onto
the next step of data splitting but as mentioned
earlier we don't have enough data for splitting it
into training and testing set. So we need to
create the training set ourselves . Let's take a
look at the whole dataset:
Age Search Ad
We have 20 rows and 3 columns 0 18 Cars Car
worth of data. We already know 1 19 Automobiles None
the input i.e. Age and Search
2 21 Automobiles None
and the output i.e. Ad where
3 22 Cars Car
the input Search will be
4 23 None None
provided by another classifier
which will classify the user 5 26 None None

into Cars, Automobiles, Health 6 27 Health Insurance

and None classes on the basis 7 28 Cars Car


of previous searches. We can 8 18 Health Insurance
visually create a understanding 9 19 None None
from the data that a person of 10 20 None None
age 18-28 should only be shown
11 22 Automobiles None
Car advertisement when the
12 26 None Insurance
person is in Cars class else
the Insurance advertisement 13 30 Health Insurance

when the person is in Health 14 30 Automobiles Car

class. Similarly, a person of 15 29 Cars Car


age 30 or more should be shown 16 29 Health Insurance
the Car advertisement if the 17 29 None Insurance
person is in Cars or 18 32 Automobiles Car
Automobiles class and vice
19 32 Health Insurance
versa
' 212•/ >
-
ML APPLICATION

And person from the 18-24 in the None class


should be shown nothing but if the person is 25 or
more the Insurance ad should be shown
All this assumed conditions are called
hypothesis. So using these hypothesis we can
create a testing dataset with close to accurate
outputs. So let's create them

In [3]: import numpy


X = [Link](columns='Ad')
y = dt['Ad']

def en(val):
v = [Link]([val])
return v[0]

tst_X = [Link](
[[21.»en( 'None')]_, [21,en( 'Health')],
[27,en('None')],[23,en('Automobiles')],
[34,en('None')],[34,en('Automobiles')]])
tst_y = [Link](
[[en('None')]}[en('Insurance')]}
[en('Insurance')],[en('None')],
[en('Insurance')],[en('Car')]])

First of all we imported the numpy package to


create our test data. Then we divided our dataset
into training input and training output
To create the testing set we will use the en()
function which will take the Search or Ad label
and return the encoded value for it. Now we can
create some testing data based on our hypothesis,
like a 21-years old person of class None ([21,en(
rNone')]) should be shown no advertisements ([en(
rNone')])
As mentioned earlier the training set is built
upon hypothesis. They may be correct or wrong. We
created them for the purpose of testing our
model. We can only use this in situations like
these where the data is compressed into a small
dataset. We can visualiza the dataset using the
pandas data frame
213
• ML APPLICATION

In [4]: def dec(val):


v = Enc.inverse_transform(val)
return v

tst = [Link]({'Agetst_X[:0],
'Search':dec(tst_X[:,1]),
'Ad':dec(tst_y[:,0])})
tst

Age Search Ad

0 21 None None

1 21 Health Insurance

2 27 None Insurance

3 23 Automobiles None

4 34 None Insurance

5 34 Automobiles Car

We defined dec() function to minimize our code.


The ages in the testing set are not present in the
actual dataset. Now we can move onto creating our
KNN classifier and training it

In [5]: KNN = KNeighborsClassifier()


[Link](X,y)

Out[5]: KNeighborsClassifier()

Now we can pass the test input to KNN classifier


and print the accuracy

In [6]: from [Link] import accuracy_score


pred_y = [Link](tst_X)
accuracy_score(tst_y, pred_y)

Out[6]: 0.8333333333333334

So our model has an accuracy of approx 83% and


given the number of testing set length i.e. 6, our
model has predicted correct for 5 inputs but wrong
for only one

214 ML APPLICATION

But remember that the testing set is based upon


the hypothesis so maybe what the model predicted
is right
Also I don't know you have wondered about it
until now or not but you can see that the training
set was based upon the hypothesis we created i.e.
we analyzed the data, found connections in the
features & labels and created the testing set of
which the model is thinking the same as us for the
5 inputs. All the hypothesis we built are the same
patters and relations used by the model to
predict. Even though we have done that just having
a thorough look at the data which is most likely
not to be wrong, but the model does everything in
some milliseconds. Think that there were tens of
thousands of data like these! Could you have done
the same there?
I hope your understanding about 'machine' and
'learning' in machine learning is more clear now
OQ ML
CO APPLICATION 3
• Checking wine
quality

o
Problem: In a wine factory, you are asked to
rate the quality of the production in a scale of 1
to 5 if different chemical properties are passed
as an input for the following produced batch and
then tell whether the batch is good or not. A good
scale is more than half (2.5). You have the
following sample dataset of some 1500 samples
/—C Input and Output Values

[Link]
[Link]

■{ batch]- X. sample

We can use machine learning models to solve the


problem but the question is which algorithm to
choose?
If you are think of using an classifier algorithm
because we need to rate the wine then of course
you're wrong. Rating is to be done in a scale of 1
to 10 where the rating can be 5 or 5.5 or even
5.45, so for this problem we are going to use
linear regression
So let's our sample dataset and preview it with
the describeO function along with the other
neccesities

In [1]: import pandas


import numpy
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
dt = pandas.read_csv('wine_sample.csv')
([Link]()).round(1)
217 ML APPLICATION

Out[l]:
fixed acidity volatile acidity citric acid residual sugar chlorides

count 1499.0 1499.0 1499.0 1499.0 1499.0

mean 8.4 0.5 0.3 2.5 0.1

std 1.7 0.2 0.2 1.4 0.0

min 4.6 0.1 0.0 0.9 0.0

25% 7.2 0.4 0.1 1.9 0.1

50% 8.0 0.5 0.3 2.2 0.1

75% 9.3 0.6 0.4 2.6 0.1

max 15.9 1.6 1.0 15.5 0.6

free sulfur dioxide total sulfur dioxide density PH sulphates alcohol quality

1499.0 1499.0 1499.0 1499.0 1499.0 1499.0 1499.0

15.6 46.8 1.0 3.3 0.7 10.4 5.6

10.5 33.3 0.0 0.2 0.2 1.1 0.8

1.0 6.0 1.0 2.7 0.3 8.4 3.0

7.0 22.0 1.0 3.2 0.6 9.5 5.0

13.0 38.0 1.0 3.3 0.6 10.1 6.0

21.0 63.0 1.0 3.4 0.7 11.1 6.0

72.0 289.0 1.0 4.0 2.0 14.9 8.0

In this dataset we have twelve columns which have


about 1500 samples. The first eleven columns are
different chemical properties of wine i.e. input
and quality is the rating i.e. output. The minumum
rating is 3 and the maximum is 8. But we need to
rate the quality of wine in a scale of 1 to 5. So
we need to Rescale the quality feature in the
scale of 1 to 5 and we will do that using the
MinMaxScater

In [2]: from [Link] import MinMaxScaler


Sclr = MinMaxScaler(feature_range=(l, 5))
qal = [Link](dt['quality'])
dt['quality'] = Sclr.fit_transform([Link](-l,l))

218 ML APPLICATION

We imported the MinMaxScater and created our


Sctr object of the class. We passed the scale in
the feature_range parameter i.e. 1-5. Then we
created a numpy array of the quality feature. Then
we scaled the data using the fit_transform()
function. Note that we passed the reshaped array
using the reshape(-1,1) function which will
convert the 1-D array [5,6,7,...] to 2-D array
[[5],[6],[7],...]
Now we can split the data, create our linear
regressor and train it

In [3]: X = [Link](columns='quality')
y = dt['quality']
trnX,tstX,trnY,tstY = train_test_split(X,y,test_size=0.1)

Reg = LinearRegression()
[Link](trnX,trnY)

Out[3]: LinearRegression()

Before checking the quality of the given batch we


need to test our data and find some metrics. So
let's use the testing sets and compare the model's
predictions

In [4]: predY = [Link](tstX)


mae = metrics.mean_absolute_error(tstY,predY)
err = metrics.max_error(tstY,predY)
cmp = [Link]({'Predicted':predY,
'Actual':[Link]})
print ('MAE: ’,mae, '\n','Max RE: ’,err)
[Link](figsize=(7.5,6))

MAE: 0.43020099299063136
Max RE: 1.5037184643992099

Our model has MAE (Mean absolute error) of approx


0.43 i.e. the average aboslute errors with the
maximum residual error i.e. Max error as 1.5
We have a lot of values so let's plot the graph
for comparing the values
219
• ML APPLICATION

0ut[4]: [Link]._subplots.AxesSubplot at 0x23bl0ef1160>

Looking at the data we can tell how our model is


performing. By observing the graph we can tell
that our model didn't rated 5 to any input whereas
the actual values have only 3 times which explains
everything. The distribution of higher values is
low therefore prediction of higher rating is also
low. Although our model is fine, so let's import
the batch dataset and pass it to the model

In [5]: batch = pandas.read_csv('wine_batch.csv1)


batch_pred = [Link]([Link])
batch_pred.mean()

Out[5]: 3.135502047867703

We imported the csv file and passed the values for


predictions. And at average the rating is 3.14 and
if we take MAE(0.43) the rating could be also 2.71
or 3.57. But in all of the cases the average rating
of the batch is higher than 2.5 so it is fine!
p/l ml
CH APPLICATION 4
• Motch ploy
decision

<s
24A-------------
— ML APPLICATION 4

Problem: You have to decide whether a match can


be played or not if the temperature, humdity and
rainfall status is provided as input. The
following dataset has over 1000 past days of
samples:

[Link]

Once again we need to decide which method to use


and clearly this is a problem of classification.
We can use the decision tree classifier because we
need to predict Yes or No. So because the task is
simple and he have a wide range of data so the
tree should be really helpful

So let's import the modules and functions we need


for this model that are:

import pandas
from [Link] import -
- DecisionTreeCtassifier
from [Link] import -
- LabetEncoder
from [Link] import -
- classification_report
from sktearn.modet_setection import -
- train_test_sptit

and our dataset and preview it without the


headO function
222 ML APPLICATION

In [1]: import pandas


from [Link] import DecisionTreeClassifier
from [Link] import LabelEncoder
from [Link] import classification_report
from sklearn.model_selection import train_test_split
dt = pandas.read_csv('play_stats.csv')
dt

Temperature Humidity% Rain Play

0 34 74.2 Yes No

1 19 68.2 No No

2 28 67.2 Yes Yes

3 29 66.6 Yes No

4 26 57.9 Yes Yes

996 28 62.0 Yes Yes

997 27 71.4 Yes No

998 19 54.1 No Yes

999 32 57.4 Yes No

1000 24 61.4 Yes No

1001 rows x 4 columns

We need to encode the Rain and Play labels so


create an encoder and encode the labels

In [2]: Enc = LabelEncoder()


[Link]([’Yes','No'])
for col in ['RainPlay']:
dt[col] = [Link](dt[col])

We need simply created a Enc encoder using the


LabelEncoder and then fitted the rYes-’ and rNoJ
values. Then using a for loop and the
transform() we encoded the labels
Note that you may get an error if you re-run the
cell because the labels are encoded so the
transform() will recieve numbers this time

223 ML APPLICATION

Now we can split the dataset into training and


testing sets and train our classifier after we
create it

In [3]: X = [Link](columns='Play')
y = dt['Play']
trnXjtstXjtrnY,tstY = train_test_split(X,y,test_size=0. 1)
CModel = DecisionTreeClassifier()
[Link](trnX,trnY)

Out[3]: DecisionTreeClassifier()

Our classifier CModel is ready so now we can


test it and print the classification report

In [4]: predY = [Link](tstX)


print(classification_report(tstY,predY))

precision recall fl-score support

0 1.00 1.00 1.00 70


1 1.00 1.00 1.00 31

accuracy 1.00 101


macro avg 1.00 1.00 1.00 101
weighted avg 1.00 1.00 1.00 101

Wow! we have created an A-grade classifier! It?s


the first time I-’ve seen so in practical
applications
So our model has performed 100% well so no
questions. Analysing the classification report we
can tell that Yes and No has likely a distribution
of 70%-30%. Now you can define a function like we
did with our movie classifier and get the input
data from the user (input prompt) and print Play
or CanJt Play
□ C ML
CO APPLICATION 5

• Best striking
formation
(Player statistics)

o-------------
Problem: You have to create(predict) a
formation for a football team with the best 3
strikers out of 5 players if there statistics from
the previous match are provided. Here is the
sample data of performance and rating in
previously 6 matches of each player:

[Link]

So which method do you think we should use for


this problem? Should we use regression or
classification or something else? You may think we
should use the regressor model alike the wine
rating model and you are right but, the ratings
there should be accurate by even 0.01 points but
here player statistics can be int i.e. non-decimal
values like 5 or 6. So let's use both of the
methods for this problem to rate the players and
then we will define a function to create a
formation of the best three players
So let's import the sample dataset and the
followings:

import pandas
from [Link] import LabelEncoder
from [Link] import -
- DecisionTreeCtassifier
from sktearn.tinear_modet import-
- LinearRegression
In [1]: import pandas
from [Link] import LabelEncoder
from [Link] import DecisionTreeClassifier
from sklearn.linear_model import LinearRegression
dt = pandas.read_csv('Players_data.csv')
dt

Out[l]:
Player Possesion% Pass% Goals Shots Rating

0 Silva 55 60 2 7 9

1 Deigo 54 64 0 6 5

2 Robert 57 77 2 6 10

3 Davies 49 75 1 5 7

4 Paul 54 65 0 6 6

5 Silva 51 71 1 2 6

6 Deigo 50 74 1 2 6

7 Robert 50 70 1 5 7

8 Davies 51 68 1 5 7

9 Paul 58 61 2 5 9

10 Silva 51 70 1 2 6

11 Deigo 56 71 2 7 10

12 Robert 57 59 2 5 9

13 Davies 52 60 1 3 6

14 Paul 57 77 2 7 10

15 Silva 58 64 2 6 9

16 Deigo 51 68 1 5 7

17 Robert 59 71 2 5 10

18 Davies 52 64 1 2 5

19 Paul 52 66 1 6 7

20 Silva 54 63 0 2 4

21 Deigo 52 64 1 7 6

22 Robert 55 66 2 6 10

23 Davies 59 63 2 6 9

24 Paul 50 75 1 3 7

25 Silva 51 69 1 3 7

26 Deigo 54 60 0 5 5

27 Robert 54 64 0 2 4

28 Davies 50 72 1 5 7

29 Paul 54 65 0 2 5
' •/ >
227
-ML APPLICATION
We have 5 players and their previous performances
of 6 matches i.e. total 30 samples and 6 samples
for each player. We have all the ratings are
rounded off to non-decimal values thatJs why we
will create a classifier along with regressor. But
we need to encode the player names before moving
forward

In [2]: Enc = LabelEncoder()


plyrs = [’Silva’/Deigo’/Robert*/Davies’/Paul’]
[Link](plyrs)
dt['Player'] = [Link](dt['Player'])

So we have encoded the Player labels. Now letJs


move onto splitting where we will use the first
five matches data for training and the sixth match
for testing our models

In [3]: X = [Link](columns='Rating')
y = dt[’Rating']
trnX = X[:25]
tstX = X[25:]
trnY = y[:25]
tstY = y[25:]
RegModel = LinearRegression()
CModel = DecisionTreeClassifier()
[Link](trnX,trnY)
[Link](trnX,trnY)

Out[3]: DecisionTreeClassifier()

First of all we seperated the data into input i.e


the players along with their performance and the
output i.e. rating of each player. Then we
splitted the data into training and testing sets.
As mentioned earlier we will use the data of first
five matches (5 players in 5 matches {5*5=25}) and
the last match for testing which we did manually
using the slice syntax
Then we created our regressor RegModel and
classifier CModel and trained them with the first
five matches data
228 ML APPLICATION

So letJs pass the sixth match data and let the


models rate the players based on what they have
learnt. We can create a data frame to compare the
values

In [4]: pred_reg = [Link](tstX)


pred_cls = [Link](tstX)

cmp = [Link]({'Regressor':pred_reg,
'Classifier':pred_cls,
'Actual':[Link]},
index=plyrs)
cmp

Regressor Classifier Actual

Silva 6.404661 7 7

Deigo 4.613903 5 5

Robert 4.158141 5 4

Davies 6.821053 7 7

Paul 4.160626 5 5

We can also visualize our data for analysing them


visually

In [5]: [Link](kind='bar')

Out[5]: [Link],_subplots.AxesSubplot at 0xl8bb7249!



229 ML APPLICATION

To our surprise the classifier model has an


accuracy of 80% i.e. it accurately rated 4
players. But is that's it? If you obseve carefully
the regressor model is just a bit lagging behind
but to it's plus point, when rating Robert the
regressor was close or correct. We can also print
some metrics to get more insights

In [6]: from sklearn import metrics


mae = metrics .mean_absolute_error(tstY<,pred_reg)
err = metrics.max_error(tstY?pred_reg)
acc = metrics. accuracy_score(tstY.»pred_cls)
print('MAE:’^mae^f'where max error is{err}')
print('CModel Accuracy_,acc)

MAE: 0.431579401888947 where max error is0.83937


CModel Accuracy: 0.8

So we can draw a conclusion that the regressor is


more close-to model i.e. close to the actual
values by 0.4 where it can double in worst cases
whereas the classifier is on-point model for the
correct predictions but if it predicts wrong the
values can differ from the actual by 1.0 or more.
So according to you which model is more
preferable?
Well if you think the classifer because it is an
on-point model but it comes with it's risks too
i.e. the wrong predictions have higher differences
from the actual in comparison to the regressors
max error i.e. 0.84
In my opinion both models have their own
advantages. But we need to choose one so let's
choose the regressor. Why regressor or why not the
classifier? Using these models always comes with
risks of some wrong predictions but we need to
keep this in our mind that we must choose the
best. The regressor has an average absolute error
of 0.4 but it is close to the actual values i.e.
most of the times it is less and if more that's by
just a bit
ML APPLICATION

All the values have a difference of 0.4 at


average from the actual values, so what if we
round of the values and plot the graph again we
will get something like this

In [7]: cmp['Regressor'] = round(cmp['Regressor'])


[Link](kind='bar')

0ut[7]: <[Link]._subplots.AxesSubplot at 0xl8bb7375



231 ML APPLICATION

Now we have 3 values on-point but 2 values with


difference of 1.0 and the classifier has 4 values
on-point but 1 value with 1.0 of difference. So if
the mean absolute error is 0.4 we can round them
up but if they have error of 0.6 or 0.8 like the
with Silva and Paul we will move down to 1.0
difference. Now what is your judgement? Which
model should we use?

As of now we learned which model does what but we


need to also choose between these models at times
like these. I would like you to choose one from
these and come up with the best formation of the
strikers using your chosen model and the provided
test values:

[Link]

Hint: Remember what are the best qualities of an


model?
□q ml
CO RESOURCES
• Datasets
• Forum

o
A

26 RESOURCES
T_________________________ J
All the datasets used in this book whether
it is json file., txt file, csv file, etc.
can be found in the below link:
c—C [Link]/datasets
[Link]

If you have any doubts or questions


unanswered fell free to drop them in the
below forum which is created to help you:
f [ [Link]/community

[Link]

To seperate the readers of this book from


others you can use the below credentials to
login or scan the code:

Username : ml_reader
Password : ml_reader_forum_pass
WhatJs next?
It's been a quite long journey where we learn't a
lot of things. I would like you to look back at
the date where you purchased this book and now
what you've gained. But it's only the start
there's still more to learn! I would be really
obliged to know your views about ehat you think
about the book and how was your experience, please
leave a review here
Leave your thoughts
And Best of luck on your way ahead!

You might also like