PYTHON
MACHINE LEARNING
Machine learning is one of the skills
considered must for the future yeors. As
tosks are increasing it has became
time consuming to program, machine learning
allows machines to learn on their own and
produce the same results. Machine
learning allows machines to learn
on their own through feeding them data.
I assume that you have prior
knowledge of Python programming and
data science, if you don't
you can check these books on the next page.
A complete guidebook for anyone who
wants to master machine learning
with Python.
Rahul Mula
Mochine Learning with Python
by Rahul Mula
© 2020 Machine Learning with Python
All rights reserved. No portion of this
book may be reproduced in any form
without permission from the copyright
holder, except as permitted by U.S.
copyright law.
Cover by Rahul Mula.
All the programs written in this book
are tested and verified by the author.
Cover Template from [Link]
ISBN : 979-8-58-426755-1
/----------------------------------------------------------------------------------------- X
MAKE SURE TO CHECK THEM OUT
<_____________________________________________________ >
Python For Beginner
A beginners guide to programming
with python
Data Science with Python
Learn how to perform tasks like data
processing, cleansing, analysis and
visualization
Why should you learn machine learning? or what are
its uses? Would be the questions that may come to
your mind. The answer is simple, think that you are
given a data from a online store about its products
and recommend products. The data has a product
name, its category, its quantity, and rate columns
with several hundred rows of products. If you want to
perform some analysis like the product which is most
purchased in that day, it will take a lot of time to do it
manually. To ease up these tasks, we use data
analysis, i.e. we run a program with codes to perform
a certain data analysis. The computer runs the
program and we get the output in just few seconds.
Then we classify the user and suggest it products
based on preferable categories on the basis of it's
previous search [Link], How to do that? Well, we
need to learn Data Science and Machine Learning to
perform those tasks.
Businesses S organizations are trying to deal with it
by building intelligent systems using the concepts and
methodologies from Data science, Data Mining and
Machine learning. Among them, machine learning is
the most exciting field of computer science. It would
not be wrong if we call machine learning the
application and science of algorithms that provides
sense to the data.
This book is prepared especially for beginners (at
Data Science and Machine Learning), but you should
be familiar with programming in Python. We will work
with packages and modules like NumPy, SciPy,
Pandas, Matplotlib, Scikit Learn, etc. to perform
analysis and other tasks. I kept this book open to the
basic concepts of data science to help the beginners
to understand everything but the book only covers
data science concepts prior for machine learning, as
the name suggests, the book is not for you if you're
looking for data science, you check the other books
page to find that. I also included advanced topics to
not limit you to the basics. Machine learning,
algorithms, data science, etc. moy seem tough end
boring, but as you handle more end more data, you'll
ploy with it!
(contents)
PANDAS
03
CHAPTER pandas
• Features of Pandas Library
• Series
• Data Frames
1
MATPLOTLIB
06
CHAPTER
l • Features of matplotlib
matpMib . Data visualizationp
• PyPlot in matplotlib
1
(contents)
SCIKIT LEARN
k
CHAPTER
• Features of Scikit-learn library
• How to work with data?
• Why use Python?
1
TYPES OF MACHINE LEARNING
08 b?
k
CHAPTER
1APTER O
• Supervised learning
• Unsupervised learning
• Deep learning 1
MATHEMATICS FOR MACHINE LEARNING
k
CHAPTER
I
IDIID • Data instances
I I • Statistics
• Probability
1
SCIKIT LEARN ALGORITHMS
CHAPTER
• Regression algorithm
• Classification algorithm
• Clustering algorithm
1
IMPORTING DATA
CHAPTER
I I
I
I I
• Importing CSV data
• Importing JSON data
• Importing Excel data
I
r DATA OPERATIONS
V
CHAPTER
• NumPy operations
• Pandas operations
• Cleaning data
1
(contents)
DATA ANALYSIS 8 PROCESSING
k
CHAPTER
• Data analytics
• Correlations between attributes
• Skewness of the data
1
DATA VISUALIZATION
14
CHAPTER
• Plotting data
• Univaritae plots
• Multivariate plots
1
r CLASSIFICATION
16
CHAPTER
• Decision tree
• Linear regression
• Naive Bayes
1
(contents)
PERFORMANCE 8 METRICS
20
CHAPTER
• Calculating the model
• Improving the model
• Saving and loading models
(contents)
MACHINE LEARNING
U1 INTRODUCTION
• What is Machine
Learning?
• Use of Machine
Learning
• How Machines
Learn?
o
(—[Link]---------------------
MACHINE LEARNING INTRODUCTION
A
'--- '□-------------------- y
What is Machine Learning?
//
Data is what you need to
do ANALYTICS,
Information is what you
need to do BUSSINESS.
Commonly referred to as the “OiL of the 21st
century" our digital data carries the most
importance in the field. It has incalculable
benefits in business, research and our everyday
lives.
Machine Learning is the field of computer science
where machines provide meaning to the data like we
humans do. Machine Learning is an type of
artifical intelligence which finds patterns in raw
data through various algorithms and perform
predictions like humans. Machine Learning also
means machines learn on their own. To better
understand it think of a new born child and refer
it to a machine learning model. The parents cannot
teach them everything that's why they leave them
to schools which can be you in this case with the
machine learning model. The school has text books,
tests, etc. to help you learn on your own which
w
data
' •/ '
]_4 MACHINE LEARNING INTRODUCTION
)
Uses of Machine Learning
Organizations are investing heavily in
technologies like Artificial Intelligence, Machine
Learning and Deep Learning to get the key
information from data to perform several
real-world tasks and solve problems. We can call
it data-driven decisions taken by machines,
particularly to automate the process. These
data-driven decisions can be used, instead of
using programing logic, in the problems that
cannot be programmed inherently. The fact is that
we can't do without human intelligence, but other
aspect is that we all need to solve real-world
problems with efficiency at a huge scale. That is
why the need for machine learning arises.
Followings are some of it's applications in real
world:
Forecasting weather of
a day beforehand through finding
patters of weather in the data
of weather of previous days
Predicting the future
prices of stocks in
stock market
suggesting a product to a
customer in an oniine store
according to the users previous
search terms
' •/ '
J.5 MACHINE LEARNING INTRODUCTION
'
How do machines learn?
So what magic happens that machines learns like
us and perform tasks? Let's understand that by an
example. Let's say you are a new computer dealer.
You have very basic experience in it. So you ask
another dealer and obtain information. You
summarized the following points to be important
like the processor cores, ram and gpu. The dealer
tells you these and you learn in return. Then you
are provided with following data about 8gb ram of
different brands:
3500
„ 3000
o’
c
I 2500
<u
I 2000
§
1500
-------------------- >
60 65 70 75 80 85
Price
By observing the data we can tell that the price
increases with the increase in frequency speeds.
You understand an simple logic behind the data.
Then if you get ram with the following
specifications you can tell it's price, like a ram
with 1666 Mhz of frequency so its price is 60.
frequency
RAM
' •/ '
16 MACHINE LEARNING INTRODUCTION
)
But what if you get an frequency that you don't
have record of like 2600Mhz? Then you have to
learn how to decide the prices. We start to find a
way to calculate with the given data. We assume
that their is a linear relationship between the
two. We define the relationship as a straight line
as shown below:
Price
Now we can use the line as reference and predict
values. SOj for 2600Mhz the cost will be about 77
So how do we draw out the line. We follow the
formula cost - a + b * Mhz, but what are a & b?
a and b are parameters of the straight line which
you don't need to sweat about.
frequency
RAM
' •/ '
]_7 MACHINE LEARNING INTRODUCTION
Likewise in machine learning, the machine i.e.
computer learns the patters or relations in the
data through algorithms and predict values when
new value is asked.
So will there be no errors? Definitely it will
predict wrong than the actual answer. We also do
many mistakes but learn from our mistakes or
change our tutor if the result stays negative.
Machine learning models too learn from their
mistakes and change algorithms when results are
not improved.
■x
SETTING-UP
ENVIRONMENT
• Installing Anocondo
• Jupyter notebook
• Working with
Jupyter notebook
A
021/
X__ SETTING-UP ENVIRONMENT
__________________________________ J
Installing Anaconda
Head to [Link]/products/individuat to
download the latest version of Anaconda.
Anaconda Installers
Windows ■■ MacOS « Linux A
64-Bit Graphical Installer (466 MB) 64-Bit Graphical Installer (462 MB) 64-Bit (x86) Installer (550 MB)
32-Bit Graphical Installer (397 MB) 64-Bit Command Line Installer (454 MB) 64-Bit (Power8 and Power9) Installer (
MB)
You can download the anaconda-installer for your
system, whether it is Windows, Mac or Linux. After
installing it, just run the installer and install
Search the web
Anaconda Prompt (anaconda3)
P anaconda Prompt (anaconda3) - See App
web results
P anaconda prompt anaconda3
CT Open
yP anaconda prompt anaconda3 conda
c0 Run as administrator
yP anaconda prompt anaconda3
uninstall
> Open file location
“P3 Pin to Start
Pin to taskbar
® Uninstall
P anaconda Prompt (anaconda3)
* •/
20 SETTING-UP ENVIRONMENT
---------
This is the Anaconda Command Prompt, from where we
can run programs or perform other operations using
code's as commands.
Anaconda Prompt (anaconda3)
(base) C:\Users\Rahul>
Jupyter Notebook
The lupyter Notebook is an open-source jupyter
web application that allows you to create
and share documents that contain live
code, equations, visualizations and
explanatory text. We will use it to perform
our data processing, analytics and
visualization, etc. on the go.
To open lupter Notebook, write jupyter notebook in
the anaconda command prompt and press enter.
5 Anaconda Prompt (anaconda3) □ >
(base) C:\Users\Rahul>jupyter notebook
21 SETTING-UP ENVIRONMENT
-----
fS’ Anaconda Prompt (anaconda3) - jupyter notebook
1
(base) C:\Users\Rahul>jupyter notebook
[I - - JupyterLab extension loaded from C:\Users\Rahul\anaconda3\lib\site-packages\jupyterlab
NotebookApp]
[I NotebookApp] JupyterLab application directory is C:\Users\Rahul\anaconda3\share\jupyter\lab
[I NotebookApp] Serving notebooks from local directory: C:\Users\Rahul
[I NotebookApp] The Jupyter Notebook is running at:
[I NotebookApp] [Link]
[I NotebookApp] or’[Link]
[I NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C NotebookApp]
To access the notebook, open this file in a browser:
[Link]
Or copy and paste one of these URLs:
[Link]
or [Link]
J
Anaconda will redirect you to your browser, [it may
ask you, in which browser to host your jupyter
notebook if you have more than one browsers] a new
tab will appear with your Jupyter notebook hosted.
You can host your Python files here, and also run the
code on the fly.
P’ jupyter Quit Logout
Files Running Clusters
Select items to perform actions on them. Upload New » C
o - to/ Name ♦ Last Modified File size
C3 3D Objects 5 days ago
Co anaconda3 4 days ago
Co ansel 5 days ago
| Co Contacts 5 days ago
Co Creative Cloud Files 4 days ago
Co Desktop 3 days ago
| Ca Documents a day ago
Ca Downloads 10 minutes ago
Ca Favorites 5 days ago
Ca Links 5 days ago
Ca Music a day ago
Ca OneDrive an hour ago
Ca Pictures 5 days ago
Ca Saved Games 5 days ago
Ca Searches 5 days ago
Ca Videos 4 days ago
S Newipynb 4 days ago 72 B
In Jupyter Notebook, we donJt need to install any
other module, package or library externally
everything we need is already present here and the
best thing is that you can code online without
installing any IDE or the Python Interpreter, which
makes it the best choice for data scientists.
' •/ '
22 DATA SCIENCE INTRODUCTION
—J
Working with Jupyter Notebook
To start coding, click on New and select Python 3 to
open a new Python file.
jupyter Quit Logout
Files Running Clusters
Select items to perform actions on them. Upload New» C
□ 0 - to / Name Last Modified File size
□ O 3D Objects 5 days ago
□ Ca anaconda3 4 days ago
□ CJ ansel 5 days ago
□ Ca Contacts 5 days ago
□ Ca Creative Cloud Files 4 days ago
C jupyter Quit Logout
Files Running Clusters
Select items to perform actions on them. ______ I I —I I_____
Create a new notebook with Python 3
□ 0 -r */ Name 4 :e
Python 3
□ Ca 3D Objects
□ Ca anaconda3 Text File
□ Ca ansel Folder
Terminal
□ Ca Contacts
□ Ca Creative Cloud Files 4 days ago
This is the place where we will write our code [in
the cell] and run it.
JUpyter Untitled Last Checkpoint: a few seconds ago (unsaved changes) t* Visit repo Copy Binder link
Kernel Trusted | Python 3 O
O GitHub % Binder Memory: 168/2048 MB
If you cannot create new file or encounter any
error, you can head directly to [Link]/try and
choose Python.
Try Classic Notebook Try JupyterLab Try Jupyter with Julia
jupyter
A tutorial introducing basic features JupyterLab is the new interface for A basic example of using Jupyter
of Jupyter notebooks and the Jupyter notebooks and is ready for with Julia.
I Python kernel using the classic general use. Give it a try!
Jupyter Notebook interface.
23 SETTING-UP ENVIRONMENT
--------------- '• '
We can rename our file, by clicking the name
[untitled]
jupyter Untitled Last Checkpoint: a few seconds ago (unsaved changes)
File Edit View Insert Cell Kernel Widgets Help Trusted | Python 3 O
E + ® ft *4- H Run ■ C » Code v Q i Download A A O GitHub % Binder Memory: 168/2048 MB
I In [ ]: Q
We have only one code cell, in this cell we will
write our code
3 jupyter New Last Checkpoint: 2 minutes ago (autosaved) f® Visit repo Copy Binder link
File Edit View Insert Cell Kernel Widgets Help Trusted | Python 3 O
B + »: ft ft ♦4’ H Run ■ C » |code v| D ± Download A A O GitHub % Binder Memory: 219/2048 MB
In [ ]:
There are three type cells - code cells, markdown
cells and raw cells.
We can use markdown cells to display headings or
titles.
J U py ter New Last Checkpoint: 7 minutes ago (unsaved changes)
File Edit View Insert Cell Kernel Widgets Help Trusted ✓ | Python 3 O
B + ® t +4' H Run ■ C Markdown v o i Download H d O GitHub % Binder Memory: 219/2048 MB
I # Jupyter Notebook
Now run the cell by clicking the run button on the
header.
^jupyter New Last Checkpoint: 7 minutes ago (unsaved changes)
File Edit View Insert Cell Kernel Widgets Help Trusted ✓ | Python 3 O
B + 9® C ♦ 4 H Run | ■ C » Code v ra i Download 41 d O GitHub % Binder Memory: 219/2048 MB
Jupyter Notebook
I" [ ]:
* •/
24 SETTING-UP ENVIRONMENT
-----
In code cells, we can write Python codes and execute
them instantly.
3 jupyter New Last Checkpoint: 12 minutes ago (autosaved) Visit repo Copy Binder link
File Edit View Insert Cell Kernel Widgets Help Trusted ✓ | Python 3 O
B + 3^ I?) C ♦ * H Run ■ C » Code v Q ± Download a a O GitHub % Binder Memory: 119/2048 MB
Jupyter Notebook
In [ ]: 123*525|
^.JUpyter New Last Checkpoint: 13 minutes ago (unsaved changes) Visit repo Copy Binder link
File Edit View Insert Cell Kernel Widgets Help Trusted | Python 3 O
B + ® + 4> H Run ■ C H Code v E3 i Download a a O GitHub % Binder Memory: 119 / 2048 MB
Jupyter Notebook
In [1]: 123*525
Out[l]: 64575
To insert a new cell below the selected cell, press
b on your keyborad or click the + icon.
3 jupyter New Last Checkpoint: 17 minutes ago (unsaved changes) f® Visitrepo Copy Binder link
File Edit View Insert Cell Kernel Widgets Help Trusted | Python 3 O
B + ® 6 * 4- H Run ■ C ►* Code * a ± Download a a O GitHub % Binder Memory: 119/2048 MB
Jupyter Notebook
In [1]: 123*525
Out[l]: 64575
In [ J:
You can select [blue] or edit [green] a cell, by
clicking outside the text feild or inside the text
feild respectively.
3 jupyter New Last Checkpoint: 19 minutes ago (autosaved) •®
File Edit View Insert Cell Kernel Widgets Help Trusted | Python 3 O
a I + 3- ® t + 4- H Run ■ c H code £ Download a a O GitHub % Binder Memory: 119/2048 MB
Jupyter Notebook
SETTING-UP ENVIRONMENT
We have more access to the markdown cells, to diplay
texts more gracefully. We can add headings,
sub-headings and lower-headings, using # 1, 2 and 3
times followed by space and then text respectively.
Jupyter Notebook
IPython
Data
In [ ]:
We can create ordered and bulleted lists
1. Data Science
1. Python
2. Jupyter Notebook
3. Libraries
1. Data Science
A. Python
B. Jupyter Notebook
C. Libraries
Ordered List
____________ J
z - Data Science ------ ----------------- \
* Python
* Jupyter Notebook
* Libraries
• Data Science
■ Python
■ Jupyter Notebook
■ Libraries
BuLLeted List
_____________ d
To create an ordered list, use 1 for the first list
item and then use tabspace for the sub-list items and
use correct numbering. [The text should be written
followed by space after the numbers]
To create a bulleted list, use - for square bullets
and * for round bullets, and same manner as above for
list and sub-list items.
* •/
26 SETTING-UP ENVIRONMENT
-- '•
We can also links, using [] & (). Write the display
text in [] and put the link in (), you can also add a
hover text inside of ( ) using " ” quotes.
TJupter Notebook for Pvthonl(httDs://[Link]/trv "Try it!")
Jupter Notebook for Python
We can also use **<text>** or <text> to render
bold text and *<text>* or _<text>_ to render
italicized text
We can also insert images by going to the
Edit>Insert Image and browse your image to enter it
File Edit View Insert Cell Kernel Widgets Help Trusted | Python 3
Create tables using | and strictly following the
below example
|Product|Price|Quantity|
|----------- 1-— |-.............. |
|Biscuits|5|2|
|Milk|7|5L|
Product Price Quantity
Biscuits 5 2
Milk 7 5L
27 SETTING-UP ENVIRONMENT
I V z
HereJs a complete list of shortcuts of various
operations with cells.
r ’I
Operations Shortcut
change cell to code y
change cell to markdown m
change cell to raw r
close the pager Esc
restart kernal 0 + 0
copy selected cell c
cut selected cell X
delete selected cell d + d
enter edit mode Enter
extend selection below Shift + j
extend selection above Shift + k
find and replace f
ignore Shift
insert cell above a
insert cell below b
interrupt the kernal i + i
Merge cells Shift + m
paste cells above Shift + v
paste cells below V
run cell and insert below Alt + Enter
run cell and select below Shift + Enter
run selected cells Ctrl + Enter
save notebook Ctrl + s
scroll notebook up SHIFT + Space
scroll notebook down Space
select all Ctrl + a
show keyboard shortcuts h
toggle all line numbers Shift + 1
toggle cell output 0
toggle cell scrolling Shift + o
toggle line numbers 1
undo cell deletion z
■x
PANADAS
UO LIBRARY
• Features of Pandas
library i'll
• Series I'1
• Dataframes pandas
o J
/• A
03
k _______ /
PANDAS LIBRARY
________________________________________ >
Pandas
Data science requires high-performance data ma
nipulation and data analysis, which we can achieve
with Pandas Data [Link] with pandas is
in use in a variety of academic and commercial
domains, including Finance, Economics, Statistics,
Advertising, Web Analytics, and more. Using
Pandas, we can accomplish five typical steps in
the processing and analysis of data, regardless of
the origin of data - load, organize, manipulate,
model, and analyse the data.
Key features of Pandas library
We can achieve a lot with Pandas library using
its features like:
• Fast and efficient DataFrame object with
default and customized indexing.
• Tools for loading data into in-memory data
objects from different file formats.
• Data alignment and integrated handling of
missing data.
' if '
30 PANDAS LIBRARY
—
• Label-based slicing, indexing and subsetting
of large data sets.
• Columns from a data structure can be deleted
or inserted.
• Group by data for aggregation and
transformations.
Series
Pandas deals with data with itJs data
structures known as series, data frames and panel.
Series is an one-dimensional array like structure
with homogeneous data. For example, the following
series is a collection of integers
10 17 23 55 67 71 92
As series are homogeneous data structure, it can
contain only one type of data [here integer]. So,
we conclude that Pandas Series is:
• It is a homogeneous data structure
• Its size cannot be mutated
• Values in series can be mutated
Data Frames
DataFrame is a two-dimensional array with
heterogeneous data.
Day Sales
Monday 33
Tuesday 37
Wednesday 14
Thursday 29
31 PANDAS LIBRARY
The data shows the sales of certain product for 4
days. You can think of Data Frames a container for
2 or more series. So, we conclude that pandas data
frames is:
• It can contain heterogeneous data
• Its size is mutable
• ALso its data is mutable.
We will use Pandas series and data frames a lot
in the future lessons, make sure to go through the
lesson again and get the grasp of it.
Key Points
• Pandas library is a high performance data
manupilation and data analysing tool.
• Pandas data structures include series and
data frames
• Series is a 1-Dimensional array of
homogeneous data, whose size is immutable
but values in a series are mutable.
• Data Frames is a 2-Dimensional array of
heterogeneous data of 2 or more series,
whose size and data are mutable.
■x
f"\/| NUMPY
MH PACKAGE
• Features of NumPy
• ndarray Objects
• List vs. ndarrays
o J
A
04 NUMPY PACKAGE
_________________________________________>
NumPy
NumPy is a Python package which stands for
'Numerical Python'. It is a library consisting of
multidimensional array objects and a collection of
routines for processing of array.
3D array
2D array
ID array
5.2 3.0 4.5
7 2 9 10 9.1 0.1 0.3
axis 0 axis 1
shape: (4,) shape: (2, 3)
Key features of NumPy
NumPy is powerful that consists of many features
like :
• Mathematical and logical operations on arrays.
• Fourier transforms and routines for shape
manipulation.
• Operations related to linear algebra. NumPy
has in-built functions for linear algebra and
random number generation.
• NumPy ndarrays are much much faster than
Python Built-in lists and less memoray
consuming.
• Most of the part that requires fast
computation are written C and C++
34 NUMPY PACKAGE
•
ndarray objects
NumPy aims to provide an array object that is up to
50x faster that traditional Python lists. The array
object in NumPy is called ndarray, it provides a lot
of supporting functions that make working with
ndarray very easy. Arrays are very frequently used in
data science, where speed and resources are very
important.
In NumPy, we can create 0-D,l-D,2-D and 3-D
ndarrays.
0-D (33)
1- D ([11,27,18])
2- D ([ 3, 5,6],
[5, 7,11])
3- D ([ 5,8,19],
[ 6, 9,10],
[4,1,11])
In breif ndarrays or n-dimensional arrays are:
• It describes the collection of items of the
same type.
• Items in the collection can be accessed using a
zero-based index.
• Every item in an ndarray takes the same size of
block in the memory.
• Each element in ndarray is an object of data-type
object (called dtype). Any item extracted from
ndarray object (by slicing) is represented by a
Python object of one of array scalar types.
35 NUMPY PACKAGE
•
Lists vs. ndarray
In Python we have lists that serve the purpose of
arrays, but they are slow to process. NumPy aims to
provide an array object that is up to 50x faster that
traditional Python lists.
1
Lists ndarrays
• List is an array of • ndarray is an array of
heterogeneous objects homogeneous objects
• List arrays are • ndarrays arrays are
stored in different stored in one
places in the memory continuous place in
which, makes it slow the memory which,
to process data. makes it faster to
• Lists are not process data.
optimized to work • ndarrays are
with latest CPU's optimized to work
• A 1-Dimensional with latest CPU's
List • A 1-Dimensional
ndarray
['A',56,67.05] ([ 12, 17, 25])
Lists arrays
memory Loe
-12044567
memory too
-12044568
memory too
-12044569
0 x 310718 memory too
-12044570
memory too
0 x 310719 -12044571
memory too
-12044572
0 x 310720 memory too
-12044573
0 x 310721 memory too
-12044574
memory too
0 x 310722 -12044575
memory too
-12044576
(----------------------------------------------------- X
List arrays memory 0 x 310723 memory too
-12044577
allocation 0 x 310726 memory too
-12044578
X_______________________________ /
36 NUMPY PACKAGE
-
ndarrays
PyObject_Head 1
data 2
“7
3
dimensions
4
strides
5
(--------------------------------
6 ndarrays memory
7 allocation
\_____________________________ /
You can clearly understand why the built-in list
arrays are slower than ndarrays.
To accelerate and process data much faster we will
use NumPy in future lessons., make sure to geta hold
of it.
r
Key Points
• NumPy stands for Numerical Python, which
is a Python Package used for working with
arrays.
• It also has functions for working in domain
of linear algebra, fourier transform, and
matrices.
• ndarrays or n-dimensional arrays are
homogeneous arrays, which are optimized
for fast processing.
• ndarrays also provide many functions that
makes it suitable to work with data
■x
SCIPY
UD PACKAGE
• Features of SciPy
• Data Structures 1
• SciPy
Sub-Packages
o
/----------------------------------------------------- A
05 SCIPY PACKAGE
k J
SciPy
The SciPy library of Python is built to work with
NumPy arrays and provides many user-friendly and
efficient numerical practices such as routines for
numerical integration and optimization. Together,
they run on all popular operating systems, are
quick to install.
/-------------
In [1]: #Import packages
from scipy import integrate
import numpy as np
def my_integrator(a,b,c):
my_fun = lambda x: a*[Link](b*x)+c
NumPy
y,err = [Link](my_fun,0,100)
print(’ans: %1.4e, error: %1.4e' % (y,err))
return(y,err) z
#CaLL function
my_integrator (5,-10,3)
ans: 3.0050e+02, error: 4.5750e-10
Out[l]: (300.5, 4.574965520082099e-10)
\_________________________
Key features of SciPy
SciPy combined with NumPy results a powerful tool
for data processing with features like:
• The SciPy package contains various toolboxes
dedicated to common issues in scientific
computing. Its different submodules correspond
to different applications, such as
interpolation, integration, optimization, image
processing, statistics, special functions, etc.
• SciPy is the core package for scientific
routines in Python; it is meant to operate
efficiently on NumPy arrays, so that numpy and
scipy work hand in hand.
• SciPy is organized into sub-packages covering
different scientific computing domains, which
makes it more efficient.
/
39 SCIPY PACKAGE
Data structures
The basic data structure used by SciPy is a mul
tidimensional array provided by the NumPy module.
NumPy provides some functions for Linear Algebra,
Fourier Transforms and Random Number Generation,
but not with the generality of the equivalent
functions in SciPy. Except for these, SciPy offers
Physical and mathematical constants, fourier
transform, interpolation, data input and output,
sparse metrics, etc.
Dense Matrix Sparse Matrix
1 2 31 2 9 7 34 22 11 5 1 3 9 3
11 92 4 3 2 2 3 3 2 1 11 4 2 1
3 9 13 8 21 17 4 2 1 4 1 4 1
8 32 1 2 34 18 7 78 10 7 8 3 1
9 22 3 9 8 71 12 22 17 3 9 1 17
13 21 21 9 2 47 1 81 21 9 13 21 9 2 47 1 81 21 9
21 12 53 12 91 24 81 8 91 2
61 8 33 82 19 87 16 3 1 55 19 8 16 55
54 4 78 24 18 11 4 2 99 5 54 4 11
13 22 32 42 9 15 9 22 1 21 2 22 21
Use of Sparse matrix
_________ __________ J
SciPy sub-packages
As we already know, SciPy is organized into
sub-packages covering different scientific comput
ing domains, we can import them according to our
needs rather than importing the whole library.
The following table shows the list of all the
sub-packages of SciPy :
[next page]
z—• r--------------------------------------
40 SCIPY PACKAGE
-
[Link] Mathematical constants
[Link] Fourier transform
[Link] Integrate routines
[Link] Interpolation
[Link] Data input and output
[Link] Linear algebra routines
[Link] Optimization
[Link] Signal processing
[Link] Sparse matrices
[Link] Spatial data structures
[Link] Special mathematics
[Link] Statistics
Key Points
• SciPy Package is a toolbox which is used
for common scientific issues.
• SciPy together with NumPy creates a
dynamic tool for data processing.
• Along with NumPy functions, SciPy provides
a lot of functions to perform different
tasks with ndarrays.
• SciPy is divided into sub-packages
determined for different tasks.
■x
r\OMALPLOTLIB
MO LIBRARY
’SSOf matplstlib
• Data Visualization
• PyPlot in Matplotlib
o
A
06J
\______ /
MATPLOTLIB LIBRARY
_____________________________J
MatPlotLib
Matplotlib is a python library used to create 2D
graphs and plots by using python scripts. It has a
module named pyplot which makes things easy for
plotting by providing feature to control line
stylesj font properties., formatting axes etc.
50
40
30
20 ■ *
10
Sun Sat Thur Fri
X
10
• •
6 •• ••*
10 20 30 40 50
Key features of MatPlotLib
Matplotlib is the best choice for data
visualization because of its features like:
• It supports a very wide variety of graphs and
plots namely - histogram, bar charts, power
spectra, error charts, and many more.
• It is used along with NumPy to provide an
environment that is an effective open source
alternative for MatLab.
• Using its PyPlot module, plotting simple
graphs or any other charts is very easy.
43 MATPLOTLIB LIBRARY
•
Data Visualization
Data visualization is the graphical representa
tion of information and data. By using visual
elements like charts, graphs, and maps, data visu
alization tools provide an accessible way to see
and understand trends, outliers, and patterns in
data.
In the world of Big Data, data visualization
tools and technologies are essential to analyze
massive amounts of information and make
data-driven decisions. Data visulaization helps us
to view data in a graphical or more interesting
way rather than viewing a big chunk of numbers in
a uniform line.
We will process, analyze and then visualize our
data, if we don't visualize our data, it loose a
lot of impact as it will in the form bar graphs,
pie charts, etc.
44 MATPLOTLIB LIBRARY
PyPlot in Matplotlib
[Link] is a collection of functions
that make matplotlib work like MATLAB. Each pyplot
function makes some change to a figure: e.g.,
creates a figure, creates a plotting area in a
figure, plots some lines in a plotting area,
decorates the plot with labels, etc.
To test it yourself, jump to lupyter Notebook and
start of by importing the matplotlib. pyplot
module.
In [ ]: import [Link] as mplt
To plot a simple graph, use the plot function
and pass a list, and then use the show function
to view the graph
We have successfully plotted our graph with some
random values in a list.
If we want we can name x and y axis using the
xtabet and ytabet repectively.
Z—•
45 MATPLOTLIB LIBRARY
In [2]: import [Link] as mplt
[Link]([l,3,6,9])
[Link]('X_Axis')
[Link]('Y_Axis')
[Link]()
The graph has solid blue line, we change itJs
color and the line style by passing another
argument to the plot function like, 'ro' for 'r'
red and 'o' circles.
In [2]: import [Link] as mplt
[Link]([l,3,6,9],’ro*)
[Link](’X_Axis’)
[Link]('Y_Axis')
[Link]()
9 •
8
7
6 •
</i
3 •
2
1 •
0.0 0.5 10 15 2.0 25 3.0
XAxis
The letters and symbols of the format string like
'ro' are from MATLAB, and you concatenate a color
string with a line style string. There are many
symbols for different shapes and colors like,
'b—' for blue solid. You-’ll find all the symbols
for different color and shapes in the following
list.
Z—•
46 MATPLOTLIB LIBRARY
line and shape styles
r
- Solid line
-- Dashed line
Dotted line
Dash-Dot line
'o' Circle
Plus sign
'*' Asterisk
'e' Point
'X 1 Cross
Horizontal line
■r Vertical line
' s' Square
'd' Diamond
1 A 1
Upward-pointing triangle
' v1 Downward-pointing triangle
'>' Right-pointing triangle
Left-pointing triangle
'P' Pentagram
' h' Hexagram
k_________
color styles
r
y yellow
m magenta
c cyan
r red
g green
b blue
w white
k black
k________________
47 MATPLOTLIB LIBRARY
•
Except for the color and line & shape style we
have a lot of editibility on the plotted graphs,
you can learn those in the seperate book dedicated
for data science or data visualization.
r
Key Points
• MatPlotLib is a library used for
visualizing our data using itJs MATLAB
like functions
• MatPlotLib-’s PyPlot module makes it easier
to plot data, with full control over
color, line & shape, font, axis-labels,
etc.
• It supports wide range of graphs and plots
like, histogram, bar graphs, pie charts,
and even 3-D graphs.
• MatPlotLib is the best library for data
visualization
■x
r\~f SCIKIT LEARN
\Jf LIBRARY
• Features of
Scikit learn
library
o J
A
07 SCIKIT LEARN LIBRARY
________________________________________ >
Scikit Learn or Sklearn
Scikit-learn or Sklearn is the most useful and
robust library for machine learning in Python. It
provides a selection of efficient tools for
machine learning and statistical modeling
including classification, regression, clustering
and dimensionality reduction via a consistence
interface in Python. This library, which is
largely written in Python, is built upon NumPy,
SciPy and Matplotlib.
Features of Sckikit Learn
Scikit-learn focuses on modelling data. The
followings are the most popular groups provided by
the library:
• Supervised learning algorithms, like Linear
Regression, Support Vector Machine (SVM), Decision
Tree etc. are the part of scikit-learn.
© $ ®
• Unsupervised learning algorithms like
clustering, factor analysis, PCA (Principal
Component Analysis) to unsupervised neural
50 SCILKIT LEARN LIBRARY
• Cross Validation, Dimensionality Reduction,
Ensemble methods, Feature extraction, Feature
selection are also features of scikit learn that
are used to check the accuracy of supervised
models, reducing the number of attributes in a
data, combining the prediction of multiple
supervised models, extract features and identify
useful featurews in adata, respectively.
no TYPES OF
UO MACHINE LEARNING
Supervised leorning
• Unsupervise leorning
• Deep learning
• Reinforcement
leorning
• Deep reinforcement
learning
o-------------
A
08 TYPES OF MACHINE LEARNING
_____________________________ )
In the previous lessons we learned about the
various libraries and packages used in the process
of machine learning. Now letJs look at the types
of machine learning
Types of machine learning
The followings are the different type of machine
learning:
reinforcement
Learnifig {J^pervised
Learning
^supervised
Learning
deep
reinforcement
Learnirfg
Deep Learning
Supervised learning
As it's name suggest in supervised
learning we train a machine learning
model with supervision. We feed the
data, train the model and tell the
model to make predictions. If the
predictions are correct we leave the
model to work on itJs own from there else
we help the model to predict correctly
until it learns so. It is the same as
teaching a child to solve questions at first
until he can solve them on his own.
53 TYPES OF MACHINE LEARNING
Types of supervised learning
Regression and classification are two types of
supervised machine learning. They can be under
stood as:
• Regression, is the type of machine learning in
which we feed the model with data like rA' (input,
i.e. X) has value of 65 (output, i.e. Y), fB' has
value of 66, etc. Based on the given data, the
model learns the relation between the input and
output (here fA' & 65).
Once the machine is trained with sufficient data
we provide a input let's say rC' and let the model
predict the output, but you must know the real
output of that input. You check the prediction
with the real value and check whether it is
correct or wrong. If the predictions are correct
we pass the model. If the predictions aren't
54 TYPES OF MACHINE LEARNING
Regression inturn have different ypes like linear
and logistic regression, which we will learn in
it's separate lesson.
• Classification, is the type of machine learning
in which we feed data and the model classifies the
data into different groups. Consider the following
example,
the data has different type of shapes in it. We
will teach the model which is what shape or what
are the different groups in the data. We will
provide the groups with their features like:
/
circle square oval rounded
squares
• ■
■ ■
•
■ ■
\_____________ J
55 TYPES OF MACHINE LEARNING
Now the trained model can classify any data after
learning how the groups are formed. If a new shape
is passed it will classify it according to what it
has learned. Like regression, we will keep feeding
it data until it classifies the data correctly.
Classification has also different types like
decision tree, Naive Bayes classification, support
vector machines, etc.
We will learn about them in the lesson dedicated
for this topic.
56 TYPES OF MACHINE LEARNING
Unsupervised learning
Unlike supervised learning., we
don't teach or check the
predictions made by the models,
instead we feed the data and ask
for predictions directly. And it
is obvious that much data you'll
feed the results will be much
accurate. Unsupervised learning is
used in artificial intelligence
applications like face detection, object detection,
etc.
Deep learning
Deep learning models are based on Artificial
Neural Networks (ANN), or more specifically
Convolutional Neural Networks (CNN)s. There are
several architectures used in deep learning such
as deep neural networks, deep belief networks,
recurrent neural networks, and convolutional
neural networks.
These models are used to solve
problems like:
• computer vision, image
classification, etc.
• bioinformatics
• drug design
• games, etc.
Deep learning is entirely a
concept in itself that it is
completely different type of machine learning or
different from it, we will not discuss it in much
detail but let's see what is it and how do it
solves problems.
Deep learning requires a lot of data and
computational power. But nowadays high performance
57 TYPES OF MACHINE LEARNING
-
computing is available to us. Let's consider an
example where the deep learning model tells us
about whether an animal is horse or not. The net
work consists of large amount of horse photos as
data and analyze them and try to extract patterns
from it like horns, color, saddle, eyes, etc.
the neural networks came to conclusion whether the
animal his horse or not, but how did it reach the
conclusion is unknown. The reasoning cannot be
obtained from deep learning models that's it is
also considered as black box.
Reinforcement learning
Reinforcement learning consists of learning
models which doesn't require any input or output
data instead they learn to perform certain tasks
like solving puzzles or playing games. If the
model performs steps correctly it is rewarded
points, unless it penalized. The models learns the
more it performs from it's mistakes. The model
creates data as it performs the functions unlike
receiving data at the beginning.
For example consider the following model:
58 TYPES OF MACHINE LEARNING
a
Deep Reinforcement learning
As the name suggests this is the combination of
reinforcement and deep learning. Reinforcements
algorithms are combined with deep learning to
create powerful Deep Reinforcement learning models
that is used in fields like robotics, video games,
finance and healthcare. Many unsolvable problems
are solved by these models. DRL models are still
at new to us and there is a lot to learn about it
■x
MATHEMATICS FOR
Vv? MACHINE LEARNING
• Doto instances i 11
• Statistics IDDIDI
• Probability
• Bayes Theorem
O J
A
09 MATHEMATICS FOR ML
_____________________________ )
Although every mathematical calculation will be
performed by the computer, but you need to know
the important formulaes and mathematical notations
even if you-’re not solving them yourselves. In
this chapter we will go through the important
concepts in mathematics required for machine
learning
Data instances
Data is what we need to perform all the
functions, i.e. data is the base of everything.
You need to know what types of data is required in
which process. LetJs consider the following as our
data:
r Day Sales
Monday 33
Tuesday 37
Wednesday 14
Thursday 29
In the above table there are two columns Day &
Sales and four rows. In the data, we have two
things, feature and label i.e. feature of the data
(numeric values like 33) and labels of the data
(descriptive values like Monday). Here Day is the
label and Sales is the label.
Monday
Tuesday 37
Wednesday 14
Thursday 29
MATHEMATICS FOR MACHINE LEARNING
The labels also have the following two types:
• Nominal, these data aren't ordered. They
have no heirarchy or upper or lower status.
In the following data the labels have either
True or False value i.e. considered nominal data
Answer Question
True 01
False 02
False 03
True 04
• Oridnal, these data are ordered. They have upper
or lower status. In the following data the
labels have an order in teh values like
Good > Average > Bad i.e. ordinal data
Product ID Rating
101 Average
102 Good
103 Average
104 Bad
Similarly the features have also two different
types. The followings are the two different types:
• Discrete, or finite values. These values have
a limit for example in the following data
the feature, numbers of children (NoC) in
different families is finite to 1, 2 or 3 i.e.
called discrete data
Family NoC
Smith's 2
Matrin's 1
Cox' s 3
Hyde's 2
MATHEMATICS FOR MACHINE LEARNING
• Continuous, or infinite values. These values
doesn't have a limit for example in the
following data the feature, weight of different
people isn't finite. It could be 110 pounds,
it can be 110.20 punds or even 110.21 pounds
i.e. continuous data.
Person Weight 1
Thon 110
Max 122
Mary 96
Alex 120
Data Collection
Data is collected from many sources. Let's say we
want data of the whole countrey. So we need to
survey the whole population whihc is time
consuming or we can select a sample of the
population and survey it's data. The sample can be
selected randomly, on the basis characters or
other features.
Likewise instead of feeding the a whole data set
to model we can obtain a sample from it to save
time and have better results.
\____________________ 7
Population
MATHEMATICS FOR MACHINE LEARNING
Statistics
Statistics is often thought of data visualization
like bar graphs, etc. but statistics also include
data collection, data analysis and it's represen
tation. As you may have learn't in school, we
perform statistical analysis of data like finding
tge central tedency and visulaizing the data onto
graphs.
Descriptive and Inferntial stastics are used in
machine learning.
Descriptive statistics
In descriptive statistics we work with the whole
data i.e. population rather than sample. In de
scriptive statistics we have the followings:
• Central Tendency
The mean, median and mode of an data is refered
as it's central tendency. We can find each of
them very easily, let's consider the following
data and find it's central tendency,
Day Sales
Monday 33
Tuesday 37
Wednesday 14
Thursday 29
To find the mean or the average of the sales, we
need to add all the values together and divide it
with the total number of values
sum of all values 33 + 37 + 14 + 29 ->Q nc
mean (x) = ----------------------- = ----------------------- = 28.25
Total no. of values 4
the average or mean sales is 28.25. As mentioned
earlier you don't need to perform the calculations,
it will be done in computer. Even there is
MATHEMATICS FOR MACHINE LEARNING
seperate functions in the pandas library like
[Link]() to find it, you just need to know
about it and how the value is obtained, so you
understand what happens in which analysis.
To find the median or the middle value sort the
numbers and if the data has odd numbers of data,
the middle value is the median like
12 34(56)71 77
56 is the median of the above data. But if the
number of data is even like our sales data we need
to find the sum of the middle pair and divide that
by 2
x sum of mid pair 29 + 33
median (x) = ---------------------
2 2
And at last we have the mode or the most occuring
value. It can be visually calculated but in our
data we don't we any repeating values so we will
consider the following example
®34(l2)71 77©56 78
12 is the mode in the above data as it ocurred
for three times, there can be many repetitions in
a data
65 MATHEMATICS FOR MACHINE LEARNING
• Variability or Spread
Range, interquratile range, variance and
standard deviation are referred to variability
or measurement of spread
Spread
>""■...... *"■......\
Interquatile Standard
Range Variance
Range Deviation
1 J
Range is the difference between the maximum and
minimum value in a data. Like the range of the
sales data is
Z
range = max - min
= 37-14
= 23
I______________ 7
Interquartile Range is similar to the range but a
bit different. Let's consider teh following data
12 27 33 35 35 42 45 47 51 53 54
We will divide the data into quarters with the
numbers as separators
12 27)33 [35 35] 42 [45 47] 51 [53 54]
And subtract the third seperator from the first
seprator i.e. interquartile range
Interquartile range = 3rd Seperator -1 st Seperator
= 51-33
= 18
\_____________________________________________________ 7
* •/
gg MATHEMATICS FOR MACHINE LEARNING
--- '•------------------
Variance or difference of random variables from
the expected value can be obtained with the fol
lowing formula where x is individual data points,
x is mean and n is the total number of data values
fi(xt-x)1
2
s2 = ------------------
n
If you want you can find the variance of any data
by replacing the values in the formula but make
sure to remember how the variance is found
Next if we want to find deviation i.e. or the
difference of each value from itJs average or
mean, we can use the following formula where i
represents the number of values in the data and u
represents the mean
Deviation = (xt- u)
If we want to find the deviation of a data of a
population we will use the mean of the whole
population which is represented by u
o2 = (xr u)2
But if we want to find the deviation of a data of
a sample from a population, we will use the mean
of the sample instead of the whole population
which is represented by x i.e. called inferential
statistics
s2 = (xrx)2
Similarly we can find the standard deviation or
the dispersion of data from itJs mean through the
following formula where N is the number of data
points or values
1 ——I
o (xi-u)2
1=1
67 MATHEMATICS FOR MACHINE LEARNING
I
LetJs consider an example to understand how to
find standard deviation. We will find the standard
deviation of our sales data
First we need to find the mean or u i.e. 28.25
which we found earlier
Then we will find the difference of all the sales
data from the mean and square them and add them
* •/
gQ MATHEMATICS FOR MACHINE LEARNING
--- '•------------------
Finally we can find the root of the product of
1/N and 87.74 to find O
° = fP^74
o = J~jTx87-74 [N = 4]
o = /zT.93
o = 4.67
Therefore, the standard deviation of the sales
data is approximately 4.67 and as mentioned
earlier don't sweat on the calculation just
unserstand the application of the formula
Entropy and Information gain
Entropy or the uncertainity in a data can be
found with the following problem:
N
H(S) = -^[Link]
1=1
where S stands for set of all instances in a data,
N refers to the number of distinct values and p.
stands for probability of the event. Through
entropy we can further calculate information gain
from a varaible through the following formula
v IS -I
Gain(A, S) = H(S) x H(SJ)
j=i |S|
where A is a feature or variable whose information
gain is being calaculated, H(S) is the entropy of
the whole dataset,|Sj| is the number of instances
with value j of the feature A, |S| is all data
69 MATHEMATICS FOR MACHINE LEARNING
I
instances in the dataset, v is the set of distinct
values of the feature A, H(Sj) is the entropy of
subset of data instances of feature A and H(A,S)
is entropy of feature A of the dataset
Let's consider an example to understand what is
entropy and information gain more clearly. We have
the following dataset
Day Discount Advertisement Sales
1 10% No Average
2 25% No Maximum
3 20% Yes Maximum
4 10% Yes Maximum
5 25% No Average
6 10% Yes Maximum
7 20% No Maximum
8 20% No Maximum
9 10% Yes Average
10 20% Yes Maximum
You are told to find the best feature i.e.
Discount or Advertisement to have Sales as
Maximum. So which feature will you choose to
create a model to predict the best values to have
maximum sales? We will find the information gain
from each feature to figure that out. Let's find
the information gain from Discount, we have the
following details about Discount
z....................... \
Total values zZ \
10 Discount
Ix . J
✓
z z z
Avg. Max 10%
3 7 z
\ \ z
Max Avg Max
2 0 1
V z z
70 MATHEMATICS FOR MACHINE LEARNING
I
Now we can find the entropy of the whole dataset
using the entropy formula
N
H(S) = -EPi-log2Pi
i=1
So total values or N is 10 and propbability of
Avg. Sales and Max Sales are 3/10 and 7/10
resepectively
N
H(S) = -EpJog^ Max Sales Probability
i=1
H(S) = -37;log2-^- -T7rlog2-^-
10 y 10 10 y 10
Avg. Sales Probability
H(S) ~ 0.82
After obtaining the entropy we can substitute the
values in the information gain formula to find the
information gain from Discount feature
Discount Sales
10% Average
25% Maximum
20% Maximum
10% Maximum
25% Average
10% Maximum
20% Maximum
20% Maximum = 0.82-0.4-0.0-0.2
10% Average = 0.82 - 0.6
20% Maximum = 0.22
The information gain from the Discount feature is
0.22, the feature with highest information gain
values is used in models to predict values to get
better results. So if Advertisement is the feature
then information gain is
Avg. 1
Yes V _
Advertisement
> •/
71 MATHEMATICS FOR MACHINE LEARNING
Similarly we can find the information gain of the
Advertisement feature
Advertisement Sales
Gain(A, S) = H(S) - H(A,S)
No Average
= 0.82- ^(410544-1094)
No Maximum
Yes Maximum
+ 4(-4^4-4^4)
Yes Maximum
No Average
= 0.82 - 0.36 - 0.445
Yes Maximum
No Maximum
= 0.82 - 0.805
No Maximum
= 0.015
Yes Average
Yes Maximum
It is clear that we need to use the Discount
feature rather than Advertisement feature because
of more information gain. And again as mentioned
earlier just understand what's going on, this is
one of the important techniques for data scientists
Confusion matrix
Confusion matrix is used to calculate the
accuracy of an model. To calculate that we need to
create the confusion matrix first, let's consider
we created a model that predicts weather and we
asked it to predict whether it will rain for the
next 30 days. After 30 days, we matched the
predicted values with the actual values and found
that the model predicted 25 days correctly but 5
were incorrect. Here 8 is referrred as True
negatives(T-), 3 is referred /---------- X
Predicted Predicted
as False positives(F+), 2 is 30 no rain rain
referred as False negatives
(F-) and 17 as True Positives Actual
8 3
(T+). We can use these values no rain
to calculate the accuracy of Actual
2 17
the model using the following Vrain____ J
formula:
accuracy =
So the accuracy of our model is
17 + 8
accuracy = -----------------------
17+8+3+2
25
accuracy = -------
30
accuracy = 0.8
Similarly we can calculate the error rate or mis
classification rate using the following formula:
(F+) + (F-)
Error rate =----------------------------
(T+) + (T-) + (F+) + (F-)
3+2
Error rate = -----------------------
17+8+3+2
Error rate = -------
30
Error rate = 0.2
It is same as 1-Accuracy. Next, we can calculate the
precision of the model using the below formula:
Precision = ----------------
(T+) +(F+)
Precision = ----------------
17 + 3
Precision = -----
20
Precision = 0.85
precision can also be defined as how many true
positive predictions our model makes
' •/
73 MATHEMATICS FOR MACHINE LEARNING
We can also calculate the Recall using the below
formula:
(T+)
Recall =
(T+) +(F-)
17
Recall =
17 + 2
17
Recall =
19
Recall = 0.89
And finally we can calculate F-measure if models
have high precision & low recall or vice versa
using the below formula:
2 x Recall x Precision
F-measure =
Recall + Precision
2 x 0.89 x 0.85
F-measure =
0.89 + 0.85
1.51
F-measure =
1.74
F-measure = 0.86
Probability
It is the easiest mathematical calculation to
predict the outcome of an event using the
following formula
r Favourable outcomes
Probability of event A =-------------------------------
Total outcomes
There are three different concepts required for
machine learning i.e. Probability Density
Function, Noraml distribution and central limit
theorem which are both statistics and probability
Probability Density Function
The probability function states the following
points:
• It is continuous over the
range
• Area under the curve and the
x-axis is equals to 1
• Probability of events will
lie between a and b
Any variable that satisfies
these conditions is called
continuous random variable
Normal Distribution
Variable's (features) with mean as 0 and variance
as 1 are called noraml random variables. A normal
distribution is an arrangement of a data set in
which most values cluster in the middle of the
range and the rest taper off symmetrically toward
either extreme. In an normal distribution mean,
median and mode are same. We can represent the
distribution as the below graph:
where uis mean and is ostandard deviation. The
formula of normal distribution can be represented
as:
y (Normal Variable) = [1/ox V2TT1 e(x'u)2/2°2
where e = 2.718
7g MATHEMATICS FOR MACHINE LEARNING
L—
Central Limit Theorem
The theorem states that the mean u of samples
from a population should be equal to the
population mean
Samplel
mean
V J
Sample2
Population
mean
mean V ! - J
Sample3
mean
V J
Types of Probability
Probability can be classified into the following
three types:
• Marginial probability, probability of an event
without conditions like drawing a number from
the first ten natural numbers.
• Joint probability, probability of two events
at once like drawing a red card with 4 number
from a deck of cards ^4
• Conditional probability, 4 number ----- red coLor
probability of one or more
event with conditions. The
condition may be fulfilled already or need at
the moment of the event. For example, drawing
a joker card from your friend, where it may or
may not be already present.
MATHEMATICS FOR MACHINE LEARNING
Bayes Theorem
Bayes theorem is way of finding probability of an
event when we know about the probility of other
events or conditions. The formula is given as:
P(A|B) = P(B|A) P(A)/P(B)
Let's say P(Fire) means how often there is fire,
and P(Smoke) means how often we see smoke, then:
• P(Fire|Smoke) means how often there is fire when
we can see smoke
• P(Smoke|Fire) means how often we can see smoke
when there is fire
So we know the following probabilities:
• dangerous fire = 1%
• fire with smoke = 10%
• dangerous fire with smoke = 90%
Then probability of dangerous fire when there is
smoke:
P(Fire|Smoke) = 1% x 90% / 10%
P(Fire|Smoke) = 9%
Using bayes theorem we can find many
probabilities of events. Even there is an naive
bayes theorem model in machine learning which we
will learn in the upcoming lessons.
■x
1 SCIKIT LEARN
IV ALGORITMS
• Regression
algorithm
• Classification
algorithm
• Clustering algorithm
o
r
10
___ /
SCIKIT LEARN ALGORITHMS
_________________________________>
Algorithms
We already learnt about the different types of
machine learning algorithms like, supervised
learning, etc. But how will you choose the best
algorithm for your problem? For that purpose we
can understand the algorithms we are going to use
so we can decide which algorithm is suitable for
our problem
Regression Algorithm
Any scikit-learn algorithm requires data points
or values more than 50. The following visual will
help you to understand the working of the scikit
learn's regression model which used to predict
quantity:
<100k
"F
SGD Regressor
Some features
should be important t—I—I
■ ■ Z Stochastic
Gradient
Descent for
sparse data
f
when there are F
multiple features
which are correlated ▼ Support Vector
V. _ — Regression
linear model (Kernel-linear)
V
that estimates
sparse coefficients Ordinary Least Squares
' •/ '
79 SCIKIT LEARN ALGORITHMS
----- '• '
If the Ridge regressor and SVR(kernel-linear)
doesn't work out, we can use EnsembleRegressor and
SVR(Kernel-rbf) instead
Classification Algorithm
Scikit-learn's classification algorithm works in
the following manner to predict categories when we
have labeled data:
Labeled
>100k Data <100k
4~ "I
SGD Classifier Linear SVC
Not Working Not Working
Kernel approximation
[ I
Text Data
Naive KNeighbour
Bayes Classifier
Ensemble
s
classifiers Not Working
Linear SVC(Support Vector Classification) faster
for the case of linear kernel i.e. it doesn't have
the kernel parameter.
On the other hand SGD(Stochastic Gradient
Descent) Classifier implements regularized linear
models with stochastic gradient descent (SGD)
learning
*■
80 SCIKIT LEARN ALGORITHMS
---------------------------------------------------------
Clustering Algorithm
Scikit-learn-’s clustering algorithm works in the
following manner to predict categories when we
have unlabeled data and no. of categories:
Unlabeled
>10k
MiniBatch KMeans
KMeans
Not Working
I I
Gaussian
Mixture Model
Z
But if we donJt the no. of categories the
algorithm works as follows:
Unlabeled
Data
T
f v
MeanShift Variational
Bayesian Gaussian
mixture model
J
' •/ '
81 SCIKIT LEARN ALGORITHMS
----- '• '
We also have a dimension reduction algorithm to
just view data. The sole of this chapter was to
understand the different algorithms as we have
already learnt the different machine learning
types. So before solving any problem you will able
to choose the right algorithm suitable for you
without actually choosing randomly and selecting
another if the accuracy is very low
■x
IMPORTING
U DATA
• Importing CSV
Data
• Importing JSON I I
Data I I
I
• Importing XLS
Doto
o J
—□/--------------------- ■>
11 IMPORTING DATA
y □------------------- J
Data files
Often we have data in multiple file formats like.,
data of sales of any product, number of
subscribers, etc. There are a lot of ways to import
data but we will use pandas library here
Importing CSV Data
Reading data from CSV(comma separated values) is a
fundamental necessity in Data Science. Often, we get
data from various sources which can get exported to
CSV format so that they can be used by other
systems. To work with csv files we need one first,
you can download the sample file from here [Link]
(check the Resouces)
To import it, you need move the csv file in the
place where your Jupyter Notebook is hosted, to find
it use the below code
In [32]: import os
print([Link]())
C:\Users\Rahul
Move your file there, and use the read_csv
function of pandas library to import the csv file
In [38]: import pandas as pan
dt = pan.read_csv("csv_data.csv")
dt
id name price sales brand
0 101 biscuits 5.00 227 HomeFoods
1 102 cookies 7.25 158 TBakery
2 103 cake 12.00 50 TBakery
3 104 whey_supplement 34.90 24 MusleUp
4 105 protein_bars 4.90 85 Muslellp
5 106 potato_chips 1.75 121 HomeFoods
84 IMPORTING DATA
----- '•
We can access a single column of the csv data using
slicing like DataFrames
In [39]: import pandas as pan
dt = pan.read_csv("csv_data.csv")
dt['sales']
Out[39]: 0 227
1 158
2 50
3 24
4 85
5 121
Name:: sales., dtype: int64
We can extract only 2 or more columns from the data
using the toe[:,[<*cotumns>]] syntax
In [44]: import pandas as pan
dt = pan.read_csv("csv_data.csv")
[Link][:>['name','sales']]
name sales
0 biscuits 227
1 cookies 158
2 cake 50
3 whey_supplement 24
4 protein_bars 85
5 potato_chips 121
or just with some rows
In [46]: import pandas as pan
dt = pan.read_csv("csv_data.csv")
[Link][4:6j'name'>'sales']]
0ut[46]:
name sales
4 protein_bars 85
5 potato_chips 121
To access a single element, we can use its
row-column index with the values function
In [49]: import pandas as pan
dt = pan.read_csv("csv_data.csv")
dt
id name price sales brand
0 101 biscuits 5.00 227 HomeFoods
1 102 cookies 7.25 158 TBakery
2 103 cake 12.00 50 TBakery
3 104 whey_supplement 34.90 24 Muslellp
4 105 protein_bars 4.90 85 MusleUp
5 106 potato_chips 1.75 121 HomeFoods
In [50]: biscuits_sales = [Link][0,3]
biscuits_sales
Out[50]: 227
0 1 2 3 4
Z------------- id name price sales brand
[Link]
Zl01 biscuits 5^^(227) HomeFoodsJ
1 102 ^^ceoKies 7.25 158 TBakery
2 cake 12.00 50 TBakery
z*
dt,va~Lues[Q,3] 3 104 whey_supplement 34.90 24 MusleUp
4 105 protein_bars 4.90 85 MusleUp
5 106 potato_chips 1.75 121 HomeFoods
The data values are stored as ndarrays so, to access
single elements we can using slicing similar to that
of DataFrames
86 IMPORTING DATA
Importing JSON Data
JSON file stores data as text in human-readable
format. ISON stands for JavaScript Object Nota
tion. Get your sample ISON data here
json_data (Check the Resources)
Pandas can read ISON files using the read_json
function.
In [2]: import pandas as pan
dt = pan.read_json("json_data.json")
dt
ID Name Price Sales Brand
0 101 Biscuits 5.00 227 HomeFoods
1 102 Cookies 7.25 158 TBakery
2 103 Cake 12.00 52 TBakery
3 104 Whey Supplement 34.90 24 Muslellp
4 105 Protein Bars 4.90 85 MusleUp
5 106 Potato Chips 1.75 121 HomeFoods
Similar to the CSV files., we can perform all the
slicing and data extraction with JSON data files
In [6]: import pandas as pan
dt = pan.read_json("json_data.json")
print([Link][:,["ID","Name","Sales"]])
print(dt["Name"])
print([Link][5,4])
ID Name Sales
0 101 Biscuits 227
1 102 Cookies 158
2 103 Cake 52
3 104 Whey Supplement 24
4 105 Protein Bars 85
5 106 Potato Chips 121
0 Biscuits
1 Cookies
2 Cake
3 Whey Supplement
4 Protein Bars
5 Potato Chips
Name: Name, dtype: object
HomeFoods
87 IMPORTING DATA
Importing EXCEL Data
Microsoft Excel is a very widely used spread
sheet program. Its user friendliness and appealing
features makes it a very frequently used tool in
Data Science. Get your sample ISON data here
xtsx_data (check the Resources)
The read_excet function of the pandas library is
used read the content of an Excel file into the
python environment as a pandas DataFrame.
In [9]: import pandas as pan
dt = pan.read_excel("xlsx_data.xlsx")
dt
id name price sales brand Unnamed: 5
0 101 biscuits 5.00 227 HomeFoods NaN
1 102 cookies 7.25 158 TBakery NaN
2 103 cake 12.00 50 34 TBakery
3 104 whey_supplement 34.90 24 Muslellp NaN
4 105 protein_bars 4.90 85 Muslellp NaN
5 106 potato_chips 1.75 121 HomeFoods NaN
As execel sheets are imported as Pandas Data-
FrameSj we can perform all the tasks on the excel
data like Data Frames.
You may notice., we have a Unnamed: 5 column with
NaN values [except [Link][2,5]]. Let's clean up
our data.
First we need to remove the Unnamed: 5 column,
which we can do using the det keyword
In [10]: import pandas as pan
dt = pan.read_excel("xlsx_data.xlsx")
del dt["Unnamed: 5"]
As we have learned earlier the det keyword
removes the whole column we don't need to deal
with the Data Cleansing
88 IMPORTING DATA
We have removed the Unnamed: 5 column
In [11]: import pandas as pan
dt = pan.read_excel("xlsx_data.xlsx")
del dt["Unnamed: 5"]
dt
id name price sales brand
0 101 biscuits 5.00 227 HomeFoods
1 102 cookies 7.25 158 TBakery
2 103 cake 12.00 50 (34)
3 104 whey_supplement 34.90 24 MusleUp
4 105 protein_bars 4.90 85 MusleUp
5 106 potato_chips 1.75 121 HomeFoods
Now, we need to replace [Link][2,5] i.e. 34
with TBakery. We can use the reptace method
In [12]: import pandas as pan
dt = pan.read_excel("xlsx_data.xlsx")
del dt["Unnamed: 5"]
[Link]({34:"TBakery"})
id name price sales brand
0 101 biscuits 5.00 227 HomeFoods
1 102 cookies 7.25 158 TBakery
2 103 cake 12.00 50 TBakery
3 104 whey_supplement 34.90 24 MusleUp
4 105 protein_bars 4.90 85 MusleUp
5 106 potato_chips 1.75 121 HomeFoods
So our data is clean with no errors. Try recaping
the chapter and attempt the Exercise, where youJll
be provided with sample data files [links] with
lots of errors and you have to perform all the data
cleansing practised in the previous lesson, this
will be a very good exercise to help you understand
about data processing and cleansing more
1 O DATA
1C OPERATIONS
• NumPy
operations
• Pandas
operations
• Cleaning data
o
/--------
12 data analysis
'----------
Python handles data of various formats mainly
through the two libraries., Pandas and Numpy. We
have already seen the important features of these
two libraries in the previous chapters. In this
chapter we will see some basic examples from each
of the libraries on how to operate on data and
perform different tasks like cleaning the data,
analytics, etc.
NumPy Operations
To start working with NumPy, we need to import
numpy to create NumPy arrays.
In [ ]: import numpy
Now let's create an array, using the arrayO
function and print it.
In [2]: import numpy
ar = [Link]([l,5,7])
print(ar)
[1 5 7]
an is a 1-Dimensional array, we can also create a
2-Dimensional array by creating one or more
1-Dimensional array inside of another array
In [8]: import numpy
ar = [Link]([[l,5,7], [2,3,9]])
print(ar)
[[1 5 7]
[2 3 9]]
[[1 5 7] 1-D array
I [2 3 9]]J = 2-D array
- 91
data operations
-J
We can specify the dimension of an array during
creation using the ndmin parameter
In [10]: import numpy
ar = [Link]([lj5,7]> ndmin = 2 )
print(ar)
[[1 5 ?]]
Although we passed a 1-Dimensional array, it
became a 2-Dimensional array because of the
specification of the dimensions of the array in the
ndmin parameter
We created an array with integers so, let's create
arrays with strings and floats using the dtype
parameter with the same values
In [11]: import numpy
ar_str = [Link]([l,5,7], dtype = str )
ar_flt = [Link]([l,5,7], dtype = float)
print(ar_str)
print(ar_fIt)
[■r '5' ■?■]
[1. 5. 7.]
ar_str is an array of string literals and ar_ftt
is an array of floats. We can also change these
numbers to complex numbers the same way using
complex as dtype
In [13]: import numpy
ar_str = [Link]([l,5,7], dtype = str )
ar_flt = [Link]([l,5,7], dtype = float)
ar_cmx = [Link]([l,5,7], dtype = complex )
print(ar_str)
print(ar_flt)
print(ar_cmx)
fl1 '5' -7’]
[1. 5. 7.]
[l.+e.j 5.+0.J 7.+0.j]
[Link], ar_fT_t and ar_cmx are arrays created
with same data, but with different data types as
strings, floats and complex numbers repectively.
92 data operations
Pandas Operations
Pandas handles data through Series,Data Frame,
and Panel. We will learn to create each of these.
Pandas Series
We already know what Pandas Series is. A pandas
Series can be created using the SeriesO function
so, let's import pandas and create series.
In [14]: import pandas
sr = [Link]([1,5,7])
print(sr)
0 1
1 5
2 7
dtype: int64
As you can see our data is indexed form 0 to 2
with the data type printed as integer, we can
specify our own indexes in the index parameter
In [16]: import pandas
sr = [Link]([1,5,7], index = ['A','B','C'])
print(sr)
A 1
B 5
C 7
dtype: int64
Like ndarrays, we can also specify the data type
in pandas series using dtype parameter during
series creation
In [18]: import pandas
sr = [Link]([1,5,7], dtype = complex )
print(sr)
0 1.000000+0.000000j
1 5.000000+0.000000j
2 7.000000+0.000000j
dtype: complexl28
f
93 data operations
We can use a ndarray to create a pandas series
In [19]: import numpy
import pandas
ar = [Link]([1,5,7])
sr = [Link]( data = ar, copy = True )
#is same as sr = [Link](ar, copy = True)
print(sr)
Q 1
1 5
2 7
dtype: int32
We passed the ar ndarray as the data for the
series [use of the data parameter isn-’t necessary,
its just for better understanding] and also used
the copy parameter to create a copy of the data.
If you want to get the data, without the indexes
use the values function
In [21]: import numpy
import pandas
ar = [Link]([l,5,7])
sr = [Link](ar)
print([Link])
[1 5 7]
You can print a more detailed version of the
above using the array function
In [22]: import numpy
ray Type
import pandas
ar = [Link]([l,5,7]) <PandasArray>
sr = [Link](ar) [1, 5, 7] {vplues
print([Link]) Lengtl
■ ta type
<PandasArray>
[1, 5, 7]
Length: 3, dtype: int32
You can use values or array function according
to your needs whether you want just the values or
summarized detail of the arrays in that panda
series. Also note the difference in the array
function in NumPy and Pandas.
94 data operations
Pandas Data Frames
Pandas Data Frames aligns data in a tabular
fashion of rows and columns. A pandas DataFrame
can be created using the DataFrameO function, we
need pass a dictionary as the data
In [23]: import pandas
df = [Link]({"Product":['CookiesBiscuits'],
"Sales":[157,227]})
print(df)
Product Sales
0 Cookies 157
1 Biscuits 227
Dictionary keys are the columns and their values
are the content of the rows of the Data Frame. We
can also use index parameter here
In [24]: import pandas
df = [Link]({"Product":['CookiesBiscuits'],
"Sales":[157,227]}, index = [1,2])
print(df)
Product Sales
1 Cookies 157
2 Biscuits 227
We can define the columns and it's data
seperately using ndarrays
In [42]: import pandas
import numpy
ar = [Link]([[l,3],[6,2]])
df = [Link](data = ar,
index = ['A','B'],
columns = [,C1','C2'])
print(df)
Cl C2
A 1 3
B 6 2
The data is stored in the ndarray and the columns
are defined in the DataFrame's columns parameter.
Note that, a 2-Dimensional ndarray with 2
1-Dimensional arrays in it is passed to the data
parameter to act as the data
95 data operations
We can add columns to the DataFrame using the
<DataFrame>[<cotumn_name>] = <values>
syntax
In [44]: import pandas
import numpy
ar = [Link]([[1,3],[6,2]])
df = [Link](data = ar,
index = ['A','B'],
columns = ['Cl','C2'])
df['C3'] = (df['Cl']*5)
print(df)
Cl C2 C3
A 1 3 5
B 6 2 30
We can delete columns from the DataFrame using
the det function
In [45]: import pandas
import numpy
ar = [Link]([[1,3],[6,2]])
df = [Link](data = ar,
index = ['A','B'],
columns = [,C1,,'C2'])
df['C3‘] = (df['Cl']*5)
del df['C2']
print(df)
Cl C3
A 1 5
B 6 30
We can print a column of the DataFrame using the
<DataFrame> [<cotumn_name>] syntax
In [46]: import pandas
import numpy
ar = [Link]([[l,3],[6,2]])
df = [Link](data = ar,
index = ['A','B'],
columns = ['C1','C2'])
print(df['Cl'])
A 1
B 6
Name: Cl, dtype: int32
96 data operations
•
Slicing Syntax
To get a single element from a ndarray or pandas
series or pandas dataframes, we need to use the
slice syntax <array>[start:end:step(optional)]
LetJs extract some elements from the arrays we
have created so far.
In [59]: import numpy as npy
arl = [Link]([1, 5])
ar2 = [Link]([[1, 3],
[5, 2]])
ar3 = [Link]([[[1, 3],
[5, 2]],
[[2, 4],
[4, 6]]])
#SLicing 1-Dimensional, array
print(arl[0])
#SLicing 2-DimensionaL array
print(ar2[0,l])
#SLicing 3-Dimensional, array
print(ar3[l,0j1])
1
3
4
We use a comma } to slice further in 2 or more
dimensional arrays, the following figure will help
you understand the slicing of the 3-Dimensional
array more better
ar3[] [[[1, 3],
[5, 2]],
[[2, 4],
Full orroy [4, 6]]]
97 data operations
•
ar3[l] [[[C 3],
[5, 2]]j
[[2, 4]j
First Slice [4, 6]] ]
ar3[l,0] [[[1> 3],
[5, 2]]j
l[2> 4],
Second Slice [4j 6]] ]
ar3[l,0,l] [[[1> 3]j
[5, 2]]j
[[2, 4 ],
[4j 6]] ]
ar3[l,0,l]
iFinol Slicej
Slicing may seem a bit tough for beginners due to
the dimensions, thatJs why I created the figure to
help you understand slicing better. If you are
confident try solving the slicing questions in the
Exercise
* ’
98 data operations
To get a element from a pandas series., we use the
<series>[<expticit_index> or <impticit_index>]
syntax
In [4]: import pandas as pan
sr = [Link]([1, 3, 5], index = ['a','b','c'])
print(sr['a']) #impLicit indexing
print(sr[0]) #expLicit indexing
1
1
If you have indexes like numbers like these
In [6]: import pandas as pan
sr = [Link]([1, 3, 5], index = [2,4,6])
If you want the second element using the implicit
index [indexing defined in index parameter] use
<series>.Loc [<LabeL>] syntax and using the
explicit indexing [ 0,1,2,... ] use
<series>. iloc [<index>] syntax
In [7]: import pandas as pan
sr = [Link]([1, 3, 5], index = [2,4,6])
print([Link][4])
print([Link][l])
3
3
We can modify or delete the elements using slicing
In [9]: import pandas as pan
sr = [Link]([1, 3, 5], index = [2,4,6])
sr[4] = 7
print(sr)
del sr[6]
print(sr)
2 1
4 7
6 5
dtype: int64
2 1
4 7
dtype: int64
99 data operations
•------------------------
Let's say we have a DataFrame like this
In [27]: import pandas as pan
sr = [Link]({'Product':['Biscuit','Cookies'],
'Sales':[227,158]}, index = [1,2])
sr
0ut[27]:
Product Sales
1 Biscuit 227
2 Cookies 158
And want the Sales Column only, so use the
<DataFrame> [<column_label>] syntax
In [32]: import pandas as pan
sr = [Link]({'Product':[1 Biscuit','Cookies'],
'Sales':[227,158]}, index = [1,2])
sr['Sales']
Out[32]: 1 227
2 158
Name: Sales, dtype: int64
or to get the second row only, so use the
<DataFrame>.loc [<row_index>] syntax
In [33]: import pandas as pan
sr = [Link]({'Product':['Biscuit','Cookies'],
'Sales':[227,158]}, index = [1,2])
[Link][2] #You can aLso use [Link][l]
Out[33]: Product Cookies
Sales 158
Name: 2, dtype: object
or to get the sales of cookies only, so use the
<DataFrame>.values[<index>] syntax
In [37]: import pandas as pan
sr = [Link]({'Product':['Biscuit','Cookies'],
'Sales':[227,158]}, index = [1,2])
[Link][l,l]
Out[37]: 158
________________________________________________________________________________________________
The values are stored as ndarrays, that's why it
used slicing similar to that of 2-Dimensional
ndarrays
We can delete a whole column from the DataFrame
In [41]: import pandas as pan
sr = [Link]({'Product':['Biscuit','Cookies'],
'Sales':[227,158]}, index = [1,2])
sr
0ut[41]:
Product Sales
1 Biscuit 227
2 Cookies 158
In [45]: del sr['Sales']
sr
Out[45]:
Product
1 Biscuit
2 Cookies
but we cannot delete a value
In [47]: import pandas as pan
sr = [Link]({'Product':['BiscuitCookies'],
'Sales':[227,158]}, index = [1,2])
del [Link][l,l]
ValueError Traceback (most recent call last)
<ipython-input-47-4422f0e71c59> in <module>
2 sr = [Link]({'Product':['Biscuit','Cookies'],
3 'Sales':[227,158]}, index = [1,2])
------ > 4 del [Link][l,l]
ValueError: cannot delete array elements
nor you can modify a value
In [48]: import pandas as pan
sr = [Link]({'Product':['Biscuit','Cookies'],
'Sales':[227,158]}, index = [1,2])
sr
0ut[48]:
Product Sales
1 Biscuit 227
2 Cookies 158
In [51]: [Link][l,l] = 162
[Link][l,l]
0ut[51]: 158
101 data operations
More with ndarrays
We can reverse a ndarray using <array>[ ::-1] syntax
In [56]: import numpy as npy
ar = [Link]([1,2,3,4])
ar
Out[56]: array([l, 2, 3, 4])
In [55]: ar = ar[::-1]
ar
Out[55]: array([4, 3, 2, 1])
We can broadcast a whole ndarray without doing it
the long way
In [63]: import numpy as npy
ar = [Link]([5,1,3,9])
ar
Out[63]: array([5, 1, 3, 9])
In [64]: [Link]()
ar
Out[64]: array([l, 3, 5, 9])
There are many built-in ndarray methods that will
not be discussed now, but will be used in the future
lessons in various steps, you may go to the
documentation to find all the functions and their
roles, as we don't require every function for our
data processing and analyzing, all the miscellaneous
functions are not discussed in this book
DATA OPERATIONS
Data Cleansing
Let's consider a situtation like below
In [71]: import pandas as pan
import numpy as npy
ar = [Link]([[1,2,3],[4,7,2],[4,9,1]])
df = [Link]( data = ar, index = [' a ', ' c ', ' e ' ],
columns = ['Cl','C2','C3'])
df
0ut[71]:
C1 C2 C3
a 1 2 3
c 4 7 2
e 4 9 1
In [72]: df = [Link](['a','b','c','d','e'])
df
0ut[72]:
C1 C2 C3
a 1.0 2.0 3.0
b NaN NaN NaN
c 4.0 7.0 2.0
d NaN NaN NaN
e 4.0 9.0 1.0
The reindexed Data Frame has NaN values in the b and
d rows. This happened because, there is no data for b
and d rows. Using reindexing, we have created a Data-
Frame with missing values. In the output, NaN means
Not a Number. To make detecting missing values easier
(and across different array dtypes), Pandas provides
the isnuULO and notnullO functions, which are also
methods on Series and DataFrame objects
Name: Cl, dtype: bool
103 data operations
Pandas provides various methods for cleaning the
missing values. The fillna function can fittna NaN
values with non-null data in a couple of ways like
replacing NaN values with 0
In [74]: [Link](0)
Out[74]:
C1 C2 C3
a 1.0 2.0 3.0
b 0.0 0.0 0.0
c 4.0 7.0 2.0
d 0.0 0.0 0.0
e 4.0 9.0 1.0
We can copy the value above or below that data using
pad or bfi’L’L in method parameter of fittna function
In [75]: [Link]( method = 'pad' )
C1 C2 C3
a 1.0 2.0 3.0
b 1.0 2.0 3.0
c 4.0 7.0 2.0
d 4.0 7.0 2.0
e 4.0 9.0 1.0
We can drop the rows with missing values with dropna
function
In [76]: [Link]()
0ut[76]:
C1 C2 C3
a 1.0 2.0 3.0
c 4.0 7.0 2.0
e 4.0 9.0 1.0
If we want to change a single value in a Data
Framejwe can use the replace function
In [78]: import pandas as pan
import numpy as npy
ar = [Link]([[1,2,3],[4,7,2],[4,9,1]])
df = [Link]( data = ar, index = [' a',' c', ' e' ]
columns = ['Cl','C2'>'C3'])
[Link]({3:13})
0ut[78]:
C1 C2 C3
a 1 2 13
c 4 7 2
e 4 9 1
1 Q DATA ANALYSIS
IO S PROCESSING
• Doto Analytics
• Correlations
between
attributes
• Skewness of
the data
o
A
13
_____ /
data analysis s processing
_________________________________________ y
As we learned in the mathematics for machine
learning lesson, we need to a lot of analytics or
statistics of our data to know more about the data.
As we already know central tendency i.e. mean,
median and mode are the basic statistics of our
data which tells us about the average of the data,
50% or middle value and the most occuring value in
the whole data
Likewise we will analyze our data and as mentioned
earlier we don't need to calculate them manually or
through formula's, there's plenty of functions
present in different libraries to conduct the
analysis
Data analytics
Before training any model we need to check the
data and it's details. We will use the [Link]'
as your data for now. You can get the
file either scanning the qr-code or [Link]
the link.
Make sure to move the file your home
directory of lupyter Notebook and then
import the csv data. Before doing any
further action, let's have a look at
our raw data
In [2]: import pandas
dt = pandas.read_csv(’[Link]')
dt
Out[2]:
Index "Girth (in)” "Height (ft)” ”Volume(ftA3)”
0 1 8.3 70 10.3
1 2 8.6 65 10.3
2 3 8.8 63 10.2
3 4 10.5 72 16.4
4 5 10.7 81 18.8
107 data analysis a processing
The first analysis is to know the shape of the
data or how amny rows and columns are present in
the data. We can do so by using the shape
attribute of the dataframe object
In [4]: import pandas
dt = pandas.read_csv('[Link]')
[Link]
0ut[4]: (31, 4)
So our data has 31 rows and 4 columns i.e. 124
values in total. If we want we can just inspect
the first 10 values using the head 0 function and
passing 10 as argument
In [5]: import pandas
dt = pandas.read_csv('[Link]')
[Link](10)
Index "Girth (in)" "Height (ft)" "Volume(ftA3)"
0 1 8.3 70 10.3
1 2 8.6 65 10.3
2 3 8.8 63 10.2
3 4 10.5 72 16.4
4 5 10.7 81 18.8
5 6 10.8 83 19.7
6 7 11.0 66 15.6
7 8 11.0 75 18.2
8 9 11.1 80 22.6
9 10 11.2 75 19.9
To get a statistical overview of the whole data
we can use the describeO function which provides
8 properties i.e. count, mean, standard deviation,
minimum value, maximum value, 25% (first
interquartile sperator), 50% (median) and 75%
(third interquartile seperator)
In [6]: import pandas
dt = pandas.read_csv('[Link]')
[Link]()
Index "Girth (in)" "Height (ft)" "Volume(ftA3)"
count 31.000000 31.000000 31.000000 31.000000
mean 16.000000 13.248387 76.000000 30.170968
std 9.092121 3.138139 6.371813 16.437846
min 1.000000 8.300000 63.000000 10.200000
25% 8.500000 11.050000 72.000000 19.400000
50% 16.000000 12.900000 76.000000 24.200000
75% 23.500000 15.250000 80.000000 37.300000
max 31.000000 20.600000 87.000000 77.000000
If you want the values rounded-off to say 2
decimal places we can use the pandas set_option()
function and specify the precision as 2. We can
specify a lot of options through this function
In [7]: import pandas
dt = pandas.read_csv('[Link]')
pandas.set_option('precision',2)
[Link]()
Out[7]:
Index "Girth (in)" "Height (ft)" "Volume(ftA3)"
count 31.00 31.00 31.00 31.00
mean 16.00 13.25 76.00 30.17
std 9.09 3.14 6.37 16.44
min 1.00 8.30 63.00 10.20
25% 8.50 11.05 72.00 19.40
50% 16.00 12.90 76.00 24.20
75% 23.50 15.25 80.00 37.30
max 31.00 20.60 87.00 77.00
' •/>
109 DATA ANALYSIS S PROCESSING
----- '•--------------------------------
Correlation between attributes
The relation between two attributes (feature or
label) in a data is called relationship. It is
important to know the relations between the
attributes. We can do so using the corr() function
and using the Pearson's Correlation Coefficient to
calculate that. The Pearson's Correlation
Coefficiet can be understood by the following:
• 1 represents positive correlation
• 0 represents no relation at all
• -1 represents negative correlation
In [2]: import pandas
dt = pandas.read_csv('[Link]')
pandas.set_option('precision' ,2)
[Link](method='pearson')
Index "Girth (in)" "Height (ft)" "Volume(ftA3)"
Index 1.00 0.97 0.47 0.90
"Girth (in)" 0.97 1.00 0.52 0.97
"Height (ft)" 0.47 0.52 1.00 0.60
"Volume(ftA3)" 0.90 0.97 0.60 1.00
Note that we used the precision of the values as
2 to keep the values rounded-off to 2 decimal
places. In the corr() function we specified
pearson in the method parameter.
As we already know that Girth, Height and Volume
of tree are correlated that's why we get the
values around 0.5 - 1.0 which represents positive
correlationship i.e. if Height is changed the
volume will be affected, if the Girth is changed
the volume will be affected and vice versa
* •/
110 DATA ANALYSIS S PROCESSING
----- '•--------------------------------
Skewness of the data
Skewness of a data is the situation when the data
appears to have normal distribution but it may be
skewed to either left or right. We need the
skewness of a data to correct the data during it's
preparation. The more the value is close to 0 it
is less skewed and more the value is close it -1
or 1 it is more skewed to either left or right
side, let's check the skewness of our tress data
using the skew() function
In [3]: import pandas
dt = pandas.read_csv('[Link]')
pandas.set_option('precision',2)
[Link]()
Out[3]: index 0.00
"Girth (in)" 0.55
"Height (ft)" -0.39
"Volume(ftA3)" 1.12
dtype: float64
As the index column has values from 1 to 31 it's
skewness is 0 i.e. not skewness at all. On the
other hand Girth can be said to be skewed to the
right side, Height is skewed to the left side and
Volume is highly skewed to the right side i.e.
beyond 1. While data preparation we must consider
the skewness and keep it close as much as possible
to 0
Data Processing
Before feeding the data to models we need to
pre-process the data because the algorithms are
completely depended on the data so it must be
clean and appropriate as much as possible. While
finding skewness we found that you data is skewed
i.e. it needs to be closer to 0 for better
results, so let's look at some processes to ready
our data
• /---------------------------------
Ill data analysis s processing
Scaling
Our data is spread over a wide range with
different scales i.e. not suitable to train
models. We need to bring our data in a more
appropriate scale, we can do so using the
MinMaxScalen class and it's fit_transforrn()
method of the scikit-learn library. We can scale
our data in the range of 0 to 1 which is the most
appropriate range for the algorithms
In [28]: import pandas
from sklearn import preprocessing
dt = pandas.read_csv('[Link]')
ar = [Link] # array
# Scoter Object
Sclr = [Link](feature_range=(0?l))
skl_ar = Sclr.fit_transform(ar) ^Seating
# Seated data
skl_dt = pandas .DataFrame(skl_ar_,
columns=['[Link].','Height','Height','Volume'])
skl_dt.round(1).loc[5:10]
Out[28]:
[Link]. Height Height Volume
5 0.2 0.2 0.8 0.1
6 0.2 0.2 0.1 0.1
7 0.2 0.2 0.5 0.1
8 0.3 0.2 0.7 0.2
9 0.3 0.2 0.5 0.1
10 0.3 0.2 0.7 0.2
You can compare the values [Link]. Girth Height Volume
with the values beside i.e. 5 5 10.7 81 18.8
unsealed. If you want you can
6 6 10.8 83 19.7
change the range to say 0-100
7 7 11.0 66 15.6
through the feature_range
parameter in MinMaxScaler 8 8 11.0 75 18.2
while the scaler class 9 9 11.1 80 22.6
intialization 10 10 11.2 75 19.9
* •/
112 DATA ANALYSIS S PROCESSING
----- '•--------------------------------
Normalization
Normalization is used to rescale each row of data
to have a length of 1. It is mainly useful in
Sparse dataset where we have lots of zeros. There
are two types of normalization process namely LI
and L2. With the LI method, all the values in each
row will sum upto 1. We can demonstrate the same
using the Normalizer class and it's transform
method. To use the LI method specify fLl' in the
norm parameter of the class
In [3]: import pandas
from sklearn import preprocessing
dt = pandas.read_csv('[Link]')
ar = [Link]
Nm = [Link](norm='11')
nm_ar = [Link](ar)
print(nm_ar[:5])
# Sum of the rows
for i in [0,1,2,3,4]:
print(sum(nm_ar[i]))
[[0.01116071 0.09263393 0.78125 0.11495536]
[0.02328289 0.10011641 0.75669383 0.11990687]
[0.03529412 0.10352941 0.74117647 0.12 ]
[0.03887269 0.10204082 0.69970845 0.15937804]
[0.04329004 0.09264069 0.7012987 0.16277056]]
1.0
1.0
1.0
0.9999999999999999
1.0
We created the Nm object of the Normalizer class
with the normalizing method as 11 in the norm
parameter while intialization and normalized our
ar data values with the transform method of the
Normalizer class and stored it in nm_ar variable.
Then we printed the 5 rows of the normalized data
values
We also created a for loop to print the sum of
each row of the normalized data i.e. l(except for
the 4th row i.e. 0.99). Note that we didn't per
form any rounding-off
In the next method i.e. L2 Normalization, all the
squares of values in each row sum upto 1. So let's
use f12' in the norm parameter and check their
sums
In [12]: import pandas
from sklearn import preprocessing
dt = pandas.read_csv('[Link]')
ar = [Link]
Nm = [Link](norm='12')
nm_ar = [Link](ar)
print('L2 Normalization\n')
print(nm_ar[:5])
# Sum of vaLues in the rows
print('\nSum of the values in each row\n')
for i in [0,1,2]:
print(sum(nm_ar[i]))
# Sum of the squares of the vaLues in the rows
sm_row = '\nSum of squares of the values in each row\n'
for i in [0,1,2,3]:
print(sm_row)
sm_row = 0
for val in nm_ar[i]:
sm_row += val*val
L2 Normalization
[[0.01403589 0.11649791 0.98251251 0.1445697 ]
[0.03012017 0.12951675 0.97890567 0.1551189 ]
[0.04651593 0.13644674 0.97683459 0.15815417]
[0.05355175 0.14057333 0.96393143 0.21956216]
[0.05953254 0.12739964 0.96442719 0.22384236]]
Sum of the values in each row
1.2576160130346932
1.2936614951212533
1.3179514295489714
Sum of squares of the values in each row
1.0
1.0
1.0
* •/
114 DATA ANALYSIS S PROCESSING
----- '•--------------------------------
The code may a bit hard to understand because of
the for loop but let's try to understand it. We
normalize the data as we did we before but this
time we used the f12' method and printed the data
values. Then we printed the sum of the values of
first three rows of the normalized data but they
didn't sum upto 1. Next as the L2 method states,
we printed the sum of the squared values in the
first three rows using for loop which turned out
to be exactly 1
Before the for loop, we created an sm_row
varaible in which we will add our squared values
in the rows but we stored a string at start. Then
we created the outer loop in which we will get the
index of the rows. Also we entered one more number
in the list because at the first iteration the
string in the sm_row will be printed and after
printing it we changed it's value to 0 and then we
create the innner in which we will perform
addition. In each iteration of the inner loop we
will add the square of each element in the rows
with += compound assignment operator. After all
the values are sumed up, we return to the outer
loop and print it and again revert the value to 0
to store teh sum for the next row until all the
values are printed
(Sum holder variable (vessel)]
[sm_row] = [’\nSum of squares of.7)
for i in [0,1,2,3]: [ ist’Run j- nm_ar[0]
print (sm_row)<<— 1 I r [0.01403589,
[1st Run] sm_row = 0 > 0.11649791,
for [val jin (nm_ar [i]): > 0.98251251,
(sm_row]+= val*val L 0.1445697]
___________
~—I
for i in [0,1,2,3]: [2nd Run]
nT)--------- [print (sm_row)}*—I
1.0 /[sm_row = 0]
1.0 Output
X_____ _______ z
Binarization
In binarization we binarize our data i.e. reduce
differences to only two to leave crisp vales with
a threshold. For exaple if we set the threshold to
10, all the value in a data set under 10 will be
converted to 0 and above 10 will be converted into
1. Let's binarize our data with Binarizer class
and transform() method
In [21]: import pandas
from sklearn import preprocessing
dt = pandas.read_csv('[Link]')
ar = [Link]
Nm = [Link](norm='11')
nm_ar = [Link](ar)
Bin = [Link](threshold=0.1)
bin_ar = [Link](nm_ar)
bin_ar[10:16]
0ut[21]: array([[0., 0., 1-, 1.],
[0., 0.t 1., 1-],
[1-, 0., 1., 1-],
[1., 1., 1., 1.],
[1-, 0., 1.],
[1-, 1., I-]])
As you can see used the normalized (LI) that had
a range of 0 to 1 which made things easier to set
a threshold which is specified in the threshold
parameter i.e. 0.1
So all the values below 0.1 are changed to 0 and
all the values above 0.1 are changed to 1
0
' •/>
116 DATA ANALYSIS S PROCESSING
----- '•--------------------------------
Standardization
Standardization or Standard scaling is the method
of changing the distribution of data arttributes
to Gausiann distribution (Normal distribution). In
this mthe mean is changes to 0 and standard devia
tion is changed to 1. Let's standardize our data
using the StandardScater class and it's fit()
and transform() methods
In [14]: import pandas
import numpy
from sklearn import preprocessing
dt = pandas.read_csv(1 trees.csv1)
ar = [Link]
# Standardizer
Std = [Link]().fit(ar)
std_ar = [Link](ar)
print(std_ar[0:3])
print(’Mean:1, round([Link](std_ar)} 2))
print('[Link]:'}round([Link](std_ar)>2))
[[-1.67705098 -1.60291968 -0.9572127 -1.22883711]
[-1.56524758 -1.50574137 -1.75488995 -1.22883711]
[-1.45344419 -1.44095583 -2.07396086 -1.23502119]]
Mean: -0.0
[Link]: 1.0
While the StandardScater object intialization
we also called the fit() function to fit the
scaler to our ar array and also transformed it, if
you don't call the fit you'll get an error like
This StandardScater instance is not fitted
yet. Catt ’fit1 with appropriate arguments
before using this estimator
you can also use the fit & transform functions in
the previous preprocessing methods, for
demonstration purpose they aren't used in previous
examples but make sure to use them in it's
applications
Note that we used meant) and std() functions of
the numpy package to calcualte the mean and
standard deviation i.e. 0 and 1
117 DATA ANALYSIS S PROCESSING
----- '•--------------------------------
Label encoding
In many cases our data has more labels (word)
than features (numeric) but using words (strings)
in processing limits many activities. For that
purpose we need to change those labels into
numeric notations or features like the following
example
In [15]: import pandas
from sklearn import preprocessing
dt = [Link]({'Questions':['A'>'B'> 'C','DE'],
’Answers':[’True’> 'True','False','True','False']})
dt
Out[15]:
Questions Answers
0 A True
1 B True
2 C False
3 D True
4 E False
We can use the Label-Encoder class for label encoding
In [17]: import pandas
from sklearn import preprocessing
dt = [Link]({'Questions':['A','B','C','D','E'],
'Answers':['True','True','False','True','False']})
Enc = [Link]( )
[Link](dt['Answers'])
# Encoded LabeLs
dt['Answers'] = [Link](dt['Answers'])
dt
Out[17]:
Questions Answers
0 A 1
1 B 1
2 C 0
3 D 1
4 E 0
' •/>
118 DATA ANALYSIS S PROCESSING
----- '•--------------------------------
As you can see that we had the Questions as A-E
and Answers as True or False. But we encoded the
Answers label to be 0(False) or l(True)
If we want we can get the label for the value or
decode the 0 or 1 values using the
inverse_transform() function
In [18]: import pandas
from sklearn import preprocessing
dt = [Link]({'Questions':['A'B' C','D'>' E'],
’Answers*:[’True’,'True',’False’,'True','False']})
Enc = [Link]()
[Link](dt['Answers'])
# Encoded LabeLs
dt['Answers'] = [Link](dt['Answers'])
# Decoding LabeLs
print(Enc.inverse_transform([0,1]))
['False' 'True']
By encoding we can hide the true values and
perform a lot operations with them because they are
numerical values. In this data we had less only two
label values i.e. True and False, but when there
are more values the encoding will range from 0 to
their respective lengths
<Or
1A data
14 VISUALIZATION
• Plotting data
• Univariate plots
• Multivariate
plots
o
A
V
14 J data visualization
□------------------ J
Python has excellent libraries for data
visualization. A combination of Pandas, numpy and
matplotlib can help in creating in nearly all types
of visualizations charts. We can use the visuals to
better understand our data
Plotting data
We use numpy library to create the required
numbers to be mapped for creating the chart and the
pyplot module of matplotlib which draws the
actual chart
In [11]: import numpy as npy
import [Link] as pit
x = [Link](0,10) ^outputs [0123456789]
y = x A 2
[Link](x,y)
Out[ll]: [<[Link].Line2D at 0x26clalbcl00>]
The arange(0,10) function creates a ndarray of
numbers from 0 to 10 [excluding 10] and the plot
function plots a simple chart of the data we
provided. We can also plot any imported data using
the same
f
121 DATA VISUALIZATION
Editing labels and colors
As we already know matplotlib uses MATLAB symbols
as formatted strings to customize the colors [refer
to page 32] and how to add labels to the plots
In [15]: import numpy as npy
import [Link] as pit
x = [Link](0,10)
y = x A 2
# Editing LabeLs
[Link]("Matplotlib")
[Link]("X_Axis")
[Link]("Y_Axis")
# Editing Line styLe and coLor
[Link](x,y,'c')
[Link](x,y,'*')
Out[15]: [<[Link].Line2D at 0x26cla312130>]
We added a title for the plot Matplolib using the
title function, X axis and Y axis labels using
xtabet and ylabet functions respectfully.
The fc' represents cyan which is the color of the
line with as symbol for stars
Note that we didn't passed these values in any
parameter because they are treated as formatted
strings matplotlib. pyplot. plot still interprets
it as positional argument
122 DATA VISUALIZATION
--- '•-------------
Univariate plots
Univariate plots are the type of plots or visuals
with different plots for each variable to
understand them individually. We will use the
[Link] data, which we used in the previous
lesson. First of all let's plot histogram for each
variable using the hist() function of the pandas
library
In [1]: import pandas
dt = pandas.read_csv('[Link]')
[Link]()
Out[l]: array([[<[Link]._subplots.AxesSubplot object
<[Link]._subplots.AxesSubplot object
[<[Link]._subplots.AxesSubplot object
<[Link]._subplots.AxesSubplot object
dtype=object)
As you can see we have histograms plotted for
each variable i.e. Girth, Height, Volume and
Index(which can be ignored). Inspecting the
visuals we can get a lot of information about the
data, it's distribution, maximum value, minimum
value, etc
We can also visualize individual variables using
the histC) function and the sliced variable or
the column
123 DATA VISUALIZATION
--- '•-------------
In [2]: import pandas
dt = pandas.read_csv('[Link]'j
names=['Index','Girth','Height','Volume'])
[Link](9,inplace=True)
dt['Height'].hist()
70 65 63 72 81 83 66 75 80 79 76 69 74 85 86 71 64 78 77 82 87
Note that we renamed the columns while importing
the csv data using the names parameter of the
read_csv() function, but the default column names
will be stored as the first row, we need to remove
that, to do so we use the dropO function and
provide the 0 as the index i.e. first row and also
True for the inptace parameter which will remove
that from the actual data
Next is density plots which are similar to
histograms but have smooth lines like the below
chart
124 DATA VISUALIZATION
We craete density plots using the pLot()
function of pandas library and specifying
density in the kind paramter
In [7]: import pandas
dt = pandas.read_csv('[Link]')
del dt['Index']
[Link](kind='density')
Out[7]: [Link]._subplots.AxesSubplot at 0x261f2dc
If you want seperated plots., specify True in
subplots parameter
[Link](kind='density',subplots=True)
0ut[8]: array([<[Link]._subplots.AxesSubplot object
<[Link]._subplots.AxesSubplot object
<[Link]._subplots.AxesSubplot object
dtype=object)
125 DATA VISUALIZATION
Another univariate plot called box and whiskers
plot can be used too. We can specify box in the
kind parameter of ptot() function this time
In [9]: import pandas
dt = pandas.read_csv('[Link]')
del dt['Index']
[Link](kind='box'>subplots=True)
0ut[9]: "Girth (in)" AxesSubplot(0.125,0.125;0.227941:
"Height (ft)" AxesSubplot(0.398529,0.125;0.227941:
"Volume(ftA3)" AxesSubplot(0.672059,0.125;0.227941:
dtype: object
The box plots have the following features:
126 DATA VISUALIZATION
--- '•-------------
Multivariate plots
As it sounds, through multivariate plots we can
understand the realtions btween different
attributes or variables in a data. One of the
multivariate plots is a correlation matrix. We can
plot correlation matrix for our data like so:
In [16]: from matplotlib import pyplot
import pandas
dt = pandas.read_csv('[Link]')
del dt['Index']
cor = [Link]() # correLations
# PLotting correLation matrix
vis = [Link]()
# Adding the coLor meter
ax = vis.add_subplot(lll)
cax = [Link](cor, vmin=-l, vmax=l)
[Link](cax)
[Link]()
0 1 2
We used the figure() function of the matlplotlib
library, along with the matrix we also added a color
bar to indicate the value of colors using the
correlations data i.e. calculated using the corr()
function. We added the matrix using the
add_subplot() function and 111 as argument i.e. itJs
position. Then we used the matshow() function and
passed our correlations and min and max values.
Finally we plotted our matrix with the colorbar
' •/>
127 DATA VISUALIZATION
----- '•--------------------
We can also label the axes as our column names
using the set_xticks() and set_yticks()
functions to set the positions and
set_xtickLabetO and set-yticklabelO functions
to label them accordingly
In [19]: vis = [Link]()
ax = vis.add_subplot(lll)
cax = [Link](cor, vmin=-l, vmax=l)
[Link](cax)
ax.set_xticks([0,1,2])
ax.set_yticks([0,1,2])
ax.set_xticklabels(['Girth','Height','Volume'])
ax.set_yticklabels(['Girth','Height','Volume'])
[Link]()
U>lume
■-0 00
■ -0 25
■--0 50
-0.75
u-1.00
We can easily understand the relations between
the variables i.e. Girth & Volume is close to 1
correlation value(full positive) and Height &
Volume have a bit less realtion than Girth &
Volume.
We can also view scatter matrix plots to under
stand realtions between variables or attributes
using dot graphs. We can do so using the pandas
scatten_matrix() function. We need to use the
[Link].scatter_matrix0 syntax to
access the function and pass the data inside of
the parenthesis of the function
128 DATA VISUALIZATION
In [1]: from matplotlib import pyplot
import pandas
dt = pandas.read_csv('[Link]')
del dt['Index']
[Link].scatter_matrix(dt)
Out[l]: array([[<[Link]._subplots.AxesSubplot object
<[Link]._subplots.AxesSubplot object
<[Link]._subplots.AxesSubplot object
[<[Link]._subplots.AxesSubplot object
<[Link]._subplots.AxesSubplot object
<[Link]._subplots.AxesSubplot object
[<[Link]._subplots.AxesSubplot object
<[Link]._subplots.AxesSubplot object
<[Link]._subplots.AxesSubplot object
dtype=object)
« »
•••
Ju
••
15 ■
10 ■
s
■Jl,
g
80
f • • •
u.
70
\ .
50 ■
. V •*
• • ••
25 -
—,----------- .------
a a
—, —a R 8 « 8 £
"Girth (in)" "Height (ft)’ "Volume(ft~3)"
Girth & Volume
JuI..?--'
•••
■Jj, V.-'
• •
•
• V ••
L=-r-------- ii----
jq
"Girth (in)"
•
i
£
i
•
g
•
•• •• .
i
•
"Height (ft)"
ui
}Q 8
r
"Volume(ft~3)"
i
|C
1 E REGRESSION
!□ ALGORITHM
• What is
Regression
• Regressor
model
• Linear regression
o J
A
\ 15
REGRESSION
□------------ J
What is regression?
As we already know about the regression algorithm
that it is a case of supervised machine learning
where we feed input data(numeric) and the algorithm
learns the patterns in the data and predicts the
output data. The performance of the algorithm is
entirely based on it's learning so we need to do
our best to feed the best data to our model. So
letJs create a regressor with the scikit-learn-’s
readymade algorithms
Regressor Model
LetJs create a simple regressor model to predict
the weight of a person if the height is given as
input value. You can download the file from here.
Then jump to your jupyter notebook
and import the following modules to [Link]
start the model
import pandas
from sklearn import Linear_model
import [Link]
import [Link]
In [1]: import pandas
from sklearn import linear_model
import [Link]
import [Link]
We need pandas library to import our csv data,
linear_model to create the regressor model,
[Link] to evaluate the accuracy of our
model and matplotlib to visualize our data in
multiple steps
Now we can import the csv data and have a look at
it before training the model
131 REGRESSION
In [2]: import pandas
from sklearn import linear_model
import [Link]
import [Link]
dt = pandas.read_csv('height_and_weight.csv')
[Link]()
Out[2]:
Index Height(ln) Weight(lbs)
0 1 65.78 112.99
1 2 71.52 136.49
2 3 69.40 153.03
3 4 68.22 142.34
4 5 67.79 144.30
We imported the csv data and inspect the first
five rows using the head() function (by default
if oyu don't pass any value it will show the first
5 rows). We already have indexes so we can remove
the Index column. We can check the distribution
of the data by plotting a histogram of the data
using the pandas hist() function
In [3]: import pandas
from sklearn import linear_model
import [Link]
import [Link]
dt = pandas.read_csv('height_and_weight.csv')
del dt['Index']
[Link]()
0ut[3]: Height(ln) Weightfibs)
ma
40
20
0 0
65 70 100 120 140 160
We have a normal distributed data so we don-’t
need to do any preprocessing,, we can move onto
training the model
132 REGRESSION
•
But we need to seperate the values into input
data i.e. X and output data i.e. y. For our input
data we will use the height data and for the
output we will use the weight data. So moving onto
a new cell (Hotkey : b) let's seperate the data
into inp_X and out_y variables
In [4]: inp_X = dt['Height(In)']
out_y = dt['Weight(lbs)']
Now we can create our regression model RegModel
using the LinearRegression class and use to
fit() method and pass the input value inp_X and
output value out_y as arguments
In [5]: inp_X = [Link](columns=['Weight(lbs)'])
out_y = [Link](columns=['Height(In)'])
RegModel = linear_model.LinearRegression()
[Link](inp_X,out_y)
0ut[5]: LinearRegression()
And our model is ready! Yes., we can ask the model
to make predictions for weight of a person if it's
height is 60 inches. We will use the predict()
method to do the same
In [6]: [Link]([[60]])
Out[6]: array([[99.93286131]])
So as you can see if a person has height of 60
inches then the weight of the person will be
approx 99.93 pounds, is what our model predicted.
As you can see we are in a new cell and used the
predictO function to predict the value and we
also passed the value as [[60]], because actually
the value will be treadted as [Link] ([ [60]])
and if you pass 60 or [60] you'll encounter an
error because the the dataset for training is a
2-D array so we need the same for predictions. And
if a question is tinkering you that how did the
model made prediction just review the first lesson
Z—•
133 REGRESSION
But we don't know whether the value is correct or
not? Then we can't calculate the accuracy of the
model either. Then we need to divide our dataset
into training set say 90% of the data to train the
model and 10% as testing set whose input values
will be provided to the model to make predictions
and then we will compare the predictions with the
real ones
In the cell where we trained the model., we will
divide the input values i.e. Height into 90% in
inp_X and 10% in tst_X and the same with the
output values
In [1]: import pandas
from sklearn import linear_model
from sklearn.model_selection import train_test_split
import [Link]
import [Link]
dt = pandas.read_csv('height_and_weight.csv')
del dt['Index']
In [2]: height = [Link](columns=['Weight(lbs)'])
weight = [Link](columns=['Height(In)'])
inP_Xj tst_Xj out_y, tst_y = train_test_split(
heightjweightj test_size=0.1)
RegModel = linear_model.LinearRegression()
[Link](inp-XjOut-y)
Out[2]: LinearRegression()
To ease up the data spliting task we used the
train_test_spl.it 0 function in the
sktearn.modet_setection. That's why we imported
the function in the first cell where we imported
data and the required libraries and modules. In
the second cell we divided the input data and
output data into height and weight. Then we
created four variable inp_X, tst_X, out_y and
tst_y to store input for training, input for
testing, output for training and output for
comparing the predictions respectively using the
train_test_spl.it 0 function. We passed the input
data height, ouput data weight as arguments and
Z—•
134 —REGRESSION
0.1(10%) to the test_size parameter. The spliting
and assigning of the data can be understood from
the below figure:
train_test_sptit(
height,weight,test_size=0. 1)
height weight
90% 10% 90% 10%
after that as we did earlier we create our model
and train it
Now we can ask our RegModet to predict the
weight of the tst_X values and compare it with
the tst_y values
In [3]: pred_y = [Link](tst_X)
cmp = [Link]({'Predictionspred_y.flatten(),
'Actual values':tst_y['Weight(lbs)'].values})
[Link](kind='bar'3figsize=(7.5,6))
0ut[3]: <[Link]._subplots.AxesSubplot at 0x2038d756cl0>
Z—•
135 REGRESSION
•
We used the predict() method and passed the test
height values tst_X as arguments and stored the
predictions in the pned_y. Then we created a data
frame using the pred_y values (we flattened the
array using the flatten() function to change it
to 1-D array from 2-D array before passing it as
values) and tst_y actual values (we sliced the
Weight(tbs) column and extracted itJs values) and
plotted a grouped bar graph using the ptot()
function and bar to kind parameter and also
specified the size of the plot using figsize
parameter
As you can observe, the predicted values as blue
and actual values are orange so we can tell
visually what is the performance of our model.
Most of the bars are close to each other but some
are way taller or shorter than the other i.e. our
model is performing good but in some cases it
predicted wrong. We can also visualize the
regression line and data points
In [4]: [Link](tst_X4 tst_y, color='black')
[Link](tst_X, pred_y, color='yellow')
[Link]()
140
130
120
110
100
We used the scatter() function to plot the data
points and plot() function to plot the regression
line
136
•• REGRESSION
To know how much error is their in the
predictions or the MAE (Mean Absolute Error) we
can use the mean_absoLute_error() function in
[Link]. It will return an average of
the differences between the predicted and actual
values
In [6]: metrics.mean_absolute_error(tst_yi» pred_y)
0ut[6]: 11.529538265252379
We passed our actual values tst_y, predicted
values precLy as arguments to the function. So at
an average the error values will have 11.5 pounds
of difference from the actual weights with our
current model
Linear Regression
As learnt earlier, regression algorithms find
relationship between two variables and predict the
values based on that relation. In linear
regression, the algorithm finds the linear
relationship between two variables i.e. if a
independent varaible is changed the dependent
variable will be affected. For example consider
the following dataset, we have data of different
squares with itJs length of one side and itJs
area. And it is clear that if the side
(independent variable) is changed say increased
the area (dependent variable) will
change(increase) too because area is calculated
from the length of the side
Side 1_____ Area_____ 1
Z—•
137 REGRESSION
•
Mathematically the relation can be expressed as:
Y = mX + b
where, Y is the dependent(side), X is the
dependent variable(area) we are using to make
predictions, m is the slop of the regression line
which represents the effect X has on Y and b is a
constant, known as the Y-intercept. If X = 0,Y
would be equal to b
The relationship can be either positive or
negative
Simple Linear Regression
Simple linear regression or SLR is the types of
regression in which we predict a value using only
one feature like we did before, we predicted the
weight of a person using the height
140
130
120
no
100
64 65 66 67 68 69 70 71
Z—•
138 REGRESSION
Multiple Linear Regression
Multiple Linear Regression or MLR is the type of
regression in which we predict a value using two
or more features like predicting the weight of a
person using the age and height values. Here, the
regression line is calculated using the following
formula:
h(xt) = bo + bixu+ ... + bpxip+ et
where p is the number of the features, h(\) is the
predicted value, bo, bi, etc. are regression
coefficients and e^are the residual errors i.e.
errors in the data. This also gives rise to the
following formula
yt= h(xi) + ei
where y is the actual value or the dependent
variable
The training process for a MLR model is same as
we did before, but this time we have more features
from which predictions will be made. LetJs create
a model to predict the salary of person if we get
the age and gender as input. You can download the
dataset from here Check the Resources
We need to need to import the same [Link]
things as we did we before
import pandas
from sktearn import tinear_modet
from sktearn import metrics
from matptottib import pyptot
from sktearn.modet_setection -
-import train_test_sptit
we will import train_test_split before
Z—•
139 REGRESSION
•
and also import the data set and inspect the first
five rows using the head() function
In [1]: import pandas
from sklearn import linear_model, metrics
from sklearn.model_selection import train_test_split
from matplotlib import pyplot
dt = pandas.read_csv('[Link]')
[Link] ()
CustomerlD Genre Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
Our data has 200 rows and 5 columns. As you can
see the Male and Female are not numeric values
which is fine, but it is best time to pratice our
preprocessing skills. We will encode them into
numeric values. If you don't remember how to do
so, go back to the Data analysis and processing
lesson
In [2]: import pandas
from sklearn import linear_model.> metrics
from sklearn.model_selection import train_test_split
from matplotlib import pyplot
from sklearn import preprocessing
dt = pandas.read_csv('[Link]')
Enc = [Link]()
[Link](['Male','Female'])
dt['Genre'] = [Link](dt['Genre'])
[Link]()
CustomerlD Genre Age Annual Income (k$) Spending Score (1-100)
0 1 1 19 15 39
1 2 1 21 15 81
2 3 0 20 16 6
3 4 0 23 16 77
4 5 0 31 17 40
Z—•
140 REGRESSION
We have encoded our Male and Female labels into 1
and 0 features respectively. Now we can move onto
a new cell and split our data
In [3]: Input = [Link](columns=['CustomerlD'>
'Annual Income (k$)'->
'Spending Score (1-100)'])
Output = [Link](columns=['CustomerlD',
'Genre'>'Age',
'Spending Score (1-100)'])
inp-Xjtst-XjOut-Yjtst-y = train_test_split(
InputjOutputtest_size=0.1)
The input values are stored in Input and the
output are stored in the Output. Then we splitted
the data for training and testing using the
train_tets_spl.it 0 function. Now we can create
our model and train it
In [4]: Input = [Link](columns=['CustomerlD'>
'Annual Income (k$)',
'Spending Score (1-100)'])
Output = [Link](columns=['CustomerlD',
'Genre','Age',
'Spending Score (1-100)'])
inp_Xj tst_X<,out_y^tst_y = train_test_split(
Input,Output,test_size=0.1)
# Training ModeL
RegModel = linear_model.LinearRegression()
[Link](inp_X,out_y)
0ut[4]: LinearRegression()
Our RegModet is ready to predict the annual
income of people if the gender and age are passed
as input. But we don't know the range of age here
so we can execute the following code to know so
In [31]: dt['Age'].max() - dt['Age'].min()
I Out[31]: 52
The range of the age in the dataset is 52
—•
141 REGRESSION
Before comparing the predictions and actual
values we can ask the model to predict some values
like how much a 30 years old female is earning and
how much a 42 years old male is earning
In [5]: [Link]([ [0,30], [1,42] ])
Out[5]: array([[60.5504497 ],
[62.09356734]])
Note the way we passed the values. As Female and
Male are encoded as 0 and 1, we pass [ [0,30],
[1,42] ] and our model is telling that a 30 years
old female earns about 60.5k and 42 years old male
earns about 62k. Well let's predict the test input
and compare the results with actual ones
In [6]: pred_y = [Link](tst_X)
cmp = [Link]({1 Predictedpred_y.flatten(),
'Actual':tst_y['Annual Income (k$)'].values})
[Link](kind='bar’,figsize=(7.5,6))
Out[6]: <[Link]._subplots.AxesSubplot at 0xl402ccc89a0>
As you can observe, most of the values are
incorrect because of the distribution of the data,
so what do we do now?
•
142 REGRESSION
There are lot of ways to improve the performance
of model, we can increase the data as in our case
we have only 200 rows, to predict values precisely
we need at least 10 times the data we have now
because the distribution isn't noraml in this case
In [7]: cmp[’Predicted'].plot(kind='density')
[Link]()
cmp[’Actual'].plot(kind='density',color='orange')
[Link]()
0014
0012
0010
>. 0 008
<z>
I 0 006
0 004
0 002
0 000
-25 0 25 50 75 100 125 150
We will learn about more methods to improve the
performance of our models in detail in the
performance and metrics lesson
<Or
1 G CLASSIFICATION
IO ALGORITHM
• Decision Tree I 0 n
o' O1 I
• Logistic In □ □
regression
• Nolve Boyes
I
o
A
16 CLASSIFICATION
\_______ /
□------------- J
We already know what is a classification
algorithm, but there are two types of
classification algorithms - lazy learners and eager
learners i.e. lazy learners learns less during
trainging but more in predicting like KNN
algorithms but eager learners learns in training
and less in testing like decision tree, Naive
Bayes, etc. Now let's create a classifier using the
decision tree
Decision tree
Let's create a classifier to classify a customer
into which falvour he/she likes if the age and
gender are provided as input. To do so we will use
the decision tree, an algorithm that can perform
both classification and regression tasks. It learns
the categories in a dataset and creates categories
using a decision tree and then predicts the
category of an input
We will use the scikit learns
DecisionTreeClassifier to create our
model. You can download the data set [Link]
from here -
Now jump onto your jupyter notebook
and import the packages and modules
needed in this project
import pandas
from [Link] import -
-DecisionT reeClassifier
from sklearn.model_selection import
train_test_split (optional)
In [1]: import pandas
from [Link] import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
145 CLASSIFICATION
•
Now we can import our dataset and previwe it
using the head() function
In [2]: import pandas
from [Link] import DecisionTreeClassifier
dt = pandas.read_csv('[Link]')
[Link]()
Age Gender Flavour
0 6 Male Chocolate
1 6 Female Strawberry
2 7 Male Chocolate
3 8 Female Strawberry
4 11 Male Butterscotch
As you can see we need to encode the Gender
labels into numeric values. So let's import the
preprocessing module and encode the labels
In [3]: import pandas
from [Link] import DecisionTreeClassifier
from sklearn import preprocessing
dt = pandas.read_csv('[Link]')
Enc = [Link]().fit(['Male',
'Female'])
dt['Gender'] = [Link](dt['Gender'])
[Link]()
Out[3]
Age Gender Flavour
0 6 1 Chocolate
1 6 0 Strawberry
2 7 1 Chocolate
3 8 0 Strawberry
4 11 1 Butterscotch
So the Mate and Female labels are encoded into 1
and 0 respectively using the LabetEncoder. Note
that we fitted Mate and Female while intialization
146 CLASSIFICATION
•
Now we can move onto a new cell and split our
data into input and output. Then we will create
our classifier CModel using DecisionTreeClassifier
and training the model using the input and ouput
values
In [4]: inp_X = [Link](columns='Flavour')
out_y = [Link](columns=['AgeGender'])
# CLassifier
CModel = DecisionTreeClassifier()
[Link](inp_X,out_y)
0ut[4]: DecisionTreeClassifier()
So we trained our model using the fit() function
and our model is ready to make predictions. Let's
ask our model which flavour will a 7 years old boy
and 9 years old girl will prefer:
In [5]: [Link]([ [7,1], [9,0] ])
Out[5]: array(['Chocolate', 'Strawberry'], dtype=object)
According to our model a 7 years old boy [7,1]
will prefer chocolate and a 9 years old girl [9,0]
will prefer strawberry which is pretty much
correct. But we trained our model with only 20
rows of data but we are getting some descent
results. But what if we ask the model to predict
what a 30 years old women will prefer, i.e. beyond
the range of age we provided in the data
In [8]: [Link]([[30,0]])
Out[8]: array(['Coffee'], dtype=object)
According to our model a 30 years old women [7,1]
will prefer coffee. Which maybe not correct
because the maximum age for women provided in the
dataset is 24. So how did the model predicted that
value? Let's find out how our model is performing
classifications
147 CLASSIFICATION
To view the the decision tree or the algorithm
which is used by the model to classify the values
can be seen using the below code
In [1]: import graphviz
from [Link] import export_graphviz
data = export_graphviz(CModel,
out_file='dt_tree.dot' ,feature_names=[ 'Age', 'Gender' ],
class_names=['Chocolate', 'Strawberry',
'Butterscotch', 'Vanilla', 'Mango',
'Almond_Choco', 'Coffee'],
filled=True, rounded=True,special_characters=True)
graph = [Link](data)
graph
Age < 20.5
gini = 0.825
samples = 20
value = [3. 6, 2. 3. 2, 2, 2]
class = Strawberry
Age < 9.0 Age <23.0
gini = 0.781 gini = 0.375
samples = 16 samples = 4
value = [2. 6, 2, 0. 2, 2, 2] value = [1.0, 0. 3, 0, 0,0]
class = Strawberry j yz± class = Vanilla
Gender s 0.5 Genders 0.5 Genders 0.5
gini = 0.5 gini = 0.667 gini = 0.5
samples = 2
samples = 4 samples = 12 samples = 2
value = [0. 0, 0, 2, 0, 0, 0]
value = [0, 0, 2, 0, 0, 2, 0] value = [2, 6, 0, 0, 2, 0, 2] value = [1,0, 0, 1,0, 0,0]
class = Vanilla_____ y
class = Butterscotch class = Strawberry class = Chocolate
Ages 15.5 Age <13.5
gini = 0.0 gini = 0.0 gini = 0.0
gini = 0.444 gini = 0.667
samples = 2 samples = 2 samples = 1 samples = 1
samples = 6 samples = 6
value = [0. 0, 0, 0. 0, 2, 0] value = [0, 0, 2, 0, 0. 0. 0] value = [0. 0. 0,1,0, 0,0] value = [1.0, 0, 0, 0, 0,0]
value = [0.4, 0, 0. 0, 0, 2] value = [2, 2, 0. 0, 2, 0. 0]
class = Almond Choco class = Butterscotch class = Vanilla class = Chocolate
class = Strawberry class = Chocolate
/" Age< 18.0
Age <12.0
gini = 0.0 "X /■ gini = 0.0
gini = 0.444 gini = 0.5
samples = 3 samples = 2
samples = 3 samples = 4
value = [0. 3. 0, 0. 0. 0, 0] value = [0. 2. 0, 0. 0. 0. 0]
value = [0,1, 0, 0, 0, 0, 2] value = [2. 0, 0, 0, 2, 0, 0]
class = Strawberry j class = Strawberry
class = Coffee class = Chocolate
J
' gini = 0.0 gini = 0.0 gini = 0.0 gini = 0.0
samples = 1 samples = 2 samples = 2 samples = 2
value = [0. 1, 0, 0, 0, 0, 0] value = [0, 0, 0, 0, 0, 0, 2] value = [0, 0, 0, 0, 2, 0, 0] value = [2. 0, 0, 0, 0, 0, 0]
class = Strawberry class = Coffee, class = Mango class = Chocolate
Before visualizing the decision tree you need to
install the graphviz using the below code in your
anaconda prompt
conda install -c conda-forge python-graphviz
Then import the export_graphviz and graphviz.
Using export_graphviz we visualize the decision
tree and store it in data and using Source
function of graphviz we view the tree
But as you can see the tree is much big to fit in
the page so let's understand it by breaking it
down
148 CLASSIFICATION
•
LetJs see what will happen if [7jl] (7-year old
boy) is the input
Starting from the root the first comparison is
whether the age is less than or equal to 20.5 Age
20.5 and the age of the input is 7 so we move
on to the True side
Next the algorithm compares whether the age is
smaller than or equal to 9? and our age is smaller
so we move down (straight to green)
149 CLASSIFICATION
Now we are checking whether the gender is smaller
than or equal to 0.5 i.e. 0 or not. But here our
gender is 1 so we move to the non green side
Once again we compare age, whether age is smaller
to equal or smaller than 18 and 7 is so we finally
stop at the orage box (orange = True, blue =
False) and choose the class i.e. Chocolate
Starting from the root we reach the conclusion
that the input has a class chocolate i.e. a 7
years old boy likes chocolate
But you may what are the other attributes present
there like gini, samples, etc. Gini is the name of
the cost function that is used to evaluate the
binary splits in the dataset and works with the
categorial target variable "Success” or "Fail
ure”.A perfect Gini index value is 0 and worst is
0.5 which used to split further or not. You can
see gini score with 0 aren't splitted further,
samples is the number of data points in the given
dataset with the respective characterestics
z ■
150 CLASSIFICATION
Here is the full decision tree for the
classification of the 7 years old boyJs
preference. If you want to look at the whole tree
in more quality execute the code
z Age < 20.5
"X
gini = 0.825
samples = 20
value = [3, 6, 2, 3, 2, 2, 2]
class = Strawberry
z Age < 9.0 Age < 23.0 >
gini = 0.781 gini = 0.375
samples = 16 samples = 4
value = [2. 6. 2. 0, 2. 2. 2] value = [1.0, 0. 3. 0. 0.0]
class = Strawberry class = Vanilla j
Gender < 0.5 Gender < 0.5
gini = 0.5 gini = 0.667
samples = 4 samples = 12
value = [0. 0. 2. 0. 0. 2, 0] value = [2. 6. 0. 0. 2. 0. 2]
class = Butterscotch class = Strawberry
y
z Age < 15.5 Age< 13.5
gini = 0.444 gini = 0.667
samples = 6 samples = 6
value = [0. 4. 0. 0. 0. 0. 2] value = [2. 2. 0. 0. 2. 0. 0]
class = Strawberry y X,
class = Chocolate
z gini = 0.0
Age< 18.0 ■X
gini = 0.5
samples = 2
samples = 4
value = [0. 2, 0. 0, 0. 0, 0]
value = [2, 0. 0. 0. 2. 0. 0]
class = Strawberry
class = Chocolate
X.- s
"X
gini = 0.0
samples = 2
value = [2. 0. 0. 0. 0. 0. 0]
class = Chocolate y
Likewise we can use the decision tree to solve
different kind of problems based on classification
But you may also ask how does the tree creates
those comparisions or splits? It isn-’t necessary
to know but you should. First the algorithm
calculates the gini index for each attribute using
the below formula:
p2 + q2
which is the sum of the square of probability for
success(p) and failure(q). Then the dataset is
splitted into two lists of rows having index of an
attribute and a split value of that attribute.
Then it finds the best possible split by
evaluating the cost(gini) of the split
151 CLASSIFICATION
•
Logistic Regression
Logistic regression is a type of model that
predicts the outcome of output values as Yes or no
as numeric values 1 or 0 respectively. We can use
these type of models to classify a day as rainy or
notj a person as healthy or sick, etc. But there
are different types of logistic regression used
for to different situations.
Binomial Logistic Regression
Binomial or binary logistic regression used to
predict exactly two outcomes i.e. either
l(positive) or 0(negative)
Let's use an dataset to predict whether it will
rain or not if the temperature and humidity
percent are provided as input. You can download
the data set from here - rhprk thp Rpsourrps
Let's import the modules and the [Link]
dataset together. This time we will import
linear_model and train_test_split
from sklearn
import pandas
from sklearn import linear_model
from sklearn.model_selection import -
- train_test_split
from [Link] import accuracy_score
In [1]: import pandas
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from [Link] import accuracy_score
We also imported the accuracy_score0 function
from sklearn. metrics to calculate the accuracy
of our model
Now we can import our dataset and this time let's
view it as it is
152 CLASSIFICATION
In [2]: import pandas
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from [Link] import accuracy_score
dt = pandas.read_csv('Rainfall_data.csv')
dt
Out[2]:
Unnamed: 0 Temperature Humidity% Rain
0 0 34 74.2 Yes
1 1 19 68.2 No
2 2 28 67.2 Yes
3 3 29 66.6 Yes
4 4 26 57.9 Yes
19995 19995 30 77.9 Yes
19996 19996 20 74.8 Yes
19997 19997 14 69.4 No
19998 19998 20 60.6 No
19999 19999 22 64.8 No
20000 rows x 4 columns
As you can see we have 20000 rows and 4 columns
worth of data!
Now we can move on to a new cell and perform the
splitting of the data into train-test input and
train-test output
In [3]: Input = [Link](columns=['Unnamed: O'^'Rain'])
Output = dt['Rain']
inp-X, tst_X, out—y, tst_y = train_test_split(
[Link],Output,test_size=0.01)
We stored the input features i.e. Temperature in
Input and the output i.e. Rain (Yes or No) in
Output. Then we passed these values to the
train_test_sptit0 function and splitted the
data into training input, testing input, trainging
output and testing output where the test size is
0.01 (1% i.e. 200)
z <
153 CLASSIFICATION
Now we can create out logistic regression CModel
and train it
In [4]: Input = [Link](columns=['Unnamed: 0','Rain'])
Output = dt['Rain1]
inp_Xj tst_X, out_y, tst_y = train_test_split(
[Link],Output,test_size=0.01)
CModel = linear_model.LogisticRegression()
[Link](inp_Xjout_y)
0ut[4]: LogisticRegression()
So our model is ready to make predictions, letJs
move onto a new cell and let the model predict.
Then we will compare the values and print the
accuracy score
In [5]: from sklearn import preprocessing
pred_y = [Link](tst_X)
Enc = [Link]().fit(['Yes'No'])
cmp = [Link]({'Predicted':[Link](pred_y),
'Actual':[Link](tst_y)})
print('Accuracy Score:',accuracy_score(tst_y,pred_y))
[Link](kind='density')
Accuracy Score: 0.91
Out[5]: <[Link]._subplots.AxesSubplot at 0xlfb7a395d30>
So the model has accuracy score of 0.91 i.e. 91%,
which is really good! You can also see the density
plot where only 9% of values are predicted wrong
by the model
154 CLASSIFICATION
So how did our model predicted teh values or how
do the logistic regression works? To understand we
will see what is the mathematics behind the
algorithm, if you want you can move ahead or give
it read. The followings are the steps of linear
function of binomial logistics regression:
• We already know that the output will be either
0(No) or l(Yes). For that the linear function is
basically used as an input to another function
such as g in the following relation
h0(x) = g(0Tx) [0 h0 sS 1 ]
gis the logistic or sigmoid function which can be
found with the following formula:
where z is 0Tx
• We can visualize the sigmoid curve can be
understood by the following graph
the classes can be divided into positive or
negative. The output comes under the probability
of positive class if it lies between 0 and 1. For
our implementation, we are interpreting the output
of hypothesis function as positive if it is bigger
than or equal to 0.5 (>0.5), otherwise negative
• We also need to define a loss function to
measure how well the algorithm performs using
the weights on functions, represented by 6 and h
is equal to g(X0):
155 CLASSIFICATION
after defining the loss function our prime goal is
to minimize the loss function
• It can be done with the help of fitting the
weights which means by increasing or decreasing
the weights. With the help of derivatives of the
loss function with respect to each weight, we
would be able to know what parameters should
have high weight and what should have smaller
weight. The following gradient descent equation
tells us how loss would change if we modified
the parameters:
=—XT (g(X0) — y)
60j m
Multinomial Logistic Regression
As the name suggest this time we will have to pre
dict outputs more than 2 times. In multinomial lo
gistic regression we perform classification into 2
or more categories also the categories can be just
different types like Rain, Hailstorm, Snow, etc. or
ordinal like Heavy rain, moderate rain or low rain
fall
Let's consider the previous situation where we
predicted whether it will rain or not, Chprk the Rpsourcps
so let's create a model to predict [Link]
whether it will rain heavy, moderate
or low. You can download the dataset
from here -
and import the modules as we did
while creating model to predict the
rainfall
data
In [1]: import pandas
from sklearn import linear_model, metrics
from sklearn.model_selection import train_test_split
156 CLASSIFICATION
Now we can import our data and preview it without
the head() function
In [2]: import pandas
from sklearn import linear_model.> metrics
from sklearn.model_selection import train_test_split
dt = pandas.read_csv('[Link]')
dt
Temperature Humidity% Rainfall
0 34 74.2 Low
1 19 68.2 No Rain
2 28 67.2 Moderate
3 29 66.6 Moderate
4 26 57.9 Low
17996 31 89.7 No Rain
17997 21 84.7 No Rain
17998 28 74.7 No Rain
17999 30 78.2 No Rain
18000 34 80.4 Low
18001 rows x 3 columns
We have the same temperature, Humidity percent
columns but the rain is classified into No rain,
low, moderate and high. Now we can move onto the
next i.e. splitting the data
In [3]: Input = [Link](columns='Rainfall')
Output = [Link](columns=['Temperature','Humidity%'])
inp-Xjtst-XjOut-Yjtst-y = train_test_split(
Input,Output,test_size=0.1)
—
Next we need to scale our Input data (optional)
or we may encounter error. We will import
preprocessing module and scale our input data.
Then we can split our data into training and
testing sets and train our model after creating it
157 CLASSIFICATION
In [4]: from sklearn import preprocessing
Input = [Link](dt .drop(columns='Rainfall').values)
Output = dt['Rainfall']
inp_X,tst_X,out_y,tst_y = train_test_split(
Input,Output,test_size=0.2)
CModel = linear_model.LogisticRegression()
[Link](inp_X,out_y)
0ut[4]: LogisticRegression()
Our model is trained. Now we can test out model's
predictions with actual values. To visualize it we
need to use the LabetEncoder and encode the
Rainfall labels into numeric values. We will also
print the accuracy of our model
In [5]: pred_y = [Link](tst_X)
Enc = [Link]().fit(['No Rain',
'Low','Moderate','High'])
cmp = [Link]({'Predicted':[Link](
pred_y),'Actual':[Link](tst_y)})
acc = metrics.accuracy_score(tst_y,pred_y)
print('Accuracy:’,acc)
[Link](kind='density*)
Accuracy: 0.435156900860872
Out[5]: <[Link]._subplots.AxesSubplot at 0x239fe078.
----- Predicted
----- /krtual
—J
l------------------ 1 i i------------------ r i
-10 12 3 4
so our model didn't performed well. So here's a
question for you - why is the accuracy of our
model is below 50%? (without reading further)
158 CLASSIFICATION
•
So what do you think? Is it because we scaled the
data? Yes, you are obviously wrong. The scaling
wasn't necessary but it was included to lure you
to think that it would've been the reason but
that's not it. Instead it is a good practice to do
so. The actual reason is the distribution in the
data. If you observe the plotted graph of the
predicted and actual values
you'll notice that the distribution is really
different. Our dataset has a spike i.e. less 'high
rainfall' data only concentrated at one place and
no or low rainfall a lot. We provided our model
with 18000 rows worth of data but the distribution
wasn't good i.e. we didn't get enough data for
moderate or high rainfall. The places where the
lines are together or overlapping are the
predictions made correct by our model i.e. mostly
no rainfall and low rainfall. Our model didn't
predicted moderate or high rainfall for any value
at all
Encountering errors like this helps us to counter
problems in actual situations. During this time we
need to come up with different methods to improve
our model (that will be covered in the Performance
and metrics lesson) or change the algorithm, so
let's look at another classification algorithm
159 CLASSIFICATION
•
Naive Bayes
Naive Bayes algorithm is based on the Bayes
theorem which we already learned in previous
lessons. We have three types of Naive Bayes
algorithms:
• Gaussian, is used when the data in labels is
drawn from a gaussian distribution
• Multinomial, is used when the data in labels is
drawn from a multinomial distribution
• Bernoulli, is used when we have to predict
binary features like 0 or 1
So let's use the naive bayes algorithm to create
to model to predict whether it will rain or not
with the dataset used in the Binomial logistic
regression. We will use the Gaussian Naive Bayes
algorithm for this model
In [1]: import pandas
from sklearn import naive_bayes, metrics
from [Link] import LabelEncoder
from sklearn.model_selection import train_test_split
dt = pandas.read_csv('Rainfall_data.csv')
This time we imported naive_bayes from sklearn
to create our model and LabelEncoder to encode
the labels while comparing the predictions. Next
we can split out data into training input, testing
input, training output and testing output
respectively using the train_test_sptit()
function and the test_size as 10%
In [2]: Input = [Link](columns=['Unnamed: 0','Rain'])
Output = dt['Rain’]
inp_X,tst_X,out_y,tst_y = train_test_split (
Input,Output,test_size=0.1)
Now to create our model, we will use the
GaussianNB class from naive_bayes and train it
with the fit() method
160 CLASSIFICATION
In [3]: Input = [Link](columns=['Unnamed: 0','Rain'])
Output = dt['Rain']
inp_X<»tst_Xi,out_y<,tst_y = train_test_split(
InputjOutput,test_size=0.1)
CModel = naive_bayes.GaussianNB()
[Link](inp_X,out_y)
Out[3]: GaussianNB()
Our CModel is ready to predict. So let's test
our model with the predict0 method and compare
the answers using the density plot. We will also
print the accuracy score
In [4]: pred_y = [Link](tst_X)
Enc = LabelEncoder().fit(['Yes','No'])
cmp = [Link]({'[Link](
pred_y),'Actual':[Link](tst_y.values)})
acc = metrics.accuracy_score(tst_y,pred_y)
print('Accuracy:'>acc)
[Link](kind='density')
Accuracy: 0.9275
0ut[4]: <[Link]._subplots.AxesSubplot at 0x26al30ee<
The model has an accuracy of 92% i.e. the
predicted line and the actual line in the graph is
almost overlapping each other. Our naive bayes
model has performed well than the logistic
regression model by 1%
SUPPORT VECTOR
MACHINE
• What is SVM
• SVM Models
• SVM Kernels
o J
A
17
_____ /
SUPPORT VECTOR MACHINES
i__________________________________________ )
What is Support Vector Machine
Support vector machines or SVM is case of
supervised machine learning which is used for both
regressions and classifications. But the working
of the SVMJs can be considered a bit different. In
SVM algorithms divides a dataset into different
classes or categories in a hyperplane in
multidimensional space. The categories are divided
in a manner to find the maximum marginal
hyperplanes or MMH. These hyperplanes are
generated in an iterative manner to minimize
errors
In the above graph, the data points closest to
the hyperplane are called support vectors
The line separating the class A and class B is
the line called hyperplane i.e. dividing the data
into two class
And for the line called margin is the gap between
two lines on the closet data points of different
classes. It can be calculated as the perpendicular
distance from the line to the support vectors.
Large margin is considered as a good margin and
small margin is considered as a bad margin
The SVM algorithms find the maximum marginal
hyperplane to divide the datapoints into different
classes
163 SUPPORT VECTOR MACHINES
Support Vector Machine models
We can create our own SVM model using sample
datasets, so jump into your jupyter notebook and
import the following modules:
import numpy
from matplottib import pyptot
import seaborn
from [Link].samptes_generator -
-import make_btobs
In [1]: import numpy
from matplotlib import pyplot
import seaborn
from [Link].samples_generator\
import make_blobs
We will use numpy to modify the data, matplotlib
and seaborn to visualize our data and the
make_btobs() to create a dataset which will be
linearly seperable
In [2]: X, y = make_blobs(n_samples=100^centers=2,
cluster_std=0.50)
Using the make_btobs() function we created a
data set with 100 samples (data points) specified
in the n_samptes parameter, we want data with 2
centres thatJs why we passed 2 to the centers
parameter
★ **----------- 7two (centres)
/concentrations
------------------------- »x
and finally the standard deviation of the cluster
as 0.5 i.e. specified in the cluster_std
parameter
' •/>
164 SUPPORT VECTOR MACHINES
----- '•------------------------------
We can see the input and output data, but make
sure to view them in a different cell because
everytime you run the cell with the make_biobs()
function the values will be reassigned i.e. they
will be changed
In [3]: X[:5],y[:5]
Out[3]: (array([[-3.04236107, -5.08307767],
[ 4.45165507, 7.04554459],
[-4.0071927 , -5.82995397],
[-3.27505396, -4.85255011],
[-3.86582416, -4.44910783]]),
array([l, 0, 1, 1, 1]))
So we have a 2-D array with 1-D arrays with 2
elements(features) in them as the input. And 0 and
1 as the output values. All the values in the 1-D
arrays with index 0 are data points have one
center and all the values in the 1-D arrays with
index 1 have another center
(array([ [-3.04236107, -5.08307767],
[ 4.45165507, 7.04554459], • •
•
• •• [-4.0071927 , -5.82995397], •V
•••
• [-3.27505396, -4.85255011],
[-3.86582416, -4.44910783]] ) /
- index 0 index 1 -
We need to divide them into different varaibales
to visualize them, so let's slice all the values
with 0 index and 1 index from the 1-D arrays and
store them into XI and X2
In [4]: XI = X[:, 0]
X2 = X[:, 1]
print(Xl[:2])
print(X2[:2])
[-3.04236107 4.45165507]
[-5.08307767 7.04554459]
' •/>
165 SUPPORT VECTOR MACHINES
----- '•------------------------------
Now we can visualize our dataset using the
scatter plot
—
In [5]: XI = X[:, 0]
X2 = X[:, 1]
[Link](XI,X2)
0ut[5]: <[Link] at 0xl043f478
As you can see we have we the datapoints
concentrated in two places or formed 2 clusters.
We can draw lines to seperate them like:
Using numpy to create the line and matplotlib to
plot the lines in the graph we can visualize the
same
In [10]: xfit = [Link](-5,5)
[Link](XIX2,c=y)
for m, b in [(0.375,0.5),(-0.5,2)]:
[Link](xfit, m * xfit + b, '-k')
[Link](-5,5)
Out[10]: (-5.0, 5.0)
the code may be a bit hard to understand so letJs
break it down. First of all we created a xfit
varaible and stored all the values from -5 and 5 as
an array using the linspace() function, simply we
used to create values for the line i.e. from -5 to
5. Then we plotted a graph with our two data
clusters. Then we created a for loop to plot the
lines in the graph using the plot() function. We
passed the values of the plots in the for loop, and
performed multiplication and addition to extend the
line to the other end. This may difficult to follow
but this is whatJs going on there
m[0] * xfit[:] + b[0] = right y value (line
start)
m[l] * xfit [: ] + b[l] = left y value (line end)
and then we limit the length of the x-axis to -5 to
5 to view our line stretched from the right to the
left
167 SUPPORT VECTOR MACHINES
----- '•------------------------------
So we have plotted two lines that has seperated
the data into two classes. As we already know, the
SVM algorithms finds the maximum marginal
hyperplane (MMH), we can do so for the two lines
by drawing margins around them of some width like
so:
In [16]: xfit = [Link](-5,5)
[Link](XI,X2,c=y)
for in, b, d in [ (0.375,0.5,3), (-0.5,2,6) ]:
yfit = m * xfit + b
[Link](xfit, yfit, '-k’)
pyplot.fill_between(xfit, yfit - d, yfit + d,
edgecolor='none',color='#AAAAAA', alpha=0.4)
[Link](-5,5)
Out[16]: (-5.0, 5.0)
As you can we see we have drawn marigns around
the lines to find the nearest support vector. The
margins are very wide because the data clusters
are a bit far
We did so as same as before, plotting our graph
with data clusters, seperator lines and then we
used the fitt_between() functions to create those
margins. We passed the value of one X and 2 Y axes
to fill between them to represent our margin. We
didn't need edgecotor, and the fill color as
black and the alpha as 0.4 i.e. transparency
Now we can import the SVC support vector
classifier from sklearn. svm to create our model
In [36]: from [Link] import SVC
CModel = SVC(kernel='linear')
[Link](X, y)
Out[36]: SVC(kernel='linear')
Let's create a function to plot the maximum
marginal hyperplane and the support vectors using
our CModel
In [40]: def MMH(modelj ax=None, plot_support=True):
# 2-D graph pLot
if ax is None:
ax = [Link]()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
# Creating grid
x = [Link](xlim[0], xlim[l], 30)
y = [Link](ylim[0], ylim[l]_» 30)
Y, X = [Link](y, x)
xy = [Link]([[Link](), [Link]()]).T
P = model.decision_function(xy).reshape([Link])
# PLotting boundaries and margins
[Link](X, P, colors='k'J
levels=[-l, 0, 1], alpha=0.5j
linestyles=[’--'])
# PLotting support vectors
if plot_support:
[Link](model.support_vectors_[:, 0],
model.support_vectors_[:> 1],
s=300, linewidth=l, facecolors='none')
ax.set_xlim(xlim)
ax.set_ylim(ylim)
First of all we get the model., ax (axes) and
plot_support (to plot support vectors or not) as
arguments and parameters. Then we start off with
the 2-D graph plot and if we don't pass the axes
we will find them using the gca() function. We
also find the x axis limit and y axis limit using
the get_xlim() and get_ylim() function
repectively and store them in xlim and ylim
169 SUPPORT VECTOR MACHINES
----- '•------------------------------
Then we create the grid where or the base of our
plot using the xiim and ytim. As we did before we
create values for the lines using the tinspaceO
function and create the grid using the meshgrid()
function and then use the vstackO function to
vertically stack the arrays where the values are
reshaped using the ravetC) function. We also call
the decision_f unction (J to get the valyes for
the boundaries and margins
Next we use the data to plot the boundaries and
margins using the contourO function to draw the
lines and specify the linestyles and the other
properties using the respective parameters
Atlas, we check whether to plot the support
vectors or not and plot them if to using the
scatter() functions by using the
support_vectors_ 0 and 1 indexed values in the
array
Finally we can plot our data clusters and call
the MMH () function and pass our SVC model
In [41]: [Link](Xl>X2>c=y)
MMH(CModel)
Finally we have the maximum marginal hyperplane
plotted for our data clusters with the support
vectors
' •/>
170 SUPPORT VECTOR MACHINES
----- '•------------------------------
Support Vector Machine Kernels
Support vector machines are implemented with
kernels that transforms a input data space into
multidimensional for more flexiblity and smooth
workflow for the support vectors machines. As in
the previous model we used the linear kernel there
are different types of kernels like:
• Linear Kerenel, is used when predicting two
outcomes
• Polynomial kernel, is more generalized version
of the linear kernel where the input space is
non-linear
• Radial Basis Function kernel, is used for
SVM? s that maps the input space into infinite
dimensions
This time we will use the sample dataset provided
by the sklearn to understand the different
kernels. First of all we need to import our data
and prepare it. Import the the followings
import numpy
from sklearn import svc,datasets
from matplotlib import pyptot
In [1]: import numpy
from sklearn import svm., datasets
from matplotlib import pyplot
We load the iris (sample iris flower dataset)
dataset from the dataset
In [2]: dt = datasets.load_iris()
X = [Link][:, :2]
y = [Link]
We imported the dataset and splitted it into
input data[:,:2] (elements with 0-2 indexes in
alt the arrays) and output target and stored
them into X and y respectively
' •/>
171 SUPPORT VECTOR MACHINES
----- '•------------------------------
Now we need the data to plot the SVM boundaries
or input data spaces(different classes). To do so
we need to create a grid as we did before. To plot
the grid we need minimum and maximum values of the
input and output datasets. Then we reshape them
using ravet() and pass them to the c_() function
which nn particular stacks arrays along their last
axis after being upgraded to at least 2-D and
stored in the X_ptot as testing input
In [3]: x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, l].min() - 1, X[:> l].max() + 1
h = (x_max / x_min)/100
xx, yy = [Link]([Link](x_min, x_max, h),
[Link](y_min, y_max, h))
X_plot = numpy.c_[[Link](), [Link]()]
We have the required data to train the SVC
classifier, so let's create it
In [4]: SvcModel = [Link](kernel='linear',C=1.0).fit(X, y)
We can create the SvcModel using the SVC()
function and pass linear to the kernel parameter
and 1.©(float) to the regularization C parameter.
Also train it using the fit () method
Now we can predict the X_plot values and store
it in Z. We will reshape it using the reshape()
function to the shape of the xx meshgrid
First of all we will plot the figure(base). Then
we will add a subplot and draw the filled contours
using the subplot(121) and contourfO function.
We passed the values created with the meshgridO
function and Z predicted values. Now we can plot
the data clusters using the scatten() plot and
finally limiting the x-axis to maximum and minimum
values of the xx meshgrid
We can see how our dataset is divided intom
different spaces by the support vector classifier
with linear kernel
In [5]: Z = [Link](X_plot)
Z = [Link]([Link])
[Link](figsize=(15, 5))
[Link](121)
[Link](xx, yy> Z, alpha=0.3)
[Link](X[:, 0], X[:, l],c=y)
[Link]([Link](), [Link]())
0ut[5]: (3.3, 8.882727272727251)
Similarly we can create a SVC using the Radial
Basis Function kernel.
In [10]: RbfSvc = [Link](kernel='rbf',C=1.0).fit(X, y)
Z = [Link](X_plot)
Z = [Link]([Link])
[Link](figsize=(15, 5))
[Link](121)
[Link](xx, yy, Z,alpha=0.3)
[Link](X[:, 0], X[:, l],c=y)
[Link]([Link](), [Link]())
(Output on the next Page)
You can observe both of the plots using the
linear and rbf and notice and notice a clear
difference in lines and curves
Out[10]: (3.3, 8.882727272727251)
10 ------ j---------- 1------------------------- 1-------------------------1------------------------- 1-------------------------r-
4 5 6 7 8
1 Q CLUSTERING
IO ALGORITHM
• Clustering
• K-Means
algorithm
• Mean shift
algorithm
• Heirarchical clustering
o
A
18
k ______/
CLUSTERING
i__________________________________________ j
Clustering is a case of unsupervised machine
learning. The clustering algorithms learns
relations in the data and classifies it into
groups according to whether number of groups
provided with input or not
Clustering
The followings are the different types of
clustering:
• Density-based, clusters are formed as dense
regions. These algorithms have good accuracy
and capibility to merge two clusters together.
Like, Density-Based Spatial Clustering of
Applications with Noise (DBSCAN), Ordering
Points to identify Clustering structure (OPTICS)
• Heirarchial-based, clusters are formed in a
heirarchical tree which has Agglomerative
(Bottom up approach) and Divisive (Top down
approach). Like Clustering using Representatives
(CURE), Balanced iterative Reducing Clustering
using Hierarchies (BIRCH)
• Partitioning, clusters are formed by partioning
the objects into k, number of clusters will be
equal to that of partitions. Like K-Means
• Grid, clusters are formed as grid. This method
is fast and independent on the number of
objects. Like Statistical Information Grid
(STING), Clustering in Quest (CLIQUE)
176 CLUSTERING
•
Until now we have calculated the
accuracy of supervised learning
algorithms with the predicted
values and actual values, but how
can we do so for unsupervised
learning algorithms when we are
dealing with unlabeled data?
There are some metrics that can be
used to evaluate the performance or quality of
different unsupervised learning algorithms by the
changes in the clusters
Silhouette analysis used to check the quality of
clustering model by measuring the distance
between the clusters. It basically provides us a
way to assess the parameters like number of
clusters with the help of Silhouette score
The silhouette score ranges from -1 to 1. The
different numbers represent the followings:
• 1 is the situation when the cluster is far
away from it's neighbouring cluster
• 0 is the situation when the cluster is very
close or on the decision boundary itself i.e.
seperating the clusters
• -1 is the situation when cluster aren't formed
correctly
the silhouette score can be calculated using the
following formula:
Silhouette Score = p-q/max(p,q)
where p is the mean distance to the points in the
nearest cluster and q is the mean intra-cluster
distance to all the points
Next we can use the Davis-Bouldin Index to know
whether the clusters are well spaced from each
other or not and the density of the clusters. We
can calculate the DB index using the following
formula:
177 CLUSTERING
•
where nis the number of clusters, o1 is the
average distance of all points in cluster i from
the cluster centroid ci.
Lower values indicate good performance, where 0
is the minimum value
Dunn index is another metric that can be used to
evaluate the performance of a clustering
algorithm. It is similar to the DB index but the
difference are:
• It considers only the clusters close together
whereas DB index considers all of the clusters
• Lower Dunn indexes indicates bad performance
whereas lower the DB index higher the
performance of the algorithm
The Dunn index can be calculated using the
following formula:
mini<i<j<nP( ijj )
mixi <i<k<n q(k)
where ijjjn are each indices for clusters, P is
the inter-cluster distance and q is the
intra-cluster distance
The Dunn index increases with the performance of
the clustering algorithm
So we can evaluate the performance of different
clustering algorithm (not accuracy of one
algorithm) with the following metrics:
• Silhouette Score
• Davis-Boulden Index
• Dunn Index
178 CLUSTERING
•
Let's see the clustering algorithm in action. We
will use the make_btobs() function to create
dataset with clusters as we did at the time of
SVM. So let's import the modules and functions we
need
In [1]: from [Link] import KMeans
from [Link] import make_blobs
from matplotlib import pyplot
Onto the new cell, create dataset with the
make_btobs0 function and pass 4 to the centers
parameter, 240 to n_samptes for 240 datapoints
and standard deviation as 0.6. Keep the cell
untouched or data will be re-assigned randomly
In [2]: dt,y = make_blobs(n_samples=240,centers=4,
cluster_std=0.60)
Let's visualize the data (as 0 and 1 indexes)
using the scatterO plot
In [3]: XI = dt[:,0]
X2 = dt[:,l]
[Link](XI,X2)
0ut[3]: <[Link] at 0x259b413
-2
-4
-6
-10 -5 0 5 10
179 CLUSTERING
•
Now we can create our clustering algorithm i.e.
K-Means. Create a Ctstr using the KMeans and
specify 4 in the n_clusters i.e. number of
clusters
In [4]: Clstr = KMeans(n_clusters=4)
[Link](dt)
Out[4]: KMeans(n_clusters=4)
So our clustering algorithm is ready! LetJs ask
the Clstr to predict the clusters in the dt
dataset and plot them
In [5]: pred = [Link](dt)
[Link](XI,X2,c=pred)
centers = Clstr.cluster_centers_
[Link](centers[:> 0]>centers[:, 1],
marker='x'}c='cyan',s=80)
0ut[5]: <[Link] at 0x259b3fa<
We stored the predictions to use them as the
color in the plot. We also extracted the centres
of our clusters from the Clstr using
ctuster_centers_ class variable and also plotted
them as the different centres of our clusters in
the dataset. As you can our Ctstr has clustered
the data into 4 clusters
180 CLUSTERING
•
K-Means Algorithm
We already know there are different types of
clustering algorithms and K-Means algorithm is one
of those. K-means clustering algorithm computes
the centroids and iterates until we it finds
optimal centroid. While using this algrithm we
always need to pass the number of cluster
(n_ctusters). It is also called flat clustering
algorithm. The number of clusters identified from
data by algorithm is represented by fKJ in K-means
We already used the K-Means algorithm in the
previous example. So let's see how did it worked
First of all it divided the dataset into
n_clusters by dividing the number of datapoints in
the dataset by it. In our case it created 4
clusters with 60 datapoints each
The process of dividing these clusters is totally
random. The algorithm takes K(n_ctusters)
datapoints from the sample and
divide them into a cluster
Next it will compute the
cluster centroids for
each of the clusters formed
181 CLUSTERING
•
Next the algorithm keeps iterating the followings
until it finds optimal centroid which is the
assignment of data points to the clusters that are
not changing any more:
• Sum of squared distance between the datapoints
and the centeroids
• Assign each datapoint to the cluster closer than
the other cluster(centeroid)
• Find the centeroid of each cluster formed by
taking the average of all the datapoints in it
K-means follows Expectation-Maximization approach
to solve the problem. The Expectation-step is used
for assigning the data points to the closest
cluster and the Maximization-step is used for
computing the centroid of each cluster
Mean-Shift Algorithm
Next we have the Mean-Shift algorithm which
assigns the datapoints to the clusters iteratively
by shifting points towards the highest density of
datapoints i.e. cluster centroid. The number of
clusters is is evaluated by the algorithm upon
recieving the data and analysing it
We can create a sample dataset using make_blobs()
function for clustering using the MeanShift
algorithm. Let's import the modules and functions
we need
In [1]: import numpy
from [Link] import MeanShift
from matplotlib import pyplot
from [Link] import make_blobs
Then we can move onto a new cell and create our
dataset using the make_btobs() function
In [2]: dt, y = make_blobs(n_samples=270,centers=[[2,3],
[4,5],[3,10]], cluster_std = 0.5)
182 CLUSTERING
•
This time we specified the position of the
clusters and passed the centers(x and y values) as
a 2-D array and the number of points is equal to
the number of clusters. We can view our data using
the scatterO plot
In [3]: XI = dt[:,0]
X2 = dt[:,l]
[Link](XI?X2)
Out[3]: <[Link] at 0xldbd820e
Now we can create a clustering model using the
MeanShift algorithm and pass our dt data
In [4]: Clstr = MeanShift()
[Link](dt)
0ut[4]: MeanShift()
We created the Clstr object of the MeanShift
class and passed our data through the fit()
method. You may note that we didn't specified the
number of clusters in our dataset as we did with
the K-Means algorithm
Now let's find the number of clusters predicted
by the algorithm and plot the clusters classified
by our MeanShift algorithm
183 CLUSTERING
In [5]: labels = Clstr.labels_
cen = Clstr.cluster_centers_
n_clusters = len([Link](labels))
print("Estimated No. of Clusters:", n_clusters)
colors = 10*['r.’g.’b.']
for i in range(len(dt)):
[Link](dt[i][0], dt[i][l], colors[labels[i]])
[Link](cen[:,0],cen[: ,1],
marker='x',color='k',s=100,zorder=10)
Estimated No. of Clusters: 3
Out[5]: <[Link] at 0xldbd929725
As you can see the Clstr has predicted 3
clusters in our dataset i.e. correct. We extracted
the label of each datapoint using the labels,
attribute which we will use to plot the clusters.
Then We found cluster centres or centeroids using
the cluster_centers_ attribute of the Clstr
class which is the x and y values of them. The
length of the centres (x and y values i.e. an 2-D
array) represents the number of clusters predicted
by the Clstr. Then using a for loop we plotted
the clusters (note the use of labels to color
different datapoints in different clusters) and
atlas we plotted the centers of the clusters using
the scatter() plot and specifying the zorder as
10 to plot it upfront
184 CLUSTERING
•
So how did the MeanShift algorithm found the
clusters even though we didn't specified the
number of clusters beforehand?
Firstof all it starts with the data points
assigned to a cluster of their own. Then the
algorithm will compute the centroids and change
the location of new centroids.
This process will be iterated and moved to the
higher density region until the centroids reach at
position from where it cannot move further.
Heirarchical clustering
As the name suggests, heirarchical clustering is
a form of clustering where the dataset is divided
into groups or categories like a tree which may be
into individual datapoints or a single cluster
There are two types of heirarchical clustering as
follows:
• Agglomerative, is a type of heirarchical
clustering where each individual datapoint is
considered as a cluster and then merge or
agglomerate into cluster of 2 to 4 to 8 and vice
versa until it forms a cluster of the whole
dataset
Whole dataset cluster
shapes
individual datapoints
185 CLUSTERING
•
• Divisive, is the opposite of the agglomerative
where the dataset is considered as a cluster
and then divided into individual datapoints
Whole dataset cluster
shapes
filled un-filled filled un-filled
circle circle square square
individual datapoints
• Divisive, is the opposite of the agglomerative
where the dataset is considered as a cluster
and then divided into individual datapoints
Let's create a heirarchical model to see it int
action. We will use the flavours. csv dataset used
in DecisionTreeCLassifier (page no. 145)
classification algorithm
lump into your jupyter notebook and import the
following modules
In [1]: import pandas
import numpy
from [Link] import AgglomerativeClustering
from matplotlib import pyplot
from [Link] import dendrogram
from [Link] import LabelEncoder
z «
186 CLUSTERING
Before model creation we need to encode the
flavour labels using the LabetEncoder. We will
import the data and encode it
In [2]: dt = pandas.read_csv('[Link]')
Enc = LabelEncoder().fit(dt['Flavour'])
dt['Flavour'] = [Link](dt['Flavour'])
Now we can create our agglomerative heirarchical
clustering Clstr model. We will use the age and
flavour data
In [3]: X = [Link](columns='Gender')
Clstr = AgglomerativeClustering(distance_threshold=0,
n_clusters=None)
[Link](X)
Out[3]: AgglomerativeClustering
We passed 0 to the distance_threshtod(The
linkage distance threshold above which, clusters
will not be merged) parameter which means we will
use the whole dataset and None to n_ctusters
Now we can create a function to plot the heirarchy
In [4]: # Create Linkage matrix and then pLot the dendrogram
def plot_dendrogram(modelj **kwargs):
# create the counts of samp Les under each node
counts = [Link](model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1 # Leaf node
else:
current_count += counts[
child_idx - n_samples]
counts[i] = current_count
linkage_matrix = numpy.column_stack([
model. children-., model. distances-,
counts]).astype(float)
# PLot the corresponding dendrogram
dendrogram(linkage_matrix, **kwargs)
187 CLUSTERING
•
In the plot-dendrogram() we create the counts
of samples under each node and plot the dendogram.
First of all we calculate the sample count using
the zerosO function which will return an array
with zeros of the shape of the children-
attribute (under each non-leaf node). We calculate
the total number of sample using the
[Link]- attribute (label of each
datapoint). Then using a for loop we create the
values(x,y) for heirarchical lines or linkage
matrix using enumerate() function (enumerate
function passes an element with numbers starting
from 0 to the respective length of the elements
with the elements like (0,children_[0]),
(1, children[1]), etc.). Then we create the
linkage matrix using the column_stack() function
and plot the heirarchy using the dendrogram()
function
In [5]: plot-dendrogram(Clstr, truncate_mode='level')
As you can see the values in the x-axis represent
each element indexes which are grouped in
succession until a whole dataset cluster is
formed. You can check the heirarchies like
elements indexed 0 and 2 represent chocolate which
are grouped and elements indexed 1 and 3
represents strawberry which are grouped together
z ■
188 CLUSTERING
and those two groups in turn are grouped together
because the range of the age for them is 6-8 so it
is a group of flavours liked by children
Age Gender Flavour
1° 6 Male Chocolate]--------
6 Female Strawberry]--------
6-
(2 7 Male Chocolate]--------
3 8 Female Strawberry]---------
4 11 Male Butterscotch
5 10 Female Butterscotch
6 12 Male Butterscotch
7 14 Female Vanilla
8 15 Male Mango
9 15 Female Vanilla
10 17 Male Mango
11 16 Female Butterscotch
12 19 Male Almond & Chocolate
13 18 Female Butterscotch
14 20 Male Almond & Chocolate
15 20 Female Butterscotch
16 21 Male Coffee
17 22 Female Coffee
18 24 Male Almond & Chocolate
19 24 Female Coffee
■x
1 a KNN
lO ALGORITHMS
• Finding neorest
neighbours
• Regression with
KNN
• Ciossification with
KNN
o J
A
19
k ______/
KNN ALGORITHMS
T_____________________________________________________________________________ J
KNN stands for K-Nearest Neighbors, which is a
case of supervised machine learning used for both
regression and classification problems
Finding nearest neighbours
KNN is a lazy learning algorithm i.e. it doesn't
have a special training phase instead it uses all
the data to for classification or creating
regression line i.e. at time of prediction. It is
also considered non-parametric because it doesn't
bother about the underlying data
K in K-NN stands for nearest datapoints, as it
uses 'feature similarity' to predict the values of
new datapoints which further means that the new
data point will be assigned a value based on how
closely it matches the points in the training set.
The KNN algorithms works in the following manner:
• First we need to specify K i.e. (the number of
of nearest datapoints) say 3
• Then calculate the distance between test data
and each row of training data
191 KNN ALGORITHMS
--------- '•-----------------------------
• Now sort them into ascending order on the basis
of distance value calculated above
• Finally assign a class or values to the point on
the basis of the nearest datapoint
Regression with KNN
We will perform regression with the KNN algorithm
to create a model to predict the weight of a
person if the height is provided as input, the
same problem solved in the Linear regression model
(page no. 130)
We can start-off by importing the necessary
modules and the dataset height_and_weight.csv
import pandas
from [Link] import -
- KNeighborsRegressor
from sktearn.modet_setection import -
- train_test_sptit
192 KNN ALGORITHMS
In [1]: import pandas
from [Link] import KNeighborsRegressor
from sklearn.model_selection import train_test_split
dt = pandas.read_csv('height_and_weight.csv')
dt
Out[l]:
Index Height(ln) Weight(lbs)
0 1 65.78 112.99
1 2 71.52 136.49
2 3 69.40 153.03
3 4 68.22 142.34
4 5 67.79 144.30
...
195 196 65.80 120.84
196 197 66.11 115.78
197 198 68.24 128.30
198 199 68.02 127.47
199 200 71.39 127.88
200 rows x 3 columns
Now we can perform the splitting of our dataset
into training and testing sets
In [2]: X = [Link](columns=['Index','Weight(lbs)'])
y = [Link](columns=['Index'>'Height(In)'])
inp-Xj tst_Xj out_y, tst_y = train_test_split(X,y,
test_size=0.1)
Our data is ready so let's create our KNN model
and train it
In [3]: KNN = KNeighborsRegressor()
[Link](inp_Xj out_y)
Out[3]: KNeighborsRegressor()
193 KNN ALGORITHMS
We can specify the K with the n_neighbors
parameter while the class intialization
In [3]: KNN = KNeighborsRegressor(n_neighbors=25)
[Link](inp_X,out_y)
0ut[3]: KNeighborsRegressor()
But as we didn't passed any the default value is
used i.e. 5
Our model is ready so we can let it predicted and
then compare the values
In [4]: pred_y = [Link](tst_X)
act = tst_y['Weight(lbs)'].tolist()
prd = (pred_y.flatten()).tolist()
cmp = [Link]({'Predictionsprd,
'Actual':act})
[Link](kind='bar')
0ut[4]: <[Link]._subplots.AxesSubplot at 0xlda4af9<
Our KNN model has performed pretty good, as you
can see the actual (orange) and predicted (blue)
bars are mostly close. And we can just improve
that by increasing the number of K or n_neighbors
while the KNN intialization to something about the
double i.e. 10 to improve the performance but
still, it is performing pretty well
194 KNN ALGORITHMS
Classification with KNN
Similarly we can use the KNN algorithm to create
classifiers. We can create a algorithm with the
Rainfall_data.csv used in the logistic
regression model (page no. 152). So let's import
the modules and the dataset
In [1]: import pandas
from [Link] import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from [Link] import LabelEncoder
from [Link] import accuracy_score
dt = pandas.read_csv('Rainfall_data.csv')
We need to change the Rain labels into numerics
values. We will use the LabelEncoder to do so
In [2]: Enc = LabelEncoder()
[Link]([’Yes’,’No’])
dt['Rain'] = [Link](dt['Rain'])
So let's preview our data using the head()
function to view the encoding
In [3]: [Link]()
Unnamed: 0 Temperature Humidity% Rain
0 0 34 74.2 1
1 1 19 68.2 0
2 2 28 67.2 1
3 3 29 66.6 1
4 4 26 57.9 1
Now we can perform the splitting of our dataset
into training input, testing input, training
output and testing output using the
train_test_split() function with the test_size
as 10%
195 KNN ALGORITHMS
In [4]: X = [Link](columns=['Unnamed: 0','Rain'])
y = dt['Rain']
inp_X<,tst_X>out_y>tst_y = train_test_spJ.it (X,y,
test_size=0.1)
Our data is ready so let's create our KNN
classifier and train it
In [5]: KNN = KNeighborsClassifier()
[Link](inp_X>out_y)
Out[5]: KNeighborsClassifier()
Now we can pass the test values to the KNN
classifier and print the accuracy score
In [6]: pred_y = [Link](tst_X)
acc = accuracy_score(tst_y,pred_y)
print('Accuracy:',acc)
Accuracy: 0.9335
Our model has 93% of accuracy which is more than
the logistic regression model by 2%. So this was
how we can create KNN algorithms to solve
different problems of regression and clustering
alike
PERFORMANCE
CU 5 METRICS
• Calculoting the
model
• Improving the
model
• Saving ond
loading models
o
/------------------------------------------------------------------\
PERFORMANCE 8 METRICS
So far we have created a lot of models with
different algorithms for different tasks like
regression, classification, etc. and also
evaluated their performance visually through
graphs or their accuracy score. In this lesson we
will look at the methods to calculate the
performance of algorithms
Calculating the model
In the maths for machine learning lesson we
learned about some methods to calculate error
rate, Precision, Recall and F-measure (page no.
71) using confusion matrix. All these values can
be used to evaluate the performance of a
classifier. Let's use the KNN classifier model we
created previously and calculate it's performance
First off all let's print the confusion matrix
using the confusion_matrix0 function
In [8]: from sklearn import metrics
cm = metrics.confusion_matrix(tst_y,pred_y)
cm
Out[8]: array([[ 657, 62],
[ 80, 1201]], dtype=int64)
We passed the actual values followed by predicted
values. We can visualize it better like
In [11]: from sklearn import metrics
cm = metrics.confusion_matrix(tst_y,pred_y)
cf = [Link]({'True +ve':cm[:,0],
'True -ve':cm[:,1]},
index=['Predicted +ve',
'Predicted -ve'])
cf
Out[ll]:
Ttue +ve Ttue -ve
Predicted +ve 657 62
Predicted -ve 80 1201
198 PERFORMANCE 8 METRICS
So we have 657 i.e. True Positives (Predicted
Positive values {1, fYes-’, etc.} that are Positive
too), 62 i.e. False Positives (Predicted Positive
values that are Negative), 80 i.e. False Negatives
(Predicted Negative values {0, fNoJ, etc.} that
are Positive) and 1201 i.e. True Negatives
(Predicted Negative values that are Negative too).
Using the confusion matrix we can calculate other
metrics like:
In [14]: from sklearn import metrics
cm = metrics.confusion_matrix(tst_y,pred_y)
cf = [Link]({'True +ve':cm[:,0],
'True -ve':cm[:,1]},
index=['Predicted +ve',
'Predicted -ve'])
val = (tst_y,pred_y)
acc = metrics.accuracy_score(*val)
pre = metrics.precision_score(*val)
rcl = metrics.recall_score(*val)
fms = metrics.fl_score(*val)
print('Accuracy: \acc)
print('Precision:',pre)
print('Recall:'rcl)
print('F-Measure:',fms)
Accuracy: 0.929
Precision: 0.950910530482977
Recall: 0.9375487900078064
F-Measure: 0.944182389937107
We have calculated the accuracy, precision (True
positive values predicted by the model from total
positive values predicted), recall (True positive
values predicted by the model from actual positive
values)and f-measure (also known as Fl score)
using the accuracy_score(), precision_score(),
recaUL_score() and fl_score() functions and
printed them respectively respectively
We can print all of them together in a tabular
form using the ctassification_report() function
199 PERFORMANCE 8 METRICS
In [17]: from sklearn import metrics
rep = metrics.classification_report(tst_y,pred_y)
print(rep)
precision recall fl-score support
0 0.89 0.91 0.90 719
1 0.95 0.94 0.94 1281
accuracy 0.93 2000
macro avg 0.92 0.93 0.92 2000
weighted avg 0.93 0.93 0.93 2000
The support is the number of Positive values in
the sample for the feature or label here 0 and 1
i.e. Rain or No Rain. Macro average stands for
(macro*score of class 0 + macro*score of class 1
where macro is 0.5 here) and weighted average
stands for (weighted score of class 0 + weighted
score class 1 where the weight is mostly
imbalanced). We can use these metrics to evaluate
the performance of a algorithm. Then look at the
metrics to evaluate a regression model. We will
use the KNN regressor we created to predict
In [6]: from sklearn import metrics
val = (tst_y.» pred_y)
err = metrics.max_error(*val)
mae = metrics.mean_absolute_error(*val)
mse = metrics.mean_squared_error(*val)
rsq = metrics.r2_score(*val)
print('Max Error:'?err)
print('MAE:',mae)
print('MSE:',mse)
print('R2:',rsq)
Max Error: 19.824399999999983
MAE: 7.167660000000001
MSE: 84.34250813599998
R2: 0.02620416439244655
' •/>
200 PERFORMANCE & METRICS
----- '•---------------------------
We calculated the Maximum Error(Maximum residual
error), MAE(Mean absolute error i.e. average
vertical distance between each point and the
regression line), MSE(mean of the squared distance
from each point to the regression line) and R2
(Explained variation / Total variation) using the
max_error(), mean_absoLute_error(),
mean_squared_error() and r2_score() functions
and printed them repectively. The lesser the Max
Error, MAE and MSE is the better the performance
of the model is. Where R2 is a percentage i.e.
more closer to 1.0 is more better. But a constant
model like our's that always predicts the expected
value of y, disregarding the input features, would
get a R2 score closer to 0.0
Improving the model
Upon calculating the metrics of an algorithm we
can perform the following steps to improve the
performance of our models:
• Make sure to train the model with adequate data.
The dataset shouldn't have abnormal distribution
of features or labels like 5 samples of Yes and
95 samples of No
• After loading data we should always
practice the best and suitable
preprocessing methods on our data
to improve it's quality like
encoding labels
• We shouldn't save much data for testing but
don't less too. For datasets with samples over
10k, 20% or less is adequate
• In cases of very less data you can create random
values for testing instead of splitting the
already scarce data
• You can always test different algorithms to
solve a problem, compare them with their
metrics and choose the best and improve it
201 PERFORMANCE 8 METRICS
Saving and loading models
So we have created our model, tested it and even
improved it. Let's say we want to use the model
somewhere else or share it so, how to do that?
Well, we can do so using the joblib module. So
let's save our KNN weight predicting model using
joblib
In [6]: import joblib
[Link](KNN, "[Link]")
0ut[6]: ['[Link]']
We used the dumpO function and passed the KNN
regressor adn the "WeightPred. sav" filename as
arguments. Make sure to use the .sav extension
after the model name. As we haven't specified any
specific location it is stored in the place where
jupyter notebook is hosted
D sales_data.csv
[Link]
[Link]
[Link]
Now we can open a new jupyter notebook, import
jobtib and our model
In [1]: import joblib
KNN = [Link]("[Link]")
[Link]([[70]])
Out[l]: array([[133.4852]])
We imported our model using the T_oad() function
and passed the saved model name. We also asked the
model to predict the weight of a person with 70
inches of height and it passed 133.5 pounds
ML
APPLICATION 1
• Movie
Recommender
ML APPLICATION 1
□----------------
Problem: You have to create a model who will
suggest the genre for movies a person likes if the
person's age, gender and previously watched movie
genre is provided as input. Here is the dataset
for sample recommendation:
[Link]
------------ [ data ]------------- '
So the first step is to decide which method to
use? If the task is to recommend the genre of a
movie, that is classify a person so we will use
classification. Next, we need to decide which
algorithm to use? We aren't dealing with a huge
dataset so we can go with the decision tree
classifier
So let's start of by importing all the modules we
need
In [1]: import pandas
from [Link] import DecisionTreeClassifier
from [Link] import LabelEncoder
Now we can import the dataset and preview it
using the headO function
In [2]: import pandas
from [Link] import DecisionTreeClassifier
from [Link] import LabelEncoder
dt = pandas.read_csv('[Link]*)
[Link](3)
ML APPLICATION
Out[2]:
Age Gender Watched Genre
0 19 Male Comdey Mystery
1 19 Female Romance Drama
2 19 Male Romance Drama
As you can see we need to encode all of the
labels into numeric values. We can use an encoder
for the Gender labels and another encoder for
Watched and Genre labels
In [3]: # Gender Encoder
gndr_enc = LabelEncoder()
gndr_enc.fit(['Male','Female'])
Out[3]: LabelEncoder()
We created the gndr_enc Gender encoder and
passed the Gender labels to the fit() method. Now
we can create the Genre encoder. But before that
we need all the unique Genre labels in both
Watched and Genre column
In [4]: # Unique Genre Label, extraction
watched = dt['Watched'].unique()
genre = dt['Genre'].unique()
Genres = [*watched]
for ele in genre:
if ele in Genres:
continue
else:
[Link](ele)
First of all we extracted the unique labels from
the Watched and Genre cloumns using the uniqueO
method. Then we created another list variable and
passed the Watched uniques (note that we need a
single list i.e. 1-D that's why the watched list
is unpacked by the * operator). Using the for
loop we added the uniques of the Genre column
labels that aren't present in the Genres list
ML APPLICATION
Now we can create our gnre_enc Genre encoder and
fit the Genres
In [5]: # Unique Genre Label, extraction
watched = dt['Watched'].unique()
genre = dt['Genre'].unique()
Genres = [*watched]
for ele in genre:
if ele in Genres:
continue
else:
[Link](ele)
# Genre Encoder
gnre_enc = LabelEncoder()
gnre_enc.fit(Genres)
Out[5]: LabelEncoder()
All the encoders are ready so let's encode the
labels in our dataset with them
In [6]: for col in ['Gender*,'Watched','Genre']:
if col == 'Gender': # Gender Encode
dt[col] = gndr_enc.transform(dt[col])
else: # Watched & Genre Encode
dt[col] = gnre_enc.transform(dt[col])
Now we can divide our dataset into input and
outputj so let's move onto another new cell
because if you re-run the above cell it will cause
error because the labels are encoded so when the
above cell is executed again the encoder will
recieve number insted of labels and cause error so
move onto a new cell
In [7]: X = [Link](columns='Genre')
y = dt['Genre']
CModel = DecisionTreeClassifier()
[Link](X,y)
0ut[7]: DecisionTreeClassifier()
206
• ML APPLICATION
We also have trained our CModel, and use it to
make predictions. So let's create a function to
pass the values and return the Genre label
In [8]: def recommend(age=18.,gnd=0,watched=0Jtest=False):
# Getting input is testing
if test:
age = int(input("Age:"))
gnd = int(input("Gender:"))
for g in Genres:
print(g,
*gnre_enc.transform([g]))
watched = int(input("Watched:"))
# Ask the model, for recommendation
pred = [Link]([[age,gnd,watched]])
# Decoding the prediction to LabeL
rec = gnre_enc.inverse_transform(pred)
return rec[0]
So we defined a recommend() function and defined
four parameters i.e. age by default 18, gnd
gender by default ©(female), watched genre of the
previously watched movie by default O(Comedy) and
test by default Fatse which we can use during
testing to pass the input values
Then if we pass test as True then the function
will ask for our input and also display the
encoded values for each genre. Then the model will
predict using the input values. We will take the
output(encoded value) and decode it and finally
return it
So let's move onto a new cell and call our
recommend () function and specify the True for
the test parameter
In [*]: recommend(test=True)
Age:|18 |
You can see we are prompted to the input prompt
called in the recommend 0 function. So let's pass
18 as the age
•
207 ML APPLICATION
In [*]: recommend(test=True)
Age:18
Gender: 1
Pass the Gender as l(Male)
In [*]: recommend(test=True)
Age:18
Gender:1
Comedy - 0
Romance - 5
Horror - 3
Mystery - 4
Drama - 1
Fantasy - 2
Watched:[0
—
You can see the function has displayed all the
encoded values for each genre, so let-’s pass
0(Comedy)
In [9]: recommend(test=True)
Age:18
Gender:1
Comedy - 0
Romance - 5
Horror - 3
Mystery - 4
Drama - 1
Fantasy - 2
Watched:0
Out[9]: 'Mystery'
Now we get the Mystery as the recommendation for
the 18 years-old Male who has watched a comedy
movie previously. Well because we have very less
data, so letJs check the answer visually using the
dataset
•
208 ML APPLICATION
Out[2] :
Age Gender Watched Genre
0 19 Male Comdey Mystery
1 19 Female Romance Drama
2 19 Male Romance Drama
In the second run we printed the first three rows
of our dataset and by looking we can say that if a
18 years old male (whose sample isn't present in
the dataset) have previously watched a Comedy
movie so most likely he'll like a Mystery movie
too along with the Comedy movies
So we have created our Movie Recommender Model
using very little dataset, now it's up to you to
test the model or even take opinions from your
relatives and recommend them using the model!
ML
APPLICATION 2
• Advertisement
handling
J
Problem: You have to create a model to decide
whether to show an ad to a user or not. If yes
then which Car or Insurance advt. where the age of
the user and user class i.e. a group provided from
another model based on the user's past search
results are provided as input. You are provided
with the following dataset
[Link]
Once again we need to decide which method to use
and clearly this is a problem of classification.
We can use the KNN classifier algorithm for this
task
So let's move onto jupyter notebook and import
the algorithm, LabelEncoder, pandas library and
the dataset
In [1]: import pandas
from [Link] import KNeighborsClassifier
from [Link] import LabelEncoder
dt = pandas.read_csv(1 [Link]’)
[Link](3)
Age Search Ad
0 18 Cars Car
1 19 Automobiles None
2 21 Automobiles None
211 ML APPLICATION
We again have very less data to work with. We
have the Age and Search as input and Ad as output.
But we need to preprocess the labels
In [2]: Enc = LabelEncoder()
[Link](['Cars','Automobiles’,'Health','Car',
'Insurance','None'])
for col in ['Search','Ad']:
dt[col] = [Link](dt[col])
So we encoded the values in the Search and Ad
column using the Enc Encoder. Now we can move onto
the next step of data splitting but as mentioned
earlier we don't have enough data for splitting it
into training and testing set. So we need to
create the training set ourselves . Let's take a
look at the whole dataset:
Age Search Ad
We have 20 rows and 3 columns 0 18 Cars Car
worth of data. We already know 1 19 Automobiles None
the input i.e. Age and Search
2 21 Automobiles None
and the output i.e. Ad where
3 22 Cars Car
the input Search will be
4 23 None None
provided by another classifier
which will classify the user 5 26 None None
into Cars, Automobiles, Health 6 27 Health Insurance
and None classes on the basis 7 28 Cars Car
of previous searches. We can 8 18 Health Insurance
visually create a understanding 9 19 None None
from the data that a person of 10 20 None None
age 18-28 should only be shown
11 22 Automobiles None
Car advertisement when the
12 26 None Insurance
person is in Cars class else
the Insurance advertisement 13 30 Health Insurance
when the person is in Health 14 30 Automobiles Car
class. Similarly, a person of 15 29 Cars Car
age 30 or more should be shown 16 29 Health Insurance
the Car advertisement if the 17 29 None Insurance
person is in Cars or 18 32 Automobiles Car
Automobiles class and vice
19 32 Health Insurance
versa
' 212•/ >
-
ML APPLICATION
And person from the 18-24 in the None class
should be shown nothing but if the person is 25 or
more the Insurance ad should be shown
All this assumed conditions are called
hypothesis. So using these hypothesis we can
create a testing dataset with close to accurate
outputs. So let's create them
In [3]: import numpy
X = [Link](columns='Ad')
y = dt['Ad']
def en(val):
v = [Link]([val])
return v[0]
tst_X = [Link](
[[21.»en( 'None')]_, [21,en( 'Health')],
[27,en('None')],[23,en('Automobiles')],
[34,en('None')],[34,en('Automobiles')]])
tst_y = [Link](
[[en('None')]}[en('Insurance')]}
[en('Insurance')],[en('None')],
[en('Insurance')],[en('Car')]])
First of all we imported the numpy package to
create our test data. Then we divided our dataset
into training input and training output
To create the testing set we will use the en()
function which will take the Search or Ad label
and return the encoded value for it. Now we can
create some testing data based on our hypothesis,
like a 21-years old person of class None ([21,en(
rNone')]) should be shown no advertisements ([en(
rNone')])
As mentioned earlier the training set is built
upon hypothesis. They may be correct or wrong. We
created them for the purpose of testing our
model. We can only use this in situations like
these where the data is compressed into a small
dataset. We can visualiza the dataset using the
pandas data frame
213
• ML APPLICATION
In [4]: def dec(val):
v = Enc.inverse_transform(val)
return v
tst = [Link]({'Agetst_X[:0],
'Search':dec(tst_X[:,1]),
'Ad':dec(tst_y[:,0])})
tst
Age Search Ad
0 21 None None
1 21 Health Insurance
2 27 None Insurance
3 23 Automobiles None
4 34 None Insurance
5 34 Automobiles Car
We defined dec() function to minimize our code.
The ages in the testing set are not present in the
actual dataset. Now we can move onto creating our
KNN classifier and training it
In [5]: KNN = KNeighborsClassifier()
[Link](X,y)
Out[5]: KNeighborsClassifier()
Now we can pass the test input to KNN classifier
and print the accuracy
In [6]: from [Link] import accuracy_score
pred_y = [Link](tst_X)
accuracy_score(tst_y, pred_y)
Out[6]: 0.8333333333333334
So our model has an accuracy of approx 83% and
given the number of testing set length i.e. 6, our
model has predicted correct for 5 inputs but wrong
for only one
•
214 ML APPLICATION
•
But remember that the testing set is based upon
the hypothesis so maybe what the model predicted
is right
Also I don't know you have wondered about it
until now or not but you can see that the training
set was based upon the hypothesis we created i.e.
we analyzed the data, found connections in the
features & labels and created the testing set of
which the model is thinking the same as us for the
5 inputs. All the hypothesis we built are the same
patters and relations used by the model to
predict. Even though we have done that just having
a thorough look at the data which is most likely
not to be wrong, but the model does everything in
some milliseconds. Think that there were tens of
thousands of data like these! Could you have done
the same there?
I hope your understanding about 'machine' and
'learning' in machine learning is more clear now
OQ ML
CO APPLICATION 3
• Checking wine
quality
o
Problem: In a wine factory, you are asked to
rate the quality of the production in a scale of 1
to 5 if different chemical properties are passed
as an input for the following produced batch and
then tell whether the batch is good or not. A good
scale is more than half (2.5). You have the
following sample dataset of some 1500 samples
/—C Input and Output Values
[Link]
[Link]
■{ batch]- X. sample
We can use machine learning models to solve the
problem but the question is which algorithm to
choose?
If you are think of using an classifier algorithm
because we need to rate the wine then of course
you're wrong. Rating is to be done in a scale of 1
to 10 where the rating can be 5 or 5.5 or even
5.45, so for this problem we are going to use
linear regression
So let's our sample dataset and preview it with
the describeO function along with the other
neccesities
In [1]: import pandas
import numpy
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
dt = pandas.read_csv('wine_sample.csv')
([Link]()).round(1)
217 ML APPLICATION
Out[l]:
fixed acidity volatile acidity citric acid residual sugar chlorides
count 1499.0 1499.0 1499.0 1499.0 1499.0
mean 8.4 0.5 0.3 2.5 0.1
std 1.7 0.2 0.2 1.4 0.0
min 4.6 0.1 0.0 0.9 0.0
25% 7.2 0.4 0.1 1.9 0.1
50% 8.0 0.5 0.3 2.2 0.1
75% 9.3 0.6 0.4 2.6 0.1
max 15.9 1.6 1.0 15.5 0.6
free sulfur dioxide total sulfur dioxide density PH sulphates alcohol quality
1499.0 1499.0 1499.0 1499.0 1499.0 1499.0 1499.0
15.6 46.8 1.0 3.3 0.7 10.4 5.6
10.5 33.3 0.0 0.2 0.2 1.1 0.8
1.0 6.0 1.0 2.7 0.3 8.4 3.0
7.0 22.0 1.0 3.2 0.6 9.5 5.0
13.0 38.0 1.0 3.3 0.6 10.1 6.0
21.0 63.0 1.0 3.4 0.7 11.1 6.0
72.0 289.0 1.0 4.0 2.0 14.9 8.0
In this dataset we have twelve columns which have
about 1500 samples. The first eleven columns are
different chemical properties of wine i.e. input
and quality is the rating i.e. output. The minumum
rating is 3 and the maximum is 8. But we need to
rate the quality of wine in a scale of 1 to 5. So
we need to Rescale the quality feature in the
scale of 1 to 5 and we will do that using the
MinMaxScater
In [2]: from [Link] import MinMaxScaler
Sclr = MinMaxScaler(feature_range=(l, 5))
qal = [Link](dt['quality'])
dt['quality'] = Sclr.fit_transform([Link](-l,l))
•
218 ML APPLICATION
We imported the MinMaxScater and created our
Sctr object of the class. We passed the scale in
the feature_range parameter i.e. 1-5. Then we
created a numpy array of the quality feature. Then
we scaled the data using the fit_transform()
function. Note that we passed the reshaped array
using the reshape(-1,1) function which will
convert the 1-D array [5,6,7,...] to 2-D array
[[5],[6],[7],...]
Now we can split the data, create our linear
regressor and train it
In [3]: X = [Link](columns='quality')
y = dt['quality']
trnX,tstX,trnY,tstY = train_test_split(X,y,test_size=0.1)
Reg = LinearRegression()
[Link](trnX,trnY)
Out[3]: LinearRegression()
Before checking the quality of the given batch we
need to test our data and find some metrics. So
let's use the testing sets and compare the model's
predictions
In [4]: predY = [Link](tstX)
mae = metrics.mean_absolute_error(tstY,predY)
err = metrics.max_error(tstY,predY)
cmp = [Link]({'Predicted':predY,
'Actual':[Link]})
print ('MAE: ’,mae, '\n','Max RE: ’,err)
[Link](figsize=(7.5,6))
MAE: 0.43020099299063136
Max RE: 1.5037184643992099
Our model has MAE (Mean absolute error) of approx
0.43 i.e. the average aboslute errors with the
maximum residual error i.e. Max error as 1.5
We have a lot of values so let's plot the graph
for comparing the values
219
• ML APPLICATION
0ut[4]: [Link]._subplots.AxesSubplot at 0x23bl0ef1160>
Looking at the data we can tell how our model is
performing. By observing the graph we can tell
that our model didn't rated 5 to any input whereas
the actual values have only 3 times which explains
everything. The distribution of higher values is
low therefore prediction of higher rating is also
low. Although our model is fine, so let's import
the batch dataset and pass it to the model
In [5]: batch = pandas.read_csv('wine_batch.csv1)
batch_pred = [Link]([Link])
batch_pred.mean()
Out[5]: 3.135502047867703
We imported the csv file and passed the values for
predictions. And at average the rating is 3.14 and
if we take MAE(0.43) the rating could be also 2.71
or 3.57. But in all of the cases the average rating
of the batch is higher than 2.5 so it is fine!
p/l ml
CH APPLICATION 4
• Motch ploy
decision
<s
24A-------------
— ML APPLICATION 4
Problem: You have to decide whether a match can
be played or not if the temperature, humdity and
rainfall status is provided as input. The
following dataset has over 1000 past days of
samples:
[Link]
Once again we need to decide which method to use
and clearly this is a problem of classification.
We can use the decision tree classifier because we
need to predict Yes or No. So because the task is
simple and he have a wide range of data so the
tree should be really helpful
So let's import the modules and functions we need
for this model that are:
import pandas
from [Link] import -
- DecisionTreeCtassifier
from [Link] import -
- LabetEncoder
from [Link] import -
- classification_report
from sktearn.modet_setection import -
- train_test_sptit
and our dataset and preview it without the
headO function
222 ML APPLICATION
In [1]: import pandas
from [Link] import DecisionTreeClassifier
from [Link] import LabelEncoder
from [Link] import classification_report
from sklearn.model_selection import train_test_split
dt = pandas.read_csv('play_stats.csv')
dt
Temperature Humidity% Rain Play
0 34 74.2 Yes No
1 19 68.2 No No
2 28 67.2 Yes Yes
3 29 66.6 Yes No
4 26 57.9 Yes Yes
996 28 62.0 Yes Yes
997 27 71.4 Yes No
998 19 54.1 No Yes
999 32 57.4 Yes No
1000 24 61.4 Yes No
1001 rows x 4 columns
We need to encode the Rain and Play labels so
create an encoder and encode the labels
In [2]: Enc = LabelEncoder()
[Link]([’Yes','No'])
for col in ['RainPlay']:
dt[col] = [Link](dt[col])
We need simply created a Enc encoder using the
LabelEncoder and then fitted the rYes-’ and rNoJ
values. Then using a for loop and the
transform() we encoded the labels
Note that you may get an error if you re-run the
cell because the labels are encoded so the
transform() will recieve numbers this time
•
223 ML APPLICATION
Now we can split the dataset into training and
testing sets and train our classifier after we
create it
In [3]: X = [Link](columns='Play')
y = dt['Play']
trnXjtstXjtrnY,tstY = train_test_split(X,y,test_size=0. 1)
CModel = DecisionTreeClassifier()
[Link](trnX,trnY)
Out[3]: DecisionTreeClassifier()
Our classifier CModel is ready so now we can
test it and print the classification report
In [4]: predY = [Link](tstX)
print(classification_report(tstY,predY))
precision recall fl-score support
0 1.00 1.00 1.00 70
1 1.00 1.00 1.00 31
accuracy 1.00 101
macro avg 1.00 1.00 1.00 101
weighted avg 1.00 1.00 1.00 101
Wow! we have created an A-grade classifier! It?s
the first time I-’ve seen so in practical
applications
So our model has performed 100% well so no
questions. Analysing the classification report we
can tell that Yes and No has likely a distribution
of 70%-30%. Now you can define a function like we
did with our movie classifier and get the input
data from the user (input prompt) and print Play
or CanJt Play
□ C ML
CO APPLICATION 5
• Best striking
formation
(Player statistics)
o-------------
Problem: You have to create(predict) a
formation for a football team with the best 3
strikers out of 5 players if there statistics from
the previous match are provided. Here is the
sample data of performance and rating in
previously 6 matches of each player:
[Link]
So which method do you think we should use for
this problem? Should we use regression or
classification or something else? You may think we
should use the regressor model alike the wine
rating model and you are right but, the ratings
there should be accurate by even 0.01 points but
here player statistics can be int i.e. non-decimal
values like 5 or 6. So let's use both of the
methods for this problem to rate the players and
then we will define a function to create a
formation of the best three players
So let's import the sample dataset and the
followings:
import pandas
from [Link] import LabelEncoder
from [Link] import -
- DecisionTreeCtassifier
from sktearn.tinear_modet import-
- LinearRegression
In [1]: import pandas
from [Link] import LabelEncoder
from [Link] import DecisionTreeClassifier
from sklearn.linear_model import LinearRegression
dt = pandas.read_csv('Players_data.csv')
dt
Out[l]:
Player Possesion% Pass% Goals Shots Rating
0 Silva 55 60 2 7 9
1 Deigo 54 64 0 6 5
2 Robert 57 77 2 6 10
3 Davies 49 75 1 5 7
4 Paul 54 65 0 6 6
5 Silva 51 71 1 2 6
6 Deigo 50 74 1 2 6
7 Robert 50 70 1 5 7
8 Davies 51 68 1 5 7
9 Paul 58 61 2 5 9
10 Silva 51 70 1 2 6
11 Deigo 56 71 2 7 10
12 Robert 57 59 2 5 9
13 Davies 52 60 1 3 6
14 Paul 57 77 2 7 10
15 Silva 58 64 2 6 9
16 Deigo 51 68 1 5 7
17 Robert 59 71 2 5 10
18 Davies 52 64 1 2 5
19 Paul 52 66 1 6 7
20 Silva 54 63 0 2 4
21 Deigo 52 64 1 7 6
22 Robert 55 66 2 6 10
23 Davies 59 63 2 6 9
24 Paul 50 75 1 3 7
25 Silva 51 69 1 3 7
26 Deigo 54 60 0 5 5
27 Robert 54 64 0 2 4
28 Davies 50 72 1 5 7
29 Paul 54 65 0 2 5
' •/ >
227
-ML APPLICATION
We have 5 players and their previous performances
of 6 matches i.e. total 30 samples and 6 samples
for each player. We have all the ratings are
rounded off to non-decimal values thatJs why we
will create a classifier along with regressor. But
we need to encode the player names before moving
forward
In [2]: Enc = LabelEncoder()
plyrs = [’Silva’/Deigo’/Robert*/Davies’/Paul’]
[Link](plyrs)
dt['Player'] = [Link](dt['Player'])
So we have encoded the Player labels. Now letJs
move onto splitting where we will use the first
five matches data for training and the sixth match
for testing our models
In [3]: X = [Link](columns='Rating')
y = dt[’Rating']
trnX = X[:25]
tstX = X[25:]
trnY = y[:25]
tstY = y[25:]
RegModel = LinearRegression()
CModel = DecisionTreeClassifier()
[Link](trnX,trnY)
[Link](trnX,trnY)
Out[3]: DecisionTreeClassifier()
First of all we seperated the data into input i.e
the players along with their performance and the
output i.e. rating of each player. Then we
splitted the data into training and testing sets.
As mentioned earlier we will use the data of first
five matches (5 players in 5 matches {5*5=25}) and
the last match for testing which we did manually
using the slice syntax
Then we created our regressor RegModel and
classifier CModel and trained them with the first
five matches data
228 ML APPLICATION
So letJs pass the sixth match data and let the
models rate the players based on what they have
learnt. We can create a data frame to compare the
values
In [4]: pred_reg = [Link](tstX)
pred_cls = [Link](tstX)
cmp = [Link]({'Regressor':pred_reg,
'Classifier':pred_cls,
'Actual':[Link]},
index=plyrs)
cmp
Regressor Classifier Actual
Silva 6.404661 7 7
Deigo 4.613903 5 5
Robert 4.158141 5 4
Davies 6.821053 7 7
Paul 4.160626 5 5
We can also visualize our data for analysing them
visually
In [5]: [Link](kind='bar')
Out[5]: [Link],_subplots.AxesSubplot at 0xl8bb7249!
•
229 ML APPLICATION
To our surprise the classifier model has an
accuracy of 80% i.e. it accurately rated 4
players. But is that's it? If you obseve carefully
the regressor model is just a bit lagging behind
but to it's plus point, when rating Robert the
regressor was close or correct. We can also print
some metrics to get more insights
In [6]: from sklearn import metrics
mae = metrics .mean_absolute_error(tstY<,pred_reg)
err = metrics.max_error(tstY?pred_reg)
acc = metrics. accuracy_score(tstY.»pred_cls)
print('MAE:’^mae^f'where max error is{err}')
print('CModel Accuracy_,acc)
MAE: 0.431579401888947 where max error is0.83937
CModel Accuracy: 0.8
So we can draw a conclusion that the regressor is
more close-to model i.e. close to the actual
values by 0.4 where it can double in worst cases
whereas the classifier is on-point model for the
correct predictions but if it predicts wrong the
values can differ from the actual by 1.0 or more.
So according to you which model is more
preferable?
Well if you think the classifer because it is an
on-point model but it comes with it's risks too
i.e. the wrong predictions have higher differences
from the actual in comparison to the regressors
max error i.e. 0.84
In my opinion both models have their own
advantages. But we need to choose one so let's
choose the regressor. Why regressor or why not the
classifier? Using these models always comes with
risks of some wrong predictions but we need to
keep this in our mind that we must choose the
best. The regressor has an average absolute error
of 0.4 but it is close to the actual values i.e.
most of the times it is less and if more that's by
just a bit
ML APPLICATION
All the values have a difference of 0.4 at
average from the actual values, so what if we
round of the values and plot the graph again we
will get something like this
In [7]: cmp['Regressor'] = round(cmp['Regressor'])
[Link](kind='bar')
0ut[7]: <[Link]._subplots.AxesSubplot at 0xl8bb7375
•
231 ML APPLICATION
•
Now we have 3 values on-point but 2 values with
difference of 1.0 and the classifier has 4 values
on-point but 1 value with 1.0 of difference. So if
the mean absolute error is 0.4 we can round them
up but if they have error of 0.6 or 0.8 like the
with Silva and Paul we will move down to 1.0
difference. Now what is your judgement? Which
model should we use?
As of now we learned which model does what but we
need to also choose between these models at times
like these. I would like you to choose one from
these and come up with the best formation of the
strikers using your chosen model and the provided
test values:
[Link]
Hint: Remember what are the best qualities of an
model?
□q ml
CO RESOURCES
• Datasets
• Forum
o
A
26 RESOURCES
T_________________________ J
All the datasets used in this book whether
it is json file., txt file, csv file, etc.
can be found in the below link:
c—C [Link]/datasets
[Link]
If you have any doubts or questions
unanswered fell free to drop them in the
below forum which is created to help you:
f [ [Link]/community
[Link]
To seperate the readers of this book from
others you can use the below credentials to
login or scan the code:
Username : ml_reader
Password : ml_reader_forum_pass
WhatJs next?
It's been a quite long journey where we learn't a
lot of things. I would like you to look back at
the date where you purchased this book and now
what you've gained. But it's only the start
there's still more to learn! I would be really
obliged to know your views about ehat you think
about the book and how was your experience, please
leave a review here
Leave your thoughts
And Best of luck on your way ahead!