0% found this document useful (0 votes)
17 views589 pages

Machine Learning Concepts and Python Guide

The document is a comprehensive guide on machine learning, covering fundamental concepts, Python programming, and various algorithms and techniques used in the field. It includes sections on data preparation, model evaluation, and real-life applications, along with practical examples and Python implementations. Additionally, it addresses challenges, limitations, and the importance of data quality in machine learning projects.

Uploaded by

Samielgasim
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views589 pages

Machine Learning Concepts and Python Guide

The document is a comprehensive guide on machine learning, covering fundamental concepts, Python programming, and various algorithms and techniques used in the field. It includes sections on data preparation, model evaluation, and real-life applications, along with practical examples and Python implementations. Additionally, it addresses challenges, limitations, and the importance of data quality in machine learning projects.

Uploaded by

Samielgasim
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

*

PROGRAMMING ■»
9

Machine Learning Basics Concepts + Artificial Intelligence +


Python Programming + Python Machine Learning

MACHINE LEARNING
0 6

FECHNOLOGY ENGINEERING ALGORITHM DATA ANALYTICS CLUSTERING COMPUTER SIENCE

I^TWOW

KAVISHANKAR
Programming Machine Learning : Machine
Learning Basics Concepts + Artificial Intelli­
gence + Python Programming + Python Ma­
chine Learning
(Language of Book: English)

Author
Kavishankar Panchtilak
(Professional Blogger, Author)

Publisher
Kavis Web Designer
214, Ukwa Balaghat 481105 (Madhya Pradesh)
Company Website: Kavis Web Designer
Amazon Author Profile: [Link]

Copyright © 2023 kdp copyright All rights reserved.


No part of this publication may be reproduced, distributed, or transmitted in any form or by any means,
including photocopying, recording, or other electronic or mechanical methods, without the prior written
permission of the publisher, except in the case of brief quotations embodied in reviews and certain other
non-commercial uses permitted by copyright law.

Table Of Contents

Basic Concepts

Data
Feature
Model
Training
Testing
Overfitting
Underfitting
Why & When to Make Machines Learn?
Lack of human expertise
Dynamic scenarios
Difficulty in translating expertise into computational tasks
Machine Learning Model
Task(T)
Experience (E)
Performance (P)
Python Libraries

Why Python for Data Science?


Extensive set of packages
Easy prototyping
Collaboration feature
One language for many domains
Strengths and Weaknesses of Python
Strengths
Weakness
Installing Python
Installing Python Individually
Using Pre-packaged Python Distribution: Anaconda
Components of Python ML Ecosystem
Jupyter Notebook
Installation and Execution
Types of Cells in Jupyter Notebook
NumPy
Pandas
Scikit-learn
TensorFlow
PyTorch
Keras
OpenCV
Applications
Image and Speech Recognition
Natural Language Processing
Fraud Detection
Predictive Maintenance
Healthcare
Recommendation Systems
Autonomous Vehicles
Life Cycle

Problem Definition
Data Collection
Data Preparation
Model Building
Model Evaluation
Model Deployment
Required Skills
Programming Skills
Statistics and Mathematics
Mathematical Notation
Probability Theory
Optimization Problem
Data Preprocessing
Data Visualization
Machine Learning Algorithms
Deep Learning
Natural Language Processing
Problem-solving Skills
Communication Skills
Business Acumen
Implementation

Data Collection and Preparation


Data Exploration and Visualization
Feature Selection and Engineering
Model Selection and Training
Model Evaluation
Model Tuning
Deployment and Monitoring
Choosing the Language and IDE for ML Development
Language Choice
IDEs
Challenges & Common Issues

Overfitting
Underfitting
Data Quality Issues
Imbalanced Datasets
Model Interpretability
Generalization
Scalability
Ethical Considerations
Limitations

Dependence on Data Quality


Lack of Transparency
Limited Applicability
High Computational Costs
Data Privacy Concerns
Ethical Considerations
Dependence on Experts
Lack of Creativity and Intuition
Limited Interpretability
Real-Life Examples

Fraud Detection in Banking and Finance


Healthcare Diagnosis and Treatment
Natural Language Processing in Virtual Assistants and Chatbots
Autonomous Vehicles
Personalized Recommendations in e-commerce and Entertainment
Data Structure

Commonly Used Data Structure for Machine Learning


Arrays
Lists
Dictionaries
Linked Lists
Trees
Graphs
How Data Structure is Used in Machine Learning
Storing and Accessing Data
Pre-processing Data
Creating Feature Vectors
Building Decision Trees
Building Graphs
Mathematics

Linear Algebra
Calculus
Probability
Statistics
Artificial Intelligence

What is Machine Learning?


What is Artificial Intelligence?
Machine Learning vs. Artificial Intelligence
Neural Networks

What are Neural Networks?


Machine Learning vs. Neural Networks
Deep Learning

What is Deep Learning?


Machine Learning vs. Deep Learning
Getting Datasets

Public Datasets
Data Scraping
Data Purchase
Data Collection
Strategies for Acquiring High Quality Datasets
Identify the Problem You Want to Solve
Determine the Size of the Dataset
Ensure the Data is Relevant and Accurate
Preprocess the Data
Categorical Data

What is Categorical Data?


Techniques for Handling Categorical Data
One-Hot Encoding
Example
Label Encoding
Example
Frequency Encoding
Example
Target Encoding
Example
Binary Encoding
Example
Data Loading

Consideration While Loading CSV data


File Header
Comments
Delimiter
Quotes
Various Methods of Loading a CSV Data File
Using the CSV Module
Using the Pandas Library
Using the Nurnpy Library
Using the Scipy Library
Using the Sklearn Library
Data Understanding

Understand the Data before Uploading It in ML Projects


Identify Data Quality Issues
Determine Data Relevance
Select Appropriate ML Techniques
Improve Model Performance
Data Understanding with Statistics
Looking at Raw Data
Example
Checking Dimensions of Data
Example
Getting Each Attribute’s Data Type
Example
Statistical Summary of Data
Example
Reviewing Class Distribution
Example
Reviewing Correlation between Attributes
Example
Reviewing Skew of Attribute Distribution
Example
Data Preparation
Example
Why Data Pre-processing?
Data Pre-processing Techniques
Scaling
Example
Normalization
Types of Normalization
LI Normalization
L2 Normalization
Binarization
Example
Standardization
Example
Data Labeling
What is Label Encoding?
Example
Models
Supervised Learning
Unsupervised Learning
Semi-supervised Learning
Reinforcement Learning
Supervised

Classification
Regression
Algorithms for Supervised Learning
k-Nearest Neighbours
Decision Trees
Naive Bayes
Logistic Regression
Support Vector Machines
Unsupervised
What is Unsupervised Learning?
Algorithms for Unsupervised Learning
k-means clustering
Cluster Identification
Clustering
Association
Dimensionality Reduction
Anomaly Detection
Semi-supervised
Reinforcement

Supervised vs. Unsupervised

Differences between Supervised and Unsupervised Machine Learning


Data Visualization
Univariate Plots: Understanding Attributes Independently
Multivariate Plots: Interaction Among Multiple Variables
Histograms

Python Implementation of Histograms


Example
Histogram with Multiple Data Sets
Example
Density Plots

Python Implementation of Density Plots


Example
Density Plot with Multiple Data Sets
Example
Box and Whisker Plots

Python Implementation of Box and Whisker Plots


Example
Correlation Matrix Plot
Python Implementation of Correlation Matrix Plots
Example
Scatter Matrix Plot

Python Implementation of Scatter Matrix Plot


Example
Machine Learning - Statistics
Descriptive Statistics
Example
Inferential Statistics
Example
Mean, Median, Mode

Mean
Median
Mode
Python Implementation
Output
Standard Deviation

Types of Examples
Example 1
Example 2
Percentiles

Calculating Percentiles using NumPy


Example
Calculating Percentiles using Pandas
Example
Data Distribution
Normal Distribution
Example
Skewed Distribution
Example
Uniform Distribution
Example
Bimodal Distribution
Example
Skewness and Kurtosis

Example
Output
Bias and Variance

Example
Output
Reducing Bias and Variance
Example
Example
Example
Hypothesis

Properties of a Good Hypothesis


Regression Analysis
Types of Regression Models
Building a Regressor in Python
Step 1: Importing necessary python package
Step 2: Importing dataset
Step 3: Organizing data into training & testing sets
Step 4: Model evaluation & prediction
Step 5: Plot & visualization
Step 6: Performance computation
Types of ML Regression Algorithms
Applications
Linear Regression

Positive Linear Relationship


Negative Linear Relationship
Types of Linear Regression
Assumptions
Simple Linear Regression

Python Implementation
Data Preparation
Model Training
Model Testing
Model Evaluation
Plotting the Regression Line
Complete Implementation Example
Multiple Linear Regression

Python Implementation
Example
Polynomial Regression

Python Implementation
Example
Classification Algorithms

Data Preparation
Feature Extraction/Selection
Model Selection
Model Training
Model Evaluation
Hyperparameter Tuning
Model Deployment
Types of Learners in Classification
Lazy Learners
Eager Learners
Building a Classifier in Python
Step 1: Importing necessary python package
Step 2: Importing dataset
Step 3: Organizing data into training & testing sets
Step 4: Model evaluation
Step 5: Finding accuracy
Classification Evaluation Metrics
Confusion Matrix
Various ML Classification Algorithms
Applications
Logistic Regression

Implementation in Python
Load the Dataset
Plot the Training Data
Split the Dataset
Create the Logistic Regression Model
Train the Model
Make Predictions
Evaluate the Model
Complete Implementation Example
K-Nearest Neighbors (KNN)

Working of KNN Algorithm


Implementation in Python
Example
Naive Bayes Algorithm

Types of Naive Bayes Algorithm


Implementation in Python
Complete Implementation Example
Decision Trees Algorithm

Types of Decision Tree Algorithm


Implementation in Python
Complete Implementation Example
Support Vector Machine
Working of SVM
Implementation in Python
Example
Tuning SVM Parameters
Example
Random Forest
Working of Random Forest Algorithm
Advantages of Random Forest Algorithm
Implementation of Random Forest Algorithm in Python
Step 1 - Importing the Libraries
Step 2 - Loading the Data
Step 3 - Data Preprocessing
Step 4 - Training the Model
Step 5 - Making Predictions
Step 6 - Evaluating the Model
Complete Implementation Example
Confusion Matrix

How to Implement Confusion Matrix in Python?


Stochastic Gradient Descent

Working of Stochastic Gradient Descent Algorithm


Implementation of Stochastic Gradient Descent in Python
Example
Clustering Algorithms
Cluster Formation Methods
Density-based
Hierarchical-based
Partitioning
Grid
Types of ML Clustering Algorithms
K-means Clustering
Mean-Shift Algorithm
Hierarchical Clustering
Applications of Clustering
Centroid-Based Clustering
K-means Clustering
K-Medoids Clustering
K-Means Clustering
Implementation in Python
Step 1 - Import Required Libraries
Step 2 - Generate Data
Step 3 - Initialize K-Means
Step 4 - Train the Model
Step 5 - Visualize the Clusters
Complete Implementation Example
Applications of K-Means Clustering
Image Segmentation
Customer Segmentation
Anomaly Detection
Genomic Data Analysis
K-Medoids Clustering

K-Medoids Clustering - Algorithm


Implementation in Python
Example
K-Medoids Clustering - Advantages
K-Medoids Clustering - Disadvantages
Mean-Shift Clustering

Implementation of Mean-Shift Clustering in Python


Step 1 - Import the necessary libraries
Step 2 - Generate the data
Step 3 - Estimate the bandwidth of the kernel function
Step 4 - Initialize the Mean-Shift clustering algorithm
Step 5 - Train the model
Step 6 - Visualize the results
Example
Applications of Mean-Shift Clustering
Hierarchical Clustering

Steps to Perform Agglomerative Hierarchical Clustering


Role of Dendrograms in Agglomerative Hierarchical Clustering
Example 1
Example 2
Density-Based Clustering

Popular Density-based Clustering Algorithms


DBSCAN Clustering
OPTICS Clustering
HDBSCAN Clustering
DBSCAN Clustering
Implementation in Python
Load the dataset
Perform DBSCAN clustering
Visualize the results
Example
Advantages of DBSCAN
Disadvantages of DBSCAN
OPTICS Clustering

Implementation of OPTICS in Python


Example
Advantages of OPTICS Clustering
Disadvantages of OPTICS Clustering
HDBSCAN Clustering

Working of HDBSCAN Clustering


Implementation in Python
Installation
Usage
Advantages of HDBSCAN Clustering
BIRCH Clustering

Implementation of BIRCH Clustering in Python


Example
Advantages of BIRCH Clustering
Disadvantages of BIRCH Clustering
Affinity Propagation
Implementation in Python
Example
Advantages of Affinity Propagation
Disadvantages of Affinity Propagation
Distribution-Based Clustering

Gaussian Mixture Model


Implementation in Python
Example
Advantages of Gaussian Mixture Models
Disadvantages of Gaussian Mixture Models
Agglomerative Clustering

Implementation in Python
Example
Advantages of Agglomerative Clustering
Disadvantages of Agglomerative Clustering
Dimensionality Reduction

Feature Selection
Feature Extraction
Feature Selection

Filter Methods
Wrapper Methods
Embedded Methods
Principal Component Analysis (PCA)
Recursive Feature Elimination (RFE)
Example
Feature Extraction

Example
Output
Advantages of Feature Extraction
Disadvantages of Feature Extraction
Backward Elimination

Implementation in Python
Example
Forward Feature Construction

Example
Output
High Correlation Filter

Example
Output
Advantages of High Correlation Filter
Disadvantages of High Correlation Filter
Low Variance Filter

Example
Output
Advantages of Low Variance Filter
Disadvantages of Low Variance Filter
Missing Values Ratio

Example
Output
Advantages of Missing Value Ratio
Disadvantages of Missing Value Ratio
Principal Component Analysis

Example
Output
Advantages of PC A
Disadvantages of PC A
Performance Metrics

Example
Output
Automatic Workflows
Introduction
Challenges Accompanying ML Pipelines
Quality of Data
Data Reliability
Data Accessibility
Modelling ML Pipeline and Data Preparation
Example
Modelling ML Pipeline and Feature Extraction
Example
Boost Model Performance

Example
Output
Performance Improvement with Ensembles
Sequential ensemble methods
Parallel ensemble methods
Ensemble Learning Methods
Bagging
Boosting
Voting
Bagging Ensemble Algorithms
Bagged Decision Tree
Output
Random Forest
Output
Extra Trees
Output
Boosting Ensemble Algorithms
AdaBoost
Output
Stochastic Gradient Boosting
Output
Voting Ensemble Algorithms
Output
Gradient Boosting
What is a Gradient Boosting Machine (GBM)?
How Does a Gradient Boosting Machine Work?
Example
Output
Advantages of Using Gradient Boosting Machines
Limitations of Gradient Boosting Machines
Bootstrap Aggregation (Bagging)
How Does Bagging Work?
Example
Output
Cross Validation

What is Cross-Validation?
Why is Cross-Validation Important?
Implementing Cross-Validation in Python
Example
AUC-ROC Curve
What is the AUC-ROC Curve?
Why is the AUC-ROC Curve Important?
Implementing the AUC ROC Curve in Python
Example
Grid Search

Implementation in Python
Example
Data Scaling

Example
Output
Train and Test

Example
Output
Association Rules

Example
Output
Apriori Algorithm

Example
Output
Gaussian Discriminant Analysis

Example
Output
Cost Function
Implementation in Python
Example
Bayes Theorem
Implementation in Python
Example
Precision and Recall
Implementation in Python
Example
Adversarial
Implementation in Python
Example
Stacking
Example
Output
Epoch

Implementation in Python
Example
Perceptron
Architecture of Perceptron
Training of Perceptron
Implementation of Perceptron in Python
Example
Role of Step Functions in the Training of Perceptrons
Regularization
Overfitting

Causes of Overfitting
Techniques to Prevent Overfitting
Example
Output
P-value

Implementation of P-value in Python


Output
Example
Output
MLOps
MLOps Best Practices
Python Libraries for MLOps
Data Leakage

Target Leakage
Train-test Contamination
How to Prevent Data Leakage?
Implementation in Python
Example

Basic Concepts
Machine learning, as we know, is a subset of artificial intelligence that involves training computer algo­
rithms to automatically learn patterns and relationships in data. Here are some basic concepts of machine
learning -

Data
Data is the foundation of machine learning. Without data, there would be nothing for the algorithm to
learn from. Data can come in many forms, including structured data (such as spreadsheets and databases)
and unstructured data (such as text and images). The quality and quantity of the data used to train the ma­
chine learning algorithm are crucial factors that can significantly impact its performance.

Feature
In machine learning, features are the variables or attributes used to describe the input data. The goal is to
select the most relevant and informative features that will allow the algorithm to make accurate predic­
tions or decisions. Feature selection is a crucial step in the machine learning process because the perfor­
mance of the algorithm is heavily dependent on the quality and relevance of the features used.

Model
A machine learning model is a mathematical representation of the relationship between the input data
(features) and the output (predictions or decisions). The model is created using a training dataset and then
evaluated using a separate validation dataset. The goal is to create a model that can accurately generalize
to new, unseen data.
Training
Training is the process of teaching the machine learning algorithm to make accurate predictions or
decisions. This is done by providing the algorithm with a large dataset and allowing it to learn from the
patterns and relationships in the data. During training, the algorithm adjusts its internal parameters to
minimize the difference between its predicted output and the actual output.

Testing
Testing is the process of evaluating the performance of the machine learning algorithm on a separate
dataset that it has not seen before. The goal is to determine how well the algorithm generalizes to new, un­
seen data. If the algorithm performs well on the testing dataset, it is considered to be a successful model.

Overfitting
Overfitting occurs when a machine learning model is too complex and fits the training data too closely.
This can lead to poor performance on new, unseen data because the model is too specialized to the training
dataset. To prevent overfitting, it is important to use a validation dataset to evaluate the model's perfor­
mance and to use regularization techniques to simplify the model.
Underfitting
Underfitting occurs when a machine learning model is too simple and cannot capture the patterns and
relationships in the data. This can lead to poor performance on both the training and testing datasets.
To prevent underfitting, we can use several techniques such as increasing model complexity, collect more
data, reduce regularization, and feature engineering.

It is important to note that preventing underfitting is a balancing act between model complexity and the
amount of data available. Increasing model complexity can help prevent underfitting, but if there is not
enough data to support the increased complexity, overfitting may occur instead. Therefore, it is important
to monitor the model's performance and adjust the complexity as necessary.

Why & When to Make Machines Learn?


We have already discussed the need for machine learning, but another question arises that in what scenar­
ios we must make the machine learn? There can be several circumstances where we need machines to take
data-driven decisions with efficiency and at a huge scale. The followings are some of such circumstances
where making machines learn would be more effective -

Lack of human expertise


The very first scenario in which we want a machine to learn and take data-driven decisions, can be the
domain where there is a lack of human expertise. The examples can be navigations in unknown territories
or spatial planets.

Dynamic scenarios
There are some scenarios which are dynamic in nature i.e. they keep changing over time. In case of these
scenarios and behaviors, we want a machine to learn and take data-driven decisions. Some of the examples
can be network connectivity and availability of infrastructure in an organization.

Difficulty in translating expertise into computational tasks


There can be various domains in which humans have their expertise,; however, they are unable to trans­
late this expertise into computational tasks. In such circumstances we want machine learning. The exam­
ples can be the domains of speech recognition, cognitive tasks etc.

Machine Learning Model


Before discussing the machine learning model, we must need to understand the following formal defini­
tion of ML given by professor Mitchell -

"A computer program is said to learn from experience E with respect to some class of tasks T and perfor­
mance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”
The above definition is basically focusing on three parameters, also the main components of any learning
algorithm, namely Task(T), Performance(P) and experience (E). In this context, we can simplify this defi­
nition as -

ML is a field of Al consisting of learning algorithms that -

1. Improve their performance (P)


2. At executing some task (T)
3. Over time with experience (E)

Based on the above, the following diagram represents a Machine Learning Model -
Let us discuss them more in detail now -

Task(T)
From the perspective of problem, we may define the task T as the real-world problem to be solved. The
problem can be anything like finding best house price in a specific location or to find best marketing strat­
egy etc. On the other hand, if we talk about machine learning, the definition of task is different because it
is difficult to solve ML based tasks by conventional programming approach.

A task T is said to be a ML based task when it is based on the process and the system must follow for
operating on data points. The examples of ML based tasks are Classification, Regression, Structured anno­
tation, Clustering, Transcription etc.

Experience (E)
As name suggests, it is the knowledge gained from data points provided to the algorithm or model. Once
provided with the dataset, the model will run iteratively and will learn some inherent pattern. The learn­
ing thus acquired is called experience(E). Making an analogy with human learning, we can think of this
situation as in which a human being is learning or gaining some experience from various attributes like
situation, relationships etc. Supervised, unsupervised and reinforcement learning are some ways to learn
or gain experience. The experience gained by out ML model or algorithm will be used to solve the task T.

Performance (P)
An ML algorithm is supposed to perform task and gain experience with the passage of time. The measure
which tells whether ML algorithm is performing as per expectation or not is its performance (P). P is basi­
cally a quantitative metric that tells how a model is performing the task, T, using its experience, E. There
are many metrics that help to understand the ML performance, such as accuracy score, Fl score, confusion
matrix, precision, recall, sensitivity etc.

Python Libraries

Python has become one of the most popular programming languages for machine learning due to its
simplicity, versatility, and extensive ecosystem of libraries and tools. In this chapter, we will explore the
Python ecosystem for machine learning and highlight some of the most popular libraries and frameworks.

Why Python for Data Science?


Python is the fifth most important language as well as most popular language for Machine learning and
data science. The following are the features of Python that makes it the preferred choice of language for
data science -

Extensive set of packages


Python has an extensive and powerful set of packages which are ready to be used in various domains. It
also has packages like numpy, scipy, pandas, scikit-learn etc. which are required for machine learning and
data science.

Easy prototyping
Another important feature of Python that makes it the choice of language for data science is the easy and
fast prototyping. This feature is useful for developing new algorithm.

Collaboration feature
The field of data science basically needs good collaboration and Python provides many useful tools that
make this extremely.

One language for many domains


A typical data science project includes various domains like data extraction, data manipulation, data
analysis, feature extraction, modelling, evaluation, deployment and updating the solution. As Python is a
multi-purpose language, it allows the data scientist to address all these domains from a common platform.

Strengths and Weaknesses of Python


Every programming language has some strengths as well as weaknesses, so does Python too.
Strengths
According to studies and surveys, Python is the fifth most important language as well as the most popular
language for machine learning and data science. It is because of the following strengths that Python has -

Easy to learn and understand - The syntax of Python is simpler; hence it is relatively easy, even for begin­
ners also, to learn and understand the language.

Multi-purpose language - Python is a multi-purpose programming language because it supports struc­


tured programming, object-oriented programming as well as functional programming.

Huge number of modules - Python has huge number of modules for covering every aspect of program­
ming. These modules are easily available for use hence making Python an extensible language.

Support of open source community - As being open source programming language, Python is supported
by a very large developer community. Due to this, the bugs are easily fixed by the Python community. This
characteristic makes Python very robust and adaptive.

Scalability - Python is a scalable programming language because it provides an improved structure for
supporting large programs than shell-scripts.

Weakness
Although Python is a popular and powerful programming language, it has its own weakness of slow exe­
cution speed.
The execution speed of Python is slow as compared to compiled languages because Python is an inter­
preted language. This can be the major area of improvement for Python community.

Installing Python
For working in Python, we must first have to install it. You can perform the installation of Python in any of
the following two ways -

1. Installing Python individually


2. Using Pre-packaged Python distribution - Anaconda

Let us discuss these each in detail.

Installing Python Individually


If you want to install Python on your computer, then then you need to download only the binary code
applicable for your platform. Python distribution is available for Windows, Linux and Mac platforms.

The following is a quick overview of installing Python on the above-mentioned platforms -

On Unix and Linux platform

With the help of following steps, we can install Python on Unix and Linux platform -

1. First, go to [Link]/downloads/.
2. Next, click on the link to download zipped source code available for Unix/Linux.
3. Now, Download and extract files.
4. Next, we can edit the Modules/Setup file if we want to customize some options.
a. Next, write the command run ./configure script
b. make
c. make install

On Windows platform

With the help of following steps, we can install Python on Windows platform -

1. First, go to [Link]/downloads/.
2. Next, click on the link for Windows installer [Link] file. Here XYZ is the version we
wish to install.
3. Now, we must run the file that is downloaded. It will take us to the Python install wizard,
which is easy to use. Now, accept the default settings and wait until the install is finished.

On Macintosh platform

For Mac OS X, Homebrew, a great and easy to use package installer is recommended to install Python 3. In
case if you don't have Homebrew, you can install it with the help of following command -

p[Link]

It can be updated with the command below -


brew update

Now, to install Python 3 on your system, we need to run the following command -

brew install pvthon3

Using Pre-packaged Python Distribution: Anaconda


Anaconda is a packaged compilation of Python which have all the libraries widely used in Data science. We
can follow the following steps to setup Python environment using Anaconda -

1. Step 1 - First, we need to download the required installation package from Anaconda distri­
bution. The link for the same is [Link]/distribution/. You can choose from Win­
dows, Mac and Linux OS as per your requirement.
2. Step 2 - Next, select the Python version you want to install on your machine. The latest
Python version is 3.7. There you will get the options for 64-bit and 32-bit Graphical installer
both.
3. Step 3 - After selecting the OS and Python version, it will download the Anaconda installer on
your computer. Now, double click the file and the installer will install Anaconda package.
4. Step 4 - For checking whether it is installed or not, open a command prompt and type Python.

You can also check this in detailed video lecture at Python Essentials Online Training.
Components of Python ML Ecosystem
In this section, let us discuss some core Data Science libraries that form the components of Python
Machine learning ecosystem. These useful components make Python an important language for Data
Science. Though there are many such components, let us discuss some of the importance components of
Python ecosystem here -

Jupyter Notebook
Jupyter notebooks basically provides an interactive computational environment for developing Python
based Data Science applications. They are formerly known as ipython notebooks. The following are some
of the features of Jupyter notebooks that makes it one of the best components of Python ML ecosystem -

1. Jupyter notebooks can illustrate the analysis process step by step by arranging the stuff like
code, images, text, output etc. in a step by step manner.
2. It helps a data scientist to document the thought process while developing the analysis
process.
3. One can also capture the result as the part of the notebook.
4. With the help of jupyter notebooks, we can share our work with a peer also.

Installation and Execution


If you are using Anaconda distribution, then you need not install jupyter notebook separately as it is al­
ready installed with it. You just need to go to Anaconda Prompt and type the following command -

C:\> jupyter notebook

After pressing enter, it will start a notebook server at localhost:8888 of your computer. It is shown in the
following screen shot -
3 x + —ox

r -» C © localhost-SSS.-tree ☆ © © :

~ jupyter o«t Logout

Fies Running Clusters

Select items to perform actions on them upload New- C

□ 0 - ■/ Name 4- Last Mod tied File see

□ CnDObjects 2 months ago

□ C2Anaconda3 16 days ago

U 03 Contacts 2 months ago

L C3 Creative Cloud Fies an hour ago

□ D Desktop 5 days ago

□ 03 Documents 16 days ago

□ 03 Downloads 5 days ago

□ 03 Farorites 2 months ago

Q 03 Links 2 morins ago

□ 03 Music 2 months ago

□ 03 nodejnoduies 2 months ago

□ 03 OneDrnt 25 days ago

□ 03 Pictures 2 months ago

8 03 Saved Games 2 months ago

□ 03 Searches 2 months ago

* * -
Now, after clicking the New tab, you will get a list of options. Select Python 3 and it will take you to the new
notebook for start working in it. You will get a glimpse of it in the following screenshots -

3 X D Untrtitdl -ox
-> C © localhost:8888/tree Gt ☆ © O :

Cjupyter Quit Logout

Files Running Clusters

Select items to perform actions on them. Upload New ▼ 0


Notebook:
L 0 ▼ fc I Name 4*
Python 3
O CJ 3D Objects
Other:

□ CD Anaconda3 Text File

□ CD Contacts Folder
Terminal
□ CD Creative Cloud Files

□ CD Desktop 5 days ago

CD Documents 16 days ago

CD Downloads 5 days ago


Home X 0 Untitled 1 X + -OX

C 0 localhost8888/notebooks/[Link]’kernel name=python3 ☆ © O :
C® Logout
2 jupyter Untitledl Last Checkpoint a few seconds ago (unsaved changes)

File Edit View Insect Cell Kernel Widgets Help Trusted ✓ | Python 3 O

O + 2c Q 14* H Run ■ C H code


On the other hand, if you are using standard Python distribution then jupyter notebook can be installed
using popular python package installer, pip.

pip install jupyte

Types of Cells in Jupyter Notebook


The following are the three types of cells in a jupyter notebook -

Code cells - As the name suggests, we can use these cells to write code. After writing the code/content, it
will send it to the kernel that is associated with the notebook.

Markdown cells - We can use these cells for notating the computation process. They can contain the stuff
like text, images, Latex equations, HTML tags etc.

Raw cells - The text written in them is displayed as it is. These cells are basically used to add the text that
we do not wish to be converted by the automatic conversion mechanism of jupyter notebook.

For more detailed study of jupyter notebook, you can go to the link [Link]/jupyter/
[Link].

NumPy
NumPy is a fundamental library for scientific computing in Python. It provides support for large, multi­
dimensional arrays and matrices, along with a collection of mathematical functions to operate on them.
NumPy is a critical component of the Python machine learning ecosystem, as it provides the underlying
data structure and numerical operations required for many machine learning algorithms. Below is the
command to install NumPy -

pin install nump

Pandas
Pandas is a powerful library for data manipulation and analysis. It provides a range of functions for im­
porting, cleaning, and transforming data, along with powerful tools for grouping and aggregating data.

Pandas is particularly useful for data preprocessing in machine learning, as it allows for efficient data han­
dling and manipulation. Below is the command to install Pandas -

pip install pandas

Scikit-learn
Scikit-learn is a popular machine learning library in Python, providing a range of algorithms for classifica­
tion, regression, clustering, and more. It also includes tools for data preprocessing, feature selection, and
model evaluation. Scikit-learn is widely used in the machine learning community due to its ease of use,
performance, and extensive documentation.

Below is the command to install Scikit-learn -


pip install scikit-learn

TensorFlow
TensorFlow is an open-source library for machine learning developed by Google. It provides support for
building and training deep learning models, along with tools for distributed computing and deployment.
TensorFlow is a powerful tool for building complex machine learning models, particularly in the areas of
computer vision and natural language processing. Below is the command to install TensorFlow -

pin install tensorflow

PyTorch
PyTorch is another popular deep learning library in Python. Developed by Facebook, it provides a range of
tools for building and training neural networks, along with support for dynamic computation graphs and
GPU acceleration.

PyTorch is particularly useful for researchers and developers who need a flexible and powerful deep learn­
ing framework. Below is the command to install PyTorch -

pip install tore


Keras
Keras is a high-level neural network library that runs on top of TensorFlow and other lower-level frame­
works. It provides a simple and intuitive API for building and training deep learning models, making it an
excellent choice for beginners and researchers who need to quickly prototype and experiment with differ­
ent models. Below is the command to install Keras -

pip install keras

OpenCV
OpenCV is a computer vision library that provides tools for image and video processing, along with sup­
port for machine learning algorithms. It is widely used in the computer vision community for tasks such
as object detection, image segmentation, and facial recognition. Below is the command to install OpenCV -

ip install opencv-python

In addition to these libraries, there are many other tools and frameworks in the Python ecosystem for
machine learning, including XGBoost, LightGBM, spaCy, and NLTK. The Python ecosystem for machine
learning is constantly evolving, with new libraries and tools being developed all the time.
Whether you are a beginner or an experienced machine learning practitioner, Python provides a rich and
flexible environment for developing and deploying machine learning models.

Here, it is also important to note that some libraries may require additional dependencies or system­
specific requirements. In such cases, it is recommended to consult the library's documentation for instal­
lation instructions and requirements.

Applications
Machine learning has become a ubiquitous technology that has impacted many aspects of our lives, from
business to healthcare to entertainment. Here are some popular applications of machine learning -

Image and Speech Recognition


Image and speech recognition are two areas where machine learning has made significant advancements.
Machine learning algorithms are used in applications such as facial recognition, object detection, and
speech recognition to accurately identify and classify images and speech.

Natural Language Processing


Natural Language Processing (NLP) is a field of computer science that deals with the interaction between
computers and humans using natural language. NLP uses machine learning algorithms to analyze, under­
stand, and generate human language, making it possible to build chatbots, language translators, and voice
assistants.

Fraud Detection
Machine learning is widely used in the finance industry for fraud detection. Machine learning algorithms
can analyze vast amounts of transactional data to detect patterns and anomalies that may indicate fraud­
ulent activity, helping to prevent financial losses and protect customers.

Predictive Maintenance
Predictive maintenance is a process of using machine learning algorithms to predict when maintenance
will be required on a machine, such as a piece of equipment in a factory. By analyzing data from sensors
and other sources, machine learning algorithms can detect patterns that indicate when a machine is likely
to fail, enabling maintenance to be performed before the machine breaks down.

Healthcare
Machine learning has also found many applications in the healthcare industry. For example, machine
learning algorithms can be used to analyze medical images and detect diseases such as cancer, or to predict
patient outcomes based on their medical history and other factors.
Recommendation Systems
Recommendation systems are used to provide personalized recommendations to users based on their past
behavior and preferences. Machine learning algorithms are used to analyze user data and generate recom­
mendations for products, services, and content.

Autonomous Vehicles
Machine learning is a critical technology for the development of autonomous vehicles. Machine learning
algorithms are used to process data from sensors and cameras, allowing vehicles to detect and respond to
their environment in real-time.

These are just a few examples of the many applications of machine learning. As machine learning contin­
ues to evolve and improve, we can expect to see it used in more areas of our lives, improving efficiency,
accuracy, and convenience in a variety of industries.

Life Cycle

The machine learning life cycle is a process that involves several stages from problem identification to
model deployment. Here are the six stages of the machine learning life cycle -
Problem Definition
The first step in the machine learning life cycle is to identify the problem you want to solve. This stage
involves understanding the business problem, defining the problem statement, and identifying the suc­
cess criteria for the machine learning model.

Data Collection
The second stage is to collect the data that will be used to train the machine learning model. This stage
involves identifying the relevant data sources, collecting and storing the data, and cleaning and prepro­
cessing the data to prepare it for analysis.

Data Preparation
In this stage, the data is prepared for analysis by performing data exploration, feature engineering, and
feature selection. Data exploration involves visualizing and understanding the data, while feature engi­
neering involves creating new features from the existing data. Feature selection involves selecting the
most relevant features that will be used to train the machine learning model.

Model Building
In this stage, the machine learning model is built using the prepared data. The model building process
involves selecting the appropriate machine learning algorithm, tuning the hyperparameters of the algo­
rithm, and evaluating the performance of the model using cross-validation techniques.

Model Evaluation
In this stage, the performance of the machine learning model is evaluated using a set of evaluation
metrics. These metrics measure the accuracy, precision, recall, and Fl score of the model. If the model's
performance is not satisfactory, it may be necessary to return to the model building stage to improve the
model's performance.

Model Deployment
The final stage of the machine learning life cycle is to deploy the machine learning model into production.
This involves integrating the model into the production environment, testing the model in a real-world
scenario, and monitoring the model's performance to ensure that it continues to perform as expected.

The machine learning life cycle is an iterative process, and it may be necessary to revisit previous stages
to improve the model's performance or address new requirements. By following the machine learning life
cycle, data scientists can ensure that their machine learning models are effective, accurate, and meet the
business requirements.
Required Skills
Machine learning is a rapidly growing field that requires a combination of technical and soft skills to be
successful. Here are some of the key skills required for machine learning -

Programming Skills
Machine learning requires a solid foundation in programming skills, particularly in languages such as
Python, R, and Java. Proficiency in programming allows data scientists to build, test, and deploy machine
learning models.

Statistics and Mathematics


A strong understanding of statistics and mathematics is essential for machine learning. Data scientists
must be able to understand and apply statistical models, algorithms, and methods to analyze and interpret
data.

To give you a brief idea of what skills you need to acquire, let us discuss some examples -

Mathematical Notation
Most of the machine learning algorithms are heavily based on mathematics. The level of mathematics that
you need to know is probably just a beginner level. What is important is that you should be able to read the
notation that mathematicians use in their equations. For example - if you are able to read the notation and
comprehend what it means, you are ready for learning machine learning. If not, you may need to brush up
your mathematics knowledge.

7 if net — 9 > e

{ net — 9 if — e < net — 9 < e


—y if net — 9 < —e

max label^1 > label^ • ai ’ aj{x^l\x^

^X(net—0) g—X(net—0) \

( eX(net-0) _|_ e-X(net-0) J

Probability Theory
Here is an example to test your current knowledge of probability theory: Classifying with conditional
probabilities.
With these definitions, we can define the Bayesian classification rule -

• If P(c 1 |x, y) > P(c2 |x, y), the class is c 1.


• If P(cl|x, y) < P(c2|x, y), the class is c2 .

Optimization Problem
Here is an optimization function

in
max label^ • labels • a, • aj{x^'\x^

Subject to the following constraints -

m
a > [Link] ? a. • label^ = 0
t—1
If you can read and understand the above, you are all set.
Data Preprocessing
Preparing data for machine learning requires knowledge of data cleaning, data transformation, and data
normalization. This involves identifying and correcting errors, missing values, and inconsistencies in the
data.

Data Visualization
Data visualization is the process of creating graphical representations of data to help users understand and
interpret complex data sets. Data scientists must be able to create effective visualizations that communi­
cate insights from the data.

In many cases, you will need to understand the various types of visualization plots to understand your
data distribution and interpret the results of the algorithm’s output.
Besides the above theoretical aspects of machine learning, you need good programming skills to code those
algorithms.

Machine Learning Algorithms


Machine learning requires knowledge of various algorithms, such as regression, decision trees, random
forests, k-nearest neighbors, support vector machines, and neural networks. Understanding the strengths
and weaknesses of these algorithms is critical for building effective machine learning models.
Deep Learning
Deep learning is a subfield of machine learning that involves training deep neural networks to analyze
complex data sets. Deep learning requires a strong understanding of neural networks, convolutional neu­
ral networks, recurrent neural networks, and other related topics.

Natural Language Processing


Natural language processing (NLP) is a branch of artificial intelligence that focuses on the interaction
between computers and humans using natural language. NLP requires knowledge of techniques such as
sentiment analysis, text classification, and named entity recognition.

Problem-solving Skills
Machine learning requires strong problem-solving skills, including the ability to identify problems, gen­
erate hypotheses, and develop solutions. Data scientists must be able to think creatively and logically to
develop effective solutions to complex problems.

Communication Skills
Communication skills are essential for data scientists, as they must be able to explain complex technical
concepts to non-technical stakeholders. Data scientists must be able to communicate the results of their
analysis and the implications of their findings in a clear and concise manner.

Business Acumen
Machine learning is used to solve business problems, and therefore, understanding the business context
and the ability to apply machine learning to business problems is essential.

Overall, machine learning requires a broad range of skills, including technical, mathematical, and soft
skills. To be successful in this field, data scientists must be able to combine these skills to develop effective
machine learning models that solve complex business problems.

Implementation
Implementing machine learning involves several steps, which include -

Data Collection and Preparation


The first step in implementing machine learning is collecting the data that will be used to train and test the
model. The data should be relevant to the problem that the machine learning model is being built to solve.
Once the data has been collected, it needs to be preprocessed and cleaned to remove any inconsistencies or
missing values.
Data Exploration and Visualization
The next step is to explore and visualize the data to gain insights into its structure and identify any
patterns or trends. Data visualization tools such as matplotlib and seaborn can be used to create visualiza­
tions such as histograms, scatter plots, and heat maps.

Feature Selection and Engineering


The features of the data that are relevant to the problem need to be selected or engineered. Feature engi­
neering involves creating new features from existing data that can improve the accuracy of the model.

Model Selection and Training


Once the data has been prepared and features selected or engineered, the next step is to select a suitable
machine learning algorithm to train the model. This involves splitting the data into training and testing
sets and using the training set to fit the model. Various machine learning algorithms such as linear regres­
sion, logistic regression, decision trees, random forests, support vector machines, and neural networks can
be used to train the model.

Model Evaluation
After training the model, it needs to be evaluated to determine its performance. The performance of the
model can be evaluated using metrics such as accuracy, precision, recall, and Fl score. Cross-validation
techniques can also be used to test the model's performance.

Model Tuning
The performance of the model can be improved by tuning its hyperparameters. Hyperparameters are
settings that are not learned from the data, but rather set by the user. The optimal values for these hyper­
parameters can be found using techniques such as grid search and random search.

Deployment and Monitoring


Once the model has been trained and tuned, it needs to be deployed to a production environment. The
deployment process involves integrating the model into the business process or system. The model also
needs to be monitored regularly to ensure that it continues to perform well and to identify any issues that
need to be addressed.

Each of the above steps requires different tools and techniques, and successful implementation requires a
combination of technical and business skills.

Choosing the Language and IDE for ML Development


To develop ML applications, you will have to decide on the platform, the IDE and the language for develop­
ment. There are several choices available. Most of these would meet your requirements easily as all of them
provide the implementation of Al algorithms discussed so far.
If you are developing the ML algorithm on your own, the following aspects need to be understood carefully

The language of your choice - this essentially is your proficiency in one of the languages supported in ML
development.

The IDE that you use - This would depend on your familiarity with the existing IDEs and your comfort
level.

Development platform - There are several platforms available for development and deployment. Most of
these are free-to-use. In some cases, you may have to incur a license fee beyond a certain amount of usage.
Here is a brief list of choice of languages, IDEs and platforms for your ready reference.

Language Choice
Here is a list of languages that support ML development -

1. Python
2. R
3. Matlab
4. Octave
5. Julia
6. C+ +
7. C
This list is not essentially comprehensive; however, it covers many popular languages used in machine
learning development. Depending upon your comfort level, select a language for the development, develop
your models and test.

IDEs
Here is a list of IDEs which support ML development -

1. R Studio
2. Pycharm
3. iPython/Jupyter Notebook
4. Julia
5. Spyder
6. Anaconda
7. Rodeo
8. Google-Colab

The above list is not essentially comprehensive. Each one has its own merits and demerits. The reader is
encouraged to try out these different IDEs before narrowing down to a single one.
Challenges & Common Issues

Machine learning is a rapidly growing field with many promising applications. However, there are also
several challenges and issues that must be addressed to fully realize the potential of machine learning.
Some of the major challenges and common issues faced in machine learning include -

Overfitting
Overfitting occurs when a model is trained on a limited set of data and becomes too complex, leading to
poor performance when tested on new data. This can be addressed by using techniques such as cross-vali­
dation, regularization, and early stopping.

Underfitting
Underfitting occurs when a model is too simple and fails to capture the patterns in the data. This can be
addressed by using more complex models or by adding more features to the data.

Data Quality Issues


Machine learning models are only as good as the data they are trained on. Poor quality data can lead to
inaccurate models. Data quality issues include missing values, incorrect values, and outliers.
Imbalanced Datasets
Imbalanced datasets occur when one class of data is significantly more prevalent than another. This can
lead to biased models that are accurate for the majority class but perform poorly on the minority class.

Model Interpretability
Machine learning models can be very complex, making it difficult to understand how they arrive at their
predictions. This can be a challenge when explaining the model to stakeholders or regulatory bodies. Tech­
niques such as feature importance and partial dependence plots can help improve model interpretability.

Generalization
Machine learning models are trained on a specific dataset, and they may not perform well on new data
that is outside the training set. This can be addressed by using techniques such as cross-validation and
regularization.

Scalability
Machine learning models can be computationally expensive and may not scale well to large datasets.
Techniques such as distributed computing, parallel processing, and sampling can help address scalability
issues.
Ethical Considerations
Machine learning models can raise ethical concerns when they are used to make decisions that affect
people's lives. These concerns include bias, privacy, and transparency. Techniques such as fairness metrics
and explainable Al can help address ethical considerations.

Addressing these issues requires a combination of technical expertise and business knowledge, as well as
an understanding of ethical considerations. By addressing these issues, machine learning can be used to
develop accurate and reliable models that can provide valuable insights and drive business value.

Limitations

Machine learning is a powerful technology that has transformed the way we approach data analysis, but
like any technology, it has its limitations. Here are some of the key limitations of machine learning -

Dependence on Data Quality


Machine learning models are only as good as the data used to train them. If the data is incomplete, biased,
or of poor quality, the model may not perform well.
Lack of Transparency
Machine learning models can be very complex, making it difficult to understand how they arrive at their
predictions. This lack of transparency can make it challenging to explain model results to stakeholders.

Limited Applicability
Machine learning models are designed to find patterns in data, which means they may not be suitable for
all types of data or problems.

High Computational Costs


Machine learning models can be computationally expensive, requiring significant processing power and
storage.

Data Privacy Concerns


Machine learning models can sometimes collect and use personal data, which raises concerns about pri­
vacy and data security.

Ethical Considerations
Machine learning models can sometimes perpetuate biases or discriminate against certain groups, raising
ethical concerns.

Dependence on Experts
Developing and deploying machine learning models requires significant expertise in data science, statis­
tics, and programming, making it challenging for organizations without access to these skills.

Lack of Creativity and Intuition


Machine learning algorithms are good at finding patterns in data but lack creativity and intuition. This
means that they may not be able to solve problems that require creative thinking or intuition.

Limited Interpretability
Some machine learning models, such as deep neural networks, can be difficult to interpret. This means
that it may be challenging to understand how the model arrived at its predictions.

Real-Life Examples

Machine learning has been transforming various industries by automating processes, predicting out­
comes, and discovering patterns in large data sets. Here we are providing the top 5 real-life examples of
machine learning -
Fraud Detection in Banking and Finance
Machine learning algorithms are widely used in the financial industry to detect fraudulent activities.
These algorithms can analyze transaction data and identify patterns that indicate fraud. For example,
credit card companies use machine learning to identify transactions that are likely to be fraudulent and
notify customers in real-time. Banks also use machine learning to detect money laundering, identify un­
usual behavior in accounts, and analyze credit risk.

Healthcare Diagnosis and Treatment


Machine learning algorithms can analyze medical data, such as X-rays, MRI scans, and genomic data, to
assist with the diagnosis of diseases. These algorithms can also be used to identify the most effective treat­
ment for a patient based on their medical history and genetic makeup. For example, IBM's Watson for On­
cology uses machine learning to analyze medical records and recommend personalized cancer treatments.

Natural Language Processing in Virtual Assistants and Chatbots


Natural language processing (NLP) is an area of machine learning that focuses on understanding and
generating human language. NLP is used in virtual assistants and chatbots, such as Siri, Alexa, and Google
Assistant, to provide personalized and conversational experiences. Machine learning algorithms can ana­
lyze language patterns and respond to user queries in a natural and accurate way.
Autonomous Vehicles
Autonomous vehicles use machine learning algorithms to navigate and make decisions on the road. These
algorithms can analyze data from sensors and cameras to identify obstacles and make decisions about how
to respond. Autonomous vehicles are expected to revolutionize transportation by reducing accidents and
increasing efficiency. Companies such as Tesla, Waymo, and Uber are using machine learning to develop
self-driving cars.

Personalized Recommendations in e-commerce and Entertainment


E-commerce platforms, such as Amazon and Netflix, use machine learning algorithms to provide person­
alized recommendations to users based on their browsing and viewing history. These recommendations
can improve customer satisfaction and increase sales. Machine learning algorithms can analyze large
amounts of data to identify patterns and predict user preferences, enabling e-commerce platforms and en­
tertainment providers to offer a more personalized experience to their users.

In addition to these examples, machine learning is being used in many other applications, such as energy
management, social media analysis, and predictive maintenance. Machine learning is a powerful tool that
has the potential to revolutionize many industries and improve the lives of people around the world.
Data Structure
Data structure plays a critical role in machine learning as it facilitates the organization, manipulation, and
analysis of data. Data is the foundation of machine learning models, and the data structure used can sig­
nificantly impact the model's performance and accuracy.

Here we will discuss some commonly used data structures and how they are used in Machine Learning.

Commonly Used Data Structure for Machine Learning


Data structure is an essential component of machine learning, and the right data structure can help in
achieving faster processing, easier access to data, and more efficient storage. Here are some commonly
used data structures for machine learning -

Arrays
Arrays are a collection of similar data types that can be accessed using an index. They are commonly used
in machine learning for storing data in the form of tables, such as CSV files. Arrays are easy to use and offer
fast indexing, but their size is fixed, which can be a limitation when working with large datasets.

Lists
Lists are collections of heterogeneous data types that can be accessed using an iterator. They are com­
monly used in machine learning for storing complex data structures, such as nested lists, dictionaries, and
tuples. Lists offer flexibility and can handle varying data sizes, but they are slower than arrays due to the
need for iteration.

Dictionaries
Dictionaries are a collection of key-value pairs that can be accessed using the keys. They are commonly
used in machine learning for storing metadata or labels associated with data. Dictionaries offer fast access
to data and are useful for creating lookup tables, but they can be memory-intensive when dealing with
large datasets.

Linked Lists
Linked lists are collections of nodes, each containing a data element and a reference to the next node in the
list. They are commonly used in machine learning for storing and manipulating sequential data, such as
time-series data. Linked lists offer efficient insertion and deletion operations, but they are slower than ar­
rays and lists when it comes to accessing data.

Trees
Trees are hierarchical data structures that are commonly used in machine learning for decision-making
algorithms, such as decision trees and random forests. Trees offer efficient searching and sorting algo­
rithms, but they can be complex to implement and can suffer from overfitting.
Graphs
Graphs are collections of nodes and edges that are commonly used in machine learning for representing
complex relationships between data points. Graphs offer powerful algorithms for clustering, classifica­
tion, and prediction, but they can be complex to implement and can suffer from scalability issues.

In addition to the above-mentioned data structures, many machine learning libraries and frameworks
provide specialized data structures for specific use cases, such as matrices and tensors for deep learning. It
is important to choose the right data structure for the task at hand, considering factors such as data size,
processing speed, and memory usage.

How Data Structure is Used in Machine Learning


Below are some ways data structures are used in machine learning -

Storing and Accessing Data


Machine learning algorithms require large amounts of data for training and testing. Data structures such
as arrays, lists, and dictionaries are used to store and access data efficiently. For example, an array can be
used to store a set of numerical values, while a dictionary can be used to store metadata or labels associ­
ated with data.

Pre-processing Data
Before training a machine learning model, it is necessary to pre-process the data to clean, transform, and
normalize it. Data structures such as lists and arrays can be used to store and manipulate the data during
pre-processing. For example, a list can be used to filter out missing values, while an array can be used to
normalize the data.

Creating Feature Vectors


Feature vectors are a critical component of machine learning models as they represent the features that are
used to make predictions. Data structures such as arrays and matrices are commonly used to create feature
vectors. For example, an array can be used to store the pixel values of an image, while a matrix can be used
to store the frequency distribution of words in a text document.

Building Decision Trees


Decision trees are a common machine learning algorithm that uses a tree data structure to make decisions
based on a set of input features. Decision trees are useful for classification and regression problems. They
are created by recursively splitting the data based on the most informative features. The tree data structure
makes it easy to traverse the decision-making process and make predictions.

Building Graphs
Graphs are used in machine learning to represent complex relationships between data points. Data struc­
tures such as adjacency matrices and linked lists are used to create and manipulate graphs. Graphs are
used for clustering, classification, and prediction tasks.
Mathematics

Machine learning is an interdisciplinary field that involves computer science, statistics, and mathematics.
In particular, mathematics plays a critical role in developing and understanding machine learning algo­
rithms. In this article, we will discuss the mathematical concepts that are essential for machine learning,
including linear algebra, calculus, probability, and statistics.

Linear Algebra
Linear algebra is the branch of mathematics that deals with linear equations and their representation in
vector spaces. In machine learning, linear algebra is used to represent and manipulate data. In particular,
vectors and matrices are used to represent and manipulate data points, features, and weights in machine
learning models.

A vector is an ordered list of numbers, while a matrix is a rectangular array of numbers. For example, a
vector can represent a single data point, and a matrix can represent a dataset. Linear algebra operations,
such as matrix multiplication and inversion, can be used to transform and analyze data.

Calculus
Calculus is the branch of mathematics that deals with rates of change and accumulation. In machine
learning, calculus is used to optimize models by finding the minimum or maximum of a function. In par­
ticular, gradient descent, a widely used optimization algorithm, is based on calculus.

Gradient descent is an iterative optimization algorithm that updates the weights of a model based on the
gradient of the loss function. The gradient is the vector of partial derivatives of the loss function with
respect to each weight. By iteratively updating the weights in the direction of the negative gradient, gradi­
ent descent tries to minimize the loss function.

Probability
Probability is the branch of mathematics that deals with uncertainty and randomness. In machine learn­
ing, probability is used to model and analyze data that are uncertain or variable. In particular, probability
distributions, such as Gaussian and Poisson distributions, are used to model the probability of data points
or events.

Bayesian inference, a probabilistic modeling technique, is also widely used in machine learning. Bayesian
inference is based on Bayes' theorem, which states that the probability of a hypothesis given the data is
proportional to the probability of the data given the hypothesis multiplied by the prior probability of the
hypothesis. By updating the prior probability based on the observed data, Bayesian inference can make
probabilistic predictions or classifications.

Statistics
Statistics is the branch of mathematics that deals with the collection, analysis, interpretation, and pre­
sentation of data. In machine learning, statistics is used to evaluate and compare models, estimate model
parameters, and test hypotheses.

For example, cross-validation is a statistical technique that is used to evaluate the performance of a model
on new, unseen data. In cross-validation, the dataset is split into multiple subsets, and the model is trained
and evaluated on each subset. This allows us to estimate the performance of the model on new data and
compare different models.

Artificial Intelligence

Artificial intelligence and machine learning are two buzzwords that are commonly used in the world
of technology. Although they are often used interchangeably, they are not the same thing. Artificial
intelligence (Al) and machine learning (ML) are related concepts, but they have different definitions, ap­
plications, and implications. In this article, we will explore the differences between machine learning and
artificial intelligence and how they are related.

What is Machine Learning?


Machine learning is a subset of artificial intelligence that focuses on teaching machines how to learn from
data. In other words, machine learning is a process by which computers can automatically learn patterns
and relationships in data without being explicitly programmed to do so. Machine learning algorithms are
designed to detect and learn from patterns in data to make predictions or decisions.
There are three main types of machine learning: supervised learning, unsupervised learning, and rein­
forcement learning. Supervised learning is when the machine is trained on labeled data with known
outcomes. Unsupervised learning is when the machine is trained on unlabeled data and is asked to find
patterns or similarities. Reinforcement learning is when the machine learns by trial and error through in­
teractions with the environment.

Examples of machine learning include image recognition, speech recognition, recommendation systems,
fraud detection, and natural language processing.

What is Artificial Intelligence?


Artificial intelligence, on the other hand, is a broad field that encompasses the development of intelligent
machines that can perform tasks that typically require human intelligence, such as perception, reasoning,
learning, and decision-making. In simple terms, Al is the ability of machines to perform tasks that nor­
mally require human intervention or intelligence.

There are two types of Al: narrow or weak Al and general or strong Al. Narrow Al is designed to perform
specific tasks, such as speech recognition or image recognition, while general Al is designed to be able to
perform any intellectual task that a human can do. Currently, we only have narrow Al in use, but the goal
is to develop general Al that can be applied to a wide range of tasks.

Al is like a basket containing several branches, the important ones being Machine Learning (ML), Robotics,
Expert Systems, Fuzzy Logic, Neural Networks, Computer Vision, and Natural Language Processing (NLP).
NLP

Machine
Learning

Computer
Systems Vision
While we highlight the features of ML in the next section, here is a brief overview of the other important
branches of Al:
1. Robotics - Robots are primarily designed to perform repetitive and tedious tasks. Robotics is
an important branch of Al that deals with designing, developing and controlling the applica­
tion of robots.
2. Computer Vision - It is an exciting field of Al that helps computers, robots, and other digital
devices to process and understand digital images and videos, and extract vital information.
With the power of Al, Computer Vision develops algorithms that can extract, analyze and
comprehend useful information from digital images.
3. Expert Systems - Expert systems are applications specifically designed to solve complex
problems in a specific domain, with humanlike intelligence, precision, and expertise. Just like
human experts, Expert Systems excel in a specific domain in which they are trained.
4. Fuzzy Logic - We know computers take precise digital inputs like True (Yes) or False (No), but
Fuzzy Logic is a method of reasoning that helps machines to reason like human beings before
taking a decision. With Fuzzy Logic, machines can analyze all intermediate possibilities be­
tween a YES or NO, for example, "Possibly Yes", "Maybe No", etc.
5. Neural Networks - Inspired by the natural neural networks of the human brain, Artificial
Neural Networks (ANN) can be considered as a group of highly interconnected group of pro­
cessing elements (nodes) that can process information by their dynamic state response to ex­
ternal inputs. ANNs use training data to improve their efficiency and accuracy.
6. Natural Language Processing (NLP) - NLP is a field of Al that empowers intelligent systems
to communicate with humans using a natural language like English. With the power of NLP,
one can easily interact with a robot and instruct it in plain English to perform a task. NLP can
also process text data and comprehend its full meaning. It is heavily used these days in virtual
chatbots and sentiment analysis.

Examples of Al include virtual assistants, autonomous vehicles, facial recognition, natural language pro­
cessing, and decision-making systems.

Machine Learning vs. Artificial Intelligence


Now that we have a basic understanding of what machine learning and artificial intelligence are, let's dive
deeper into the differences between the two.

Firstly, machine learning is a subset of artificial intelligence, meaning that machine learning is a part of
the larger field of Al. Machine learning is a technique used to implement artificial intelligence.

Secondly, while machine learning focuses on developing algorithms that can learn from data, artificial in­
telligence focuses on developing intelligent machines that can perform tasks that normally require human
intelligence. In other words, machine learning is more focused on the process of learning from data, while
Al is more focused on the end goal of creating machines that can perform intelligent tasks.

Thirdly, machine learning algorithms are designed to learn from data and improve their accuracy over
time, while artificial intelligence systems are designed to learn and adapt to new situations and environ­
ments. Machine learning algorithms require a lot of data to be trained effectively, while Al systems can
adapt and learn from new data in real-time.
Finally, machine learning is more limited in its capabilities compared to AL Machine learning algorithms
can only learn from the data they are trained on, while Al systems can learn and adapt to new situations
and environments. Machine learning is great for solving specific problems that can be solved through
pattern recognition, while Al is better suited for complex, real-world problems that require reasoning and
decision-making.

The following table highlights the important differences between Machine Learning and Artificial Intelli­
gence -

Key Artificial Intelligence Machine Learning

Definition Al refers to the ability of a machine or a ML is a type of Al that allows a


computer system to perform tasks that would system to learn and improve from
normally require human intelligence, such as experience without being explic­
understanding language, recognizing images, itly programmed.
and making decisions. It articulates how a machine can
learn and apply its knowledge to
improve its decisions.

Concept Al revolves around making smart and intelligent ML revolves around making a
devices. machine learn/decide and improve
its results.
Goal The goal of Al is to simulate human intelligence The goal of ML is to learn from data
to solve complex problems. provided and make improvements
in machine's performance.

Includes Al has several important branches including ML training methods include


Artificial Neural Networks, Natural Language supervised learning, unsuper-
Processing, Fuzzy Logic, Robotics, Expert Sys- vised learning, and reinforcement
terns, Computer Vision, and Machine Learning. learning.

Development Al is leading to the development of such ma­ ML is helping in the development


chines which can mimic human behavior. of self-learning algorithms.

Neural Networks

Machine learning and neural networks are two important technologies in the field of artificial intelligence
(Al). While they are often used together, they are not the same thing. In this article, we will explore the
differences between machine learning and neural networks and how they are related.

We understood about machine learning in last section so let's see what neural networks are.
What are Neural Networks?
Neural networks are a type of machine learning algorithm that is inspired by the structure of the human
brain. They are designed to simulate the way the brain works by using layers of interconnected nodes, or
artificial neurons. Each neuron takes in input from the neurons in the previous layer and uses that input to
produce an output. This process is repeated for each layer until a final output is produced.

Neural networks can be used for a wide range of tasks, including image recognition, speech recognition,
natural language processing, and prediction. They are particularly well-suited to tasks that involve pro­
cessing complex data or recognizing patterns in data.

Machine Learning vs. Neural Networks


Now that we have a basic understanding of what machine learning and neural networks are, let's dive
deeper into the differences between the two.

1. Firstly, machine learning is a broad category that encompasses many different types of algo­
rithms, including neural networks. Neural networks are a specific type of machine learning
algorithm that is designed to simulate the way the brain works.
2. Secondly, while machine learning algorithms can be used for a wide range of tasks, neural
networks are particularly well-suited to tasks that involve processing complex data or recog­
nizing patterns in data. Neural networks can recognize complex patterns and relationships in
data that other machine learning algorithms may not be able to detect.
3. Thirdly, neural networks require a lot of data and processing power to train. Neural networks
typically require large datasets and powerful hardware, such as graphics processing units
(GPUs), to train effectively. Machine learning algorithms, on the other hand, can be trained on
smaller datasets and less powerful hardware.
4. Finally, neural networks can provide highly accurate predictions and decisions, but they can
be more difficult to understand and interpret than other machine learning algorithms. The
way that neural networks make decisions is not always transparent, which can make it diffi­
cult to understand how they arrived at their conclusions.

Deep Learning

In the world of artificial intelligence, two terms that are often used interchangeably are machine learning
and deep learning. While both of these technologies are used to create intelligent systems, they are not the
same thing. In this article, we will explore the differences between machine learning and deep learning
and how they are related.

We understood about machine learning in last section so let's see what deep learning is.

What is Deep Learning?


Deep learning is a type of machine learning that uses neural networks to process complex data. In other
words, deep learning is a process by which computers can automatically learn patterns and relationships
in data using multiple layers of interconnected nodes, or artificial neurons. Deep learning algorithms are
designed to detect and learn from patterns in data to make predictions or decisions.

Deep learning is particularly well-suited to tasks that involve processing complex data, such as image and
speech recognition, natural language processing, and self-driving cars. Deep learning algorithms are able
to process vast amounts of data and can learn to recognize complex patterns and relationships in that data.

Examples of deep learning include facial recognition, voice recognition, and self-driving cars.

Machine Learning vs. Deep Learning


Now that we have a basic understanding of what machine learning and deep learning are, let's dive deeper
into the differences between the two.

• Firstly, machine learning is a broad category that encompasses many different types of
algorithms, including deep learning. Deep learning is a specific type of machine learning algo­
rithm that uses neural networks to process complex data.
• Secondly, while machine learning algorithms are designed to learn from data and improve
their accuracy over time, deep learning algorithms are designed to process complex data and
recognize patterns and relationships in that data. Deep learning algorithms are able to recog­
nize complex patterns and relationships that other machine learning algorithms may not be
able to detect.
• Thirdly, deep learning algorithms require a lot of data and processing power to train. Deep
learning algorithms typically require large datasets and powerful hardware, such as graphics
processing units (GPUs), to train effectively. Machine learning algorithms, on the other hand,
can be trained on smaller datasets and less powerful hardware.
• Finally, deep learning algorithms can provide highly accurate predictions and decisions, but
they can be more difficult to understand and interpret than other machine learning algo­
rithms. Deep learning algorithms can process vast amounts of data and recognize complex
patterns and relationships in that data, but it can be difficult to understand how the algo­
rithm arrived at its conclusion.

Getting Datasets
Machine learning models are only as good as the data they are trained on. Therefore, obtaining good qual­
ity and relevant datasets is a critical step in the machine learning process. Let's see some different sources
of datasets for machine learning and how to obtain them.

Public Datasets
There are many publicly available datasets that you can use for machine learning. Some of the popular
sources of public datasets include Kaggle, UCI Machine Learning Repository, Google Dataset Search, and
AWS Public Datasets. These datasets are often used for research and are open to the public.
Data Scraping
Data scraping involves automatically extracting data from websites or other sources. It can be a useful way
to obtain data that is not available as a pre-packaged dataset. However, it is important to ensure that the
data is being scraped ethically and legally, and that the source is reliable and accurate.

Data Purchase
In some cases, it may be necessary to purchase a dataset for machine learning. Many companies sell pre­
packaged datasets that are tailored to specific industries or use cases. Before purchasing a dataset, it is im­
portant to evaluate its quality and relevance to your machine learning project.

Data Collection
Data collection involves manually collecting data from various sources. This can be time-consuming and
requires careful planning to ensure that the data is accurate and relevant to your machine learning project.
It may involve surveys, interviews, or other forms of data collection.

Strategies for Acquiring High Quality Datasets


Once you have identified the source of your dataset, it is important to ensure that the data is of good
quality and relevant to your machine learning project. Below are some Strategies for obtaining good qual-
ity datasets -

Identify the Problem You Want to Solve


Before obtaining a dataset, it is important to identify the problem you want to solve with machine learn­
ing. This will help you determine the type of data you need and where to obtain it.

Determine the Size of the Dataset


The size of the dataset depends on the complexity of the problem you are trying to solve. Generally, the
more data you have, the better your machine learning model will perform. However, it is important to en­
sure that the dataset is not too large and contains irrelevant or duplicate data.

Ensure the Data is Relevant and Accurate


It is important to ensure that the data is relevant and accurate to the problem you are trying to solve.
Ensure that the data is from a reliable source and that it has been verified.

Preprocess the Data


Preprocessing the data involves cleaning, normalizing, and transforming the data to prepare it for machine
learning. This step is critical to ensure that the machine learning model can understand and use the data
effectively.
Categorical Data

What is Categorical Data?


Categorical data in Machine Learning refers to data that consists of categories or labels, rather than numer­
ical values. These categories may be nominal, meaning that there is no inherent order or ranking between
them (e.g., color, gender), or ordinal, meaning that there is a natural ordering between the categories (e.g.,
education level, income bracket).

Categorical data is often represented using discrete values, such as integers or strings, and is frequently
encoded as one-hot vectors before being used as input to machine learning models. One-hot encoding in­
volves creating a binary vector for each category, where the vector has a 1 in the position corresponding to
the category and Os in all other positions.

Techniques for Handling Categorical Data


Handling categorical data is an important part of machine learning preprocessing, as many algorithms
require numerical input. Depending on the algorithm and the nature of the categorical data, different en­
coding techniques may be used, such as label encoding, ordinal encoding, or binary encoding etc.

In the subsequent sections of this chapter, we will discuss the different techniques for handling categorical
data in machine learning along with their implementations in Python.
One-Hot Encoding
One-hot encoding is a popular technique for handling categorical data in machine learning. It involves
creating a binary vector for each category, where each element of the vector represents the presence or
absence of the category. For example, if we have a categorical variable for color with values red, blue, and
green, one-hot encoding would create three binary vectors: [ 1,0,0], [0,1,0], and [0, 0,1 ] respectively.

Example
Below is an example of how to perform one-hot encoding in Python using the Pandas library -

import pandas as pd

data = {'color': ['red', 'green', 'blue', 'red','greenll^^^M

jone_hot_encoded = pd.get_dummies(df['color'], prefix='color')

|df = [Link]([df, one_hot_encoded], axis=l)^^^M

# Drop the original categorical variable


df = df.dropfcolor1, axis=l)

Output
This will create a one-hot encoded dataframe with three binary variables ("color_blue," "color_green," and
"colorjred") that take the value 1 if the corresponding color is present and 0 if it is not. This encoded data,
output given below, can then be used for machine learning tasks such as classification and regression.

color_blue color_green color_red


0 0 0 1
10 1 0
2 1 0 0
3 0 0 1
4 0 1 0

One-Hot Encoding technique works well for small and finite categorical variables but can be problematic
for large categorical variables as it can lead to a high number of input features.

Label Encoding
Label Encoding is another technique for handling categorical data in machine learning. It involves assign­
ing a unique numerical value to each category in a categorical variable, with the order of the values based
on the order of the categories.

For example, suppose we have a categorical variable "Size" with three categories: "small," "medium," and
"large." Using label encoding, we would assign the values 0,1, and 2 to these categories, respectively.

Example
Below is an example of how to perform label encoding in Python using the scikit-learn library -

from [Link] import LabelEncode:

[data = ['small', 'medium', 'large', 'small'/largel^^M

label [Link] 1E a c

encoded_data = label_encoder.fit_transform(data)|

irmt the encoded data


rint(encoded_data)
This will create an encoded array with the values [0,1, 2,0, 2], which correspond to the encoded categories
"small," "medium," and "large." Note that the encoding is based on the alphabetical order of the categories
by default, but you can change the order by passing a custom list to the LabelEncoder object.

Output
[2 1 020]

Label encoding can be useful when there is a natural ordering between the categories, such as in the case
of ordinal categorical variables. However, it should be used with caution for nominal categorical variables
because the numerical values may imply an order that does not actually exist. In these cases, one-hot en­
coding is a safer option.

Frequency Encoding
Frequency Encoding is another technique for handling categorical data in machine learning. It involves
replacing each category in a categorical variable with its frequency (or count) in the dataset. The idea
behind frequency encoding is that categories that appear more frequently may be more important or in­
formative for the machine learning algorithm.

Example
Below is an example of how to perform frequency encoding in Python -

import pandas as pd
# create a sample dataset with a categorical variable
data = ('color': ['red', 'green', 'blue', 'red', 'green']}

freq dfl'color'[Link]-CountsCnonnalize^TruO^^^^^^^^^B

# replace each category with its frequency


df['color_freq'] = df['color'].map(freq)

df [Link]( - ,axis

This will create an encoded dataframe with one variable ("color_freq") that represents the frequency of
each category in the original categorical variable. For example, if the original variable had two occur­
rences of "red" and three occurrences of "green," then the corresponding frequencies would be 0.4 and 0.6,
respectively.

Output
color_freq
0 0.4
1 0.4
2 0.2
3 0.4
4 0.4

Frequency encoding can be a useful alternative to one-hot encoding or label encoding, especially when
dealing with high-cardinality categorical variables (i.e., variables with a large number of categories). How­
ever, it may not always be effective, and its performance can depend on the particular dataset and machine
learning algorithm being used.

Target Encoding
Target Encoding is another technique for handling categorical data in machine learning. It involves replac­
ing each category in a categorical variable with the mean (or other aggregation) of the target variable (i.e.,
the variable you want to predict) for that category. The idea behind target encoding is that it can capture
the relationship between the categorical variable and the target variable, and therefore improve the predic­
tive performance of the machine learning model.

Example
Below is an example of how to perform target encoding in Python with the Scikit-learn library by using a
combination of a label encoder and a mean encoder -

import pandas as pd
from [Link] import LabelEncoder

data = {'color': ['red', 'green', 'blue', 'red','greenl^^^^^^^^^^^^^M

d ."J. ''

label_encoder = LabelEncoderQ^^^^^^^^^M
d-bcl a;[Link]

df['color_encoded'] - label_encoder.transform(df[bolorJ)M

mean_encoder = [Link]('color_encoded')['target'].mean().to_dict()

df['color_encoded'] - df['color_encoded'].map(mean_encoder)

rmt the encoded data


rint(df)
In this example, we first create a Pandas DataFrame df with a categorical variable 'color' and a target vari­
able 'target'. We then create a LabelEncoder object from scikit-learn and fit it to the 'color' column of df.

Next, we transform the categorical variable 'color' using the label encoder by calling the transform method
on the label encoder object and assigning the resulting encoded values to a new column 'color_encoded' in
df.

Finally, we create a mean encoder object by grouping df by the 'color_encoded' column and calculating the
mean of the 'target' column for each group. We then convert this mean encoder object to a dictionary and
map the mean encoded values to the original 'color' column of df.

Output
color target color_encoded
0 red 1 0.5
1 green 0 0.5
2 blue 1 1.0
3 red 0 0.5
4 green 1 0.5

Target encoding can be a powerful technique for improving the predictive performance of machine learn­
ing models, especially for datasets with high-cardinality categorical variables. However, it is important to
avoid overfitting by using cross-validation and regularization techniques.
Binary Encoding
Binary encoding is another technique used for encoding categorical variables in machine learning. In bi­
nary encoding, each category is assigned a binary code, where each digit represents whether the category
is present (1) or not (0). The binary codes are typically based on the position of the category in a sorted list
of all categories.

Example
Here's an example Python implementation of binary encoding using the [Link] library -

f create a sample dataset with a categorical variable


data = {'color': ['red',' PH
a:een', 'blue', 'red', 'green']}
df = [Link](data)

iinary_encoder = [Link](cols [color ])J


linory oo/cCoj or

[Link] = binary_encoder.transform(df['color'])^^^B
tr merge the encoded variable with the original datatrame
df = [Link]([df, encoded_data], axis=l)

In this example, we first create a Pandas DataFrame df with a categorical variable 'color'. We then create a
BinaryEncoder object from the category_encoders library and fit it to the 'color' column of df.

Next, we transform the categorical variable 'color' using the binary encoder by calling the transform
method on the binary encoder object and assigning the resulting encoded values to a new DataFrame
encoded_data.

Finally, we merge the encoded variable with the original DataFrame df using the concat method along the
column axis (axis=1). The resulting DataFrame should have the original 'color' column along with the en­
coded binary columns.

Output
When you run the code, it will produce the following output -

color color_0 color_l


0 red 0 1
1 green 1 0
2 blue 1 1
3 red 0 1
4 green 1 0

The binary encoding works best for categorical variables with a moderate number of categories, as it can
quickly become inefficient for variables with a large number of categories.

Data Loading

Suppose if you want to start a ML project then what is the first and most important thing you would re­
quire? It is the data that we need to load for starting any of the ML project.

In machine learning, data loading refers to the process of importing or reading data from external sources
and converting it into a format that can be used by the machine learning algorithm. The data is then
preprocessed to remove any inconsistencies, missing values, or outliers. Once the data is preprocessed, it is
split into training and testing sets, which are then used for model training and evaluation.

The data can come from various sources such as CSV files, databases, web APIs, cloud storage, etc. The most
common file formats for machine learning projects is CSV (Comma Separated Values).

Consideration While Loading CSV data


CSV is a plain text format that stores tabular data, where each row represents a record, and each column
represents a field or attribute. It is widely used because it is simple, lightweight, and can be easily read and
processed by programming languages such as Python, R, and Java.

In Python, we can load CSV data into ML projects with different ways but before loading CSV data we must
have to take care about some considerations.

In this chapter, let's understand the main parts of a CSV file, how they might affect the loading and analysis
of data, and some consideration we should take care before loading CSV data into ML projects.

File Header
This is the first row of the CSV file, and it typically contains the names of the columns in the table. When
loading CSV data into an ML project, the file header (also known as column headers or variable names) can
play an important role in data analysis and model training. Here are some considerations to keep in mind
regarding the file header -

1. Consistency - The header row should be consistent across the entire CSV file. This means that
the number of columns and their names should be the same for each row. Inconsistencies can
cause issues with parsing and analysis.
2. Meaningful names - Column names should be meaningful and descriptive. This can help
with understanding the data and building more accurate models. Avoid using generic names
like "column 1", "column2", etc.
3. Case sensitivity - Depending on the tool or library being used to load the CSV file, the column
names may be case sensitive. It's important to ensure that the case of the header row matches
the expected case sensitivity of the tool or library being used.
4. Special characters - Column names should not contain any special characters, such as spaces,
commas, or quotation marks. These characters can cause issues with parsing and analysis. In­
stead, use underscores or camelCase to separate words.
5. Missing header - If the CSV file does not have a header row, it's important to specify the
column names manually or provide a separate file or documentation that includes the col­
umn names.
6. Encoding - The encoding of the header row can affect its interpretation when loading the CSV
file. It's important to ensure that the encoding of the header row is compatible with the tool or
library being used to read the file.

Comments
These are optional lines that begin with a specified character, such as "#" or and are ignored by most
programs that read CSV files. They can be used to provide additional information or context about the data
in the file.

Comments in a CSV file are not typically used to represent data that would be used in a machine learning
project. However, if comments are present in a CSV file, it's important to consider how they might affect
the loading and analysis of the data. Here are some considerations -
1. Comment markers - In a CSV file, comments can be indicated using a specific marker, such as
"#" orIt's important to know what marker is being used, so that the loading process can
ignore comments properly.
2. Placement - Comments should be placed in a separate line from the actual data. If a comment
is included in a line with actual data, it may cause issues with parsing and analysis.
3. Consistency - If comments are used in a CSV file, it's important to ensure that the comment
marker is used consistently throughout the entire file. Inconsistencies can cause issues with
parsing and analysis.
4. Handling comments - Depending on the tool or library being used to load the CSV file, com­
ments may be ignored by default or may require a specific parameter to be set. It's important
to understand how comments are handled by the tool or library being used.
5. Effect on analysis - If comments contain important information about the data, it may be
necessary to process them separately from the data itself. This can add complexity to the load­
ing and analysis process.

Delimiter
This is the character that separates the fields in each row. While the name suggests that a comma is used
as the delimiter, other characters such as tabs, semicolons, or pipes can also be used depending on the file.

The delimiter used in a CSV file can significantly affect the accuracy and performance of a machine learn­
ing model, so it is important to consider the following while loading data into an ML project -
1. Delimiter choice - The delimiter used in a CSV file should be carefully chosen based on the
data being used. For example, if the data contains commas within the values (e.g. "New York,
NY"), then using a comma as a delimiter may cause issues.
In this case, a different delimiter, such as a tab or semicolon, may be more appropriate.
2. Consistency - The delimiter used in the CSV file should be consistent throughout the entire
file. Mixing different delimiters or using whitespace inconsistently can lead to errors and
make it difficult to parse the data accurately.
3. Encoding - The delimiter can also be affected by the encoding of the CSV file. For example, if
the CSV file uses a non-ASCII delimiter and is encoded in UTF-8, it may not be correctly read by
some machine learning libraries or tools. It is important to ensure that the encoding and de­
limiter are compatible with the machine learning tools being used.
4. Other considerations - In some cases, the delimiter may need to be customized based on the
machine learning tool being used. For example, some libraries may require a specific delimiter
or may not support certain delimiters. It is important to check the documentation of the ma­
chine learning tool being used and customize the delimiter as needed.

Quotes
These are optional characters that can be used to enclose fields that contain the delimiter character or
newlines. For example, if a field contains a comma, enclosing the field in quotes ensures that the comma is
treated as part of the field and not as a delimiter. When loading CSV data into an ML project, there are sev­
eral considerations to keep in mind regarding the use of quotes -
1. Quote character - The quote character used in a CSV file should be consistent throughout the
file. The most commonly used quote character is the double quote (") but some files may use
single quotes or other characters. It's important to make sure that the quote character used is
consistent with the tool or library being used to read the CSV file.
2. Quoted values - In some cases, values in a CSV file may be enclosed in quotes to differentiate
them from other values. For example, if a field contains a comma, it maybe enclosed in quotes
to prevent it from being interpreted as a new field. It's important to make sure that quoted val­
ues are properly handled when loading the data into an ML project.
3. Escaping quotes - If a field contains the quote character used to enclose values, it must be
escaped. This is typically done by doubling the quote character. For example, if the quote char­
acter is double quote (") and a field contains the value "John "the Hammer" Smith", it would be
enclosed in quotes and the internal quotes would be escaped like this: "John ""the Hammer""
Smith".
4. Use of quotes - The use of quotes in CSV files can vary depending on the tool or library being
used to generate the file. Some tools may use quotes around every field, while others may only
use quotes around fields that contain special characters. It's important to make sure that the
quote usage is consistent with the tool or library being used to read the file.
5. Encoding - The use of quotes can also be affected by the encoding of the CSV file. If the file is
encoded in a non-standard way, it may cause issues when loading the data into an ML project.
It's important to make sure that the encoding of the CSV file is compatible with the tool or li­
brary being used to read the file.
Various Methods of Loading a CSV Data File
While working with ML projects, the most crucial task is to load the data properly into it. As told earlier,
the most common data format for ML projects is CSV and it comes in various flavors and varying difficul­
ties to parse.

In this section, we are going to discuss some common approaches in Python to load CSV data file into
machine learning project -

Using the CSV Module


This is a built-in module in Python that provides functionality for reading and writing CSV files. You can
use it to read a CSV file into a list or dictionary object. Below is its implementation example in Python -

import
[with [Link]', 'r') as file:

| for row inreaden^^^^^^^B

This code reads a CSV file called [Link] and prints each row in the file.

Using the Pandas Library


This is a popular data manipulation library in Python that provides a read_csv() function for reading CSV
files into a pandas DataFrame object. This is a very convenient way to load data and perform various data
manipulation tasks. Below is its implementation example in Python -

data = pd.read_csv('[Link]'

This code reads a CSV file called [Link] and loads it into a pandas DataFrame object called data.

Using the Numpy Library


This is a numerical computing library in Python that provides a genfromtxt() function for loading CSV
files into a numpy array. Below is its implementation example in Python -

data = [Link]('[Link]', delimiter^,')

This code reads a CSV file called [Link] and loads it into a numpy array called 'data'.

Using the Scipy Library


This is a scientific computing library in Python that provides a loadtxt() function for loading text files,
including CSV files, into a numpy array. Below is its implementation example in Python -
from scipy import
[data = loadtxt('[Link]', delimiter =~

This code reads a CSV file called [Link] and loads it into a numpy array called 'data'.

Using the Sklearn Library


This is a popular machine learning library in Python that provides a load_iris() function for loading the iris
dataset, which is a commonly used dataset for classification tasks. Below is its implementation example in
Python -

from [Link] import loadjris

da:?.: baa

This code loads the iris dataset, which is included in the sklearn library, and loads it into a numpy array
called data.
Data Understanding

While working with machine learning projects, usually we ignore two most important parts called mathe­
matics and data. What makes data understanding a critical step in ML is its data driven approach. Our ML
model will produce only as good or as bad results as the data we provided to it.

Data understanding basically involves analyzing and exploring the data to identify any patterns or trends
that may be present.

The data understanding phase typically involves the following steps -

1. Data Collection - This involves gathering the relevant data that you will be using for your
analysis. The data can be collected from various sources such as databases, websites, and APIs.
2. Data Cleaning - This involves cleaning the data by removing any irrelevant or duplicate data,
and dealing with missing data values. The data should be formatted in a way that makes it
easy to analyze.
3. Data Exploration - This involves exploring the data to identify any patterns or trends that
may be present. This can be done using various statistical techniques such as histograms,
scatter plots, and correlation analysis.
4. Data Visualization - This involves creating visual representations of the data to help you un­
derstand it better. This can be done using tools such as graphs, charts, and maps.
5. Data Preprocessing - This involves transforming the data to make it suitable for use in ma­
chine learning algorithms. This can include scaling the data, transforming it into a different
format, or reducing its dimensionality.

Understand the Data before Uploading It in ML Projects


Understanding our data before uploading it into our ML project is important for several reasons -

Identify Data Quality Issues


By understanding your data, you can identify data quality issues such as missing values, outliers, incorrect
data types, and inconsistencies that can affect the performance of your ML model. By addressing these is­
sues, you can improve the quality and accuracy of your model.

Determine Data Relevance


You can determine if the data you have collected is relevant to the problem you are trying to solve. By
understanding your data, you can determine which features are important for your model and which ones
can be ignored.

Select Appropriate ML Techniques


Depending on the characteristics of your data, you may need to choose a particular ML technique or algo­
rithm. For example, if your data is categorical, you may need to use classification techniques, while if your
data is continuous, you may need to use regression techniques. Understanding your data can help you se­
lect the appropriate ML technique for your problem.

Improve Model Performance


By understanding your data, you can engineer new features, preprocess your data, and select the appropri­
ate ML technique to improve the performance of your model. This can result in better accuracy, precision,
recall, and Fl score.

Data Understanding with Statistics


In the previous chapter, we discussed how we can upload CSV data into our ML project, but it would be
good to understand the data before uploading it. We can understand the data by two ways, with statistics
and with visualization.

In this chapter, with the help of following Python recipes, we are going to understand ML data with
statistics.

Looking at Raw Data


The very first recipe is for looking at your raw data. It is important to look at raw data because the insight
we will get after looking at raw data will boost our chances to better pre-processing as well as handling of
data for ML projects.
Following is a Python script implemented by using head() function of Pandas DataFrame on Pima Indians
diabetes dataset to look at the first 10 rows to get better understanding of it -

Example
from pandas import
path = r"C:\[Link]"^^^^^^^^^^^^^^^^^|
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

Output
preg plas pres skin test mass pedi age class
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
5 5 116 74 0 0 25.6 0.201 30 0
6 3 78 50 32 88 31.0 0.248 26 1
7 10 115 0 0 0 35.3 0.134 29 0
8 2 197 70 45 543 30.5 0.158 53 1
9 8 125 96 0 0 0.0 0.232 54 1
10 4 110 92 0 0 37.6 0.191 30 0

We can observe from the above output that first column gives the row number which can be very useful for
referencing a specific observation.

Checking Dimensions of Data


It is always a good practice to know how much data, in terms of rows and columns, we are having for our
ML project. The reasons behind are -

1. Suppose if we have too many rows and columns then it would take long time to run the algo­
rithm and train the model.
2. Suppose if we have too less rows and columns then it we would not have enough data to well
train the model.

Following is a Python script implemented by printing the shape property on Pandas Data Frame. We are
going to implement it on iris data set for getting the total number of rows and columns in it.

Example
from pandas import read_csv

data = read_csv(path)^^^B
prin (datELshapej^^^^^M
Output
(150, 4)

We can easily observe from the output that iris data set, we are going to use, is having 150 rows and 4
columns.

Getting Each Attribute’s Data Type


It is another good practice to know data type of each attribute. The reason behind is that, as per to the
requirement, sometimes we may need to convert one data type to another. For example, we may need to
convert string into floating point or int for representing categorial or ordinal values. We can have an idea
about the attribute’s data type by looking at the raw data, but another way is to use dtypes property of
Pandas DataFrame. With the help of dtypes property we can categorize each attributes data type. It can be
understood with the help of following Python script -

Example
from pandas import read_csv
path = r"C:\[Link]"^^^^M
data = read_csv(path)^^^M
print(dataxitypes)^^^^^B
Output
sepal_length float64
sepal_width float64
petal_length float64
petal_width float64
dtype: object

From the above output, we can easily get the datatypes of each attribute.

Statistical Summary of Data


We have discussed Python recipe to get the shape i.e. number of rows and columns, of data but many
times we need to review the summaries out of that shape of data. It can be done with the help of describeQ
function of Pandas DataFrame that further provide the following 8 statistical properties of each & every
data attribute -

1. Count
2. Mean
3. Standard Deviation
4. Minimum Value
5. Maximum value
6. 25%
7. Median i.e. 50%
8. 75%

Example
from pandas import
from pandas import set_option^^^^^^^^^^^^^^^^^M
path = r"C:\[Link]"^^^^^^^^^^^^^B
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
[data = read_csv(path, names^iames)^^^^^^^^^M

set_option('precision72)^^^^^^^^^^^^^^^^^^^^^B

Output
(768,9)
preg plas pres skin test mass pedi age class
count 768.00 768.00 768.00 768.00 768.00 768.00 768.00 768.00 768.00
mean 3.85 120.89 69.11 20.54 79.80 31.99 0.47 33.24 0.35
std 3.37 31.97 19.36 15.95 115.24 7.88 0.33 11.76 0.48
min 0.00 0.00 0.00 0.00 0.00 0.00 0.08 21.00 0.00
25% 1.00 99.00 62.00 0.00 0.00 27.30 0.24 24.00 0.00
50% 3.00 117.00 72.00 23.00 30.50 32.00 0.37 29.00 0.00
75% 6.00 140.25 80.00 32.00 127.25 36.60 0.63 41.00 1.00
max 17.00 199.00 122.00 99.00 846.00 67.10 2.42 81.00 1.00

From the above output, we can observe the statistical summary of the data of Pima Indian Diabetes dataset
along with shape of data.

Reviewing Class Distribution


Class distribution statistics is useful in classification problems where we need to know the balance of class
values. It is important to know class value distribution because if we have highly imbalanced class distri­
bution i.e. one class is having lots more observations than other class, then it may need special handling
at data preparation stage of our ML project. We can easily get class distribution in Python with the help of
Pandas DataFrame.

Example

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)^
count_class = [Link]('class').sizeQ
Output
Class
0 500
1 268
dtype: int64

From the above output, it can be clearly seen that the number of observations with class 0 are almost dou­
ble than number of observations with class 1.

Reviewing Correlation between Attributes


The relationship between two variables is called correlation. In statistics, the most common method for
calculating correlation is Pearson’s Correlation Coefficient. It can have three values as follows -

1. Coefficient value = 1 - It represents full positive correlation between variables.


2. Coefficient value = -1 - It represents full negative correlation between variables.
3. Coefficient value = 0 - It represents no correlation at all between variables.

It is always good for us to review the pairwise correlations of the attributes in our dataset before using it
into ML project because some machine learning algorithms such as linear regression and logistic regres­
sion will perform poorly if we have highly correlated attributes. In Python, we can easily calculate a corre­
lation matrix of dataset attributes with the help of corr() function on Pandas DataFrame.
Example
from pandas import
from pandas import
path = r"C:\[Link]"^^^^^^^^^^^^^B
names = ['preg', 'plash 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names

set_option('precision72)^^^^^^^^^^^^^^^^^^^^^B
correlations = d ata, corr (method ^'pearsonQ^^^^^^^^^^M
1

Output
preg plas pres skin test mass pedi age class
preg 1.00 0.13 0.14 -0.08 -0.07 0.02 -0.03 0.54 0.22
plas 0.13 1.00 0.15 0.06 0.33 0.22 0.14 0.26 0.47
pres 0.14 0.15 1.00 0.21 0.09 0.28 0.04 0.24 0.07
skin -0.08 0.06 0.21 1.00 0.44 0.39 0.18 -0.11 0.07
test -0.07 0.33 0.09 0.44 1.00 0.20 0.19 -0.04 0.13
mass 0.02 0.22 0.28 0.39 0.20 1.00 0.14 0.04 0.29
pedi -0.03 0.14 0.04 0.18 0.19 0.14 1.00 0.03 0.17
age 0.54 0.26 0.24 -0.11 -0.04 0.04 0.03 1.00 0.24
class 0.22 0.47 0.07 0.07 0.13 0.29 0.17 0.24 1.00

The matrix in above output gives the correlation between all the pairs of the attribute in dataset.

Reviewing Skew of Attribute Distribution


Skewness may be defined as the distribution that is assumed to be Gaussian but appears distorted or
shifted in one direction or another, or either to the left or right. Reviewing the skewness of attributes is one
of the important tasks due to following reasons -

1. Presence of skewness in data requires the correction at data preparation stage so that we can
get more accuracy from our model.
2. Most of the ML algorithms assumes that data has a Gaussian distribution i.e. either normal of
bell curved data.

In Python, we can easily calculate the skew of each attribute by using skew() function on Pandas
DataFrame.

Example
from pandas import
path = r"C:\[Link]"^^^^^^^^^^^^^B
names = ['preg', 'plas', 'pres', 'skin', 'test1, 'mass', 'pedi', 'age', 'class']
data read_csv(path, names
print([Link]())

Output
preg O.9O
plas 0.17
pres -1.84
skin 0.11
test 2.27
mass -0.43
pedi 1.92
age 1.13
class 0.64
dtype: float64

From the above output, positive or negative skew can be observed. If the value is closer to zero, then it
shows less skew.

Data Preparation

Data preparation, also known as data preprocessing, is a crucial step in machine learning. The quality of
the data you use for your model can have a significant impact on the performance of the model.

Data preparation involves cleaning, transforming, and pre-processing the data to make it suitable for
analysis and modeling. The goal of data preparation is to make sure that the data is accurate, complete, and
relevant for the analysis.

The following are some of the key steps involved in data preparation -

1. Data cleaning - This involves identifying and correcting errors, missing values, and outliers
in the data. Common techniques used for data cleaning include imputation, outlier detection
and removal, and data normalization.
2. Data transformation - This involves converting the data from its original format into a
format that is suitable for analysis. This could involve converting categorical variables into
numerical variables, or scaling the data to a certain range.
3. Feature engineering - This involves creating new features from the existing data that may
be more informative or useful for the analysis. Feature engineering can involve combining
or transforming existing features, or creating new features based on domain knowledge or
insights.
4. Data integration - This involves combining data from multiple sources into a single dataset
for analysis. This may involve matching or linking records across different datasets, or merg­
ing datasets based on common variables.
5. Data reduction - This involves reducing the size of the dataset by selecting a subset of fea­
tures or observations that are most relevant for the analysis. This can help to reduce noise and
improve the accuracy of the model.

Data preparation is a critical step in the machine learning process, and can have a significant impact on the
accuracy and effectiveness of the final model. It requires careful attention to detail and a thorough under­
standing of the data and the problem at hand.

Example
Let's check an example of data preparation using the breast cancer dataset -

data = load_breast_cancer()

X = [Link]^^^^^^^^^^M

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


scaler = StandardScalerQ^^^^^^^^M
\ j. '.'.'.r, 'CU'!, i ,.- .\7~~nn)J

X_test s c a 1 e r. t. r a n s fo r n X

In this example, we first load the breast cancer dataset using load_breast_cancer function from scikit-
learn. Then we separate the features and target, and split the data into training and testing sets using
train_test_split function.

Finally, we normalize the data using StandardScaler from scikit-learn, which subtracts the mean and
scales the data to unit variance. This helps to bring all the features to a similar scale, which is particularly
important for models like SVM and neural networks.

Why Data Pre-processing?


After selecting the raw data for ML training, the most important task is data pre-processing. In broad
sense, data preprocessing will convert the selected data into a form we can work with or can feed to ML
algorithms. We always need to preprocess our data so that it can be as per the expectation of machine
learning algorithm.

Data Pre-processing Techniques


We have the following data preprocessing techniques that can be applied on data set to produce data for
ML algorithms -

Scaling
Most probably our dataset comprises of the attributes with varying scale, but we cannot provide such data
to ML algorithm hence it requires rescaling. Data rescaling makes sure that attributes are at same scale.
Generally, attributes are rescaled into the range of 0 and 1. ML algorithms like gradient descent and k-Near-
est Neighbors requires scaled data. We can rescale the data with the help of MinMaxScaler class of scikit-
learn Python library.

Example
In this example we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the
CSV data will be loaded (as done in the previous chapters) and then with the help of MinMaxScaler class, it
will be rescaled in the range of 0 and 1.

The first few lines of the following script are same as we have written in previous chapters while loading
CSV data.

from pandas import read_csv^^^M


from numpy import set_printoptions|
from sklearn import preprocessing^
bath = r'[Link]'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
a ■' r a v d a L a i'r u n i e. va; u. c s

Now, we can use MinMaxScaler class to rescale the data in the range of 0 and 1.

data_scaler = [Link](feature_range=(Q;l))
data_rescaled = data_scaler.fit_transform(array)^^^^^^^B

We can also summarize the data for output as per our choice. Here, we are setting the precision to 1 and
showing the first 10 rows in the output.

se^nntogtions(^mcisio^^^^^^^^^^|
print ("\nScaled data:\n", data_rescaled[0:10])

Output
Scaled data:

[0.4 0.7 0.6 0.4 0. 0.5 0.2 0.5 1.]


[0.1 0.4 0.5 0.3 0. 0.4 0.1 0.2 0.]
[0.5 0.9 0.5 0. 0. 0.3 0.3 0.2 1.]
[0.10.4 0.5 0.2 0.10.4 0. 0. 0.]
[0. 0.7 0.3 0.4 0.2 0.6 0.9 0.2 1.]
[0.3 0.6 0.6 0. 0. 0.4 0.1 0.2 0.]
[0.2 0.4 0.4 0.3 0.1 0.5 0.1 0.1 1. ]
[0.6 0.6 0. 0. 0. 0.5 0. 0.1 0. ]
[0.1 1. 0.6 0.5 0.6 0.5 0. 0.5 1.]
[0.5 0.6 0.8 0. 0. 0. 0.1 0.6 1.]

From the above output, all the data got rescaled into the range of 0 and 1.

Normalization
Another useful data preprocessing technique is Normalization. This is used to rescale each row of data to
have a length of 1. It is mainly useful in Sparse dataset where we have lots of zeros. We can rescale the data
with the help of Normalizer class of scikit-learn Python library.

Types of Normalization
In machine learning, there are two types of normalization preprocessing techniques as follows -

LI Normalization
It may be defined as the normalization technique that modifies the dataset values in a way that in each row
the sum of the absolute values will always be up to 1. It is also called Least Absolute Deviations.

Example
In this example, we use LI Normalize technique to normalize the data of Pima Indians Diabetes dataset
which we used earlier. First, the CSV data will be loaded and then with the help of Normalizer class it will
be normalized.

The first few lines of following script are same as we have written in previous chapters while loading CSV
data.

from pandas import read_csv^^^^^^^^^^^^^^^^^^M


from numpy import set_printoptions^^^^^^^^^^^^^M

path = r'CApima-indians-diabetesTsV^^^^^^^^^^^^^M
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv (path, names=names)^^^^^^^^^^B

Now, we can use Normalizer class with LI to normalize the data.

[Link] = Data_normalizer.transform(arra5

We can also summarize the data for output as per our choice. Here, we are setting the precision to 2 and
showing the first 3 rows in the output.

set_printoptions(precision=2)
14 rint ("\nNormalized data:\n", Data_normalized [0:3])

Output
Normalized data:

[0.02 0.43 0.21 0.1 0.0.1 0.0.14 0.]


[0. [Link].0.11 0.0.130.]
[0.03 0.59 0.21 0. 0.0.07 0.0.1 0. ]

L2 Normalization
It may be defined as the normalization technique that modifies the dataset values in a way that in each row
the sum of the squares will always be up to 1. It is also called least squares.

Example
In this example, we use L2 Normalization technique to normalize the data of Pima Indians Diabetes
dataset which we used earlier. First, the CSV data will be loaded (as done in previous chapters) and then
with the help of Normalizer class it will be normalized.

The first few lines of following script are same as we have written in previous chapters while loading CSV
data.
from pandas import
from numpy import set_printoptions^^^^^^^^^^^^^M

path = r'CApima-indians-diabetesTsv'^^^^^^^^^^^^^M
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv (path, names=names)^^^^^^^^^^M

Now, we can use Normalizer class with LI to normalize the data.

pata_normalized = Data_normalizer.transform(array)

We can also summarize the data for output as per our choice. Here, we are setting the precision to 2 and
showing the first 3 rows in the output.

print ("\nNormalized data:\n", Data_normalized [0:3])

Output
Normalized data:

[0.03 0.83 0.4 0.2 0.0.19 0.0.28 0.01]


[0.01 0.72 0.56 0.24 0. 0.22 0. 0.26 0. ]
[0.04 0.92 0.32 0. 0. 0.12 0. 0.16 0.01]

Binarization
As the name suggests, this is the technique with the help of which we can make our data binary. We can
use a binary threshold for making our data binary. The values above that threshold value will be converted
to 1 and below that threshold will be converted to 0. For example, if we choose threshold value = 0.5, then
the dataset value above it will become 1 and below this will become 0. That is why we can call it binarizing
the data or thresholding the data. This technique is useful when we have probabilities in our dataset and
want to convert them into crisp values.

We can binarize the data with the help of Binarizer class of scikit-learn Python library.

Example
In this example, we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the
CSV data will be loaded and then with the help of Binarizer class it will be converted into binary values i.e.
0 and 1 depending upon the threshold value. We are taking 0.5 as threshold value.

The first few lines of following script are same as we have written in previous chapters while loading CSV
data.
from pandas import read_csv^^^^^^^J
from [Link] import Binarizer

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

Il

Now, we can use Binarize class to convert the data into binary values.

pinarizer = Binarizer(threshold=Q.5).fit(array)
Data_binarized = [Link](array)B

Here, we are showing the first 5 rows in the output.

print ("\nBinary data:\n", Data_binarized [0:5])

Output
Binary data:

[1. 1. 1. 1.0. 1. 1.1. 1.]


[1.1. 1. 1.0. 1.0. 1.0.]
[1. 1. 1.0.0. 1. 1.1. 1.]
[1. 1. 1. 1. 1. 1.0. 1.0.]
Standardization
Another useful data preprocessing technique which is basically used to transform the data attributes with
a Gaussian distribution. It differs the mean and SD (Standard Deviation) to a standard Gaussian distribu­
tion with a mean of 0 and a SD of 1. This technique is useful in ML algorithms like linear regression, logistic
regression that assumes a Gaussian distribution in input dataset and produce better results with rescaled
data. We can standardize the data (mean = 0 and SD -1) with the help of StandardScaler class of scikit-learn
Python library.

Example
In this example, we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the
CSV data will be loaded and then with the help of StandardScaler class it will be converted into Gaussian
Distribution with mean = 0 and SD = 1.

The first few lines of following script are same as we have written in previous chapters while loading CSV
data.

from [Link] import StandardScaler|


pandas
path = r'CApima-indians-diabetesTsV^^^^^^^^^^^^^M
names = ['preg', 'plash 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

Now, we can use StandardScaler class to rescale the data.

data_scaler = StandardScaler().fit(array)M
data_rescaled = data_scaler.transform(arra

We can also summarize the data for output as per our choice. Here, we are setting the precision to 2 and
showing the first 5 rows in the output.

print ("\nRescaled data:\n", data_rescaled [0:5])

Output
Rescaled data:

[0.64 0.85 0.15 0.91 -0.69 0.2 0.47 1.43 1.37]


[-0.84 -1.12 -0.16 0.53 -0.69 -0.68 -0.37 -0.19 -0.73]
[1.23 1.94-0.26-1.29-0.69-1.1 0.6 -0.11 1.37]
[-0.84-1. -0.16 0.15 0.12-0.49-0.92-1.04-0.73]
[-1.14 0.5 -1.5 0.91 0.77 1.41 5.48-0.02 1.37]

Data Labeling
We discussed the importance of good fata for ML algorithms as well as some techniques to pre-process
the data before sending it to ML algorithms. One more aspect in this regard is data labeling. It is also very
important to send the data to ML algorithms having proper labeling. For example, in case of classification
problems, lot of labels in the form of words, numbers etc. are there on the data.

What is Label Encoding?


Most of the sklearn functions expect that the data with number labels rather than word labels. Hence, we
need to convert such labels into number labels. This process is called label encoding. We can perform label
encoding of data with the help of LabelEncoder() function of scikit-learn Python library.

Example
In the following example, Python script will perform the label encoding.

First, import the required Python libraries as follows -

import num as n
from sklearn import preprocessing

Now, we need to provide the input labels as follows -

inputjabels = ['red','black','red','green','black','yellow','whiter

The next line of code will create the label encoder and train it.

encoder = [Link]

The next lines of script will check the performance by encoding the random ordered list -

testjabels = ['green','red',

print("\nLabels =", test_labels)^^^^^^^^^^^^^^^B


print("Encoded values =", list(encodedIvalues))^^^^^M
encoded_values = [
decodedjist = encoder.inverse_transform(encoded_values)

We can get the list of encoded values with the help of following python script -

rint(n\nDecoded labels =", list(decoded_list))


Output
Labels = ['green', 'red', 'black']
Encoded values = [1, 2, 0]
Encoded values = [3,0,4,1]
Decoded labels = ['white', 'black', 'yellow', 'green']

Models
There are four main types of machine learning models -

1. Supervised Learning
2. Unsupervised Learning
3. Semi-supervised Learning
4. Reinforcement Learning

In the next four chapters, we will discuss each of these machine learning models in detail. Here, let's have
a brief overview of these methods:

Supervised Learning
Supervised learning algorithms or methods are the most commonly used ML algorithms. This method or
learning algorithm take the data sample i.e. the training data and its associated output i.e. labels or re­
sponses with each data samples during the training process.

The main objective of supervised learning algorithms is to learn an association between input data sam­
ples and corresponding outputs after performing multiple training data instances.

Based on the ML tasks, supervised learning algorithms can be divided into following two broad classes -

1. Classification - The key objective of classification-based tasks is to predict categorial output


labels or responses for the given input data. The output will be based on what the model
has learned in the training phase. As we know that the categorial output responses means
unordered and discrete values, hence each output response will belong to a specific class or
category. We will discuss Classification and associated algorithms in detail in the upcoming
chapters also.
2. Regression - The key objective of regression-based tasks is to predict output labels or re­
sponses which are continues numeric values, for the given input data. The output will be
based on what the model has learned in its training phase. Basically, regression models use
the input data features (independent variables) and their corresponding continuous numeric
output values (dependent or outcome variables) to learn specific association between inputs
and corresponding outputs. We will discuss regression and associated algorithms in detail in
further chapters also.

Unsupervised Learning
As the name suggests, it is opposite to supervised ML methods or algorithms which means in unsu­
pervised machine learning algorithms we do not have any supervisor to provide any sort of guidance.
Unsupervised learning algorithms are handy in the scenario in which we do not have the liberty, like in
supervised learning algorithms, of having pre-labeled training data and we want to extract useful pattern
from input data.

Examples of unsupervised machine learning algorithms includes K-means clustering, K-nearest neigh­
bors etc.

Based on the ML tasks, unsupervised learning algorithms can be divided into following broad classes -

1. Clustering - Clustering methods are one of the most useful unsupervised ML methods. These
algorithms used to find similarity as well as relationship patterns among data samples and
then cluster those samples into groups having similarity based on features. The real-world ex­
ample of clustering is to group the customers by their purchasing behavior.
2. Association - Another useful unsupervised ML method is Association which is used to
analyze large dataset to find patterns which further represents the interesting relationships
between various items. It is also termed as Association Rule Mining or Market basket analy­
sis which is mainly used to analyze customer shopping patterns.
3. Dimensionality Reduction - This unsupervised ML method is used to reduce the number of
feature variables for each data sample by selecting set of principal or representative features.
4. Anomaly Detection - This unsupervised ML method is used to find out the occurrences
of rare events or observations that generally do not occur. By using the learned knowledge,
anomaly detection methods would be able to differentiate between anomalous or a normal
data point.

Semi-supervised Learning
Such kind of algorithms or methods are neither fully supervised nor fully unsupervised. They basically fall
between the two i.e. supervised and unsupervised learning methods. These kinds of algorithms generally
use small supervised learning component i.e. small amount of pre-labeled annotated data and large unsu­
pervised learning component i.e. lots of unlabeled data for training.

Reinforcement Learning
These methods are different from previously studied methods and very rarely used also. In this kind of
learning algorithms, there would be an agent that we want to train over a period of time so that it can
interact with a specific environment. The agent will follow a set of strategies for interacting with the en­
vironment and then after observing the environment it will take actions regards the current state of the
environment.

Supervised

Supervised learning algorithms or methods are the most commonly used ML algorithms. This method or
learning algorithm take the data sample i.e. training data and associated output i.e. labels or responses
with each data samples during the training process. The main objective of supervised learning algorithms
is to learn an association between input data samples and corresponding outputs after performing multi­
ple training data instances.

For example, we have -

1 .x- Input variables and


2 . Y - Output variable

Now, apply an algorithm to learn the mapping function from the input to output as follows -

Y=f(x)

Now, the main objective would be to approximate the mapping function so well that even when we have
new input data (x), we can easily predict the output variable (Y) for that new input data.

It is called supervised because the whole process of learning can be thought as it is being supervised by a
teacher or supervisor. Examples of supervised machine learning algorithms includes Decision tree, Ran­
dom Forest, KNN, Logistic Regression etc.

Based on the ML tasks, supervised learning algorithms can be divided into two broad classes - Classifica­
tion and Regression.
Classification
The key objective of classification-based tasks is to predict categorial output labels or responses for the
given input data. The output will be based on what the model has learned in its training phase.

As we know that the categorial output responses means unordered and discrete values, hence each output
response will belong to a specific class or category. We will discuss Classification and associated algo­
rithms in detail in further chapters also.

Regression
The key objective of regression-based tasks is to predict output labels or responses which are continues
numeric values, for the given input data. The output will be based on what the model has learned in train­
ing phase.

Basically, regression models use the input data features (independent variables) and their corresponding
continuous numeric output values (dependent or outcome variables) to learn specific association between
inputs and corresponding outputs. We will discuss regression and associated algorithms in detail in fur­
ther chapters also.

Algorithms for Supervised Learning


Supervised learning is one of the important models of learning involved in training machines. This chap­
ter talks in detail about the same.

There are several algorithms available for supervised learning. Some of the widely used algorithms of su­
pervised learning are as shown below -

1. k-Nearest Neighbours
2. Decision Trees
3. Naive Bayes
4. Logistic Regression
5. Support Vector Machines

As we move ahead in this chapter, let us discuss in detail about each of the algorithms.

k-Nearest Neighbours
The k-Nearest Neighbours, which is simply called kNN is a statistical technique that can be used for solving
for classification and regression problems. Let us discuss the case of classifying an unknown object using
kNN. Consider the distribution of objects as shown in the image given below -
The diagram shows three types of objects, marked in red, blue and green colors. When you run the kNN
classifier on the above dataset, the boundaries for each type of object will be marked as shown below -
Now, consider a new unknown object that you want to classify as red, green or blue. This is depicted in the
figure below.
As you see it visually, the unknown data point belongs to a class of blue objects. Mathematically, this can be
concluded by measuring the distance of this unknown point with every other point in the data set. When
you do so, you will know that most of its neighbours are of blue color. The average distance to red and green
objects would be definitely more than the average distance to blue objects. Thus, this unknown object can
be classified as belonging to blue class.

The kNN algorithm can also be used for regression problems. The kNN algorithm is available as ready-to-
use in most of the ML libraries.

Decision Trees
A simple decision tree in a flowchart format is shown below -
You would write a code to classify your input data based on this flowchart. The flowchart is self-explana­
tory and trivial. In this scenario, you are trying to classify an incoming email to decide when to read it.

In reality, the decision trees can be large and complex. There are several algorithms available to create and
traverse these trees. As a Machine Learning enthusiast, you need to understand and master these tech­
niques of creating and traversing decision trees.

Naive Bayes
Naive Bayes is used for creating classifiers. Suppose you want to sort out (classify) fruits of different kinds
from a fruit basket. You may use features such as color, size and shape of a fruit, For example, any fruit that
is red in color, is round in shape and is about 10 cm in diameter may be considered as Apple. So to train
the model, you would use these features and test the probability that a given feature matches the desired
constraints. The probabilities of different features are then combined to arrive at a probability that a given
fruit is an Apple. Naive Bayes generally requires a small number of training data for classification.

Logistic Regression
Look at the following diagram. It shows the distribution of data points in XY plane.
20

15"

3 -2 -1 -0 2 3
From the diagram, we can visually inspect the separation of red dots from green dots. You may draw a
boundary line to separate out these dots. Now, to classify a new data point, you will just need to determine
on which side of the line the point lies.

Support Vector Machines


Look at the following distribution of data. Here the three classes of data cannot be linearly separated. The
boundary curves are non-linear. In such a case, finding the equation of the curve becomes a complex job.

SVM classification plot

1.0

0.5

0.0

-0.5

-1.0

-1.5
The Support Vector Machines (SVM) comes handy in determining the separation boundaries in such
situations.

Unsupervised

What is Unsupervised Learning?


In unsupervised machine learning algorithms, we do not have any supervisor to provide any sort of guid­
ance. Unsupervised learning algorithms are handy in the scenario in which we do not have the liberty, like
in supervised learning algorithms, of having pre-labeled training data and we want to extract useful pat­
tern from input data.

Examples of unsupervised machine learning algorithms includes K-means clustering, K-nearest neigh­
bors etc.

In regression, we train the machine to predict a future value. In classification, we train the machine to
classify an unknown object in one of the categories defined by us. In short, we have been training ma­
chines so that it can predict Y for our data X. Given a huge data set and not estimating the categories, it
would be difficult for us to train the machine using supervised learning. What if the machine can look up
and analyze the big data running into several Gigabytes and Terabytes and tell us that this data contains so
many distinct categories?
As an example, consider the voter’s data. By considering some inputs from each voter (these are called
features in Al terminology), let the machine predict that there are so many voters who would vote for X
political party and so many would vote for Y, and so on. Thus, in general, we are asking the machine given
a huge set of data points X, "What can you tell me about X?”. Or it may be a question like "What are the five
best groups we can make out of X?”. Or it could be even like “What three features occur together most fre­
quently in X?”.

This is exactly the Unsupervised Learning is all about.

Algorithms for Unsupervised Learning


Let us now discuss one of the widely used algorithms for classification in unsupervised machine learning.

k-means clustering
The 2000 and 2004 Presidential elections in the United States were close — very close. The largest percent­
age of the popular vote that any candidate received was 50.7% and the lowest was 47.9%. If a percentage
of the voters were to have switched sides, the outcome of the election would have been different. There are
small groups of voters who, when properly appealed to, will switch sides. These groups may not be huge,
but with such close races, they may be big enough to change the outcome of the election. How do you find
these groups of people? How do you appeal to them with a limited budget? The answer is clustering.

Let us understand how it is done.


1. First, you collect information on people either with or without their consent: any sort of
information that might give some clue about what is important to them and what will influ­
ence how they vote.
2. Then you put this information into some sort of clustering algorithm.
3. Next, for each cluster (it would be smart to choose the largest one first) you craft a message
that will appeal to these voters.
4. Finally, you deliver the campaign and measure to see if it’s working.

Clustering is a type of unsupervised learning that automatically forms clusters of similar things. It is like
automatic classification. You can cluster almost anything, and the more similar the items are in the clus­
ter, the better the clusters are. In this chapter, we are going to study one type of clustering algorithm called
k-means. It is called k-means because it finds ‘k’ unique clusters, and the center of each cluster is the mean
of the values in that cluster.

Cluster Identification
Cluster identification tells an algorithm, “Here’s some data. Now group similar things together and tell me
about those groups.” The key difference from classification is that in classification you know what you are
looking for. While that is not the case in clustering.

Clustering is sometimes called unsupervised classification because it produces the same result as classifi­
cation does but without having predefined classes.
Based on the ML tasks, unsupervised learning algorithms can be divided into the following broad classes:
Clustering, Association, Dimensionality Reduction, and Anomaly Detection.

Clustering
Clustering methods are one of the most useful unsupervised ML methods. These algorithms used to find
similarity as well as relationship patterns among data samples and then cluster those samples into groups
having similarity based on features. The real-world example of clustering is to group the customers by
their purchasing behavior.

Association
Another useful unsupervised ML method is Association which is basically used to analyze large dataset
to find patterns which further represent the interesting relationships between various items. It is also
termed as Association Rule Mining or Market basket analysis which is mainly used to analyze customer
shopping patterns.

Dimensionality Reduction
As the name suggests, this unsupervised ML method is used to reduce the number of feature variables for
each data sample by selecting set of principal or representative features.

A question arises here is that, why we need to reduce the dimensionality? The reason behind this is
the problem of feature space complexity which arises when we start analyzing and extracting millions
of features from data samples. This problem generally refers to "curse of dimensionality". PCA (Principal
Component Analysis), K-nearest neighbors and discriminant analysis are some of the popular algorithms
for this purpose.

Anomaly Detection
This unsupervised ML method is used to find out occurrences of rare events or observations that generally
do not occur. By using the learned knowledge, anomaly detection methods would be able to differentiate
between anomalous or a normal data point.

Some of the unsupervised algorithms like clustering, KNN can detect anomalies based on the data and its
features.

Semi-supervised
Semi-supervised machine learning algorithms are neither fully supervised nor fully unsupervised. They
basically fall between the two, i.e., supervised and unsupervised learning methods.

Semi-supervised algorithms generally use small supervised learning component, i.e., small amount of
pre-labeled annotated data and large unsupervised learning component, i.e., lots of unlabeled data for
training.

We can follow any of the following approaches for implementing semi-supervised learning methods -
1. The first and simple approach is to build the supervised model based on a small labeled and
annotated data and then build the unsupervised model by applying the same to the large
amounts of unlabeled data to get more labeled samples. Now, train the model on them and re­
peat the process.
2. The second approach needs some extra efforts. In this approach, we can first use the unsuper­
vised methods to cluster similar data samples, annotate these groups and then use a combina­
tion of this information to train the model.

The algorithm is trained on a dataset that contains both labeled and unlabeled data. Semi-supervised
learning is generally used when we have a huge set of unlabeled data available. In any supervised learn­
ing algorithm, the available data has to be manually labelled which can be quite an expensive process. In
contrast, the unlabelled data used in unsupervised learning has limited applications. Hence, unsupervised
learning algorithms were developed which can provide a perfect balance between the two.

Semi-Supervised Learning algorithm find its application in text classification, image classification, speech
analysis, anomaly detection, etc. where the general goal is to classify an entity into a predefined category.
Semi-supervised algorithm assumes that the data can be divided into discrete clusters and the data points
closer to each other are more likely to share the same output label.
Reinforcement

These methods are a bit different from previously studied methods and very rarely used also. In this kind
of learning algorithms, there would be an agent that we want to train over a period of time so that it
can interact with a specific environment. The agent will follow a set of strategies for interacting with the
environment and then after observing the environment it will take actions regards the current state of the
environment.

Here are the major steps involved in reinforcement learning methods -

1. Step 1 - First, we need to prepare an agent with some initial set of strategies.
2. Step 2 - Then observe the environment and its current state.
3. Step 3 - Next, select the optimal policy regards the current state of the environment and per­
form important action.
4. Step 4 - Now, the agent can get corresponding reward or penalty as per accordance with the
action taken by it in previous step.
5. Step 5 - Now, we can update the strategies if it is required so.
6. Step 6 - At last, repeat steps 2-5 until the agent got to learn & adopt the optimal policies.

The following diagram shows what type of task is appropriate for various ML problems -
Ms
Is the data
Correlated or
Redundant?
Supervised vs. Unsupervised

Machine Learning approaches can be either Supervised or Unsupervised. If you can anticipate the expanse
of data, and if it is possible to divide the data into categories, then the best approach is to help the algo­
rithm become smarter by Supervised Learning.

If you anticipate that the amount of data is massive, and if you think that the data cannot be simply clas­
sified or labelled, then it is better to go for Unsupervised Learning approach and let the algorithms handle
predictions smartly.

Differences between Supervised and Unsupervised Machine Learning


The table below shows some key differences between supervised and unsupervised machine learning -

Supervised Technique Unsupervised Technique

Supervised machine learning algorithms are Unsupervised machine learning algorithms do


trained using both training data and its associated not require labeled data for training.
output i.e., label data.
Supervised machine learning model learns the Unsupervised machine learning model learns
association between input training data and their the pattern and relationship from the given raw
labels. data.

Supervised ML model takes feedback to check Unsupervised ML model does not take any kind
whether it is predicting the correct output or not. of feedback.

As name entails, supervised machine learning algo­ As name entails, unsupervised machine learn­
rithms needs supervision to train the model. ing algorithms does not any kind of supervision
to train the model.

We can divide supervised machine learning algo­ Clustering, Anomaly Detection, Association,
rithms in two broad classes namely Classification and Association are some of the broad classed of
and Regression. unsupervised machine learning algorithms.

In terms of computational complexity, supervised Unsupervised machine learning methods are


machine learning methods are computationally computationally complex.
simple.

Supervised machine learning methods are highly Unsupervised machine learning methods are
accurate. less accurate.

In supervised machine learning, the learning takes In unsupervised machine learning, the learning
place offline. takes place in real time.

Number of classes is already known before imple­ In unsupervised learning methods, number of
menting supervised machine learning methods. classes are not known in prior.

One of the main drawbacks of supervised learning As the data used in unsupervised learning is not
is to classify big data. labeled, getting precise information regarding
data sorting is one of the main drawbacks of it.

Some of the well-known supervised machine learn- Some of the well-known unsupervised machine
ing algorithms are KNN (k-nearest neighbors), learning algorithms are Hebbian Learning, K-
Decision tree, Logistic Regression, and Random means Clustering, and Hierarchical Clustering.
Forest.

Data Visualization

Data visualization is an important aspect of machine learning (ML) as it helps to analyze and communicate
patterns, trends, and insights in the data. Data visualization involves creating graphical representations
of the data, which can help to identify patterns and relationships that may not be apparent from the raw
data.
Here are some of the ways data visualization is used in machine learning -

1. Exploring Data - Data visualization is an essential tool for exploring and understanding data.
Visualization can help to identify patterns, correlations, and outliers, and can also help to de­
tect data quality issues such as missing values and inconsistencies.
2. Feature Selection - Data visualization can help to select relevant features for the ML model.
By visualizing the data and its relationship with the target variable, you can identify features
that are strongly correlated with the target variable and exclude irrelevant features that have
little predictive power.
3. Model Evaluation - Data visualization can be used to evaluate the performance of the ML
model. Visualization techniques such as ROC curves, precision-recall curves, and confusion
matrices can help to understand the accuracy, precision, recall, and Fl score of the model.
4. Communicating Insights - Data visualization is an effective way to communicate insights
and results to stakeholders who may not have a technical background. Visualizations such as
scatter plots, line charts, and bar charts can help to convey complex information in an easily
understandable format.

Some popular libraries used for data visualization in Python include Matplotlib, Seaborn, Plotly, and
Bokeh. These libraries provide a wide range of visualization techniques and customization options to suit
different needs and preferences.
Univariate Plots: Understanding Attributes Independently
The simplest type of visualization is single-variable or "univariate” visualization. With the help of univari­
ate visualization, we can understand each attribute of our dataset independently. The following are some
techniques in Python to implement univariate visualization -

1. Histograms
2. Density Plots
3. Box and Whisker Plots
Multivariate Plots: Interaction Among Multiple Variables
Another type of visualization is multi-variable or “multivariate” visualization. With the help of multivari­
ate visualization, we can understand interaction between multiple attributes of our dataset. The following
are some techniques in Python to implement multivariate visualization -

1. Correlation Matrix Plot


2. Scatter Matrix Plot

In the next few chapters, we will look at some of the popular and widely used visualization techniques
available in machine learning.

Histograms
A histogram is a bar graph-like representation of the distribution of a variable. It shows the frequency of
occurrences of each value of the variable. The x-axis represents the range of values of the variable, and the
y-axis represents the frequency or count of each value. The height of each bar represents the number of
data points that fall within that value range.

Histograms are useful for identifying patterns in data, such as skewness, modality, and outliers. Skewness
refers to the degree of asymmetry in the distribution of the variable. Modality refers to the number of
peaks in the distribution. Outliers are data points that fall outside of the range of typical values for the
variable.

Python Implementation of Histograms


Python provides several libraries for data visualization, such as Matplotlib, Seaborn, Plotly, and Bokeh. For
the example given below, we will use Matplotlib to implement histograms.

We will use the breast cancer dataset from the Sklearn library for this example. The breast cancer dataset
contains information about the characteristics of breast cancer cells and whether they are malignant or
benign. The dataset has 30 features and 569 samples.

Example
Let's start by importing the necessary libraries and loading the dataset -

from [Link] import load_breast_cancer


'.L.; loaa Kvast . l

Next, we will create a histogram of the mean radius feature of the dataset -

Oigure(figsiz^7\2^j^>))^|
[Link]([Link][:,O], bins=20)
[Link]('Mean Radius')

In this code, we have used the hist() function from Matplotlib to create a histogram of the mean radius
feature of the dataset. We have set the number of bins to 20 to divide the data range into 20 intervals. We
have also added labels to the x and y axes using the xlabel() and ylabel() functions.

Output
The resulting histogram shows the distribution of mean radius values in the dataset. We can see that the
data is roughly normally distributed, with a peak around 12-14.
80 -

>. 60'
c
(L>

f 40-

20 -

0 -1----- —F
10 15 20 25
Mean Radius

Histogram with Multiple Data Sets


We can also create a histogram with multiple data sets to compare their distributions. Let's create his­
tograms of the mean radius feature for both the malignant and benign samples -

Example
[Link](figsize=(7.2, 3.5))
pl^ust^dat^atafdat^arget^OjOl^ins^ZO^lpha^O^Jabel^Malignant)

[Link]

In this code, we have used the hist() function twice to create two histograms of the mean radius feature,
one for the malignant samples and one for the benign samples. We have set the transparency of the bars
to 0.5 using the alpha parameter so that they don't overlap completely. We have also added a legend to the
plot using the legend() function.

Output
On executing this code, you will get the following plot as the output -
Malignant
40 Benign

30

20

10

0
10 15 20 25
Mean Radius

The resulting histogram shows the distribution of mean radius values for both the malignant and benign
samples. We can see that the distributions are different, with the malignant samples having a higher fre­
quency of higher mean radius values.
Density Plots

A density plot is a type of plot that shows the probability density function of a continuous variable. It is
similar to a histogram, but instead of using bars to represent the frequency of each value, it uses a smooth
curve to represent the probability density function. The xaxis represents the range of values of the vari­
able, and the y-axis represents the probability density.

Density plots are useful for identifying patterns in data, such as skewness, modality, and outliers. Skew­
ness refers to the degree of asymmetry in the distribution of the variable. Modality refers to the number
of peaks in the distribution. Outliers are data points that fall outside of the range of typical values for the
variable.

Python Implementation of Density Plots


Python provides several libraries for data visualization, such as Matplotlib, Seaborn, Plotly, and Bokeh. For
our example given below, we will use Seaborn to implement density plots.

We will use the breast cancer dataset from the Sklearn library for this example. The breast cancer dataset
contains information about the characteristics of breast cancer cells and whether they are malignant or
benign. The dataset has 30 features and 569 samples.
Example
Let's start by importing the necessary libraries and loading the dataset -

import [Link] as

from [Link] import load_breast_cancer


'.L.; loac Kvast . l

Next, we will create a density plot of the mean radius feature of the dataset -

gl^gure(figsiz^7^L5))^^^^^J
[Link]([Link][:,Q], shade=True)
[Link] RadiusQ^^^^^^^M

In this code, we have used the kdeplot() function from Seaborn to create a density plot of the mean radius
feature of the dataset. We have set the shade parameter to True to shade the area under the curve. We have
also added labels to the x and y axes using the xlabelQ and ylabelQ functions.

Output
The resulting density plot shows the probability density function of mean radius values in the dataset. We
can see that the data is roughly normally distributed, with a peak around 12-14.

Density Plot with Multiple Data Sets


We can also create a density plot with multiple data sets to compare their probability density functions.
Let's create density plots of the mean radius feature for both the malignant and benign samples -

Example
[Link]([Link][[Link]==Q,0], shade=True, label='Malignant')
[Link]([Link][[Link]= = 1,0], shade=True, label='Bemgn')J
[Link]('Mean

In this code, we have used the kdeplot() function twice to create two density plots of the mean radius
feature, one for the malignant samples and one for the benign samples. We have set the shade parameter
to True to shade the area under the curve, and we have added labels to the plots using the label parameter.
We have also added a legend to the plot using the legend() function.

Output
On executing this code, you will get the following plot as the output -
The resulting density plot shows the probability density functions of mean radius values for both the
malignant and benign samples. We can see that the probability density function for the malignant sam­
ples is shifted to the right, indicating a higher mean radius value.
Box and Whisker Plots

A boxplot is a graphical representation of a dataset that displays the five-number summary of the data - the
minimum value, the first quartile, the median, the third quartile, and the maximum value.

The boxplot consists of a box with whiskers extending from the top and bottom of the box.

1. The box represents the interquartile range (IQR) of the data, which is the range between the
first and third quartiles.
2. The whiskers extend from the top and bottom of the box to the highest and lowest values that
are within 1.5 times the IQR.

Any values that fall outside this range are considered outliers and are represented as points beyond the
whiskers.

Python Implementation of Box and Whisker Plots


Now that we have a basic understanding of boxplots, let's implement them in Python. For our example, we
will be using the Iris dataset from Sklearn, which contains measurements of the sepal length, sepal width,
petal length, and petal width of 150 iris flowers, belonging to three different species - Setosa, Versicolor,
and Virginica.
To start, we need to import the necessary libraries and load the dataset.

Example
import [Link] as plt^^^M

from [Link] import load_iris

data = iris

Next, we can create a boxplot of the sepal length for each of the three iris species using the Seaborn library.

[Link](x=target, y=data[:, 0])

Output
This code will produce a boxplot of the sepal length for each of the three iris species, with the x-axis repre­
senting the species and the y-axis representing the sepal length in centimeters.
8.0

7.5

7.0

6.5

6.0

5.5

5.0

0 1
Species

From this boxplot, we can see that the setosa species has a shorter sepal length compared to the versicolor
and virginica species, which have a similar median and range of sepal lengths. Additionally, we can see
that there are no outliers in the setosa species, but there are a few outliers in the versicolor and virginica
specie.
Correlation Matrix Plot
A correlation matrix plot is a graphical representation of the pairwise correlation between variables in a
dataset. The plot consists of a matrix of scatterplots and correlation coefficients, where each scatterplot
represents the relationship between two variables, and the correlation coefficient indicates the strength of
the relationship. The diagonal of the matrix usually shows the distribution of each variable.

The correlation coefficient is a measure of the linear relationship between two variables and ranges from
-1 to 1. A coefficient of 1 indicates a perfect positive correlation, where an increase in one variable is as­
sociated with an increase in the other variable. A coefficient of -1 indicates a perfect negative correlation,
where an increase in one variable is associated with a decrease in the other variable. A coefficient of 0 indi­
cates no correlation between the variables.

Python Implementation of Correlation Matrix Plots


Now that we have a basic understanding of correlation matrix plots, let's implement them in Python. For
our example, we will be using the Iris flower dataset from Sklearn, which contains measurements of the
sepal length, sepal width, petal length, and petal width of 150 iris flowers, belonging to three different
species - Setosa, Versicolor, and Virginica.

Example
sns. set(style='

jcmap = sns.diverging_palette(220,10, as_cmap=True)^^^^M


[Link](corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrmk"-Ji})^^^^^B

Output
This code will produce a correlation matrix plot of the Iris dataset, with each square representing the cor­
relation coefficient between two variables.

sepal length (cm) r 0.3

-02

-01
sepal width (cm)

-00

- -0 .1

k -0 2

k -0.3
petal length (cm)

k -0 4
petal width (cm)

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
From this plot, we can see that the variables 'sepal width (cm)' and 'petal length (cm)' have a moderate
negative correlation (-0.37), while the variables 'petal length (cm)' and 'petal width (cm)' have a strong
positive correlation (0.96). We can also see that the variable 'sepal length (cm)' has a weak positive correla­
tion (0.87) with the variable 'petal length (cm)'.

Scatter Matrix Plot

Scatter Matrix Plot is a graphical representation of the relationship between multiple variables. It is a use­
ful tool in machine learning for visualizing the correlation between features in a dataset. This plot is also
known as a Pair Plot, and it is used to identify the correlation between two or more variables in a dataset.

A Scatter Matrix Plot displays the scatter plot of each pair of features in a dataset. Each scatter plot repre­
sents the relationship between two variables. It is also possible to add a diagonal line to the plot that shows
the distribution of each variable.

Python Implementation of Scatter Matrix Plot


Here, we will implement the Scatter Matrix Plot in Python. For our example given below, we will be using
Sklearn's Iris dataset.
The Iris dataset is a classic dataset in machine learning. It contains four features: Sepal Length, Sepal
Width, Petal Length, and Petal Width. The dataset has 150 samples, and each sample is labeled as one of
three species: Setosa, Versicolor, or Virginica.

We will use the Seaborn library to implement the Scatter Matrix Plot. Seaborn is a Python data visualiza­
tion library that is built on top of the Matplotlib library.

Example
Below is the Python code to implement the Scatter Matrix Plot -

import seaborn as sns


limport pandas as pdl

iris = sns.load_dataset('iris')

[Link](iris, hue='species'

^Lshowfl

In this code, we first import the necessary libraries, Seaborn and Pandas. Then, we load the Iris dataset
using the sns.load_dataset() function. This function loads the Iris dataset from the Seaborn library.
Next, we create the Scatter Matrix Plot using the [Link]() function. The hue parameter is used to
specify the column in the dataset that should be used for color encoding. In this case, we use the species
column to color the points according to the species of each sample.

Finally, we use the [Link] function to display the plot.

Output
The output of this code will be a Scatter Matrix Plot that shows the scatter plots of each pair of features in
the Iris dataset.

4 6 8
&epal_k>ng1h
Notice that each scatter plot is color-coded according to the species of each sample.

Machine Learning - Statistics


Statistics is a crucial tool in machine learning because it helps us understand the underlying patterns in
the data. It provides us with methods to describe, summarize, and analyze data. Let's see some of the ba­
sics of statistics for machine learning.

Descriptive Statistics
Descriptive statistics is a branch of statistics that deals with the summary and analysis of data. It includes
measures such as mean, median, mode, variance, and standard deviation. These measures help us under­
stand the central tendency, variability, and distribution of the data.

In machine learning, descriptive statistics can be used to summarize the data, identify outliers, and detect
patterns. For example, we can use the mean and standard deviation to describe the distribution of a
dataset.

In Python, we can calculate descriptive statistics using libraries such as NumPy and Pandas. Below is an
example -

Example
import pandas as

df = [Link](data, columns=["Values"])
;..'. . .' ■ >./ r:

Output
This will output a summary of the dataset, including the count, mean, standard deviation, minimum, and
maximum values as follows -

Values
count 5.000000
mean 3.000000
std 1.581139
min 1.000000
25% 2.000000
50% 3.000000
75% 4.000000
max 5.000000
Inferential Statistics
Inferential statistics is a branch of statistics that deals with making predictions and inferences about a
population based on a sample of data. It involves using hypothesis testing, confidence intervals, and re­
gression analysis to draw conclusions about the data.

In machine learning, inferential statistics can be used to make predictions about new data based on exist­
ing data. For example, we can use regression analysis to predict the price of a house based on its features,
such as the number of bedrooms and bathrooms.

In Python, we can perform inferential statistics using libraries such as Scikit-Learn and StatsModels.
Below is an example -

Example
import [Link] as sm
im:';u iiumov

^i^rraj([1^^4^])|
= [Link]([2,4,6, 8,10])

model = [Link](y, X).fit()


11 rint([Link])

Output
This will output a summary of the regression model, including the coefficients, standard errors, t-statis-
tics, and p-values as follows -

OLS Regression Results

Dep. Variable: V R-squared: 1.000


Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.383e+31
Date: Wed, 26 Apr 2023 Prob (F-statistic): 4.29e-47
Time: 16:12:21 Log-Likelihood: 164.22
No. Observations: 5 AIC: -324.4
Df Residuals: 3 BIC: -325.2
Df Model: 1
Covariance Type: nonrobust

coef std err t p> 111 [0.025 0.975]

const [Link]-15 1.78e-15 0.622 0.578 -4.57e-15 6.79e-15


xl 2.0000 5.38e-16 3.72e+15 0.000 2.000 2.000

Omnibus: nan Durbin-Watson: 0.091


Prob(Omnibus): nan Jarque-Bera (JB): 0.839
Skew: -0.408 Prob(JB): 0.657
Kurtosis: 1.167 Cond. No. 8.37
In the next chapter, we will discuss various descriptive and inferential statistics measures, which are com­
monly used in machine learning, in detail along with Python implementation example.

Mean, Median, Mode


Mean, Median, and Mode are statistical measures used to describe the central tendency of a dataset. In
machine learning, these measures are used to understand the distribution of data and identify outliers.
Here, we will explore the concepts of Mean, Median, and Mode and their implementation in Python.

Mean
The "mean" is the average value of a dataset. It is calculated by adding up all the values in the dataset and
dividing by the number of observations. The mean is a useful measure of central tendency because it is
sensitive to outliers, meaning that extreme values can significantly affect the value of the mean.

In Python, we can calculate the mean using the NumPy library, which provides a function called mean().

Median
The "median" is the middle value in a dataset. It is calculated by arranging the values in the dataset in order
and finding the value that lies in the middle. If there are an even number of values in the dataset, the me­
dian is the average of the two middle values.

The median is a useful measure of central tendency because it is not affected by outliers, meaning that
extreme values do not significantly affect the value of the median.

In Python, we can calculate the median using the NumPy library, which provides a function called
median().

Mode
The "mode" is the most common value in a dataset. It is calculated by finding the value that occurs most
frequently in the dataset. If there are multiple values that occur with the same frequency, the dataset is
said to be bimodal, trimodal, or multimodal.

The mode is a useful measure of central tendency because it can identify the most common value in a
dataset. However, it is not a good measure of central tendency for datasets with a wide range of values or
datasets with no repeating values.

In Python, we can calculate the mode using the SciPy library, which provides a function called mode().

Python Implementation
Let's see an example of calculating mean, median, and mode for a salary table in Python using NumPy and
Pandas -
import pandas as

salary =

^salary^[5OOOOi65OOO^5OOO^5OOO^OOO^6OOOO^5OOO^5OOO^

j 7MinBHHnHH^HH^MUUMH^KHHHKH^KHnHHHH|m^^H9MHBHHHRH^H^HnH

mean_salary = [Link](salary['salary'])
print('Mean salary:', mean_salaiy)^^M

|median_salary = [Link](salary['salary'])
print('Median salary:', median2salary)^^M

calculate mode
mode_salary = salary['salary'].mode()[0]
print('Mode salary:', mode_salary)

Output
On executing this code, you will get the following output -

Mean salary: 59500.0


Median salary: 57500.0
Mode salary: 45000

Standard Deviation

Standard deviation is a measure of the amount of variation or dispersion of a set of data values around
their mean. In machine learning, it is an important statistical concept that is used to describe the spread or
distribution of a dataset.

Standard deviation is calculated as the square root of the variance, which is the average of the squared
differences from the mean. The formula for calculating standard deviation is as follows -
Z(x-nY/N

Where -

• ais the standard deviation

• E is the sum of

• x is the data point

• n is the mean of the dataset

• N is the total number of data points

In machine learning, standard deviation is used to understand the variability of a dataset and to detect
outliers. For example, in finance, standard deviation is used to measure the volatility of stock prices. In
image processing, standard deviation can be used to detect image noise.
Types of Examples

Example 1
In this example, we will be using the NumPy library to calculate the standard deviation -

std_dev = [Link](data)^^^^^^M

print('Standard deviation:', std_dev)

Output
It will produce the following output -

Standard deviation: 1.707825127659933

Example 2
Let's see another example in which we will calculate the standard deviation of each column in Iris flower
dataset using Python and Pandas library -
import pandas as pd

load the ins datase

iris_df = pd.read_csv('[Link]
names ['>[Link] , - ■ / 7 7 ■ , - , . ' ,

- ... d 7'

rintfStandard deviationsd)J

In this example, we load the Iris dataset from the UCI Machine Learning Repository using Pandas' read-
_csv() method. We then calculate the standard deviation of each column using the std() method of the
Pandas dataframe. Finally, we print the standard deviations for each column.

Output
On executing the code, you will get the following output -

Standard deviations:
sepal length 0.828066
sepal width 0.433594
petal length 1.764420
petal width 0.763161
dtype: float64

This example demonstrates how standard deviation can be used to understand the variability of a dataset.
In this case, we can see that the standard deviation of the 'petal length' column is much higher than that of
the other columns, which suggests that this feature may be more variable and potentially more informa­
tive for classification tasks.

Percentiles

Percentiles are a statistical concept used in machine learning to describe the distribution of a dataset. A
percentile is a measure that indicates the value below which a given percentage of observations in a group
of observations falls.

For example, the 25th percentile (also known as the first quartile) is the value below which 25% of the
observations in the dataset fall, while the 75 th percentile (also known as the third quartile) is the value
below which 75% of the observations in the dataset fall.

Percentiles can be used to summarize the distribution of a dataset and identify outliers. In machine learn­
ing, percentiles are often used in data preprocessing and exploratory data analysis to gain insights into the
data.
Python provides several libraries for calculating percentiles, including NumPy and Pandas.

Calculating Percentiles using NumPy


Below is an example of how to calculate percentiles using NumPy -

Example
import num as n

prmtC75d^ementile^^75)|

In this example, we create a sample dataset using NumPy and then calculate the 25th and 75th percentiles
using the [Link]() function.

Output
The output shows the values of the percentiles for the dataset.

25th percentile: 2.0


75th percentile: 4.0
Calculating Percentiles using Pandas
Below is an example of how to calculate percentiles using Pandas -

Example

da^^m^enes(^^^^^])

pHntC75tb^ementHef^7^

In this example, we create a Pandas series object and then calculate the 25th and 75th percentiles using the
quantile() method of the series object.

Output
The output shows the values of the percentiles for the dataset.

25th percentile: 2.0


75th percentile: 4.0
Data Distribution

In machine learning, data distribution refers to the way in which data points are distributed or spread out
across a dataset. It is important to understand the distribution of data in a dataset, as it can have a signifi­
cant impact on the performance of machine learning algorithms.

Data distribution can be characterized by several statistical measures, including mean, median, mode,
standard deviation, and variance. These measures help to describe the central tendency, spread, and shape
of the data.

Some common types of data distribution in machine learning are given below -

Normal Distribution
Normal distribution, also known as Gaussian distribution, is a continuous probability distribution that
is widely used in machine learning and statistics. It is a bell-shaped curve that describes the probability
distribution of a random variable that is symmetric around the mean. The normal distribution has two pa­
rameters, the mean (p) and the standard deviation (a).

In machine learning, normal distribution is often used to model the distribution of error terms in linear
regression and other statistical models. It is also used as a basis for various hypothesis tests and confidence
intervals.
One important property of normal distribution is the empirical rule, also known as the 68- 95-99.7 rule.
This rule states that approximately 68% of the observations fall within one standard deviation of the
mean, 95% of the observations fall within two standard deviations of the mean, and 99.7% of the observa­
tions fall within three standard deviations of the mean.

Python provides various libraries that can be used to work with normal distributions. One such library is
[Link], which provides functions for calculating the probability density function (PDF), cumulative
distribution function (CDF), percent point function (PPF), and random variables for normal distribution.

Example
Here is an example of using [Link] to generate and visualize a normal distribution -

import numi
from [Link] import norm
import [Link] as pit

sigma = 1 # Standard deviation^^^^^^^^^^^^^^^^^^^^^B


sample [Link](mu,

x = [Link](mu - 3*sigma, mu + 3*sigma, 100)


[Link](sample, bins = 30, density=True, alpha=oT)^^^^^^^^M

In this example, we first generate a random sample of 1000 values from a normal distribution with mean
0 and standard deviation 1 using [Link]. We then use [Link] to calculate the PDF for the
normal distribution and [Link] to generate an array of 100 evenly spaced values between p -3o and p.
+ 3o

Finally, we plot the histogram of the random sample using [Link] and overlay the PDF of the normal dis­
tribution using [Link].

Output
The resulting plot shows the bell-shaped curve of the normal distribution and the histogram of the ran­
dom sample that approximates the normal distribution.
0.4

0.3

0.2

0.1

0.0
-3-2-10 1 2 3

Skewed Distribution
A skewed distribution in machine learning refers to a dataset that is not evenly distributed around its
mean, or average value. In a skewed distribution, the majority of the data points tend to cluster towards
one end of the distribution, with a smaller number of data points at the other end.

There are two types of skewed distributions: left-skewed and right-skewed. A left-skewed distribution,
also known as a negative-skewed distribution, has a long tail towards the left side of the distribution, with
the majority of data points towards the right side. In contrast, a right-skewed distribution, also known as
a positive-skewed distribution, has a long tail towards the right side of the distribution, with the majority
of data points towards the left side.

Skewed distributions can occur in many different types of datasets, such as financial data, social media
metrics, or healthcare records. In machine learning, it is important to identify and handle skewed distri­
butions appropriately, as they can affect the performance of certain algorithms and models. For example,
skewed data can lead to biased predictions and inaccurate results in some cases and may require pre­
processing techniques such as normalization or data transformation to improve the performance of the
model.

Example
Here is an example of generating and plotting a skewed distribution using Python's NumPy and Matplotlib
libraries -

import numpy as np
import [Link] as pit

data = [Link](2,
Output
On executing this code, you will get the following plot as the output -

Skewed Distribution

Value
Uniform Distribution
A uniform distribution in machine learning refers to a probability distribution in which all possible out­
comes are equally likely to occur. In other words, each value in a dataset has the same probability of being
observed, and there is no clustering of data points around a particular value.

The uniform distribution is often used as a baseline for comparison with other distributions, as it repre­
sents a random and unbiased sampling of the data. It can also be useful in certain types of applications,
such as generating random numbers or selecting items from a set without bias.

In probability theory, the probability density function of a continuous uniform distribution is defined as -

fora < x <b


otherwise

where a and b are the minimum and maximum values of the


a+b
distribution, respectively, mean of a uniform distribution is ~2~ and the
(b-a)2
variance is 12

Example
In Python, the NumPy library provides functions for generating random numbers from a uniform distri­
bution, such as [Link](). These functions take as arguments the minimum and maxi­
mum values of the distribution and can be used to generate datasets with a uniform distribution.

Here is an example of generating a uniform distribution using Python's NumPy library -

import numpy as np
import [Link] as pit

uniform_data = [Link](low = 0, high= 1, size

[Link](uniform_data, bins=50, density=True)

[Link]('Uniform Distribution^]
Output
It will produce the following plot as the output -

Bimodal Distribution
In machine learning, a bimodal distribution is a probability distribution that has two distinct modes or
peaks. In other words, the distribution has two regions where the data values are most likely to occur, sep­
arated by a valley or trough where the data is less likely to occur.

Bimodal distributions can arise in various types of data, such as biometric measurements, economic indi­
cators, or social media metrics. They can represent different subpopulations within the dataset, or differ­
ent modes of behavior or trends over time.

Bimodal distributions can be identified and analyzed using various statistical methods, such as his­
tograms, kernel density estimations, or hypothesis testing. In some cases, bimodal distributions can be
fitted to specific probability distributions, such as the Gaussian mixture model, which allows for modeling
the underlying subpopulations separately.

Example
In Python, libraries such as NumPy, SciPy, and Matplotlib provide functions for generating and visualizing
bimodal distributions.

For example, the following code generates and plots a bimodal distribution -

import numpy as np
import [Link] as pit

imodal_data = [Link](([Link](loc=-2, scale= 1, size=5OOO),


[Link] , scale , size
I
Plot the histogram of the bimodal data

[Link](bimodal_data, bins=50, density=True)

pHS^ueJ^^^B

[Link]('Bimodal Distribution,)]

Output
On executing this code, you will get the following plot as the output -
Bimodal Distribution

0.200

0.175

0.150
Frequency

0.125

0.100

0.075

0.050

0.025

0.000
-6 -4 -2 0 2 4
value

Skewness and Kurtosis


Skewness and kurtosis are two important measures of the shape of a probability distribution in machine
learning.
Skewness refers to the degree of asymmetry of a distribution. A distribution is said to be skewed if it is not
symmetrical about its mean. Skewness can be positive, indicating that the tail of the distribution is longer
on the right-hand side, or negative, indicating that the tail of the distribution is longer on the left-hand
side. A skewness of zero indicates that the distribution is perfectly symmetrical.

Kurtosis refers to the degree of peakedness of a distribution. A distribution with high kurtosis has a
sharper peak and heavier tails than a normal distribution, while a distribution with low kurtosis has a
flatter peak and lighter tails. Kurtosis can be positive, indicating a higher-than-normal peak, or negative,
indicating a lower than normal peak. A kurtosis of zero indicates a normal distribution.

Both skewness and kurtosis can have important implications for machine learning algorithms, as they can
affect the assumptions of the models and the accuracy of the predictions. For example, a highly skewed
distribution may require data transformation or the use of non-parametric methods, while a highly kur-
totic distribution may require different statistical models or more robust estimation methods.

Example
In Python, the SciPy library provides functions for calculating skewness and kurtosis of a dataset. For
example, the following code calculates the skewness and kurtosis of a dataset using the skew() and kurto-
sis() functions -

from [Link] import skew, kurtosis


data = [Link](0,1,1000)

[skewness - skew(data)^^^^^^^^^^^^^^^B
/'.u .■..■-■I.-.

rint('Skewness:', skewness)

This code generates a random dataset of 1000 samples from a normal distribution with mean 0 and
standard deviation 1. It then calculates the skewness and kurtosis of the dataset using the skew() and kur-
tosis() functions from the SciPy library. Finally, it prints the results to the console.

Output
On executing this code, you will get the following output -

Skewness: -0.04119418903611285
Kurtosis:-0.1152250196054534

The resulting skewness and kurtosis values should be close to zero for a normal distribution.
Bias and Variance

Bias and variance are two important concepts in machine learning that describe the sources of error in a
model's predictions. Bias refers to the error that results from oversimplifying the underlying relationship
between the input features and the output variable, while variance refers to the error that results from
being too sensitive to fluctuations in the training data.

In machine learning, we strive to minimize both bias and variance in order to build a model that can
accurately predict on unseen data. A model with high bias may be too simplistic and underfit the training
data, while a model with high variance may overfit the training data and fail to generalize to new data.

Example
Below is an implementation example in Python that illustrates how bias and variance can be analyzed
using the Boston Housing dataset -

import pandas as
from [Link] import load_boston

boston = load_bostonQ^^^^^^^^^B
X I> ' ' ■ I/ ;
from [Link] import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,

from [Link] import mean_squared_error^^^^^^^H

Ir = LinearRegressionQ

Output
The output shows the training and testing mean squared errors (MSE) of the linear regression model. The
training MSE is 21.64 and the testing MSE is 24.29, indicating that the model has a moderate level of bias
and variance.
Training MSE: 21.641412753226312
TestingMSE: 24.291119474973456

Reducing Bias and Variance


To reduce bias, we can use more complex models that can capture non-linear relationships in the data.

Example
Let's try a polynomial regression model -

from [Link] import PolynomialFeatures

poly = PolynomialFeatures(degree=2)^^^^^^^^M

X_test_poly = [Link](X_test)^^^^^^^^^|

pr = LinearRegressionQ^^^^^^^^^^^^^^^^M

train_preds - [Link](X_tram7poly)^^^^^^^^B
train_mse = mean_squared_error(y_train, train_preds)

t e s t. _ p r e d s p r. p redid (, X _ t e s
iest_mse = mean_squared_error(y_test, test_preds)
print("Testing MSE:", test_mse)

Output
The output shows the training and testing MSE of the polynomial regression model with degree=2. The
training MSE is 5.31 and the testing MSE is 14.18, indicating that the model has a lower bias but higher
variance compared to the linear regression model.

TrainingMSE: 5.31446956670908
Testing MSE: 14.183558207567042

Example
To reduce variance, we can use regularization techniques such as ridge regression or lasso regression. In
the following example, we will be using ridge regression -

from sklearn.linear_model import Ridge

ridge = Ridge(alpha-1)^^^^^^^^^^^^^^^^M

train_preds - [Link](X_train_poly)^^^^^^M
train_mse = mean_squared_error(y_train, train_preds)
12 rint("Training MSE:", train_mse)

test_£red^^ddgy3mdicfi}^e^^Dol^)^^^^^^|
testjmse = mean_squared_error(y_test, test_preds)

Output
The output shows the training and testing MSE of the ridge regression model with alpha= 1. The training
MSE is 9.03 and the testing MSE is 13.88 compared to the polynomial regression model, indicating that the
model has a lower variance but slightly higher bias.

TrainingMSE: 9.O322O93786O839
Testing MSE: 13.882093755326755

Example
We can further tune the hyperparameter alpha to find the optimal balance between bias and variance.
Let's see an example -

from sklearn.model_selection import GridSearchCVj

ridge_cv = GridSearchCV(Ridge(), param_grid, cv=5)


idge_cv.fit(X_train_poly, y_train

Output
The output shows the training and testing MSE of the ridge regression model with the optimal alpha value.

Training MSE: 8.326082686584716


Testing MSE: 12.873907256619141

The training MSE is 8.32 and the testing MSE is 12.87, indicating that the model has a good balance be­
tween bias and variance.
Hypothesis
In machine learning, a hypothesis is a proposed explanation or solution for a problem. It is a tentative
assumption or idea that can be tested and validated using data. In supervised learning, the hypothesis is
the model that the algorithm is trained on to make predictions on unseen data.

The hypothesis is generally expressed as a function that maps input data to output labels. In other words,
it defines the relationship between the input and output variables. The goal of machine learning is to find
the best possible hypothesis that can generalize well to unseen data.

The process of finding the best hypothesis is called model training or learning. During the training process,
the algorithm adjusts the model parameters to minimize the error or loss function, which measures the
difference between the predicted output and the actual output.

Once the model is trained, it can be used to make predictions on new data. However, it is important to
evaluate the performance of the model before using it in the real world. This is done by testing the model
on a separate validation set or using cross-validation techniques.

Properties of a Good Hypothesis


The hypothesis plays a critical role in the success of a machine learning model. A good hypothesis should
have the following properties -
1. Generalization - The model should be able to make accurate predictions on unseen data.
2. Simplicity - The model should be simple and interpretable, so that it is easier to understand
and explain.
3. Robustness - The model should be able to handle noise and outliers in the data.
4. Scalability - The model should be able to handle large amounts of data efficiently.

There are many types of machine learning algorithms that can be used to generate hypotheses, including
linear regression, logistic regression, decision trees, support vector machines, neural networks, and more.

Regression Analysis

Regression is a type of supervised learning algorithm in machine learning. The key objective of regression­
based tasks is to predict output labels or responses which are continues numeric values, for the given input
data. The output will be based on what the model has learned in training phase.

Basically, regression models use the input data features (independent variables) and their corresponding
continuous numeric output values (dependent or outcome variables) to learn specific association between
inputs and corresponding outputs.
Y-Output Variables,
(dependent on Input)

X-Input Variables
(independent in nature)

Types of Regression Models


Regression models are of following two types -

Simple regression model - This is the most basic regression model in which predictions are formed from
a single, univariate feature of the data.
Multiple regression model - As name implies, in this regression model the predictions are formed from
multiple features of the data.

Building a Regressor in Python


Regressor model in Python can be constructed just like we constructed the classifier. Scikit-learn, a Python
library for machine learning can also be used to build a regressor in Python.

In the following example, we will be building basic regression model that will fit a line to the data i.e. linear
regressor. The necessary steps for building a regressor in Python are as follows -

Step 1: Importing necessary python package


For building a regressor using scikit-learn, we need to import it along with other necessary packages. We
can import the by using following script -
from sklearn import linear_model
[Link] sin^^
import [Link] aspltW

Step 2: Importing dataset


After importing necessary package, we need a dataset to build regression prediction model. We can import
it from sklearn dataset or can use other one as per our requirement. We are going to use our saved input
data. We can import it with the help of following script -

input = r'[Link]'

Next, we need to load this data. We are using [Link] function to load it.

input_data = [Link](input, delimiter


X, y = input_data[:, :-l], inpuCdata[~7-l]^M

Step 3: Organizing data into training & testing sets


As we need to test our model on unseen data hence, we will divide our dataset into two parts: a training set
and a test set. The following command will perform it -

training_samples = int(0.6 *len(X))


^stin^amgle^Uen(X^aun^mining^^^^^^^^^^|
K_train, y_train = X[:training_samples], y[:training_samples]
K_test, y_test = X[training_samples:], y[training_samples:]M

Step 4: Model evaluation & prediction


After dividing the data into training and testing we need to build the model. We will be using LineaRegres-
sion() function of Scikit-learn for this purpose. Following command will create a linear regressor object.

reg_linear = linear_model.LinearRegression()

Next, train this model with the training samples as follows -

eg_linear.fit(X_train, y_train

Now, at last we need to do the prediction with the testing data.

_test_pred = reg_linear.predict(X_test)

Step 5: Plot & visualization


After prediction, we can plot and visualize it with the help of following script -

[Link](X_test, y_test, color = 'red')


[Link](X_test, y_test_pred, color = 'black1, linewidth = 2)

Output

In the above output, we can see the regression line between the data points.

Step 6: Performance computation


We can also compute the performance of our regression model with the help of various performance met­
rics as follows.

rint("Regressor model performance:")^^^^^^^^^^^^^^^^^^^^^^^^^^^^^B


rintf'Mean absolute error(MAE) =", round(sm.mean_absolute_error(y_test,y_test_pred), 2))[
rintf'Mean squared error(MSE) =", round(sm.mean_squared_error(y_test, y_test_pred), 2)) J
rint("Median absolute error =", round(sm.median_absolute_error(y_test, y_test_pred), 2))J
rint("Explain variance score =", round(sm.explained_variance_score(y_test, y_test_pred), 2))
rini( ;R2 score ", rouncl(sm.r2_score(yj:est, y_test_pred)

Output
Regressor model performance:
Mean absolute error(MAE) = 1.78
Mean squared error(MSE) = 3.89
Median absolute error = 2.01
Explain variance score = -0.09
R2 score = -0.09

Types of ML Regression Algorithms


The most useful and popular ML regression algorithm is Linear regression algorithm which further di­
vided into two types namely -
1. Simple Linear Regression algorithm
2. Multiple Linear Regression algorithm.

We will discuss about it and implement it in Python in the next chapter.

Applications
The applications of ML regression algorithms are as follows -

Forecasting or Predictive analysis - One of the important uses of regression is forecasting or predictive
analysis. For example, we can forecast GDP, oil prices or in simple words the quantitative data that changes
with the passage of time.

Optimization - We can optimize business processes with the help of regression. For example, a store man­
ager can create a statistical model to understand the peek time of coming of customers.

Error correction - In business, taking correct decision is equally important as optimizing the business
process. Regression can help us to take correct decision as well in correcting the already implemented
decision.

Economics - It is the most used tool in economics. We can use regression to predict supply, demand, con­
sumption, inventory investment etc.

Finance - A financial company is always interested in minimizing the risk portfolio and want to know the
factors that affects the customers. All these can be predicted with the help of regression model.
Linear Regression

Linear regression may be defined as the statistical model that analyzes the linear relationship between a
dependent variable with given set of independent variables. Linear relationship between variables means
that when the value of one or more independent variables will change (increase or decrease), the value of
dependent variable will also change accordingly (increase or decrease).

Mathematically the relationship can be represented with the help of following equation -

Y=mX+b

Here,

1. Y is the dependent variable we are trying to predict


2. X is the dependent variable we are using to make predictions
3. m is the slop of the regression line which represents the effect X has on Y.
4. b is a constant, known as the Y-intercept. If X = 0, Y would be equal to b.

Furthermore, the linear relationship can be positive or negative in nature as explained below -
Positive Linear Relationship
A linear relationship will be called positive if both independent and dependent variable increases. It can be
understood with the help of following graph -

Positive Linear Relationship


Negative Linear Relationship
A linear relationship will be called positive if independent increases and dependent variable decreases. It
can be understood with the help of following graph -

Negative Linear Relationship


Linear regression is of two types, "simple linear regression" and "multiple linear regression", which we are
going to discuss in the next two chapters of this tutorial.

Types of Linear Regression


Linear regression is of the following two types -

1. Simple Linear Regression


2. Multiple Linear Regression

Assumptions
The following are some assumptions about dataset that is made by Linear Regression model -

Multi-collinearity - Linear regression model assumes that there is very little or no multi-collinearity in
the data. Basically, multi-collinearity occurs when the independent variables or features have dependency
in them.

Auto-correlation - Another assumption Linear regression model assumes is that there is very little or no
auto-correlation in the data. Basically, auto-correlation occurs when there is dependency between residual
errors.

Relationship between variables - Linear regression model assumes that the relationship between re­
sponse and feature variables must be linear.
Simple Linear Regression

Simple linear regression is a type of regression analysis in which a single independent variable (also known
as a predictor variable) is used to predict the dependent variable. In other words, it models the linear rela­
tionship between the dependent variable and a single independent variable.

Python Implementation
Given below is an example that shows how to implement simple linear regression using the Pima-Indian-
Diabetes dataset in Python. We will also plot the regression line.

Data Preparation
First, we need to import the Diabetes dataset from scikit-learn and split it into training and testing sets. We
will use 80% of the data for training the model and the remaining 20% for testing.

from sklearn.model_selection import train_test_split

diabetes = load_diabetes()B

# Split the dataset mto training and testing sets


X_train, X_test, y_train,
fl y_test - train_test_split([Link][:, 2],
[Link],. test_size=0.2, randomIstate=C)^^^^^^^^^B

<_train = X_train.reshape(-1,1)
<_test = X_test.reshape(-1,1)M

Here, we are using the third feature (column) of the dataset, which represents the mean blood pressure, as
our independent variable (predictor variable) and the target variable as our dependent variable (response
variable).

Model Training
We will use scikit-learn's LinearRegression class to train a simple linear regression model on the training
data. The code for this is as follows -

from sklearn.linear_model import LinearRegression


Here, X_train represents the input feature (mean blood pressure) of the training data and y_train repre­
sents the output variable (target variable).

Model Testing
Once the model is trained, we can use it to make predictions on the test data. The code for this is as follows

Make predictions on the test dat

y_pred = lr_model.predict(X_test)

Here, X_test represents the input feature of the test data and y_pred represents the predicted output vari­
able (target variable).

Model Evaluation
We need to evaluate the performance of the model to determine its accuracy. We will use the mean squared
error (MSE) and the coefficient of determination (RA2) as evaluation metrics. The code for this is as follows

from [Link] import mean_squared_error, r2_score


fr Calculate the coefficient ot determination

Here, y_test represents the actual output variable of the test data.

Plotting the Regression Line


We can also visualize the regression line to see how well it fits the data. The code for this is as follows -

import [Link] as pit

[Link](X_train, y_train, color=,gray')

[Link](X_train, lr_model.predict(X_train), color='red'> linewidth-2)

[Link]('Mean Blood Pressure')

Show the plot


[Link]

Here, we are using the scatter() function from the matplotlib library to plot the training data points and
the plot() function to plot the regression line. The xlabel() and ylabel() functions are used to label the x-
axis and y-axis of the plot, respectively. Finally, we use the show() function to display the plot.

Complete Implementation Example


The complete code for implementing simple linear regression in Python is as follows -

[from [Link] import load_diabetes^^^^^^^M

from [Link] import mean squared error, r2 score

diabetes = load_diabetes()B

X_train, X_test, y_train, y_test = train_test_split([Link][:, 2],


[Link], test_size = 0.2, randomZstate = C)^^^^^^^^^M

fr Reshape the input data


' i - - i r. 1 - ,

lr_model = LinearRegressionQ^^B

y_pred = lr_model.predict(X_test)B

mse - mean_squared_error(y_test, y_pred)[

r2 = r2_score(y_test7y2pred)^^^^^^^M

printfMean Squared Error:1,


printfCoefficient of Determination?7r2)^^

[Link](X_train, y_train, color^grayO^B


If Plot the regression lint
[Link](X_train, lr_model.predict(X_train), color='red', linewidth=2)

Add axis labels


[Link]('Mean Blood Pressure')
[Link]('Disease Progression')!

Show the plot


pit. show!

Output
On executing this code, you will get the following plot as the output and it will also print the Mean Squared
Error and the Coefficient of Determination on the terminal -

Mean Squared Error: 4150.680189329983


Coefficient of Determination: 0.19057346847560164
350

Disease Progression 300

250

200

150

100

50

-0.10 -0.05 0.00 0.05 0.10 0.15


Mean Blood Pressure

Multiple Linear Regression

It is basically the extension of simple linear regression that predicts a response using two or more features.
Mathematically we can explain it as follows -
Consider a dataset having n observations, p features i.e. independent variables and y as one response i.e.
dependent variable the regression line for p features can be calculated as follows -

(*^i) ^0 1 ~i~ ^2^i2 "h ’ ’ * H- bpXjp

Here,h (z,-) is the predicted response value and &i, 62- • • • bp are the
regression coefficients.

Multiple Linear Regression models always includes the errors in the


data known as residual error which changes the calculation as follows -

(•£») &0 ~"L ^1*^ il ^2^i2 d- ’ ’ ’ bp'Fjp d-

We can also write the above equation as follows -

yi = h + ei or e, = yi - h (z£)

Python Implementation
To implement multiple linear regression in Python using Scikit-Learn, we can use the same LinearRegres-
sion class as in simple linear regression, but this time we need to provide multiple independent variables
as input.

Let's consider the Boston Housing dataset from Scikit-Learn and implement multiple linear regression
using it.

Example

1..! ' 1_ I -'//.GT .//

from [Link] import mean squared error, r2 score

mat p';:i Hi\’jv8cl.

manleach a > t

X_train, X_test, y_train, y_test = train_test_split([Link],


[Link], test_size=0.2, random_state=0)^^^^^^^B

lr_model = LinearRegressionQ^l
Make predictions on the test data
y_pred = lr_model.predict(X_test)

[Link]
[Link](x, y, color='red')

In this code, we first load the Boston Housing dataset using the load_boston() function from Scikit-Learn.
We then split the dataset into training and testing sets using the train_test_split() function.

Next, we create a LinearRegression object and fit it on the training data using the fit() method. We then
make predictions on the test data using the predict() method and calculate the mean squared error and co­
efficient of determination using the mean_squared_error() and r2_score() functions, respectively.

Finally, we plot the predicted values against the actual values using the scatter() function and add a
regression line to the plot using the plot() function. We label the x-axis and y-axis using the xlabel() and
ylabel() functions and display the plot using the show() function.

Output
When you execute the program, it will produce the following plot as the output and it will print the Mean
Squared Error and the Coefficient of Determination on the terminal -

Mean Squared Error: 33.44897999767653


Coefficient of Determination: 0.5892223849182507
50 -

40 -

30 -

20-

10 -

o-
10 30 40 50
Actual Values

Polynomial Regression
Polynomial Linear Regression is a type of regression analysis in which the relationship between the
independent variable and the dependent variable is modeled as an n-th degree polynomial function. Poly­
nomial regression allows for a more complex relationship between the variables to be captured, beyond the
linear relationship in Simple and Multiple Linear Regression.
Python Implementation
Here's an example implementation of Polynomial Linear Regression using the Boston Housing dataset
from Scikit-Learn -

Example
# Transform the input data to include polynomial features
train_poly = poly.fit_transform(X_train)
,test_poly = [Link](xZtest)^^B

lr_model = LinearRegressionQM

# Make predictions on the test data


y_pred = lr_model.predict(X_test_poly)

___________________________
printfCoefficient of Determination:', r 2)|

sort_idx = X_test[:, 12].argsortQ^^^M


X_test_sorted = X_test[sort_idx]
y_test_sorted = y_test[sortjdxl

[Link]

^^“TB

Output
When you execute the program, it will produce the following plot as the output and it will print the Mean
Squared Error and the Coefficient of Determination on the terminal -

Mean Squared Error: 2 5.215 79 7 6170 518 5 5


Coefficient of Determination: O.69O3318O65831567
Classification Algorithms
Classification is a type of supervised learning technique that involves predicting a categorical target vari­
able based on a set of input features. It is commonly used to solve problems such as spam detection, fraud
detection, image recognition, sentiment analysis, and many others.
The goal of a classification model is to learn a mapping function (f) between the input features (X) and the
target variable (Y). This mapping function is often represented as a decision boundary, which separates
different classes in the input feature space. Once the model is trained, it can be used to predict the class of
new, unseen examples.

Let us now take a look at the steps involved in building a classification model -

Data Preparation
The first step is to collect and preprocess the data. This involves cleaning the data, handling missing val­
ues, and converting categorical variables to numerical values.

Feature Extraction/Selection
The next step is to extract or select relevant features from the data. This is an important step because the
quality of the features can greatly impact the performance of the model. Some common feature selection
techniques include correlation analysis, feature importance ranking, and principal component analysis.

Model Selection
Once the features are selected, the next step is to choose an appropriate classification algorithm. There
are many different algorithms to choose from, each with its own strengths and weaknesses. Some popular
algorithms include logistic regression, decision trees, random forests, support vector machines, and neu­
ral networks
Model Training
After selecting a suitable algorithm, the next step is to train the model on the labeled training data. During
training, the model learns the mapping function between the input features and the target variable. The
model parameters are adjusted iteratively to minimize the difference between the predicted outputs and
the actual outputs.

Model Evaluation
Once the model is trained, the next step is to evaluate its performance on a separate set of validation data.
This is done to estimate the model's accuracy and generalization performance. Common evaluation met­
rics include accuracy, precision, recall, Fl-score, and area under the receiver operating characteristic (ROC)
curve.

Hyperparameter Tuning
In many cases, the performance of the model can be further improved by tuning its hyperparameters.
Hyperparameters are settings that are chosen before training the model and control aspects such as the
learning rate, regularization strength, and the number of hidden layers in a neural network. Grid search,
random search, and Bayesian optimization are some common techniques used for hyperparameter tuning.
Model Deployment
Once the model has been trained and evaluated, the final step is to deploy it in a production environment.
This involves integrating the model into a larger system, testing it on realworld data, and monitoring its
performance over time.

Types of Learners in Classification


We have two types of learners in respective to classification problems -

Lazy Learners
As the name suggests, such kind of learners waits for the testing data to be appeared after storing the
training data. Classification is done only after getting the testing data. They spend less time on training
but more time on predicting. Examples of lazy learners are K-nearest neighbor and case-based reasoning.

Eager Learners
As opposite to lazy learners, eager learners construct classification model without waiting for the testing
data to be appeared after storing the training data. They spend more time on training but less time on pre­
dicting. Examples of eager learners are Decision Trees, Naive Bayes and Artificial Neural Networks (ANN).
Building a Classifier in Python
Scikit-learn, a Python library for machine learning can be used to build a classifier in Python. The steps for
building a classifier in Python are as follows -

Step 1: Importing necessary python package


For building a classifier using scikit-learn, we need to import it. We can import it by using following script

import sklearn

Step 2: Importing dataset


After importing necessary package, we need a dataset to build classification prediction model. We can
import it from sklearn dataset or can use other one as per our requirement. We are going to use sklearn’s
Breast Cancer Wisconsin Diagnostic Database. We can import it with the help of following script -

from [Link] import load_breast_cancer

The following script will load the dataset;

data = load_breast_cancer()
We also need to organize the data and it can be done with the help of following scripts -

|label_names - data['target_names']^^
geJ^ataHarged^^^^^B
feature_names = data['feature_names']

The following command will print the name of the labels, 'malignant’ and 'benign’ in case of our database.

print(label_names)

The output of the above command is the names of the labels -

['malignant' 'benign']

These labels are mapped to binary values 0 and 1. Malignant cancer is represented by 0 and Benign cancer
is represented by 1.

The feature names and feature values of these labels can be seen with the help of following commands -

pr int(feature_name s [ 0 ])

The output of the above command is the names of the features for label 0 i.e. Malignant cancer -

ean radius
Similarly, names of the features for label can be produced as follows -

pr int(feature_name s [ 1 ])

The output of the above command is the names of the features for label 1 i.e. Benign cancer -

ean texture

We can print the features for these labels with the help of following command -

print(features[O])

This will give the following output -

[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01


1.471e-01 2.419e-01 7.871e-02 l.O95e+OO 9.053e-01 8.589e+OO 1.534e+02
6.399e-03 4.904e-02 5.373e-O2 1.587e-02 3.OO3e-O2 6.193e-O3 2.538e+01
1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
4.601e-01 1.189e-01]

We can print the features for these labels with the help of following command -

>rint(features[l])
This will give the following output -

[2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-O2 8.690e-02


7.017e-02 1.812e-01 5.667e-02 5.435e-Ol 7.339e-01 3.398e+00 7.408e+01
5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-O3 2.499e+Ol
2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01
2.750e-01 8.902e-02]

Step 3: Organizing data into training & testing sets


As we need to test our model on unseen data, we will divide our dataset into two parts: a training set and a
test set. We can use train_test_split() function of sklearn python package to split the data into sets. The fol­
lowing command will import the function -

from [Link] import train_test_split

Now, next command will split the data into training & testing data. In this example, we are using taking 40
percent of the data for testing purpose and 60 percent of the data for training purpose -

train_test_split(features,labels,test_size = 0.40, random_state = 42)

Step 4: Model evaluation


After dividing the data into training and testing we need to build the model. We will be using Naive Bayes
algorithm for this purpose. The following commands will import the GaussianNB module -

from sklearn.naive_bayes import GaussianNB

Now, initialize the model as follows -

;nb = GaussianNBQ

Next, with the help of following command we can train the model -

model = gnb. fit (train, trainjabels)

Now, for evaluation purpose we need to make predictions. It can be done by using predictQ function as
follows -

>reds = [Link](test)

This will give the following output -

[1001100011101010111011011111101111110
1011011111111001111100110011100110010
1111110110000011111111001001001110110
1100011100110100110001110110010110100
1111111001111111111110111011011111100
0110101111011011101001111111101111101
001101]

The above series of Os and Is in output are the predicted values for the Malignant and Benign tumor
classes.

Step 5: Finding accuracy


We can find the accuracy of the model build in previous step by comparing the two arrays namely test-la­
bels and preds. We will be using the accuracy_score() function to determine the accuracy.

from [Link] import accuracy_score


print(accuracy_score(test_labels,preds))^B

The above output shows that NaiveBayes classifier is 9 5.17% accurate.

Classification Evaluation Metrics


The job is not done even if you have finished implementation of your Machine Learning application or
model. We must have to find out how effective our model is? There can be different evaluation metrics, but
we must choose it carefully because the choice of metrics influences how the performance of a machine
learning algorithm is measured and compared.

The following are some of the important classification evaluation metrics among which you can choose
based upon your dataset and kind of problem -

Confusion Matrix
1. Confusion Matrix - It is the easiest way to measure the performance of a classification prob­
lem where the output can be of two or more type of classes.

Various ML Classification Algorithms


The followings are some important ML classification algorithms -

1. Logistic Regression
2. K-Nearest Neighbors Regression
3. Support Vector Machine (SVM)
4. Decision Tree
5. Naive Bayes
6. Random Forest
7. Stochastic Gradient Descent

We will be discussing all these classification algorithms in detail in further chapters.


Applications
Some of the most important applications of classification algorithms are as follows -

1. Speech Recognition
2. Handwriting Recognition
3. Biometric Identification
4. Document Classification

In the subsequent chapters, we will discuss some of the most popular classification algorithms in machine
learning.

Logistic Regression

Logistic regression is a popular algorithm used for binary classification problems, where the target vari­
able is categorical with two classes. It models the probability of the target variable given the input features
and predicts the class with the highest probability.

Logistic regression is a type of generalized linear model, where the target variable follows a Bernoulli
distribution. The model consists of a linear function of the input features, which is transformed using the
logistic function to produce a probability value between 0 and 1.
The linear function is basically used as an input to another function such as g in the following relation -

hg (x) = g (Qr x} where 0 < hg < 1

Here, g is the logistic or sigmoid function which can be given as follows

9^) = --------- where z = x


1 + e~z
The sigmoid curve can be represented with the help of following graph. We can see the values of y-axis lie
between 0 and 1 and crosses the axis at 0.5.
The classes can be divided into positive or negative. The output comes under the probability of positive
class if it lies between 0 and 1. For our implementation, we are interpreting the output of hypothesis func­
tion as positive if it is > 0.5, otherwise negative.

Implementation in Python
Now we will implement the above concept of logistic regression in Python. For this purpose, we are using
a multivariate flower dataset named 'iris'. The iris dataset is a well-known dataset in machine learning,
consisting of measurements of the sepal length, sepal width, petal length, and petal width of three differ­
ent species of iris flowers. We will use logistic regression to predict the species of an iris flower given its
measurements.

Let us now check the steps to implement logistic regression in Python using the iris dataset -

Load the Dataset


First, we need to load the iris dataset into our Python environment. We can use the scikitlearn library to
load the dataset, as follows -

from [Link] import loadjris

X = [Link] # inputfeatures^^^^M
V = [Link] # target variable^^^H
Plot the Training Data
This is an optional step but for more clarification about the dataset we are plotting the training data as
follows -

import [Link] as pit

[Link](X_train[:, 0], X_train[:, 1], c~y_train)

Split the Dataset


Next, we need to split the dataset into a training set and a test set. We will use 70% of the data for training
and 30% for testing.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)
Create the Logistic Regression Model
We can use the LogisticRegression class from scikit-learn to create a logistic regression model. We will use
L2 regularization and set the regularization strength to 1.

elf = LogisticRegression(penalty='12', C=1.0, random_state=42)

Train the Model


We can train the model on the training set using the fit() method.

[Link](X_train, y_train

Make Predictions
Once the model is trained, we can use it to make predictions on the test set using the predictQ method.

red = [Link](X_test)

Evaluate the Model


Finally, we can evaluate the performance of the model using metrics such as accuracy, precision, recall,
and Fl-score.
froir [Link] import accuracy score, precision score,—

print('Accuracy:', accuracy_score(y_test, y_pred))^^^^^^^^M


print('Precision:', precision_score(y_test, y_pred, average='macro'))
printfRecall:', recall_score(y_test, y_pred, average='macro'))^^^B
print('Fl-score:', fl_score(y_test, y_pred, average='macro'))^^^B

Here, we have used the average parameter with the value 'macro' to calculate the metrics for each class
separately and then take the average.

Complete Implementation Example


Give below is the complete implementation example of logistic regression in python using the iris dataset

[from [Link] import

from sklearn.linear_model import LogisticRegression^^^^^^^^^^^^^^^B


import [Link] as
from [Link] import accuracy_score, precision_score, recalLscore, fl_score
X = [Link] # input features
y = [Link] # target variable

[Link](X_train[:, 0], X_train[:, 1], c=y_train)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

elf = LogisticRegression(penalty='12', 01.0, random_state=42)

clMt(X^ram^^ram)^^^^^^H

make predictions on the test set


y_pred = [Link](X_test)

# evaluate the performance of the model


rint('Precision:', precision_score(y_test, y_pred, average='macro'))
rint('Recall:', recall_score(y_test, y_pred, average='macro'))^^^B
rint('Fl-score:', fl_score(y_test, y_pred, average='macro'))^^^M

Output
When you execute this code, it will produce the following plot as the output -

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
Fl-score: 1.0
Iris Training Data

4.0 -
Sepal width (cm)

3.5 -

3.0 -

2.5 -

2.0 -
4.5 5.0 5.5 6.0 6.5 7.0 7.5
Sepal length (cm)

K-Nearest Neighbors (KNN)

KNN is a supervised learning algorithm that can be used for both classification and regression problems.
The main idea behind KNN is to find the k-nearest data points to a given test data point and use these
nearest neighbors to make a prediction. The value of k is a hyperparameter that needs to be tuned, and it
represents the number of neighbors to consider.

For classification problems, the KNN algorithm assigns the test data point to the class that appears most
frequently among the k-nearest neighbors. In other words, the class with the highest number of neighbors
is the predicted class.

For regression problems, the KNN algorithm assigns the test data point the average of the k-nearest neigh­
bors' values.

The distance metric used to measure the similarity between two data points is an essential factor that
affects the KNN algorithm's performance. The most commonly used distance metrics are Euclidean dis­
tance, Manhattan distance, and Minkowski distance.

Working of KNN Algorithm


The KNN algorithm can be summarized in the following steps -

1. Load the data - The first step is to load the dataset into memory. This can be done using vari­
ous libraries such as pandas or numpy.
2. Split the data - The next step is to split the data into training and test sets. The training set is
used to train the KNN algorithm, while the test set is used to evaluate its performance.
3. Normalize the data - Before training the KNN algorithm, it is essential to normalize the data
to ensure that each feature contributes equally to the distance metric calculation.
4. Calculate distances - Once the data is normalized, the KNN algorithm calculates the dis­
tances between the test data point and each data point in the training set.
5. Select k-nearest neighbors - The KNN algorithm selects the k-nearest neighbors based on
the distances calculated in the previous step.
6. Make a prediction - For classification problems, the KNN algorithm assigns the test data
point to the class that appears most frequently among the k-nearest neighbors. For regression
problems, the KNN algorithm assigns the test data point the average of the k-nearest neigh­
bors' values.
7. Evaluate performance - Finally, the KNN algorithm's performance is evaluated using various
metrics such as accuracy, precision, recall, and Fl-score.

Implementation in Python
Now that we have discussed the KNN algorithm's theory, let's implement it in Python using scikit-learn.
Scikit-learn is a popular library for Machine Learning in Python and provides various algorithms for clas­
sification and regression problems.

We will use the Iris dataset, which is a popular dataset in Machine Learning and contains information
about three different species of Iris flowers. The dataset has four features, including the sepal length, sepal
width, petal length, and petal width, and a target variable, which is the species of the flower.

To implement KNN in Python, we need to follow the steps mentioned earlier. Here's the Python code for
implementing KNN on the Iris dataset -
Example

from [Link] import


from [Link] import train_test_split
from [Link] import StandardScalerJ
from [Link] import KNeighborsClassifier|
from [Link] import accuracylscore^^^M

X_train, X_test, y_train, y_test = train_test_split([Link],


[Link], test_size=0.35, random_state=42)

scaler = StandardScalerQ^^^^^^^^^M
j 'i ।~

knn = KNeighborsClassifier(n_neighbors=5)

ftram the KNN algorithm


[Link](X_train, y_train

accuracy = accuracy_score(y_test, y_pred)^^^B


print("Accuracy: {:.2f}%".format(accuracy*100))|

Output
When you execute this code, it will produce the following output -

Accuracy: 98.11%

Naive Bayes Algorithm

The Naive Bayes algorithm is a classification algorithm based on Bayes' theorem. The algorithm assumes
that the features are independent of each other, which is why it is called "naive." It calculates the prob­
ability of a sample belonging to a particular class based on the probabilities of its features. For example,
a phone may be considered as smart if it has touch-screen, internet facility, good camera, etc. Even if all
these features are dependent on each other, but all these features independently contribute to the proba­
bility of that the phone is a smart phone.

In Bayesian classification, the main interest is to find the posterior probabilities i.e. the probability of a
label given some observed features, P( L I features). With the help of Bayes theorem, we can express
this in quantitative form as follows -

P (L) P (features\L)
P (L\ features) =

Here,

• P (Z| features) is the posterior probability of class.

• P(L) is the prior probability of class.

• P (features\L) is the likelihood which is the probability of


predictor given class.

• P (features) is the prior probability of predictor.

In the Naive Bayes algorithm, we use Bayes' theorem to calculate the probability of a sample belonging to
a particular class. We calculate the probability of each feature of the sample given the class and multiply
them to get the likelihood of the sample belonging to the class. We then multiply the likelihood with the
prior probability of the class to get the posterior probability of the sample belonging to the class. We repeat
this process for each class and choose the class with the highest probability as the class of the sample.

Types of Naive Bayes Algorithm


There are three types of Naive Bayes algorithm -

1. Gaussian Naive Bayes - This algorithm is used when the features are continuous variables
that follow a normal distribution. It assumes that the probability distribution of each feature
is Gaussian, which means it is a bell-shaped curve.
2. Multinomial Naive Bayes - This algorithm is used when the features are discrete variables. It
is commonly used in text classification tasks where the features are the frequency of words in
a document.
3. Bernoulli Naive Bayes - This algorithm is used when the features are binary variables. It
is also commonly used in text classification tasks where the features are whether a word is
present or not in a document.

Implementation in Python
Here we will implement the Gaussian Naive Bayes algorithm in Python. We will use the iris dataset, which
is a popular dataset for classification tasks. It contains 150 samples of iris flowers, each with four features:
sepal length, sepal width, petal length, and petal width. The flowers belong to three classes: setosa, versi­
color, and virginica.

First, we will import the necessary libraries and load the datase -

1
from [Link] import load_iris^^^^^^^M
from [Link] import train_test_split
from sklearn.naive_bayes import GaussianNB^^^H

iris : loadjnsO^^B

X_train, X_test, y_train, y_test = train_test_split([Link],


[Link], test_size=0.35, random_state=0)

We then create an instance of the Gaussian Naive Bayes classifier and train it on the training set -

” C : KI< i i ।;
We can now use the trained classifier to make predictions on the testing set -

make predictions on the testing dat


y_pred = [Link](X_test)

We can evaluate the performance of the classifier by calculating its accuracy -

accuracy = [Link](y_pred == y_test) / len(y_test) print( "Accuracy:", accuracy)

Complete Implementation Example


Given below is the complete implementation example of Naive Bayes Classification algorithm in python
using the iris dataset -

# split the dataset into training and testing sets


X_train, X_test, y_train,
fl y_test - train_test_split([Link],
[Link], test_size=0.3 5, random_state = 0)^^^^^^M

#make predictions on the testing data


y_pred = [Link](X_test)

accuracy = [Link](y_pred -- y_test) / len(y_test)

Output
When you execute this program, it will produce the following output -

Accuracy: 0.9622641509433962
Decision Trees Algorithm
The Decision Tree algorithm is a hierarchical tree-based algorithm that is used to classify or predict out­
comes based on a set of rules. It works by splitting the data into subsets based on the values of the input
features. The algorithm recursively splits the data until it reaches a point where the data in each subset
belongs to the same class or has the same value for the target variable. The resulting tree is a set of decision
rules that can be used to make predictions or classify new data.

The Decision Tree algorithm works by selecting the best feature to split the data at each node. The best
feature is the one that provides the most information gain or the most reduction in entropy. Information
gain is a measure of the amount of information gained by splitting the data at a particular feature, while
entropy is a measure of the randomness or disorder in the data. The algorithm uses these measures to de­
termine the best feature to split the data at each node.

The example of a binary tree for predicting whether a person is fit or unfit providing various information
like age, eating habits and exercise habits, is given below -
In the above decision tree, the question are decision nodes and final outcomes are leaves.
Types of Decision Tree Algorithm
There are two main types of Decision Tree algorithm -

1. Classification Tree - A classification tree is used to classify data into different classes or
categories. It works by splitting the data into subsets based on the values of the input features
and assigning each subset to a different class.
2. Regression Tree - A regression tree is used to predict numerical values or continuous vari­
ables. It works by splitting the data into subsets based on the values of the input features and
assigning each subset a numerical value.

Implementation in Python
Let's implement the Decision Tree algorithm in Python using a popular dataset for classification tasks
named Iris dataset. It contains 150 samples of iris flowers, each with four features: sepal length, sepal
width, petal length, and petal width. The flowers belong to three classes: setosa, versicolor, and virginica.

First, we will import the necessary libraries and load the dataset -

[from [Link] import loadJris^^^^^^^B


from [Link] import train_test_split
Ifrom [Link] import DecisionTreeClassifier^^B
# Load the ins dataset
iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split([Link],


[Link], test_size=0.3, random_stateT)^^^^^^^B

We then create an instance of the Decision Tree classifier and train it on the training set -

dt^t(^^ram^^ram)^^^^^^H

We can now use the trained classifier to make predictions on the testing set -

Make predictions on the testing data


y_pred = [Link](X_test)

We can evaluate the performance of the classifier by calculating its accuracy -

accuracy = [Link](y_pred == y_test) / len(y_test)


print("Accuracy:", accuracy)

We can visualize the Decision Tree using Matplotlib library -

import [Link] as pit


from [Link] import plot_tree

plot_tree(dtc, filled=True, feature_names=iris.feature_names,

The plot_tree function from the [Link] module can be used to plot the Decision Tree. We can pass
in the trained Decision Tree classifier, the filled argument to fill the nodes with color, the feature_names
argument to label the features, and the class_names argument to label the target classes. We also specify
the figsize argument to set the size of the figure and call the show function to display the plot.

Complete Implementation Example


Given below is the complete implementation example of Decision Tree Classification algorithm in python
using the iris dataset -

import numpy as n
from [Link] import train_test_split
from [Link] import DecisionTreeClassifier^^B

in^ToaOnsQ^^B

X_train, X_test, y_train, y_test = train_test_split([Link], [Link], test_size=0.3, random_state=0)

d^^TecisionrreeClassifie^)^B

# Make predictions on the testing data


y_pred = [Link](X_test)

accuracy = [Link](y_pred == y_test) / len(y_test)


print("Accuracy:", accuracy)

# Visualize the Decision Tree using Matplotlib


import [Link] as pit
class_names=iris.target_names)

Output
This will create a plot of the Decision Tree that looks like this -

gin* - 0 0 ginl -00 gmi -00 ginl - 0.0


samples - 3 samples - 1 samples - 3 samples • 1
value • (0. 0. 3] value - [0. 1. 0] value - (0. 0. 3] value - [0. 1. 0]
class “ virginica class - versicolor class « virginica class - versicolor
Accuracy: 0.9777777777777777

As you can see, the plot shows the structure of the Decision Tree, with each node representing a decision
based on the value of a feature, and each leaf node representing a class or numerical value. The color of
each node indicates the majority class or value of the samples in that node, and the numbers at the bottom
indicate the number of samples that reach that node.

Support Vector Machine


Support vector machines (SVMs) are powerful yet flexible supervised machine learning algorithm which
is used for both classification and regression. But generally, they are used in classification problems. In
1960s, SVMs were first introduced but later they got refined in 1990 also. SVMs have their unique way of
implementation as compared to other machine learning algorithms. Now a days, they are extremely pop­
ular because of their ability to handle multiple continuous and categorical variables.

Working of SVM
The goal of SVM is to find a hyperplane that separates the data points into different classes. A hyperplane
is a line in 2D space, a plane in 3D space, or a higher-dimensional surface in n-dimensional space. The
hyperplane is chosen in such a way that it maximizes the margin, which is the distance between the hyper­
plane and the closest data points of each class. The closest data points are called the support vectors.
The distance between the hyperplane and a data point "x" can be calculated using the formula -

distance = (w. x + b) / ||w|

where "w" is the weight vector, "b" is the bias term, and "I |w||" is the Euclidean norm of the weight vector.
The weight vector "w" is perpendicular to the hyperplane and determines its orientation, while the bias
term "b" determines its position.

The optimal hyperplane is found by solving an optimization problem, which is to maximize the margin
subject to the constraint that all data points are correctly classified. In other words, we want to find the
hyperplane that maximizes the margin between the two classes while ensuring that no data point is mis­
classified. This is a convex optimization problem that can be solved using quadratic programming.

If the data points are not linearly separable, we can use a technique called kernel trick to map the data
points into a higher-dimensional space where they become separable. The kernel function computes the
inner product between the mapped data points without computing the mapping itself. This allows us to
work with the data points in the higherdimensional space without incurring the computational cost of
mapping them.

Let's understand it in detail with the help of following diagram -


Margin

X-Axis

Given below are the important concepts in SVM -


1. Support Vectors - Datapoints that are closest to the hyperplane is called support vectors. Sep­
arating line will be defined with the help of these data points.
2. Hyperplane - As we can see in the above diagram it is a decision plane or space which is
divided between a set of objects having different classes.
3. Margin - It may be defined as the gap between two lines on the closet data points of different
classes. It can be calculated as the perpendicular distance from the line to the support vectors.
Large margin is considered as a good margin and small margin is considered as a bad margin.

Implementation in Python
We will use the scikit-learn library to implement SVM in Python. Scikit-learn is a popular machine learn­
ing library that provides a wide range of algorithms for classification, regression, clustering, and dimen­
sionality reduction tasks.

We will use the famous Iris dataset, which contains the sepal length, sepal width, petal length, and petal
width of three species of iris flowers: Iris setosa, Iris versicolor, and Iris virginica. The goal is to classify the
flowers into their respective species based on these four features.

Example
_ __
from [Link] import train_test_split
from [Link] import accuracy_score

ins loadjnsO^^B

X_train, X_test, y_train, y_test = train_test_split([Link],


[Link], test_size=0.2, random_state=42)

# make predictions on the testing set


y_pred = [Link](X_test)

# calculate the accuracy ot the classifier


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
We start by importing the necessary modules from scikit-learn: load_iris to load the iris dataset,
train_test_split to split the data into training and testing sets, SVC to create an SVM classifier with a linear
kernel, and accuracy_score to calculate the accuracy of the classifier.

We load the iris dataset using load_iris and split the data into training and testing sets using train_test_s-
plit. We use a test size of 0.2, which means that 20% of the data will be used for testing and 80% for train­
ing. We set the random state to 42 to ensure reproducibility of the results.

We create an SVM classifier with a linear kernel using SVC(kernel='linear'). We then train the SVM classi­
fier on the training set using [Link](X_train, y_train).

Once the classifier is trained, we make predictions on the testing set using [Link](X_test). We then
calculate the accuracy of the classifier using accuracy_score(y_test, y_pred) and print it to the console.

Output
The output of the code should be something like this -

Accuracy: 1.0

Tuning SVM Parameters


In practice, SVMs often require tuning of their parameters to achieve optimal performance. The most
important parameters to tune are the kernel, the regularization parameter C, and the kernel-specific
parameters.
The kernel parameter determines the type of kernel to use. The most common kernel types are linear,
polynomial, radial basis function (RBF), and sigmoid. The linear kernel is used for linearly separable data,
while the other kernels are used for non-linearly separable data.

The regularization parameter C controls the trade-off between maximizing the margin and minimizing
the classification error. A higher value of C means that the classifier will try to minimize the classification
error at the expense of a smaller margin, while a lower value of C means that the classifier will try to max­
imize the margin even if it means more misclassifications.

The kernel-specific parameters depend on the type of kernel being used. For example, the polynomial
kernel has parameters for the degree of the polynomial and the coefficient of the polynomial, while the RBF
kernel has a parameter for the width of the Gaussian function.

We can use cross-validation to tune the parameters of the SVM. Cross-validation involves splitting the data
into several subsets and training the classifier on each subset while using the remaining subsets for test­
ing. This allows us to evaluate the performance of the classifier on different subsets of the data and choose
the best set of parameters.

Example
from [Link] import GridSearchCV
define the parameter grid
param_grid =
'0: [0.1,1,10,100],
'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],

'coefO': [0.0,0.1, 0.5],


'gamma': ['scale', 'auto'

sv^^VCol^^^^H

;rid_search = GridSearchCV(svm, param_grid, cv=5)B

print("Best parameters:", grid_search.best_params_)


print("Best accuracy:", grid_search.best_score_)

We start by importing the GridSearchCV module from scikit-learn, which is a tool for performing grid
search on a set of parameters. We define a parameter grid that contains the possible values for each param­
eter we want to tune.

We create an SVM classifier using SVC() and then pass it to GridSearchCV along with the parameter grid
and the number of cross-validation folds (cv=5). We then call grid_search.fit(X_train, y_train) to perform
the grid search.
Once the grid search is complete, we print the best set of parameters and their accuracy using
grid_search.best_params_ and grid_search.best_score_, respectively.

Output
On executing this program, you will get the following output -

Best parameters: {'C: 0.1, 'coefO': 0.5, 'degree': 3, 'gamma': 'scale', 'kernel': 'poly'}
Best accuracy: 0.975

This means that the best set of parameters found by the grid search are: C=0.1, coef0=0.5, degree=3,
gamma=scale, and kernel=poly. The accuracy achieved by this set of parameters on the training set is
97.5%.

You can now use these parameters to create a new SVM classifier and test its performance on the testing
set.

Random Forest

Random Forest is a machine learning algorithm that uses an ensemble of decision trees to make predic­
tions. The algorithm was first introduced by Leo Breiman in 2001. The key idea behind the algorithm is to
create a large number of decision trees, each of which is trained on a different subset of the data. The pre­
dictions of these individual trees are then combined to produce a final prediction.
Working of Random Forest Algorithm
We can understand the working of Random Forest algorithm with the help of following steps -

1. Step 1 - First, start with the selection of random samples from a given dataset.
2. Step 2 - Next, this algorithm will construct a decision tree for every sample. Then it will get
the prediction result from every decision tree.
3. Step 3 - In this step, voting will be performed for every predicted result.
4. Step 4 - At last, select the most voted prediction result as the final prediction result.

The following diagram illustrates how the Random Forest Algorithm works -
Random Forest is a flexible algorithm that can be used for both classification and regression tasks. In
classification tasks, the algorithm uses the mode of the predictions of the individual trees to make the final
prediction. In regression tasks, the algorithm uses the mean of the predictions of the individual trees.

Advantages of Random Forest Algorithm


Random Forest algorithm has several advantages over other machine learning algorithms. Some of the key
advantages are -

1. Robustness to Overfitting - Random Forest algorithm is known for its robustness to overfit­
ting. This is because the algorithm uses an ensemble of decision trees, which helps to reduce
the impact of outliers and noise in the data.
2. High Accuracy - Random Forest algorithm is known for its high accuracy. This is because the
algorithm combines the predictions of multiple decision trees, which helps to reduce the im­
pact of individual decision trees that may be biased or inaccurate.
3. Handles Missing Data - Random Forest algorithm can handle missing data without the need
for imputation. This is because the algorithm only considers the features that are available for
each data point and does not require all features to be present for all data points.
4. Non-Linear Relationships - Random Forest algorithm can handle non-linear relationships
between the features and the target variable. This is because the algorithm uses decision trees,
which can model non-linear relationships.
5. Feature Importance - Random Forest algorithm can provide information about the impor­
tance of each feature in the model. This information can be used to identify the most impor­
tant features in the data and can be used for feature selection and feature engineering.

Implementation of Random Forest Algorithm in Python


Let's take a look at the implementation of Random Forest Algorithm in Python. We will be using the scikit-
learn library to implement the algorithm. The scikit-learn library is a popular machine learning library
that provides a wide range of algorithms and tools for machine learning.

Step 1 - Importing the Libraries


We will begin by importing the necessary libraries. We will be using the pandas library for data manipula­
tion, and the scikit-learn library for implementing the Random Forest algorithm.

from [Link] import RandomForestClassifier

Step 2 - Loading the Data


Next, we will load the data into a pandas dataframe. For this tutorial, we will be using the famous Iris
dataset, which is a classic dataset for classification tasks.

t Loading the iris dataset


iris = pdTead_csv('[Link] [Link]/ml/machine-learningdatabases/iris/[Link]',

[Link] = ['sepal_length', 'sepaLwidth', 'petal_length','petal_width', 'species'

Step 3 - Data Preprocessing


Before we can use the data to train our model, we need to preprocess it. This involves separating the fea­
tures and the target variable and splitting the data into training and testing sets.

Separating the features and target variable


/ .1 - ' :
/I '

X_train, X_test, y_train, y_test - train_test_split(X, y, test_size=0.35, random_state=42)

Step 4 - Training the Model


Next, we will train our Random Forest classifier on the training data.

Creating the Random Forest classifier object


rfc = RandomForestClassifier(n_estimators=100)

Step 5 - Making Predictions


Once we have trained our model, we can use it to make predictions on the test data.

Making predictions on the test data


y_pred = [Link](X_test)

Step 6 - Evaluating the Model


Finally, we will evaluate the performance of our model using various metrics such as accuracy, precision,
recall, and Fl-score.

from [Link] import accuracy score, precision score,

accuracy = accuracy_score(y_test, y_pred)^^^^^^^^^^M


precision = precision_score(y_test, y_pred, average='weighted')
[recall = recall_score(y_test, y_pred, average='weighted')
fl = fl_score(y_test, y_pred, average='weighted')^^M

print("Accuracy:", accuracy)^^^^^^^^^^^^^^B
print("Precision:",

Complete Implementation Example


Below is the complete implementation example of Random Forest Algorithm in python using the iris
dataset -

# Loading the ins dataset


iris______ =______ pd.read_csv('[Link]
header=None)

[Link] = ['sepaljength', 'sepaLwidth', 'petaLlength', 'petaLwidth', 'species']


y = [Link]:, -1

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.35, random_state=42)

rfc = RandomForestClassifier(n_estimators=IOO)

y_pred = [Link](X_test)

from [Link] import accuracy_score, precision_score,


ecall_score, fl_score

t Calculating the accuracy, precision, recall, and F1 -score


accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
Recall = recall_score(y_test, y_pred, average='weightecF)^^^B
fl = fl_score(y_test,y_pred, average='weighted’)^^^^^^M
rint("Precision:", precision)

Output
This will give us the performance metrics of our Random Forest classifier as follows -

Accuracy: 0.9811320754716981
Precision: 0.9821802935010483
Recall: 0.9811320754716981
Fl-score: 0.9811157396063056

Confusion Matrix

It is the easiest way to measure the performance of a classification problem where the output can be of two
or more type of classes. A confusion matrix is nothing but a table with two dimensions viz. "Actual" and
"Predicted" and furthermore, both the dimensions have "True Positives (TP)", "True Negatives (TN)", "False
Positives (FP)", "False Negatives (FN)" as shown below -
Actual
o

i True Positives (TP) False Positives (FP)

Predicted

o False Negatives (FN) True Negatives (TN)

Explanation of the terms associated with confusion matrix are as follows -

1. True Positives (TP) - It is the case when both actual class & predicted class of data point is 1.
2. True Negatives (TN) - It is the case when both actual class & predicted class of data point is 0.
3. False Positives (FP) - It is the case when actual class of data point is 0 & predicted class of data
point is 1.
4. False Negatives (FN) - It is the case when actual class of data point is 1 & predicted class of
data point is 0.
How to Implement Confusion Matrix in Python?
To implement the confusion matrix in Python, we can use the confusion_matrix() function from the
[Link] module of the scikit-learn library. Here is an simple example of how to use the confu-
sion_matrix() function -

from [Link] import confusion matrix

>rint(cm)

In this example, we have two arrays: y_actual contains the actual values of the target variable, and y_pred
contains the predicted values of the target variable. We then call the confusion_matrix() function, pass­
ing in y_actual and y_pred as arguments. The function returns a 2D array that represents the confusion
matrix.

The output of the code above will look like this -


[[3 1]
[2 4]]

We can also visualize the confusion matrix using a heatmap. Below is how we can do that using the
heatmapQ function from the seaborn library

[Link](cm, annot=True, cmap='summer'

This will produce a heatmap that shows the confusion matrix -


0

In this heatmap, the x-axis represents the predicted values, and the y-axis represents the actual values. The
color of each square in the heatmap indicates the number of samples that fall into each category.
Stochastic Gradient Descent
Gradient Descent is a popular optimization algorithm that is used to minimize the cost function of a ma­
chine learning model. It works by iteratively adjusting the model parameters to minimize the difference
between the predicted output and the actual output. The algorithm works by calculating the gradient of
the cost function with respect to the model parameters and then adjusting the parameters in the opposite
direction of the gradient.

Stochastic Gradient Descent is a variant of Gradient Descent that updates the parameters for each training
example instead of updating them after evaluating the entire dataset. This means that instead of using the
entire dataset to calculate the gradient of the cost function, SGD only uses a single training example. This
approach allows the algorithm to converge faster and requires less memory to store the data.

Working of Stochastic Gradient Descent Algorithm


Stochastic Gradient Descent works by randomly selecting a single training example from the dataset and
using it to update the model parameters. This process is repeated for a fixed number of epochs, or until the
model converges to a minimum of the cost function.

Here's how the Stochastic Gradient Descent algorithm works -

1. Initialize the model parameters to random values.


2. For each epoch, randomly shuffle the training data.
3. For each training example -
a. Calculate the gradient of the cost function with respect to the model parameters,
b. Update the model parameters in the opposite direction of the gradient.
4. Repeat until convergence

The main difference between Stochastic Gradient Descent and regular Gradient Descent is the way that the
gradient is calculated and the way that the model parameters are updated. In Stochastic Gradient Descent,
the gradient is calculated using a single training example, while in Gradient Descent, the gradient is calcu­
lated using the entire dataset.

Implementation of Stochastic Gradient Descent in Python


Let's look at an example of how to implement Stochastic Gradient Descent in Python. We will use the
scikit-learn library to implement the algorithm on the Iris dataset which is a popular dataset used for
classification tasks. In this example we will be predicting Iris flower species using its two features namely
sepal width and sepal length -

Example

■Il
import numpy as np^^H
X_data, y_data = [Link], [Link]

# Getting the Iris dataset with only the first two attributes

X_train, X_test, y_train, y_test = train_test_split(x7yj^^^^B


lest-Size ; rand e in state

scaler = StandardScaler().fit(X_train)
X_train [Link](X_train)|
X_test = [Link](X_test)^H

# create the linear model SGDclassifier


clfmodeLSGD = SGDClassifier(alpha=0.001, max_iter=200)

clfmode^G^^t(^jram^^mm)^^B

from sklearn importmetncs^^^^^^^^M


y_train_pred = clfmodel_SGD.predict(X_train)
print ("\nThe Accuracy of SGD classifier
metrics.accuracy_score(y_train, y_train_pred)* 100)

Output
When you run this code, it will produce the following output -

The Accuracy of SGD classifier is: 77.5

Clustering Algorithms

Clustering methods are one of the most useful unsupervised ML methods. These methods used to find
similarity as well as relationship patterns among data samples and then cluster those samples into groups
having similarity based on features. Clustering is important because it determines the intrinsic grouping
among the present unlabeled data. They basically make some assumptions about data points to constitute
their similarity. Each assumption will construct different but equally valid clusters.

For example, below is the diagram which shows clustering system grouped together the similar kind of
data in different clusters -

Cluster Formation Methods


It is not necessary that clusters will be formed in spherical form. Followings are some other cluster forma­
tion methods -
1. Density-based - In these methods, the clusters are formed as the dense region. The advantage
of these methods is that they have good accuracy as well as good ability to merge two clusters.
Ex. Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Ordering Points to
identify Clustering structure (OPTICS) etc.
2. Hierarchical-based - In these methods, the clusters are formed as a tree type structure based
on the hierarchy. They have two categories namely, Agglomerative (Bottom up approach) and
Divisive (Top down approach). Ex. Clustering using Representatives (CURE), Balanced itera­
tive Reducing Clustering using Hierarchies (BIRCH) etc.
3. Partitioning - In these methods, the clusters are formed by portioning the objects into k
clusters. Number of clusters will be equal to the number of partitions. Ex. Kmeans, Clustering
Large Applications based upon randomized Search (CLARANS).
4. Grid - In these methods, the clusters are formed as a grid like structure. The advantage of
these methods is that all the clustering operation done on these grids are fast and indepen­
dent of the number of data objects. Ex. Statistical Information Grid (STING), Clustering in
Quest (CLIQUE).

It is not necessary that clusters will be formed in spherical form. Followings are some other cluster forma­
tion methods -

Density-based
In these methods, the clusters are formed as the dense region. The advantage of these methods is that they
have good accuracy as well as good ability to merge two clusters. Ex. Density-Based Spatial Clustering of
Applications with Noise (DBSCAN), Ordering Points to identify Clustering structure (OPTICS) etc.

Hierarchical-based
In these methods, the clusters are formed as a tree type structure based on the hierarchy. They have two
categories namely, Agglomerative (Bottom up approach) and Divisive (Top down approach). Ex. Clustering
using Representatives (CURE), Balanced iterative Reducing Clustering using Hierarchies (BIRCH) etc.

Partitioning
In these methods, the clusters are formed by portioning the objects into k clusters. Number of clusters will
be equal to the number of partitions. Ex. K-means, Clustering Large Applications based upon randomized
Search (CLARANS).

Grid
In these methods, the clusters are formed as a grid like structure. The advantage of these methods is that
all the clustering operation done on these grids are fast and independent of the number of data objects. Ex.
Statistical Information Grid (STING), Clustering in Quest (CLIQUE).
Types of ML Clustering Algorithms
The following are the most important and useful ML clustering algorithms -

K-means Clustering
This clustering algorithm computes the centroids and iterates until we it finds optimal centroid. It as­
sumes that the number of clusters are already known. It is also called flat clustering algorithm. The num­
ber of clusters identified from data by algorithm is represented by ‘K’ in K-means.

Mean-Shift Algorithm
It is another powerful clustering algorithm used in unsupervised learning. Unlike K-means clustering, it
does not make any assumptions hence it is a non-parametric algorithm.

Hierarchical Clustering
It is another unsupervised learning algorithm that is used to group together the unlabeled data points
having similar characteristics.

We will be discussing all these algorithms in detail in the upcoming chapters.


Applications of Clustering
We can find clustering useful in the following areas -

Data summarization and compression - Clustering is widely used in the areas where we require data
summarization, compression and reduction as well. The examples are image processing and vector quan­
tization.

Collaborative systems and customer segmentation - Since clustering can be used to find similar prod­
ucts or same kind of users, it can be used in the area of collaborative systems and customer segmentation.

Serve as a key intermediate step for other data mining tasks - Cluster analysis can generate a compact
summary of data for classification, testing, hypothesis generation; hence, it serves as a key intermediate
step for other data mining tasks also.

Trend detection in dynamic data - Clustering can also be used for trend detection in dynamic data by
making various clusters of similar trends.

Social network analysis - Clustering can be used in social network analysis. The examples are generating
sequences in images, videos or audios.

Biological data analysis - Clustering can also be used to make clusters of images, videos hence it can suc­
cessfully be used in biological data analysis.
Now that you know what is clustering and how it works, let's see some of the clustering algorithms used in
machine learning, in the next few chapters.

Centroid-Based Clustering
Centroid-based clustering is a class of machine learning algorithms that aims to partition a dataset into
groups or clusters based on the proximity of data points to the centroid of each cluster.

The centroid of a cluster is the arithmetic mean of all the data points in that cluster and serves as a repre­
sentative point for that cluster.

The two most popular centroid-based clustering algorithms are -

K-means Clustering
K-Means clustering is a popular unsupervised machine learning algorithm used for clustering data. It is
a simple and efficient algorithm that can group data points into K clusters based on their similarity. The
algorithm works by first randomly selecting K centroids, which are the initial centers of each cluster. Each
data point is then assigned to the cluster whose centroid is closest to it. The centroids are then updated by
taking the mean of all the data points in the cluster. This process is repeated until the centroids no longer
move or the maximum number of iterations is reached.
K-Medoids Clustering
K-medoids clustering is a partition-based clustering algorithm that is used to cluster a set of data points
into "k" clusters. Unlike K-means clustering, which uses the mean value of the data points to represent the
center of the cluster, K-medoids clustering uses a representative data point, called a medoid, to represent
the center of the cluster. The medoid is the data point that minimizes the sum of the distances between it
and all the other data points in the cluster. This makes K-medoids clustering more robust to outliers and
noise than K-means clustering.

We will discuss these two clustering methods in the next two chapters.

K-Means Clustering

The K-Means algorithm can be summarized into the following steps -

1. Initialization - Select K random data points as the initial centroids.


2. Assignment - Assign each data point to the closest centroid.
3. Recalculation - Recalculate the centroids by taking the mean of all data points in each cluster.
4. Repeat - Repeat steps 2-3 until the centroids no longer move or the maximum number of
iterations is reached.
The K-Means algorithm is a straightforward and efficient algorithm, and it can handle large datasets.
However, it has some limitations, such as its sensitivity to the initial centroids, its tendency to converge to
local optima, and its assumption of equal variance for all clusters.

Implementation in Python
Python has several libraries that provide implementations of various machine learning algorithms, in­
cluding K-Means clustering. Let's see how to implement the K-Means algorithm in Python using the scikit-
learn library.

Step 1 - Import Required Libraries


To implement the K-Means algorithm in Python, we first need to import the required libraries. We will use
the numpy and matplotlib libraries for data processing and visualization, respectively, and the scikit-learn
library for the K-Means algorithm.

import numi
import [Link] as pit
from [Link] import KMeans

Step 2 - Generate Data


To test the K-Means algorithm, we need to generate some sample data. In this example, we will generate
300 random data points with two features. We will visualize the data also.
pl^catter(X[^]^<^^^=2O^map=^summer2;

Step 3 - Initialize K-Means


Next, we need to initialize the K-Means algorithm by specifying the number of clusters (K) and the maxi­
mum number of iterations.

kmeans = KMeans(n_clusters=3, max_iter=100)

Step 4 - Train the Model


After initializing the K-Means algorithm, we can train the model by fitting the data to the algorithm.

[Link](X

Step 5 - Visualize the Clusters


To visualize the clusters, we can plot the data points and color them based on their assigned cluster.

[Link](figsize=(7.5, 3.5))
[Link](kmeans.cluster_centers_[:,O], kmeans.cluster_centers_[:,l],

The output of the above code will be a plot with the data points colored based on their assigned cluster, and
the centroids marked with an 'x' symbol in red color.

Complete Implementation Example


Here is the complete implementation example of K-Means Clustering Algorithm in python -

:: I! • I J

import [Link] as plt^^J


from [Link] import KMeansI

pl^catter(X[^]^<^^^=2O^map=^summer2;

|kmeans - KMeans(n_clusters=3, max_iter=100)

pl^gure(figsize=(^^5))^^^^^^^^^^B
[Link](kmeans.cluster_centers_[:,O], kmeans.cluster_centers_[:,l],

Output
When you execute this code, it will produce the following plots as the output -
1.0 -

0.8 -

0.6 -

0.4 -

0.2 -

0.0 -
—i--------------------------- 1---------------------------- 1----------------------------1---------------------------- 1— ~i—
0.0 0.2 0.4 0.6 0.8 1.0

Applications of K-Means Clustering


K-Means clustering is a versatile algorithm with various applications in several fields. Here we have high­
lighted some of the important applications -

Image Segmentation
K-Means clustering can be used to segment an image into different regions based on the color or texture of
the pixels. This technique is widely used in computer vision applications, such as object recognition, image
retrieval, and medical imaging.

Customer Segmentation
K-Means clustering can be used to segment customers into different groups based on their purchasing
behavior or demographic characteristics. This technique is widely used in marketing applications, such as
customer retention, loyalty programs, and targeted advertising.

Anomaly Detection
K-Means clustering can be used to detect anomalies in a dataset by identifying data points that do not
belong to any cluster. This technique is widely used in fraud detection, network intrusion detection, and
predictive maintenance.

Genomic Data Analysis


K-Means clustering can be used to analyze gene expression data to identify different groups of genes that
are co-regulated or co-expressed. This technique is widely used in bioinformatics applications, such as
drug discovery, disease diagnosis, and personalized medicine.

K-Medoids Clustering

K-Medoids Clustering - Algorithm


The K-medoids clustering algorithm can be summarized as follows -
1. Initialize k medoids - Select k random data points from the dataset as the initial medoids.
2. Assign data points to medoids - Assign each data point to the nearest medoid.
3. Update medoids - For each cluster, select the data point that minimizes the sum of distances
to all the other data points in the cluster, and set it as the new medoid.
4. Repeat steps 2 and 3 until convergence or a maximum number of iterations is reached.

Implementation in Python
To implement K-medoids clustering in Python, we can use the scikit-learn library. The scikit-learn library
provides the KMedoids class, which can be used to perform K-medoids clustering on a dataset.

First, we need to import the required libraries -

|from sklearn_extra.cluster import KMedoids


from [Link] importmakelblobsj
impor: [Link].

Next, we generate a sample dataset using the make_blobs() function from scikit-learn -

X, y = make_blobs(n_samples=500, centers=3, random_state=42)

Here, we generate a dataset with 500 data points and 3 clusters.

Next, we initialize the KMedoids class and fit the data -


[kmedoids = KMedoids(n_clusters=3, random_state=42)

Here, we set the number of clusters to 3 and use the random_state parameter to ensure reproducibility.

Finally, we can visualize the clustering results using a scatter plot -

[Link](X[:, 0], X[:, 1], c=kmedoids.labels_, cmap='viridis')


[Link](kmedoids.cluster_centers_[^OL^^^^^^^^M
kmedoids.cluster_centers_[:, 1], marker='x', color^redQ^^I

Example
Here is the complete implementation in Python -
kmedoids = KMedoids(n_clusters=3,random_state=42)

[Link](X[:, 0], X[:, 1], c=kmedoids.labels_, cmap='viridis')


[Link](kmedoids.cluster_centers_[7oT^^^^^^^^M
|kmedoids.cluster_centers_[:, 1], marker='x', color=7ed)^^B
pit. show!

Output
Here, we plot the data points as a scatter plot and color them based on their cluster labels. We also plot the
medoids as red crosses.
10

-5

-10
-7.5 -5.0 -2.5 0.0 2.5 5.0 7.5

K-Medoids Clustering - Advantages


Here are the advantages of using K-medoids clustering -

1. Robust to outliers and noise - K-medoids clustering is more robust to outliers and noise than
K-means clustering because it uses a representative data point, called a medoid, to represent
the center of the cluster.
2. Can handle non-Euclidean distance metrics - K-medoids clustering can be used with any
distance metric, including non-Euclidean distance metrics, such as Manhattan distance and
cosine similarity.
3. Computationally efficient - K-medoids clustering has a computational complexity of
O(k*nA2), which is lower than the computational complexity of K-means clustering.

K-Medoids Clustering - Disadvantages


The disadvantages of using K-medoids clustering are as follows -

1. Sensitive to the choice of k - The performance of K-medoids clustering can be sensitive to the
choice of k, the number of clusters.
2. Not suitable for high-dimensional data - K-medoids clustering may not perform well on
high-dimensional data because the medoid selection process becomes computationally ex­
pensive.

Mean-Shift Clustering

The Mean-Shift clustering algorithm is a non-parametric clustering algorithm that works by iteratively
shifting the mean of a data point towards the densest area of the data. The densest area of the data is de­
termined by the kernel function, which is a function that assigns weights to the data points based on their
distance from the mean. The kernel function used in Mean-Shift clustering is usually a Gaussian function.
The steps involved in the Mean-Shift clustering algorithm are as follows -

1. Initialize the mean of each data point to its own value.


2. For each data point, compute the mean shift vector, which is the vector that points towards
the densest area of the data.
3. Update the mean of each data point by shifting it towards the densest area of the data.
4. Repeat steps 2 and 3 until convergence is reached.

The Mean-Shift clustering algorithm is a density-based clustering algorithm, which means that it identi­
fies clusters based on the density of the data points rather than the distance between them. In other words,
the algorithm identifies clusters based on the areas where the density of the data points is highest.

Implementation of Mean-Shift Clustering in Python


The Mean-Shift clustering algorithm can be implemented in Python programming language using the
scikit-learn library. The scikit-learn library is a popular machine learning library in Python that provides
various tools for data analysis and machine learning. The following steps are involved in implementing the
Mean-Shift clustering algorithm in Python using the scikit-learn library -

Step 1 - Import the necessary libraries


The numpy library is used for scientific computing in Python, while the matplotlib library is used for data
visualization. The [Link] library contains the MeanShift class, which is used for implementing
the Mean-Shift clustering algorithm in Python.
The estimate_bandwidth function is used to estimate the bandwidth of the kernel function, which is an
important parameter in the Mean-Shift clustering algorithm.

import numi
import [Link] as pit
from [Link] import MeanShift, estimate_bandwidth

Step 2 - Generate the data


In this step, we generate a random dataset with 5 00 data points and 2 features. We use the [Link]-
[Link] function to generate the data.

= [Link](500,2)

Step 3 - Estimate the bandwidth of the kernel function


In this step, we estimate the bandwidth of the kernel function using the estimate_bandwidth function.
The bandwidth is an important parameter in the Mean-Shift clustering algorithm, which determines the
width of the kernel function.

andwidth = estimate_bandwidth(X, quantile-0.1,n_samples=100)


Step 4 - Initialize the Mean-Shift clustering algorithm
In this step, we initialize the Mean-Shift clustering algorithm using the MeanShift class. We pass the band­
width parameter to the class to set the width of the kernel function.

ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)

Step 5 - Train the model


In this step, we train the Mean-Shift clustering algorithm on the dataset using the fit method of the Mean-
Shift class.

Step 6 - Visualize the results

[labels = msdabels^^^^^^^^^^^^^^^^^B
|cluster_centers = ms.cluster_centersZ^^^^^^M
. mi
printf'Number of estimated clusters:", n_clusters_)
[Link](cluster_centers[:,O], cluster_centers[:,l], marker^'*', s=300, c='r')

In this step, we visualize the results of the Mean-Shift clustering algorithm. We extract the cluster labels
and the cluster centers from the trained model. We then print the number of estimated clusters. Finally, we
plot the data points and the centroids using the matplotlib library.

Example
Here is the complete implementation example of Mean-Shift Clustering Algorithm in python -

= [Link](500,2)

bandwidth = estimate_bandwidth(X, quantile=0.1, n_samples=100)


ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)

[labels =
|cluster_centers = ms.cluster_centersZ^^^^^^M
. mi
printf'Number of estimated clusters:", n_clusters_)

[Link](X[:,O], X[:,l], c=labels, cmap='summerQ^^^^^^H


[Link](cluster_centers[:,O], cluster_centers[:,l], marker^'*',

Output
When you execute the program, it will produce the following plot as the output -
3-

2-

1-

0-

-1 -

-2 -

-3 -
0 2

Applications of Mean-Shift Clustering


The Mean-Shift clustering algorithm has several applications in various fields. Some of the applications of
Mean-Shift clustering are as follows -

1. Computer vision - Mean-Shift clustering is widely used in computer vision for object track­
ing, image segmentation, and feature extraction.
2. Image processing - Mean-Shift clustering is used for image segmentation, which is the
process of dividing an image into multiple segments based on the similarity of the pixels.
3. Anomaly detection - Mean-Shift clustering can be used for detecting anomalies in data by
identifying the areas with low density.
4. Customer segmentation - Mean-Shift clustering can be used for customer segmentation in
marketing by identifying groups of customers with similar behavior and preferences.
5. Social network analysis - Mean-Shift clustering can be used for clustering users in social net­
works based on their interests and interactions.

Hierarchical Clustering

Hierarchical clustering is another unsupervised learning algorithm that is used to group together the
unlabeled data points having similar characteristics. Hierarchical clustering algorithms falls into follow­
ing two categories -

. Agglomerative hierarchical algorithms - In agglomerative hierarchical algorithms, each


data point is treated as a single cluster and then successively merge or agglomerate (bottom-
up approach) the pairs of clusters. The hierarchy of the clusters is represented as a dendro­
gram or tree structure.
. Divisive hierarchical algorithms - On the other hand, in divisive hierarchical algorithms, all
the data points are treated as one big cluster and the process of clustering involves dividing
(Top-down approach) the one big cluster into various small clusters.
Steps to Perform Agglomerative Hierarchical Clustering
We are going to explain the most used and important Hierarchical clustering i.e. agglomerative. The steps
to perform the same is as follows -

1. Step 1 - Treat each data point as single cluster. Hence, we will be having say K clusters at start.
The number of data points will also be K at start.
2. Step 2 - Now, in this step we need to form a big cluster by joining two closet datapoints. This
will result in total of K-1 clusters.
3. Step 3 - Now, to form more clusters we need to join two closet clusters. This will result in total
of K-2 clusters.
4. Step 4 - Now, to form one big cluster repeat the above three steps until K would become 0 i.e.
no more data points left to join.
5. Step 5 - At last, after making one single big cluster, dendrograms will be used to divide into
multiple clusters depending upon the problem.

Role of Dendrograms in Agglomerative Hierarchical Clustering


As we discussed in the last step, the role of dendrogram started once the big cluster is formed. Dendrogram
will be used to split the clusters into multiple cluster of related data points depending upon our problem.
It can be understood with the help of following example -
Example 1
To understand, let's start with importing the required libraries as follows -

%matplotlib inline
import [Link] as pit
import num as

Next, we will be plotting the datapoints we have taken for this example -

5^i^rray([[7^1J12^om7i9U26^1B2J7U8775U73^5

plt.subplots_adjust(bottom=oT)^^^^^^^^^^^^^^^M

pl^nnotate(label4xy=(x/y)^{ytext=^^)Xextcooids=^oflfce^oints^Ji^nghf/va=bottom'

Output
When you execute this code, it will produce the following plot as the output -
100

80

60

40

20

10 20 30 40 50 60 70 80 90

From the above diagram, it is very easy to see we have two clusters in our datapoints but in real-world data,
there can be thousands of clusters.

Next, we will be plotting the dendrograms of our datapoints by using Scipy library -
from [Link] import dendrogram, linkage
fronwiatplotliMmpor^yplo^^lt
linked = linkage(X,
labelLis^^nge(k^^^^^^^^H

dendrogram(linked, orientation='top',labels=labelList
|distance_sort='descending',show_leaf_counts=True)|

>[Link]

It will produce the following plot -


Now, once the big cluster is formed, the longest vertical distance is selected. A vertical line is then drawn
through it as shown in the following diagram. As the horizontal line crosses the blue line at two points
hence the number of clusters would be two.

10

0
10 67851423
Next, we need to import the class for clustering and call its fit_predict method to predict the cluster. We
are importing AgglomerativeClustering class of [Link] library -

cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean'

:luster.fit_predict(X

Next, plot the cluster with the help of following code -

[Link](X[:,O],X[:,l], c=cluster.labels_, cmap='rainbow')

The following diagram shows the two clusters from our datapoints.
100

80 • •

60 - •

40 -

20 - • •

-------- 1------------ 1------------ i------------ i------------ i------------ i------------ 1------------ i------------ r


10 20 30 40 50 60 70 80 90

Example 2
As we understood the concept of dendrograms from the simple example above, let's move to another
example in which we are creating clusters of the data point in Pima Indian Diabetes Dataset by using hier­
archical clustering -
impo^natplotlibpvplo^^lt
import pandas as
'bmatplotlib
numpv iipKBBBBI

from pandas import


patl^mf'CApim^ndian^iiabete^sv^^^^^^^^^^^^^^^^^H
heademame^^pregTplasTpresTskinytest/massTpediTageTclass'
data = read_csv(path, names=headernames)^^^^^^^^^^^^^M

impo^^cim^lustenhierarch^^hc

dend = [Link]([Link](data, method='ward'))

frai^^lean^lustei^^^^Agglomera^eClmstering^^^^^^^^l^^^J^^H
cluste^^gglnme^tiveClustering(i^lusteK=4^ffinity=^euclideanUinkage=^ward2

[Link](figsize=(7.2. 5.5))
[Link](patient_data[:, 01. patient_data[:,l], c=[Link]
cmap='rainbow')

Output
When you run this code, it will produce the following two plots as the output -

3500 -

3000 -

2500 -

2000-

1500 -

1000 -
800 -

0 20 40 60 80 100

Density-Based Clustering
Density-based clustering is based on the idea that clusters are regions of high density separated by regions
of low density.
1. The algorithm works by first identifying "core" data points, which are data points that have a
minimum number of neighbors within a specified distance. These core data points form the
center of a cluster.
2. Next, the algorithm identifies "border" data points, which are data points that are not core
data points but have at least one core data point as a neighbor.
3. Finally, the algorithm identifies "noise" data points, which are data points that are not core
data points or border data points.

Popular Density-based Clustering Algorithms


Here are the most common density-based clustering algorithms -

DBSCAN Clustering
The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm is one of the most
common density-based clustering algorithms. The DBSCAN algorithm requires two parameters: the mini­
mum number of neighbors (minPts) and the maximum distance between core data points (eps).

OPTICS Clustering
OPTICS (Ordering Points to Identify the Clustering Structure) is a density-based clustering algorithm that
operates by building a reachability graph of the dataset. The reachability graph is a directed graph that
connects each data point to its nearest neighbors within a specified distance threshold. The edges in the
reachability graph are weighted according to the distance between the connected data points. The algo­
rithm then constructs a hierarchical clustering structure by recursively splitting the reachability graph
into clusters based on a specified density threshold.

HDBSCAN Clustering
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a clustering algo­
rithm that is based on density clustering. It is a newer algorithm that builds upon the popular DBSCAN
algorithm and offers several advantages over it, such as better handling of clusters of varying densities and
the ability to detect clusters of different shapes and sizes.

In the next three chapters, we will discuss all the three density-based clustering algorithms in detail along
with their implementation in Python.

DBSCAN Clustering

The DBSCAN Clustering algorithm works as follows -

1. Randomly select a data point that has not been visited.


2. If the data point has at least minPts neighbors within distance eps, create a new cluster and
add the data point and its neighbors to the cluster.
3. If the data point does not have at least minPts neighbors within distance eps, mark the data
point as noise and continue to the next data point.
4. Repeat steps 1-3 until all data points have been visited.

Implementation in Python
We can implement the DBSCAN algorithm in Python using the scikit-learn library. Here are the steps to do
so -

Load the dataset


The first step is to load the dataset. We will use the make_moons function from the scikitlearn library to
generate a toy dataset with two moons.

fron^kleam^atase^mgormnak^noons^^^^^^^^^^^^|
X, y = make_moons(n_samples=200, noise=0.05, random_state=Q)

Perform DBSCAN clustering


The next step is to perform DBSCAN clustering on the dataset. We will use the DBSCAN class from the
scikit-learn library. We will set the minPts parameter to 5 and the "eps" parameter to 0.2.

from [Link] import DBSCAN^^^^M


clustering = DBSCAN(eps=0.2, min_samples=5)
Visualize the results
The final step is to visualize the results of the clustering. We will use the Matplotlib library to create a
scatter plot of the dataset colored by the cluster assignments.

import matplotlib. 35?lot as pit


[Link](X[:, 0], X[:, 1], c=clustering, labels,, cmap='rainbow')

Example
Here is the complete implementation of DBSCAN clustering in Python -

X, y = make_moons(n_samples=200, noise=0.05, random_state=0)


skkarilHuster

clusterin^^BSCAN(eps=^^nii^amples=5)
clusterm?fit(X)^^^^^^^^^^^^^^^^H

import [Link] as

[Link](X[:, 0], X[:, 1], c=clustering, labels,, cmap='rainbow')


[Link](

Output
The resulting scatter plot should show two distinct clusters, each corresponding to one of the moons in the
dataset. The noise data points should be colored black.

1.00

0.75

0.50

0.25

0.00

-0.25

-0.50

-1.0 -0.5 0.0 0.5 1.0 1.5 2.0

Advantages of DBSCAN
Following are the advantages of using DBSCAN clustering -
1. DBSCAN can handle clusters of arbitrary shape, unlike k-means, which assumes that clusters
are spherical.
2. It does not require prior knowledge of the number of clusters in the dataset, unlike k-means.
3. It can detect outliers, which are points that do not belong to any cluster. This is because
DBSCAN defines clusters as dense regions of points, and points that are far from any dense re­
gion are considered outliers.
4. It is relatively insensitive to the initial choice of parameters, such as the epsilon and min_sam-
ples parameters, unlike k-means.
5. It is scalable to large datasets, as it only needs to compute pairwise distances between neigh­
boring points, rather than all pairs of points.

Disadvantages of DBSCAN
Following are the disadvantages of using DBSCAN clustering -

1. It can be sensitive to the choice of the epsilon and min_samples parameters. If these parame­
ters are not chosen carefully, DBSCAN may fail to identify clusters or merge them incorrectly.
2. It may not work well on datasets with varying densities, as it assumes that all clusters have
the same density.
3. It may produce different results for different runs on the same dataset, due to the non-deter-
ministic nature of the algorithm.
4. It may be computationally expensive for high-dimensional datasets, as the distance computa­
tions become more expensive as the number of dimensions increases.
5. It may not work well on datasets with noise or outliers if the density of the noise or outliers is
too high. In such cases, the noise or outliers may be wrongly assigned to clusters.

OPTICS Clustering

OPTICS is like DBSCAN (Density-Based Spatial Clustering of Applications with Noise), another popular
density-based clustering algorithm. However, OPTICS has several advantages over DBSCAN, including the
ability to identify clusters of varying densities, the ability to handle noise, and the ability to produce a hi­
erarchical clustering structure.

Implementation of OPTICS in Python


To implement OPTICS clustering in Python, we can use the scikit-learn library. The scikit-learn library
provides a class called OPTICS that implements the OPTICS algorithm.

Here's an example of how to use the OPTICS class in scikit-learn to cluster a dataset -

Example

from [Link] import make_blobs


import [Link] as pit

X,y = make_blobs(n_samples=2000, centers=4, cluster_std=0.60, random_state=Q)

opdc^^OPTICS(mii^ample^^O^<^^5)

labels = optics.

pl^catter(X[^]^<[^^^=labels^map^turbo2

In this example, we first generate a sample dataset using the make_blobs function from scikit-learn. We
then instantiate an OPTICS object with the min_samples parameter set to 50 and the xi parameter set to
0.05. The min_samples parameter specifies the minimum number of samples required for a cluster to be
formed, and the xi parameter controls the steepness of the cluster hierarchy. We then fit the OPTICS object
to the dataset using the fit method. Finally, we plot the results using a scatter plot, where each data point is
colored according to its cluster label.

Output
When you execute this program, it will produce the following plot as the output -

10

-3-2-10 1 2 3 4

Advantages of OPTICS Clustering


Following are the advantages of using OPTICS clustering -

1. Ability to handle clusters of varying densities - OPTICS can handle clusters that have vary­
ing densities, unlike some other clustering algorithms that require clusters to have uniform
densities.
2. Ability to handle noise - OPTICS can identify noise data points that do not belong to any
cluster, which is useful for removing outliers from the dataset.
3. Hierarchical clustering structure - OPTICS produces a hierarchical clustering structure that
can be useful for analyzing the dataset at different levels of granularity.

Disadvantages of OPTICS Clustering


Following are some of the disadvantages of using OPTICS clustering.

1. Sensitivity to parameters - OPTICS requires careful tuning of its parameters, such as the
min_samples and xi parameters, which can be challenging.
2. Computational complexity - OPTICS can be computationally expensive for large datasets,
especially when using a high min_samples value.

HDBSCAN Clustering

Working of HDBSCAN Clustering


HDBSCAN builds a hierarchy of clusters using a mutual-reachability graph, which is a graph where each
data point is a node and the edges between them are weighted by a measure of similarity or distance. The
graph is built by connecting two points with an edge if their mutual reachability distance is below a given
threshold.

The mutual reachability distance between two points is the maximum of their reachability distances,
which is a measure of how easily one point can be reached from the other. The reachability distance
between two points is defined as the maximum of their distance and the minimum density of any point
along their path.

The hierarchy of clusters is then extracted from the mutual-reachability graph using a minimum span­
ning tree (MST) algorithm. The leaves of the MST correspond to the individual data points, while the inter­
nal nodes correspond to clusters of varying sizes and shapes.

The HDBSCAN algorithm then applies a condensed tree algorithm to the MST to extract the clusters. The
condensed tree is a compact representation of the MST that only includes the internal nodes of the tree.
The condensed tree is then cut at a certain level to obtain the clusters, with the level of the cut determined
by a user-defined minimum cluster size or a heuristic based on the stability of the clusters.

Implementation in Python
HDBSCAN is available as a Python library that can be installed using pip. The library provides an im­
plementation of the HDBSCAN algorithm along with several useful functions for data preprocessing and
visualization.

Installation
To install HDBSCAN, open a terminal window and type the following command -

pip install hdbscan

Usage
To use HDBSCAN, first import the hdbscan library -

import hdbscan

Next, we generate a sample dataset using the make_blobs() function from scikit-learn -

, y - make_blobs(n_samples= 1000, centers=3, random_state=42)

Now, create an instance of the HDBSCAN class and fit it to the data -

clusterer = [Link](min_cluster_size=10, metric='euclidean')

This will apply HDBSCAN to the dataset and assign each point to a cluster. To visualize the clustering re­
sults, you can plot the data with color each point according to its cluster label -
labels = [Link].

[colors = [Link]([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])


J:k ny.../i i;kJ:I.

This code will produce a scatter plot of the data with each point colored according to its cluster label as
follows -
--- 1-------------------------- 1-------------------------- 1-------------------------- 1-------------------------- 1-------------------------- 1-------------------------- 1-------------------------- T"
-10.0 -7.5 -5.0 -2.5 0.0 2.5 5.0 7.5

HDBSCAN also provides several parameters that can be adjusted to fine-tune the clustering results -

1. min_cluster_size - The minimum size of a cluster. Points that are not part of any cluster are
labeled as noise.
2. min_samples - The minimum number of samples in a neighborhood for a point to be consid­
ered a core point.
3. cluster_selection_epsilon - The radius of the neighborhood used for cluster selection.
4. metric - The distance metric used to measure the similarity between points.
Advantages of HDBSCAN Clustering
HDBSCAN has several advantages over other clustering algorithms -

1. Better handling of clusters of varying densities - HDBSCAN can identify clusters of differ­
ent densities, which is a common problem in many datasets.
2. Ability to detect clusters of different shapes and sizes - HDBSCAN can identify clusters that
are not necessarily spherical, which is another common problem in many datasets.
3. No need to specify the number of clusters - HDBSCAN does not require the user to specify
the number of clusters, which can be difficult to determine a priori.
4. Robust to noise - HDBSCAN is robust to noisy data and can identify outliers as noise points.

BIRCH Clustering
BIRCH (Balanced Iterative Reducing and Clustering hierarchies) is a hierarchical clustering algorithm that
is designed to handle large datasets efficiently. The algorithm builds a treelike structure of clusters by re­
cursively partitioning the data into subclusters until a stopping criterion is met.

BIRCH uses two main data structures to represent the clusters: Clustering Feature (CF) and Sub-Cluster
Feature (SCF). CF is used to summarize the statistical properties of a set of data points, while SCF is used to
represent the structure of subclusters.
BIRCH clustering has three main steps -

1. Initialization - BIRCH constructs an empty tree structure and sets the maximum number of
CFs that can be stored in a node.
2. Clustering - BIRCH reads the data points one by one and adds them to the tree structure. If a
CF is already present in a node, BIRCH updates the CF with the new data point. If there is no
CF in the node, BIRCH creates a new CF for the data point. BIRCH then checks if the number of
CFs in the node exceeds the maximum threshold. If the threshold is exceeded, BIRCH creates a
new subcluster by recursively partitioning the CFs in the node.
3. Refinement - BIRCH refines the tree structure by merging the subclusters that are similar
based on a distance metric.

Implementation of BIRCH Clustering in Python


To implement BIRCH clustering in Python, we can use the scikit-learn library. The scikitlearn library pro­
vides a BIRCH class that implements the BIRCH algorithm.

Here is an example of how to use the BIRCH class to cluster a dataset -

Example
from [Link] import make_blobs
from [Link] import Birch^^^B
import [Link] as plt^^^^^B
X, y = make_blobs(n_samples= 1000, centers= 10, cluster_std=0.50,

birch = Birch(threshold=1.5, n_clusters=4)

labels [Link](X)^^^^^^^^^B

pl^catter(X[^]^<^^^=labels^map^wintef)

In this example, we first generate a sample dataset using the make_blobs function from scikit-learn. We
then cluster the dataset using the BIRCH algorithm. For the BIRCH algorithm, we instantiate a Birch object
with the threshold parameter set to 1.5 and the n_clusters parameter set to 4. We then fit the Birch object
to the dataset using the fit method and predict the cluster labels using the predict method. Finally, we plot
the results using a scatter plot.

Output
When you execute the given program, it will produce the following plot as the output -
Advantages of BIRCH Clustering
BIRCH clustering has several advantages over other clustering algorithms, including -

1. Scalability - BIRCH is designed to handle large datasets efficiently by using a treelike struc­
ture to represent the clusters.
2. Memory efficiency - BIRCH uses CF and SCF data structures to summarize the statistical
properties of the data points, which reduces the memory required to store the clusters.
3. Fast clustering - BIRCH can cluster the data points quickly because it uses an incremental
clustering approach.

Disadvantages of BIRCH Clustering


BIRCH clustering also has some disadvantages, including -

1. Sensitivity to parameter settings - The performance of BIRCH clustering can be sensitive to


the choice of parameters, such as the maximum number of CFs that can be stored in a node
and the threshold value used to create subclusters.
2. Limited ability to handle non-spherical clusters - BIRCH assumes that the clusters are
spherical, which means it may not perform well on datasets with nonspherical clusters.
3. Limited flexibility in the choice of distance metric - BIRCH uses the Euclidean distance met­
ric by default, which may not be appropriate for all datasets.

Affinity Propagation
Affinity Propagation is a clustering algorithm that identifies "exemplars" in a dataset and assigns each data
point to one of these exemplars. It is a type of clustering algorithm that does not require a pre-specified
number of clusters, making it a useful tool for exploratory data analysis. Affinity Propagation was intro­
duced by Frey and Dueck in 2007 and has since been widely used in many fields such as biology, computer
vision, and social network analysis.
The idea behind Affinity Propagation is to iteratively update two matrices: the responsibility matrix and
the availability matrix. The responsibility matrix contains information about how well-suited each data
point is to serve as an exemplar for another data point, while the availability matrix contains information
about how much each data point wants to select another data point as an exemplar. The algorithm alter­
nates between updating these two matrices until convergence is achieved. The final exemplars are chosen
based on the maximum values in the responsibility matrix.

Implementation in Python
In Python, the Scikit-learn library provides the AffinityPropagation class for implementing the Affinity
Propagation algorithm. The class takes several parameters, including the preference parameter, which
controls how many exemplars are chosen, and the damping factor, which controls the convergence speed
of the algorithm.

Here is an example of how to implement Affinity Propagation using the Scikit-learn library in Python -

Example
from [Link] import AffinityProp agation^^^^^^^B
from [Link] import make_blobs^^^^^^^^^^^B
import [Link] as

X, _ make_blobs(n_samples=100, centers=4, random_state=Q)


af = AffinityPropagation(preference

# fit the model to the dataset


[Link](X

print("Exemplars:", af.cluster_centers_mdicesZ)^^^^^^^^^^^^^^^^^^M

[Link](af.cluster_centers_[:, 0], af.cluster_centers_[:, 1], marker='x', color-'red')

In this example, we first generate a synthetic dataset using the make_blobs() function from Scikit-learn.
We then create an instance of the AffinityPropagation class with a preference value of-50 and fit the model
to the dataset using the fit() method. Finally, we print the cluster labels and the exemplars identified by the
algorithm.

Output
When you execute this code, it will produce the following plot as the output -
10 4

-1----------------------- 1--------------------------- 1------------------------- 1 i------------------------- 1------------------------- 1------------------------- r


-3-2-10 1 2 3 4

In addition, it will print the following output on the terminal -

Cluster labels: [3 03333100000000233122012313322202213020131011021313211


1002200223201123023033312220112122333111100101]
Exemplars: [9 41 51 74]

The preference parameter in Affinity Propagation controls the number of exemplars that are chosen. A
higher preference value leads to more exemplars, while a lower preference value leads to fewer exemplars.
The damping factor controls the convergence speed of the algorithm, with larger damping factors leading
to slower convergence.

Overall, Affinity Propagation is a powerful clustering algorithm that can identify the number of clusters
automatically and does not require a pre-specified number of clusters. However, it can be computationally
expensive and may not work well with very large datasets.

Advantages of Affinity Propagation


Following are the advantages of using Affinity Propagation -

1. Affinity Propagation can identify the number of clusters automatically without specifying
the number of clusters in advance.
2. It can handle clusters of arbitrary shapes and sizes.
3. It can handle datasets with noisy or incomplete data.
4. It is relatively insensitive to the choice of initial parameters.
5. It has been shown to outperform other clustering algorithms on certain types of datasets.

Disadvantages of Affinity Propagation


Following are some of the disadvantages of using Affinity Propagation -

1. It can be computationally expensive for large datasets or datasets with many features.
2. It may converge to suboptimal solutions, especially when the data has a high degree of vari­
ability or noise.
3. It can be sensitive to the choice of the damping factor, which controls the rate of convergence.
4. It may produce many small clusters or clusters with only one or a few members, which may
not be meaningful.
5. It can be difficult to interpret the resulting clusters, as the algorithm does not provide explicit
information about the meaning or characteristics of the clusters.

Distribution-Based Clustering

Distribution-based clustering algorithms, also known as probabilistic clustering algorithms, are a class of
machine learning algorithms that assume that the data points are generated from a mixture of probability
distributions. These algorithms aim to identify the underlying probability distributions that generate the
data, and use this information to cluster the data into groups with similar properties.

One common distribution-based clustering algorithm is the Gaussian Mixture Model (GMM). GMM as­
sumes that the data points are generated from a mixture of Gaussian distributions, and aims to estimate
the parameters of these distributions, including the means and covariances of each distribution. Let's see
below what is GMM in ML and how we can implement in Python programming language.
Gaussian Mixture Model
Gaussian Mixture Models (GMM) is a popular clustering algorithm used in machine learning that assumes
that the data is generated from a mixture of Gaussian distributions. In other words, GMM tries to fit a set of
Gaussian distributions to the data, where each Gaussian distribution represents a cluster in the data.

GMM has several advantages over other clustering algorithms, such as the ability to handle overlapping
clusters, model the covariance structure of the data, and provide probabilistic cluster assignments for each
data point. This makes GMM a popular choice in many applications, such as image segmentation, pattern
recognition, and anomaly detection.

Implementation in Python
In Python, the Scikit-learn library provides the GaussianMixture class for implementing the GMM algo­
rithm. The class takes several parameters, including the number of components (i.e., the number of clus­
ters to identify), the covariance type, and the initialization method.

Here is an example of how to implement GMM using the Scikit-learn library in Python -

Example
from [Link] import GaussianMixture
from [Link] import make_blobs^^B
[Link]
generate a dataset
X, _ = make_blobs(n_samples=200, centers=4, random_state=Q)

gm^^^aussianMixmm^^mmponen^^J^^I

# fit the model to the dataset


[Link](X'

labels = [Link](X)^^^^^^^^^^^^M

pl^catter(X[^]^<^^^=labels^map=Vindis2

In this example, we first generate a synthetic dataset using the make_blobs() function from Scikit-learn.
We then create an instance of the GaussianMixture class with 4 components and fit the model to the
dataset using the fit() method. Finally, we predict the cluster labels for the data points using the predictQ
method and print the resulting labels.

Output
When you execute this program, it will produce the following plot as the output -

10 -

8-

-4 -2 0 2 4

In addition, you will get the following output on the terminal -

Cluster labels: [2 013210111120021333131202232213102010


1133331201331300323023231213123002211
0300223122011200333112032132233012213
0022120313012101021021330332320222212
0333102130323220031220110333130012121
00313221300013 1]
The covariance type parameter in GMM controls the type of covariance matrix to use for the Gaussian
distributions. The available options include "full" (full covariance matrix), "tied" (tied covariance matrix
for all clusters), "diag" (diagonal covariance matrix), and "spherical" (a single variance parameter for all
dimensions). The initialization method parameter controls the method used to initialize the parameters of
the Gaussian distributions.

Advantages of Gaussian Mixture Models


Following are the advantages of using Gaussian Mixture Models -

1. Gaussian Mixture Models (GMM) can model arbitrary distributions of data, making it a flexible
clustering algorithm.
2. It can handle datasets with missing or incomplete data.
3. It provides a probabilistic framework for clustering, which can provide more information
about the uncertainty of the clustering results.
4. It can be used for density estimation and generation of new data points that follow the same
distribution as the original data.
5. It can be used for semi-supervised learning, where some data points have known labels and
are used to train the model.

Disadvantages of Gaussian Mixture Models


Following are some of the disadvantages of using Gaussian Mixture Models -
1. GMM can be sensitive to the choice of initial parameters, such as the number of clusters and
the initial values for the means and covariances of the clusters.
2. It can be computationally expensive for high-dimensional datasets, as it involves computing
the inverse of the covariance matrix, which can be expensive for large matrices.
3. It assumes that the data is generated from a mixture of Gaussian distributions, which may
not be true for all datasets.
4. It may be prone to overfitting, especially when the number of parameters is large or the
dataset is small.
5. It can be difficult to interpret the resulting clusters, especially when the covariance matrices
are complex.

Agglomerative Clustering
Agglomerative clustering is a hierarchical clustering algorithm that starts with each data point as its own
cluster and iteratively merges the closest clusters until a stopping criterion is reached. It is a bottom-up
approach that produces a dendrogram, which is a tree-like diagram that shows the hierarchical relation­
ship between the clusters. The algorithm can be implemented using the scikit-learn library in Python.
Implementation in Python
We will use the iris dataset for demonstration. The first step is to import the necessary libraries and load
the dataset.

import pandas as
import [Link] as
from [Link] import load_iris^^^^^^^^^B
from [Link] import AgglomerativeClustermg^P
from [Link] import dendrogram, linkage

iris = load_iris()
X = irisxiata^B
y = [Link]®

The next step is to create a linkage matrix that contains the distances between each pair of clusters. We can
use the linkage function from the [Link] module to create the linkage matrix.

Z = linkage(X, 'ward')

The 'ward' method is used to calculate the distances between the clusters. It minimizes the variance of the
distances between the clusters being merged.
We can visualize the dendrogram using the dendrogram function from the same module.

pltJigure(figsize=(^^5))

The resulting dendrogram (see the following plot) shows the hierarchical relationship between the clus­
ters. We can see that the algorithm has merged the closest clusters first, and the distance between the clus­
ters increases as we move up the tree.
The final step is to apply the clustering algorithm and extract the cluster labels. We can use the Agglomer-
ativeClustering class from the [Link] module to apply the algorithm.

model = AgglomerativeClustering(n_clusters=3)

The n_clusters parameter specifies the number of clusters to be extracted from the data. In this case, we
have specified n_clusters=3 because we know that the iris dataset has three classes.

We can visualize the resulting clusters using a scatter plot.

phS^SZdt^^^^^^M
[Link]('Agglomerative Clustering Results")

The resulting plot shows the three clusters identified by the algorithm. We can see that the algorithm has
successfully separated the data points into their respective classes.
Agglomerative Clustering Results
4.5 -f

4.0 -

2.0 - •
--------------------------------- 1--------------------------------------------1------------------------------------------- 1------------------------------------------- 1--------------------------------------------1------------------------------------------- 1--------------------------------------------1------------------------------------------ 1—

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0


Sepal length

Example
Here is the complete implementation of Agglomerative Clustering in Python -

import numi
import pandas as
import [Link] as plt^^^B
from [Link] import loadjris
lroi^klea^^luste^^o^^gglomerativeClustering^|
from [Link] import dendrogram, linkage

X = [Link]^^^^B
y = [Link]^^^^B
^^mkage(X/ward9|

pl^gure(figsiz^^^^))

model = AgglomerativeClustering(n_clusters=3)^^^^B

llabels ■ [Link] 8818

[Link](figsize=(7.5, 3.5))
pl^catter(X[^)]^£^^<abels)^^^^^B

[Link]("Agglomerative Clustering Results")

Advantages of Agglomerative Clustering


Following are the advantages of using Agglomerative Clustering -

1. Produces a dendrogram that shows the hierarchical relationship between the clusters.
2. Can handle different types of distance metrics and linkage methods.
3. Allows for a flexible number of clusters to be extracted from the data.
4. Can handle large datasets with efficient implementations.

Disadvantages of Agglomerative Clustering


Following are some of the disadvantages of using Agglomerative Clustering -

1. Can be computationally expensive for large datasets.


2. Can produce imbalanced clusters if the distance metric or linkage method is not appropriate
for the data.
3. The final result may be sensitive to the choice of distance metric and linkage method used.
4. The dendrogram may be difficult to interpret for large datasets with many clusters.

Dimensionality Reduction

Dimensionality reduction in machine learning is the process of reducing the number of features or vari­
ables in a dataset while retaining as much of the original information as possible. In other words, it is a way
of simplifying the data by reducing its complexity.

The need for dimensionality reduction arises when a dataset has a large number of features or variables.
Having too many features can lead to overfitting and increase the complexity of the model. It can also
make it difficult to visualize the data and can slow down the training process.

There are two main approaches to dimensionality reduction -

Feature Selection
This involves selecting a subset of the original features based on certain criteria, such as their importance
or relevance to the target variable.

The following are some commonly used feature selection techniques -

1. Filter Methods
2. Wrapper Methods
3. Embedded Methods

Feature Extraction
Feature extraction is a process of transforming raw data into a set of meaningful features that can be
used for machine learning models. It involves reducing the dimensionality of the input data by selecting,
combining or transforming features to create a new set of features that are more useful for the machine
learning model.

Dimensionality reduction can improve the accuracy and speed of machine learning models, reduce over­
fitting, and simplify data visualization.

Feature Selection

Feature selection is an important step in machine learning that involves selecting a subset of the available
features to improve the performance of the model. The following are some commonly used feature selec­
tion techniques -

Filter Methods
This method involves evaluating the relevance of each feature by calculating a statistical measure (e.g.,
correlation, mutual information, chi-square, etc.) and ranking the features based on their scores. Features
that have low scores are then removed from the model.

To implement filter methods in Python, you can use the SelectKBest or SelectPercentile functions from the
sklearn.feature_selection module. Below is a small code snippet to implement Feature selection.

from sklearn.feature_selection import SelectPercentile, chi2


selector = SelectPercentile(chi2, percentile = 10)^^^^^^B
X_new selector.fit_transform(X^y)^^^^^^^^^^^^B

Wrapper Methods
This method involves evaluating the model's performance by adding or removing features and selecting
the subset of features that yields the best performance. This approach is computationally expensive, but it
is more accurate than filter methods.

To implement wrapper methods in Python, you can use the RFE (Recursive Feature Elimination) function
from the sklearn.feature_selection module. Below is a small code snippet to implement Wrapper method.

frai^klean^eatu]^^lectioi^^^^FE^^^^^|
from sklearn.linear_model import LogisticRegression

jestimator = LogisticRegressionQ^^^^^^^^^^M
selector RFEtotoimator, n_Jeai.ures._to
selector [Link] tor.
X_new = [Link](X)

Embedded Methods
This method involves incorporating feature selection into the model building process itself. This can be
done using techniques such as Lasso regression, Ridge regression, or Decision Trees. These methods assign
weights to each feature and features with low weights are removed from the model.

To implement embedded methods in Python, you can use the Lasso or Ridge regression functions from the
sklearn.linear_model module. Below is a small code snippet for implementing embedded methods -

from sklearn.linear_model import Lasso

[lasso = Lasso(alpha=0.1)^^^^^^^^^^^J

|coef = [Link](lasso.coef_, index = [Link])


importanLfeatures coef[coef 5

Principal Component Analysis (PCA)


This is a type of unsupervised learning method that involves transforming the original features into a
set of uncorrelated principal components that explain the maximum variance in the data. The number of
principal components can be selected based on a threshold value, which can reduce the dimensionality of
the dataset.

To implement PCA in Python, you can use the PCA function from the [Link] module. For
example, to reduce the number of features you can use PCA as given the following code -

from [Link] import PCA


pea = PCA(n_components = 3)^^^^^M
X_new o c a. fi t _ t r a n s fo r m (

Recursive Feature Elimination (RFE)


This method involves recursively eliminating the least significant features until a subset of the most im­
portant features is identified. It uses a model-based approach and can be computationally expensive, but it
can yield good results in high-dimensional datasets.

To implement RFE in Python, you can use the RFECV (Recursive Feature Elimination with Cross Valida­
tion) function from the sklearn.feature_selection module. For example, below is a small code snippet with
the help of which we can implement to use Recursive Feature Elimination -

from [Link] import DecisionTreeClassifier


[estimator = DecisionTreeClassifierQ^^^^^M
selector = RFECV(estimator, step=l,cv=S)^^B
' ■ I . -' i. _ , -

_new = [Link](X)

These feature selection techniques can be used alone or in combination to improve the performance of
machine learning models. It is important to choose the appropriate technique based on the size of the
dataset, the nature of the features, and the type of model being used.

Example
In the below example, we will implement three feature selection methods - univariate feature selection
using the chi-square test, recursive feature elimination with cross-validation (RFECV), and principal com­
ponent analysis (PCA).

We will use the Breast Cancer Wisconsin (Diagnostic) Dataset, which is included in scikit-learn. This
dataset contains 569 samples with 30 features, and the task is to classify whether a tumor is malignant or
benign based on these features.

Here is the Python code to implement these feature selection methods on the Breast Cancer Wisconsin
(Diagnostic) Dataset -

import pandas as
[from [Link] import load_diabetes^^^^^M
from sklearn.feature_selection import SelectKBest, chi2
from [Link] import LogisticRegression

Load the datasel


diabetes = pd.read_csv(r'C:\Users\Leekha\Desktop\[Link]')

| diabc. a

X_ n e w s e i e c t o r. fi t_t r a n s fo

X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.3, random_state=42)

elf = LogisticRegressionQ^^^^^^^^^^^^^^^M

accuracy = [Link](X_test,
print("Accuracy using univariate feature selection: {:.2f}".format(accuracy))
estimator =
selector - RFECV(estimator, step=l,

scores = cross_val_score(LogisticRegression(), X_new, y,


print("Accuracy using RFECV feature selection: %0.2f (+/- %0.2f)" % ([Link], [Link] * 2))

scores = cross_val_score(LogisticRegression(), X_new, y,


printf Accuracy using PCA feature selection: %0.2f (+/- %0.2f)" % ([Link], [Link] * 2))

Output
When you execute this code, it will produce the following output on the terminal -

Accuracy using univariate feature selection: 0.74


Accuracy using RFECV feature selection: 0.77 (+/- 0.03)
Accuracy using PCA feature selection: 0.75 (+/- 0.07)
Feature Extraction
Feature extraction is often used in image processing, speech recognition, natural language processing, and
other applications where the raw data is high-dimensional and difficult to work with.

Example
Here is an example of how to perform feature extraction using Principal Component Analysis (PCA) on the
Iris Dataset using Python -

from [Link] import loadjrisJ


from [Link] import PCA
import [Link] as plt^^^^H

pea = PCA(n_components = 2)^^^^B


X_pca = pca.fit_transform([Link])

tr Visualize the transformed data


pl^catterQ^ca[^]^^ca^3]^^ri^arget)

In this code, we first import the necessary libraries, including sklearn for performing feature extraction
using PCA and matplotlib for visualizing the transformed data.

Next, we load the Iris Dataset using load_iris(). We then perform feature extraction using PCA with PCA()
and set the number of components to 2 (n_components=2). This reduces the dimensionality of the input
data from 4 features to 2 principal components.

We then transform the input data using fit_transform() and store the transformed data in X_pca. Finally,
we visualize the transformed data using [Link]() and color the data points based on their target value.
We label the axes as PCI and PC2, which are the first and second principal components, respectively, and
show the plot using [Link].

Output
When you execute the given program, it will produce the following plot as the output -
1.5 i

-3-2-10 1 2
PCI

Advantages of Feature Extraction


Following are the advantages of using Feature Extraction -

1. Reduced Dimensionality - Feature extraction reduces the dimensionality of the input data
by transforming it into a new set of features. This makes the data easier to visualize, process
and analyze.
2. Improved Performance - Feature extraction can improve the performance of machine learn­
ing algorithms by creating a set of more meaningful features that capture the essential infor­
mation from the input data.
3. Feature Selection - Feature extraction can be used to perform feature selection by selecting
a subset of the most relevant features that are most informative for the machine learning
model.
4. Noise Reduction - Feature extraction can also help reduce noise in the data by filtering out
irrelevant features or combining related features.

Disadvantages of Feature Extraction


Following are the disadvantages of using Feature Extraction -

1. Loss of Information - Feature extraction can result in a loss of information as it involves


reducing the dimensionality of the input data. The transformed data may not contain all the
information from the original data, and some information may be lost in the process.
2. Overfitting - Feature extraction can also lead to overfitting if the transformed features are
too complex or if the number of features selected is too high.
3. Complexity - Feature extraction can be computationally expensive and time-consuming,
especially when dealing with large datasets or complex feature extraction techniques such as
deep learning.
4. Domain Expertise - Feature extraction requires domain expertise to select and transform the
features effectively. It requires knowledge of the data and the problem at hand to choose the
right features that are most informative for the machine learning model.

Backward Elimination

Backward Elimination is a feature selection technique used in machine learning to select the most signifi­
cant features for a predictive model. In this technique, we start by considering all the features initially, and
then we iteratively remove the least significant features until we get the best subset of features that gives
the best performance.

Implementation in Python
To implement Backward Elimination in Python, you can follow these steps -

Import the necessary libraries: pandas, numpy, and [Link].

Import [Link] as sm

Load your dataset into a Pandas DataFrame. We will be using Pima-Indians-Diabetes dataset
diabetes = pd.read_csv(r'C:\Users\Leekha\Desktop\[Link]')

Define the predictor variables (X) and the target variable (y).

^atasetiloch^^hzalues
= [Link]:, -1 Lvalues!

Add a column of ones to the predictor variables to represent the intercept.

X = [Link](arr = [Link]((len(X), l)).astype(int), values = X, axis = 1)

Use the Ordinary Least Squares (OLS) method from the statsmodels library to fit the multiple linear regres­
sion model with all the predictor variables.

regressor_OLS = [Link](endog = y, exog = X_opt).fit()

Check the p-values of each predictor variable and remove the one with the highest p-value (i.e., the least
significant).

egres sor_OLS. summary(

Repeat steps 5 and 6 until all the remaining predictor variables have a p-value below the significance level
(e.g., 0.05).
|regressor_uLd - [Link]^enaog = y, exog = A_opr;.nr()
egressor_OLS.summa

The final subset of predictor variables with p-values below the significance level is the optimal set of fea­
tures for the model.

Example
Here is the complete implementation of Backward Elimination in Python -

pandas i; d
import numpv as np
import [Link] as sm

# Load the diabetes dataset


diabetes = pd.read_csv(r'C:\Users\Leekha\Desktop\[Link]')

frDehn^hepredictouvariable^XMn^h^argenzariabl^v
X = [Link][:,
J L'.i ■i ■>./h v :

#Ad^^olumiw^ne^^hepredicto^anable^^epresen^h^ntercept
X = [Link](arr = [Link]((len(X), l)).astype(int), values = X, axis=l)M

# Fit the multiple linear regression model with all the predictor variables

regressor_ujLd = [Link],enaog = y, exog = A_.[Link]

^hecl^hei^alue^feachpredictomzariabl^mc^emov^h^me

# Repeat the above step until all the remaining predictor variables!

regressoi^OL^^>nWLS(endo£^y^xoE^^^pt)^0
regressoi^OL$^^nWLS(endo: = y, exog = X_opt).fit()

regressor_OLS = [Link](endo) ; = y, exog = X_opt).fit()


egressor_OLS.summary()

regmssoi^OL$^^nWLS(endo ; = y, exog = X_opt).fit()

Output
When you execute this program, it will produce the following output
OLS Regression Results

Dep. Variable: V R-squared: 0.112


Model: OLS Adj. R-squared: 0.106
Method: Least Squares F-statistic: 19.22
Date: Thu, 27 Apr 2023 Prob (F-statistic) 4.81e-18
Time: 17:17:34 Log-Likelihood: -473.94
No. Observations 767 AIC: 959.9
Df Residuals: 761 BIC: 987.7
Of Model: 5
Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

const -0.0693 0.072 -0.960 0.337 -0.211 0.072


xl 0.0214 0.006 3.707 0.000 0.010 0.033
x2 -0.0003 0.001 -0.316 0.752 -0.002 0.001
x3 0.0005 0.000 3.497 0.000 0.000 0.001
X4 0.2171 0.050 4.334 0.000 0.119 0.315
x5 0.0064 8.002 3.770 0.000 0.003 0.01O|

Omnibus: 521.181 Durbin-Watson: 2.006


Prob(Omnibus): 0.000 Jarque-Bera (JB): 82.038
Skew: 0.538 Prob(JB): 1.53e-18
Kurtosis: 1.813 Cond. No. 678.
Forward Feature Construction
Forward Feature Construction is a feature selection method in machine learning where we start with an
empty set of features and iteratively add the best performing feature at each step until the desired number
of features is reached.

The goal of feature selection is to identify the most important features that are relevant for predicting the
target variable, while ignoring the less important features that add noise to the model and may lead to
overfitting.

The steps involved in Forward Feature Construction are as follows -

1. Initialize an empty set of features.


2. Set the maximum number of features to be selected.
3. Iterate until the desired number of features is reached -
a. For each remaining feature that is not already in the set of selected features, fit a
model with the selected features and the current feature, and evaluate its perfor­
mance using a validation set.
b. Select the feature that leads to the best performance and add it to the set of se­
lected features.
4. Return the set of selected features as the optimal set for the model.
The key advantage of Forward Feature Construction is that it is computationally efficient and can be used
for high-dimensional datasets. However, it may not always lead to the optimal set of features, especially
if there are highly correlated features or non-linear relationships between the features and the target
variable.

Example
Here is an example to implement Forward Feature Construction in Python -

import pandas as

from sklearn.model_selection import train_test_split


from sklearn.linear_model import LinearRegression|
from [Link] import mean_squarecTerrorM

diabetes = pd.read_csv(r'C:\Users\Leekha\Desktop\[Link]')

X = [Link][:,

# Split the data into training and testing sets


X_train, X_test, y_train, y_test - train_test_split(X, y, test_size = 0.2, random_state = 0)

s e 1 e c t e ci_fe atu res -

while len(selected_features) < max_features-^^^^H

# Set the best feature and the best score to be 0


best_feature = None
besLscore^^^^B

# Iterate over all the remaining features


for i in range(X_train.shape[

# Skip the feature if it's already selected


if i in selected_features:

X_train_selected = X_train[:, list(selected_features) + [i]]^M


X_test_selected = X_test[:, list(selected_features) + [i]]

# Update the best feature and score if the current feature performs better
if score >

al
# Add the best feature to the set of selected features

3rint the selected features and the scoret^^®®


rint('Selected Features:', list(selected_features))
^^^^^^Te^^core)HHHBHHBB|

Output
On execution, it will produce the following output -
Selected Features: [1]
Score: 0.23530716168783583
Selected Features: [0,1]
Score: 0.2923143573608237
Selected Features: [0,1, 5]
Score: 0.3164103491569179
Selected Features: [0,1, 5, 6]
Score: 0.3287368302427327
Selected Features: [0,1, 2, 5, 6]
Score: 0.334586804842275
Selected Features: [0,1, 2, 3, 5, 6]
Score: O.335626473655O455
Selected Features: [0,1, 2, 3,4, 5, 6]
Score: 0.3313166516703744
Selected Features: [0,1, 2, 3, 4, 5, 6, 7]
Score: 0.32230203252064216

High Correlation Filter

High Correlation Filter is a feature selection technique used in machine learning to identify and remove
highly correlated features from the dataset. This technique is used to improve the performance of the
model by reducing the number of features used for training the model and to avoid the problem of multi­
collinearity, which occurs when two or more predictor variables are highly correlated with each other.

The High Correlation Filter works by computing the correlation between each pair of features in the
dataset and removing one of the two features that are highly correlated with each other. This is done by
setting a threshold for the correlation coefficient between the features, and removing one of the features if
the absolute value of the correlation coefficient is greater than the threshold.

The steps involved in implementing High Correlation Filter are as follows -

1. Compute the correlation matrix for the dataset.


2. Set a threshold for the correlation coefficient between the features.
3. Find the pairs of features that have a correlation coefficient greater than the threshold.
4. Remove one of the two features from each pair of highly correlated features.
5. Use the remaining features for training the machine learning model.

The advantage of using High Correlation Filter is that it reduces the number of features used for training
the model, which in turn reduces the complexity of the model and makes it easier to interpret. Moreover, it
helps to avoid the problem of multicollinearity, which can lead to unstable and unreliable estimates of the
model parameters.

However, there are some limitations to High Correlation Filter. For example, it may not always select the
best set of features for the model, especially if there are non-linear relationships between the features and
the target variable. Also, if two features are highly correlated, removing one of them may result in the loss
of some important information that was present in the removed feature.
Example
Here is an example to implement High Correlation Filter in Python -

import pandas as
nunwv

diabetes = pd.read_csv(r'C:\Users\Leekha\Desktop\[Link]')

X = [Link][:,
JI. PC..... :■■■. i tj c ■: ; ckc:

|corr_matrix = [Link](X, rowvar=False)

thresholc^^^^^^^^^^^^^^^H

# Fmd the indices of the highly correlated features


high_corr_indices = [Link]([Link](corr_matrix) > threshold)
cs._ ro._ren:G\'e

for i, j in zip(*high_corr_indices)i^^^^^^^^^^^^^^^M
if i != j and (j, i) not in features_to_removei^^^^^^^^^B

features_to_remove = list(features_to_remove)

X_filtered = [Link](X, [j for i, j in features_to_remove], axis=l)^^^^^^M

printfShape of the filtered dataset:1, X_filtered.shape)

Output
When you execute this code, it will produce the following output -

Shape of the filtered dataset: (768, 8)


Advantages of High Correlation Filter
Following are the advantages of using High Correlation Filter -

1. Reduces multicollinearity - The High Correlation Filter can reduce multicollinearity, which
occurs when two or more features are highly correlated with each other. Multicollinearity can
negatively impact the performance of machine learning models.
2. Improves model performance - By removing highly correlated features, the High Correla­
tion Filter can improve the performance of machine learning models.
3. Simplifies the model - With fewer features, the model can be easier to interpret and under­
stand.
4. Saves computational resources - With fewer features, the computational resources required
to train machine learning models are reduced.

Disadvantages of High Correlation Filter


Following are the disadvantages of using High Correlation Filter -

1. Information loss - The High Correlation Filter can lead to information loss because it re­
moves features that may contain important information.
2. Affects non-linear relationships - The High Correlation Filter assumes that the relationships
between the features are linear. It may not work well for datasets where the relationships be­
tween the features are non-linear.
3. Impact on the dependent variable - Removing highly correlated features can sometimes
have a negative impact on the dependent variable, particularly if the features are strongly cor­
related with the dependent variable.
4. Selection bias - The High Correlation Filter may introduce selection bias if it removes fea­
tures that are important for predicting the dependent variable.

Low Variance Filter


Low Variance Filter is a feature selection technique used in machine learning to identify and remove low
variance features from the dataset. This technique is used to improve the performance of the model by
reducing the number of features used for training the model and to remove the features that have little or
no discriminatory power.

The Low Variance Filter works by computing the variance of each feature in the dataset and removing the
features that have a variance below a certain threshold. This is done because features with low variance
have little or no discriminatory power and are unlikely to be useful for predicting the target variable.

The steps involved in implementing Low Variance Filter are as follows -


1. Compute the variance of each feature in the dataset.
2. Set a threshold for the variance of the features.
3. Remove the features that have a variance below the threshold.
4. Use the remaining features for training the machine learning model.

Example
Here is an example to implement Low Variance Filter in Python -

import pandas as
nuniPV —

diabetes = pd.read_csv(r'C:\Users\Leekha\Desktop\[Link]')

X = [Link][:,

Set the threshold for the variance of the features


low_var_indices = [Link](variances < threshold)^

print('Shape of the filtered dataset:', X_filtered.shape)

Output
When you execute this code, it will produce the following output -

Shape of the filtered dataset: (768, 8)

Advantages of Low Variance Filter


Following are the advantages of using Low Variance Filter -

1. Reduces overfitting - The Low Variance Filter can help reduce overfitting by removing fea­
tures that do not contribute much to the prediction of the target variable.
2. Saves computational resources - With fewer features, the computational resources required
to train machine learning models are reduced.
3. Improves model performance - By removing low variance features, the Low Variance Filter
can improve the performance of machine learning models.
4. Simplifies the model - With fewer features, the model can be easier to interpret and under­
stand.

Disadvantages of Low Variance Filter


Following are the disadvantages of using Low Variance Filter -

1. Information loss - The Low Variance Filter can lead to information loss because it removes
features that may contain important information.
2. Affects non-linear relationships - The Low Variance Filter assumes that the relationships
between the features are linear. It may not work well for datasets where the relationships be­
tween the features are non-linear.
3. Impact on the dependent variable - Removing low variance features can sometimes have a
negative impact on the dependent variable, particularly if the features are important for pre­
dicting the dependent variable.
4. Selection bias - The Low Variance Filter may introduce selection bias if it removes features
that are important for predicting the dependent variable.
Missing Values Ratio
Missing Values Ratio is a feature selection technique used in machine learning to identify and remove
features from the dataset that have a high percentage of missing values. This technique is used to improve
the performance of the model by reducing the number of features used for training the model and to avoid
the problem of bias caused by missing values.

The Missing Values Ratio works by computing the percentage of missing values for each feature in the
dataset and removing the features that have a missing value percentage above a certain threshold. This is
done because features with a high percentage of missing values may not be useful for predicting the target
variable and can introduce bias into the model.

The steps involved in implementing Missing Values Ratio are as follows -

1. Compute the percentage of missing values for each feature in the dataset.
2. Set a threshold for the percentage of missing values for the features.
3. Remove the features that have a missing value percentage above the threshold.
4. Use the remaining features for training the machine learning model.

Example
Here is an example of how you can implement Missing Values Ratio in Python -
[diabetes = [Link](r'C:\Users\Leekha\Desktop\[Link]', delimiter=',')

Define the predictor variables (X) and the target variable (y)
= diabetes[:, :-l
= diabetes!:, -1]|

# Find the indices ot the features with a massing value percentage


above the threshol
high_missing_indices = [i for i, percentage in enumerate(missing_percentages) if percentage > threshold]

ff Remove the high massing value features from the dataset


.filtered = [Link](X, high_missing_indices, axis=l)^B

printfShape of the filtered dataset:', X_filtered.shape)


The above code performs Missing Values Ratio on the diabetes dataset and removes the features that have
a missing value percentage above the threshold.

Output
When you execute this code, it will produce the following output -

Shape of the filtered dataset: (769, 8)

Advantages of Missing Value Ratio


Following are the advantages of using Missing Value Ratio -

1. Saves computational resources - With fewer features, the computational resources required
to train machine learning models are reduced.
2. Improves model performance - By removing features with a high percentage of missing val­
ues, the Missing Value Ratio can improve the performance of machine learning models.
3. Simplifies the model - With fewer features, the model can be easier to interpret and under­
stand.
4. Reduces bias - By removing features with a high percentage of missing values, the Missing
Value Ratio can reduce bias in the model.
Disadvantages of Missing Value Ratio
Following are the disadvantages of using Missing Value Ratio -

1. Information loss - The Missing Value Ratio can lead to information loss because it removes
features that may contain important information.
2. Affects non-missing data - Removing features with a high percentage of missing values can
sometimes have a negative impact on non-missing data, particularly if the features are impor­
tant for predicting the dependent variable.
3. Impact on the dependent variable - Removing features with a high percentage of missing
values can sometimes have a negative impact on the dependent variable, particularly if the
features are important for predicting the dependent variable.
4. Selection bias - The Missing Value Ratio may introduce selection bias if it removes features
that are important for predicting the dependent variable.

Principal Component Analysis


Principal Component Analysis (PCA) is a popular unsupervised dimensionality reduction technique in
machine learning used to transform high-dimensional data into a lower-dimensional representation. PCA
is used to identify patterns and structure in data by discovering the underlying relationships between
variables. It is commonly used in applications such as image processing, data compression, and data
visualization.

PCA works by identifying the principal components (PCs) of the data, which are linear combinations of
the original variables that capture the most variation in the data. The first principal component accounts
for the most variance in the data, followed by the second principal component, and so on. By reducing the
dimensionality of the data to only the most significant PCs, PCA can simplify the problem and improve the
computational efficiency of downstream machine learning algorithms.

The steps involved in PCA are as follows -

1. Standardize the data - PCA requires that the data be standardized to have zero mean and unit
variance.
2. Compute the covariance matrix - PCA computes the covariance matrix of the standardized
data.
3. Compute the eigenvectors and eigenvalues of the covariance matrix - PCA then computes
the eigenvectors and eigenvalues of the covariance matrix.
4. Select the principal components - PCA selects the principal components based on their cor­
responding eigenvalues, which indicate the amount of variation in the data explained by each
component.
5. Project the data onto the new feature space - PCA projects the data onto the new feature
space defined by the selected principal components.
Example
Here is an example of how you can implement PC A in Python using the scikit-learn library

from [Link] import PCA

from [Link] import loadjris

►;<^standardized = (X - [Link](X, axis=O)) / [Link](X, axis=O)

pea -= PCA(n-Components =2)^^^B


X_pca = pca.fit_transform(X_standardized)

# Print the explained variance ratio of the selected components


printfExplained variance ratio:', pca.explained_variance_ratio_)

import [Link] as plt^^M


pl^catte^}^ca[^]^^cab^^=y)

In this example, we load the iris dataset, standardize the data, and create a PCA object with two compo­
nents. We then fit the PCA object to the standardized data and transform the data onto the two principal
components. We print the explained variance ratio of the selected components and plot the transformed
data using the first two principal components as the x and y axes.

Output
When you execute this code, it will produce the following plot as the output -
2 -

<n
0-

-2 -

■"T” T 1

-3 -2 0 2 3
PCI
Explained variance ratio: [0.72962445 0.22850762]

Advantages of PCA
Following are the advantages of using Principal Component Analysis -

1. Reduces dimensionality - PCA is particularly useful for high-dimensional datasets because


it can reduce the number of features while retaining most of the original variability in the
data.
2. Removes correlated features - PC A can identify and remove correlated features, which can
help improve the performance of machine learning models.
3. Improves interpretability - The reduced number of features can make it easier to interpret
and understand the data.
4. Reduces overfitting - By reducing the dimensionality of the data, PCA can reduce overfitting
and improve the generalizability of machine learning models.
5. Speeds up computation - With fewer features, the computation required to train machine
learning models is faster.

Disadvantages of PCA
Following are the disadvantages of using Principal Component Analysis -

1. Information loss - PCA reduces the dimensionality of the data by projecting it onto a lower­
dimensional space, which may lead to some loss of information.
2. Can be sensitive to outliers - PCA can be sensitive to outliers, which can have a significant
impact on the resulting principal components.
3. Interpretability may be reduced - Although PCA can improve interpretability by reducing
the number of features, the resulting principal components may be more difficult to interpret
than the original features.
4. Assumes linearity - PCA assumes that the relationships between the features are linear,
which may not always be the case.
5. Requires standardization - PCA requires that the data be standardized, which may not al­
ways be possible or appropriate.

Performance Metrics

Performance metrics in machine learning are used to evaluate the performance of a machine learning
model. These metrics provide quantitative measures to assess how well a model is performing and to
compare the performance of different models. Performance metrics are important because they help us
understand how well our model is performing and whether it is meeting our requirements. In this way, we
can make informed decisions about whether to use a particular model or not.

There are many performance metrics that can be used in machine learning, depending on the type of
problem being solved and the specific requirements of the problem. Some common performance metrics
include -

1. Accuracy - Accuracy is one of the most basic performance metrics and measures the propor­
tion of correctly classified instances in the dataset. It is calculated as the number of correctly
classified instances divided by the total number of instances in the dataset.
2. Precision - Precision measures the proportion of true positive instances out of all predicted
positive instances. It is calculated as the number of true positive instances divided by the sum
of true positive and false positive instances.
3. Recall - Recall measures the proportion of true positive instances out of all actual positive
instances. It is calculated as the number of true positive instances divided by the sum of true
positive and false negative instances.
4. Fl Score - Fl score is the harmonic mean of precision and recall. It is a balanced measure that
takes into account both precision and recall. It is calculated as 2 * (precision * recall) I (preci­
sion + recall).
5. ROC AUC Score - ROC AUC (Receiver Operating Characteristic Area Under the Curve) score is
a measure of the ability of a classifier to distinguish between positive and negative instances.
It is calculated by plotting the true positive rate against the false positive rate at different clas­
sification thresholds and calculating the area under the curve.
6. Confusion Matrix - A confusion matrix is a table that is used to evaluate the performance of a
classification model. It shows the number of true positives, true negatives, false positives, and
false negatives for each class in the dataset.

Example
Here is an example code snippet to calculate the accuracy, precision, recall, and Fl score for a binary clas­
sification problem -

from [Link] import

from sklearn.linear_model import LogisticRegression^^^^^^^^^^^^^^^B


from [Link] import accuracy_score, precision_score, recalLscore, fl_score
X = [Link]^^^^M
v in; ,i n

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegressionQ^^^^^^^^^^^^M

# Make predictions on the test set


y_pred = [Link](X_test)

accuracy = accuracy_score(y_test, y_pred)^^^^^^^^^B


precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro5^^^M
fl = fl score(y_test, y pred, average='macrh)^^^^^^M

print("Precision:", precision)^B
12 rint("Recall:", recall)
nnt(?Scom^l)|

Output
When you execute this code, it will produce the following output -

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
Fl Score: 1.0

Automatic Workflows
Introduction
In order to execute and produce results successfully, a machine learning model must automate some stan­
dard workflows. The process of automate these standard workflows can be done with the help of Scikit-
learn Pipelines. From a data scientist’s perspective, pipeline is a generalized, but very important concept. It
basically allows data flow from its raw format to some useful information. The working of pipelines can be
understood with the help of following diagram -
The blocks of ML pipelines are as follows -

Data ingestion - As the name suggests, it is the process of importing the data for use in ML project. The
data can be extracted in real time or batches from single or multiple systems. It is one of the most challeng­
ing steps because the quality of data can affect the whole ML model.

Data Preparation - After importing the data, we need to prepare data to be used for our ML model. Data
preprocessing is one of the most important technique of data preparation.

ML Model Training - Next step is to train our ML model. We have various ML algorithms like supervised,
unsupervised, reinforcement to extract the features from data, and make predictions.
Model Evaluation - Next, we need to evaluate the ML model. In case of AutoML pipeline, ML model can be
evaluated with the help of various statistical methods and business rules.

ML Model retraining - In case of AutoML pipeline, it is not necessary that the first model is best one. The
first model is considered as a baseline model and we can train it repeatably to increase model’s accuracy.

Deployment - At last, we need to deploy the model. This step involves applying and migrating the model
to business operations for their use.

Challenges Accompanying ML Pipelines


In order to create ML pipelines, data scientists face many challenges. These challenges fall into the follow­
ing three categories -

Quality of Data
The success of any ML model depends heavily on the quality of data. If the data we are providing to ML
model is not accurate, reliable and robust, then we are going to end with wrong or misleading output.

Data Reliability
Another challenge associated with ML pipelines is the reliability of data we are providing to the ML model.
As we know, there can be various sources from which data scientist can acquire data but to get the best re­
sults, it must be assured that the data sources are reliable and trusted.
Data Accessibility
To get the best results out of ML pipelines, the data itself must be accessible which requires consolidation,
cleansing and curation of data. As a result of data accessibility property, metadata will be updated with
new tags.

Modelling ML Pipeline and Data Preparation


Data leakage, happening from training dataset to testing dataset, is an important issue for data scientist to
deal with while preparing data for ML model. Generally, at the time of data preparation, data scientist uses
techniques like standardization or normalization on entire dataset before learning. But these techniques
cannot help us from the leakage of data because the training dataset would have been influenced by the
scale of the data in the testing dataset.

By using ML pipelines, we can prevent this data leakage because pipelines ensure that data preparation like
standardization is constrained to each fold of our cross-validation procedure.

Example
The following is an example in Python that demonstrate data preparation and model evaluation workflow.
For this purpose, we are using Pima Indian Diabetes dataset from Sklearn. First, we will be creating pipe­
line that standardized the data. Then a Linear Discriminative analysis model will be created and at last the
pipeline will be evaluated using 10-fold cross validation.
First, import the required packages as follows -

from pandas import

from [Link] import StandardScaler^^^^^^^^^^^B


from [Link] importPipelme^^^^^^^^^^^^^^^^^^B
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

Now, we need to load the Pima diabetes dataset as did in previous examples -

path - r"C:\[Link]"

Next, we will create a pipeline with the help of the following code -

[estimators -
[Link](('standardize', StandardScalerQ))^^
[Link](('lda', LinearDiscriminantAnalysisQ))
model Pipeline:
At last, we are going to evaluate this pipeline and output its accuracy as follows -

[results = cross_val_score(model, X, Y, cv=kfold)

Output
0.7790148448043184

The above output is the summary of accuracy of the setup on the dataset.

Modelling ML Pipeline and Feature Extraction


Data leakage can also happen at feature extraction step of ML model. That is why feature extraction
procedures should also be restricted to stop data leakage in our training dataset. As in the case of data
preparation, by using ML pipelines, we can prevent this data leakage also. FeatureUnion, a tool provided by
ML pipelines can be used for this purpose.

Example
The following is an example in Python that demonstrates feature extraction and model evaluation work­
flow. For this purpose, we are using Pima Indian Diabetes dataset from Sklearn.
First, 3 features will be extracted with PCA (Principal Component Analysis). Then, 6 features will be
extracted with Statistical Analysis. After feature extraction, result of multiple feature selection and extrac­
tion procedures will be combined by using

FeatureUnion tool. At last, a Logistic Regression model will be created, and the pipeline will be evaluated
using 10-fold cross validation.

First, import the required packages as follows -

from pandas import read_csv^^^^^^^^^^^^M

from [Link] selection import cross val scorel


from [Link] import Pipelme^^^^^^^^M
from [Link] import FeatureUmon^^^^M
from sklearn.linear_model import LogisticRegression
from [Link] import
from sklearn.feature_selection import SelectKBest^B

Now, we need to load the Pima diabetes dataset as did in previous examples -

path = r"C:\[Link]"^^^^^^^^^^^^^^^^M
|headernames = ['preg1, 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'
Next, feature union will be created as follows -

[features =
features. append(('pca', PCA(n_components=3)))|
[Link](('select_best', SelectKBest(k=6)))
feature_union Fe a t ureU mo n (fe a

Next, pipeline will be creating with the help of following script lines -

[estimators =
[Link](('feature_union', feature_union))
[Link](('logistic', LogisticRegression()))[
model Pipeline(estimaLors)^^^^^^^^^^^^B

At last, we are going to evaluate this pipeline and output its accuracy as follows -

[results = cross_val_score(model, X, Y, cv=kfold)

Output
O.7789811O66126855
The above output is the summary of accuracy of the setup on the dataset.

Boost Model Performance

Boosting is a popular ensemble learning technique that combines several weak learners to create a strong
learner. It works by iteratively training weak learners on subsets of the data and assigning higher weights
to the misclassified samples to increase their importance in the subsequent iterations. This process is re­
peated until the desired level of performance is achieved.

Here are some techniques to boost model performance in machine learning -

1. Feature Engineering - Feature engineering involves creating new features from the existing
features or transforming the existing features to make them more informative for the model.
This can include techniques such as one-hot encoding, scaling, normalization, and feature
selection.
2. Hyperparameter Tuning - Hyperparameters are parameters that are not learned during
training but are set by the data scientist. They control the behavior of the model, and tuning
them can significantly impact model performance. Grid search and randomized search are
common techniques for hyperparameter tuning.
3. Ensemble Learning - Ensemble learning involves combining multiple models to improve
performance. Techniques such as bagging, boosting, and stacking can be used to create en­
sembles. Random forests are an example of a bagging ensemble, while gradient boosting ma­
chines (GBMs) are an example of a boosting ensemble.
4. Regularization - Regularization is a technique that prevents overfitting by adding a penalty
term to the loss function. LI regularization (Lasso) and L2 regularization (Ridge) are common
techniques used in linear models, while dropout is a technique used in neural networks.
5. Data Augmentation - Data augmentation involves generating new data from the existing
data by applying transformations such as rotation, scaling, and flipping. This can help to re­
duce overfitting and improve model performance.
6. Model Architecture - The architecture of the model can significantly impact its performance.
Techniques such as deep learning and convolutional neural networks (CNNs) can be used to
create more complex models that are better able to learn complex patterns in the data.
7. Early Stopping - Early stopping is a technique used to prevent overfitting by stopping the
training process once the model performance stops improving on a validation set. This pre­
vents the model from continuing to learn the noise in the data and can help to improve
generalization.
8. Cross-Validation - Cross-validation is a technique used to evaluate the performance of a
model on multiple subsets of the data. This can help to identify overfitting and can be used to
select the best hyperparameters for the model.

These techniques can be implemented in Python using various machine learning libraries such as scikit-
learn, TensorFlow, and Keras. By using these techniques, data scientists can improve the performance of
their models and create more accurate predictions.
The following example below in which implement cross-validation using Scikit-learn -

Example

X = [Link]^^^^M
v i<i -

gb_clf = GradientBoostingClassifierQl

pnntCAccumcy^^^^^?^O^Hscom^nean0^com^td(^^))^^^^^B

Output
When you execute this code, it will produce the following output -

Accuracy: 0.96 (+/- 0.07)

Performance Improvement with Ensembles


Ensembles can give us boost in the machine learning result by combining several models. Basically, ensem­
ble models consist of several individually trained supervised learning models and their results are merged
in various ways to achieve better predictive performance compared to a single model. Ensemble methods
can be divided into following two groups -

Sequential ensemble methods


As the name implies, in these kind of ensemble methods, the base learners are generated sequentially. The
motivation of such methods is to exploit the dependency among base learners.

Parallel ensemble methods


As the name implies, in these kind of ensemble methods, the base learners are generated in parallel. The
motivation of such methods is to exploit the independence among base learners.

Ensemble Learning Methods


The following are the most popular ensemble learning methods i.e. the methods for combining the predic-
tions from different models -

Bagging
The term bagging is also known as bootstrap aggregation. In bagging methods, ensemble model tries to
improve prediction accuracy and decrease model variance by combining predictions of individual models
trained over randomly generated training samples. The final prediction of ensemble model will be given by
calculating the average of all predictions from the individual estimators. One of the best examples of bag­
ging methods are random forests.

Boosting
In boosting method, the main principle of building ensemble model is to build it incrementally by training
each base model estimator sequentially. As the name suggests, it basically combine several week base
learners, trained sequentially over multiple iterations of training data, to build powerful ensemble. During
the training of week base learners, higher weights are assigned to those learners which were misclassified
earlier. The example of boosting method is AdaBoost.

Voting
In this ensemble learning model, multiple models of different types are built and some simple statistics,
like calculating mean or median etc., are used to combine the predictions. This prediction will serve as the
additional input for training to make the final prediction.
Bagging Ensemble Algorithms
The following are three bagging ensemble algorithms -

Bagged Decision Tree


As we know that bagging ensemble methods work well with the algorithms that have high variance and,
in this concern, the best one is decision tree algorithm. In the following Python recipe, we are going to
build bagged decision tree ensemble model by using Baggingclassifier function of sklearn with Decision-
TreeClasifier (a classification & regression trees algorithm) on Pima Indians diabetes dataset.

First, import the required packages as follows -

from pandas import read_csv^^^^^^^^^^^^B

from sklearn.model_selection import cross_val_score


from [Link] import BaggingClassifier^^
from [Link] import DecisionTreeClassifier^^B

Now, we need to load the Pima diabetes dataset as we did in the previous examples -

path - r"C:\[Link]"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'
data = read_csv(path, names=headernames)
array = datavalues^^^^^^^^^^^^^M

Next, give the input for 10-fold cross validation as follows -

kfold - KFold(n_splits=10, random_state=seed)


carl r >a c i s i; a df a a a C d cad

We need to provide the number of trees we are going to build. Here we are building 150 trees -

hum_trees =150

Next, build the model with the help of following script -

model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)

Calculate and print the result as follows -

[results = cross_val_score(model, X, Y, cv=kfold)


cord a.;
Output
0.7733766233766234

The output above shows that we got around 77% accuracy of our bagged decision tree classifier model.

Random Forest
It is an extension of bagged decision trees. For individual classifiers, the samples of training dataset are
taken with replacement, but the trees are constructed in such a way that reduces the correlation between
them. Also, a random subset of features is considered to choose each split point rather than greedily choos­
ing the best split point in construction of each tree.

In the following Python recipe, we are going to build bagged random forest ensemble model by using Ran-
domForestClassifier class of sklearn on Pima Indians diabetes dataset.

First, import the required packages as follows -

from pandas import read_csv^^^^^^^^^^^^^B


■ i.! • 1 ■
from sklearn.model_selection import cross_val_score|
from [Link] import RandomForestClassifier

Now, we need to load the Pima diabetes dataset as did in previous examples -
It ath = r"C:\[Link]"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

[array =

[i

Next, give the input for 10-fold cross validation as follows -

kfold = KFold(n_splits=10, random_state=seed)

We need to provide the number of trees we are going to build. Here we are building 150 trees with split
points chosen from 5 features -

|num_trees =150
max_features = 5

Next, build the model with the help of following script -

[model = RandomForestClassifier(n_estimators=num_treeS; max_features=max_features)

Calculate and print the result as follows -


[results = cross_val_score(model, X, Y, cv^kfold)

Output
0.7629357484620642

The output above shows that we got around 76% accuracy of our bagged random forest classifier model.

Extra Trees
It is another extension of bagged decision tree ensemble method. In this method, the random trees are
constructed from the samples of the training dataset.

In the following Python recipe, we are going to build extra tree ensemble model by using ExtraTreesClassi-
fier class of sklearn on Pima Indians diabetes dataset.

First, import the required packages as follows -

from pandas import read_csv^^^^^^^^^^^^B

from sklearn.model_selection import cross_val_score


from [Link] import ExtraTreesClassifierH
Now, we need to load the Pima diabetes dataset as did in previous examples -

path = r"C:\[Link]"^^^^^^^^^^^^^^^^^B
pheadernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

array =

Next, give the input for 10-fold cross validation as follows -

kfold = KFold(n_splits=10, random_state=seed)

We need to provide the number of trees we are going to build. Here we are building 150 trees with split
points chosen from 5 features -

|num_trees = 150
max_features = 5

Next, build the model with the help of following script -

[model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)


Calculate and print the result as follows -

[results = cross_val_score(model, X, Y, cv=kfold)

Output
0.7551435406698566

The output above shows that we got around 75.5% accuracy of our bagged extra trees classifier model.

Boosting Ensemble Algorithms


The followings are the two most common boosting ensemble algorithms -

AdaBoost
It is one the most successful boosting ensemble algorithm. The main key of this algorithm is in the way
they give weights to the instances in dataset. Due to this the algorithm needs to pay less attention to the in­
stances while constructing subsequent models.

In the following Python recipe, we are going to build Ada Boost ensemble model for classification by using
AdaBoostClassifier class of sklearn on Pima Indians diabetes dataset.

First, import the required packages as follows -


from pandas import read_csv^^^^^^^^^^^^B
■ . ' । .■ i ' '
from sklearn.model_selection import cross_val_score
from [Link] import AdaBoostClassifier^B

Now, we need to load the Pima diabetes dataset as did in previous examples -

path = r"C:\[Link]"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
[array = [Link]^^^^^^^^^^^^^l

Next, give the input for 10-fold cross validation as follows -

kfold = KFold(n_splits=10, random_state=seed)

We need to provide the number of trees we are going to build. Here we are building 150 trees with split
points chosen from 5 features -

num_trees = 50
Next, build the model with the help of following script -

model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)

Calculate and print the result as follows -

[results = cross_val_score(modelj X, Y, cv=kfold)

Output
0.7539473684210527

The output above shows that we got around 75% accuracy of our AdaBoost classifier ensemble model.

Stochastic Gradient Boosting


It is also called Gradient Boosting Machines. In the following Python recipe, we are going to build Sto­
chastic Gradient Boostingensemble model for classification by using GradientBoostingClassifier class of
sklearn on Pima Indians diabetes dataset.

First, import the required packages as follows -

from [Link] import KFold


froir [Link] selection impor cross val scoreMM
from [Link] import GradientBoostingClassifierl

Now, we need to load the Pima diabetes dataset as did in previous examples -

path = r"C:\[Link]"
peadernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'

array =

Next, give the input for 10-fold cross validation as follows -

kfold - KFold(n_splits=10, random_state=seed)

We need to provide the number of trees we are going to build. Here we are building 150 trees with split
points chosen from 5 features -

num_trees = 50

Next, build the model with the help of following script -


model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)

Calculate and print the result as follows -

[results = cross_val_score(model, X, Y, cv=kfold)

Output
0.7746582365003418

The output above shows that we got around 77.5% accuracy of our Gradient Boosting classifier ensemble
model.

Voting Ensemble Algorithms


As discussed, voting first creates two or more standalone models from training dataset and then a voting
classifier will wrap the model along with taking the average of the predictions of sub-model whenever
needed new data.

In the following Python recipe, we are going to build Voting ensemble model for classification by using
VotingClassifier class of sklearn on Pima Indians diabetes dataset. We are combining the predictions of lo­
gistic regression, Decision Tree classifier and SVM together for a classification problem as follows -
First, import the required packages as follows -

from pandas import read_csv^^^^^^^^^^^^M

from sklearn.linear_model import LogisticRegression


from [Link] import DecisionTreeClassifier^^M
from [Link] import
from [Link] import VotingClassifier^^H

Now, we need to load the Pima diabetes dataset as did in previous examples -

path = r"C:\[Link]"
[headernames = ['preg1, 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'

array =

Next, give the input for 10-fold cross validation as follows -

kfold = KFold(n_splits=10, random_state=7)


Next, we need to create sub-models as follows -

MB
[Link](('logistic', modell))
<< .<12 .
|[Link](('cart', model2))^P

|[Link](('svm', model3))M

Now, create the voting ensemble model by combining the predictions of above created sub models.

ensembl^^odngClassifie^esdmatoK)^^^^^
[results = cross_val_score(ensemble, X, Y, cv=kfold)

Output
0.7382262474367738

The output above shows that we got around 74% accuracy of our voting classifier ensemble model.
Gradient Boosting

Gradient Boosting Machines (GBM) is a powerful machine learning technique that is widely used for build­
ing predictive models. It is a type of ensemble method that combines the predictions of multiple weaker
models to create a stronger and more accurate model.

GBM is a popular choice for a wide range of applications, including regression, classification, and ranking
problems. Let's understand the workings of GBM and how it can be used in machine learning.

What is a Gradient Boosting Machine (GBM)?


GBM is an iterative machine learning algorithm that combines the predictions of multiple decision trees to
make a final prediction.

The algorithm works by training a sequence of decision trees, each of which is designed to correct the
errors of the previous tree.

In each iteration, the algorithm identifies the samples in the dataset that are most difficult to predict and
focuses on improving the model's performance on these samples.

This is achieved by fitting a new decision tree that is optimized to reduce the errors on the difficult
samples. The process continues until a specified stopping criteria is met, such as reaching a certain level of
accuracy or the maximum number of iterations.
How Does a Gradient Boosting Machine Work?
The basic steps involved in training a GBM model are as follows -

1. Initialize the model - The algorithm starts by creating a simple model, such as a single deci­
sion tree, to serve as the initial model.
2. Calculate residuals - The initial model is used to make predictions on the training data, and
the residuals are calculated as the differences between the predicted values and the actual
values.
3. Train a new model - A new decision tree is trained on the residuals, with the goal of minimiz­
ing the errors on the difficult samples.
4. Update the model - The predictions of the new model are added to the predictions of the
previous model, and the residuals are recalculated based on the updated predictions.
5. Repeat - Steps 3-4 are repeated until a specified stopping criteria is met.

GBM can be further improved by introducing regularization techniques, such as LI and L2 regularization,
to prevent overfitting. Additionally, GBM can be extended to handle categorical variables, missing data,
and multi-class classification problems.

Example
Here is an example of implementing GBM using the Sklearn breast cancer dataset -
[Link]:[Link] malm mst split^M
from [Link] import GradientBoostingClassifier|
from [Link] import accuracylscore^^^^^^H

data = load_breast_cancer()^M
X fl./ d..j I

y = [Link]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = GradientBoostingClassifier(n_estimators=10Q, max_depth=3, learning_rate=O.l)

ff Make predictions on the testing set


y_pred = [Link](X_test)

accuracy = accuracy_score(y_test, y_pre


mm i mm myt, y
Output
In this example, we load the breast cancer dataset using Sklearn's load_breast_cancer function and split it
into training and testing sets. We then define the parameters for the GBM model using GradientBoosting-
Classifier, including the number of estimators (i.e., the number of decision trees), the maximum depth of
each decision tree, and the learning rate.

We train the GBM model using the fit method and make predictions on the testing set using the predict
method. Finally, we evaluate the model's accuracy using the accuracy_score function from Sklearn's met­
rics module.

When you execute this code, it will produce the following output -

Accuracy: 0.956140350877193

Advantages of Using Gradient Boosting Machines


There are several advantages of using GBM in machine learning -

1. High accuracy - GBM is known for its high accuracy, as it combines the predictions of multi­
ple weaker models to create a stronger and more accurate model.
2. Robustness - GBM is robust to outliers and noisy data, as it focuses on improving the model's
performance on the most difficult samples.
3. Flexibility - GBM can be used for a wide range of applications, including regression, classifi­
cation, and ranking problems.
4. Interpretability - GBM provides insights into the importance of different features in making
predictions, which can be useful for understanding the underlying factors driving the predic­
tions.
5. Scalability - GBM can handle large datasets and can be parallelized to accelerate the training
process.

Limitations of Gradient Boosting Machines


There are also some limitations to using GBM in machine learning -

1. Training time - GBM can be computationally expensive and may require a significant
amount of training time, especially when working with large datasets.
2. Hyperparameter tuning - GBM requires careful tuning of hyperparameters, such as the
learning rate, number of trees, and maximum depth, to achieve optimal performance.
3. Black box model - GBM can be difficult to interpret, as the final model is a combination of
multiple decision trees and may not provide clear insights into the underlying factors driving
the predictions.
Bootstrap Aggregation (Bagging)
Bagging is an ensemble learning technique that combines the predictions of multiple models to improve
the accuracy and stability of a single model. It involves creating multiple subsets of the training data by
randomly sampling with replacement. Each subset is then used to train a separate model, and the final pre­
diction is made by averaging the predictions of all models.

The main idea behind Bagging is to reduce the variance of a single model by using multiple models that are
less complex but still accurate. By averaging the predictions of multiple models, Bagging reduces the risk
of overfitting and improves the stability of the model.

How Does Bagging Work?


The Bagging algorithm works in the following steps -

1. Create multiple subsets of the training data by randomly sampling with replacement.
2. Train a separate model on each subset of the data.
3. Make predictions on the testing data using each model.
4. Combine the predictions of all models by taking the average or majority vote.
The key feature of Bagging is that each model is trained on a different subset of the training data, which
introduces diversity into the ensemble. The models are typically trained using a base model, such as a de­
cision tree, logistic regression, or support vector machine.

Example
Now let's see how we can implement Bagging in Python using the Scikit-learn library. For this example, we
will use the famous Iris dataset.

from [Link] import load_iris^^^^^^^B


[Link] . baggineCiassif^r^B
from [Link] import DecisionTreeClassifier^^B
from sklearn.model_selection import train_test_split
from [Link] import accuracyScore^^^^B

ins loadnrisO^^H

X_train, X_test, y_train, y_test = train_test_split([Link], [Link], test_size=0.2, random_state=42)

base_estimator = DecisionTreeClassifier(max_depth=3)
ff Dehne the Bagging classifier
bagging = BaggingClassifier(base_estimator=base_estimator, n_estimators= 10, random_state=42)

accuracy = accuracy_score(y_test, y_pre


print("Accuracy:!, accuracy)^^^^^B

In this example, we first load the Iris dataset using Scikit-learn's loadjris function and split it into training
and testing sets using the train_test_split function.

We then define the base estimator, which is a decision tree with a maximum depth of 3, and the Bagging
classifier, which consists of 10 decision trees.

We train the Bagging classifier using the fit method and make predictions on the testing set using the
predict method. Finally, we evaluate the model's accuracy using the accuracy_score function from Scikit-
learn's metrics module.

Output
When you execute this code, it will produce the following output -

Accuracy: 1.0

Cross Validation

Cross-validation is a powerful technique used in machine learning to estimate the performance of a model
on unseen data. It is an essential step in building a robust machine learning model, as it helps to identify
overfitting or underfitting, and helps to determine the optimal model hyperparameters.

What is Cross-Validation?
Cross-validation is a technique used to evaluate the performance of a model by partitioning the dataset
into subsets, training the model on a portion of the data, and then validating the model on the remaining
data. The basic idea behind cross-validation is to use a subset of the data to train the model and another
subset to test its performance. This allows the machine learning model to be trained on a variety of data
and to generalize better to new data.

There are different types of cross-validation techniques available, but the most commonly used technique
is k-fold cross-validation. In k-fold cross-validation, the data is partitioned into k equally sized folds. The
model is then trained on k-1 folds and tested on the remaining fold. This process is repeated k times, with
each of the k folds used once as the validation data. The final performance of the model is then averaged
over the k iterations to obtain an estimate of the model's performance.
Why is Cross-Validation Important?
Cross-validation is an essential technique in machine learning because it helps to prevent overfitting or
underfitting of a model. Overfitting occurs when the model is too complex and fits the training data too
closely, resulting in poor performance on new data. On the other hand, underfitting occurs when the
model is too simple and does not capture the underlying patterns in the data, resulting in poor perfor­
mance on both the training and test data.

Cross-validation also helps to determine the optimal model hyperparameters. Hyperparameters are the
settings that control the behavior of the model. For example, in a decision tree algorithm, the maximum
depth of the tree is a hyperparameter that determines the level of complexity of the model. By using cross-
validation to evaluate the performance of the model at different hyperparameter values, we can select the
optimal hyperparameters that maximize the model's performance.

Implementing Cross-Validation in Python


In this section, we will discuss how to implement k-fold cross-validation in Python using the Scikit-learn
library. Scikit-learn is a popular Python library for machine learning that provides a range of algorithms
and tools for data preprocessing, model selection, and evaluation.

To demonstrate how to implement cross-validation in Python, we will use the famous Iris dataset. The
Iris dataset contains measurements of the sepal length, sepal width, petal length, and petal width of three
different species of iris flowers. The goal is to build a model that can predict the species of an iris flower
based on its measurements.

First, we need to load the dataset using the Scikit-learn load_iris() function and split it into a training set
and a test set using the train_test_split() function. The training set will be used to train the model, and the
test set will be used to evaluate the performance of the model.

from [Link] import train_test_split

ins loadjrisO^^H

X_train, X_test, y_train, y_test = train_test_split([Link], [Link], test_size=0.2, random_state=42)

Next, we will create a decision tree classifier using the Scikit-learn DecisionTreeClassifierQ function.

from [Link] import DecisionTree

Create a decision tree classifier.

elf = DecisionTreeClassifier(random_state=42)
Now, we can use k-fold cross-validation to evaluate the performance of the model. We will use the cross_
val_score() function from Scikit-learn to perform k-fold cross-validation. The function takes as input the
model, the training data, the target variable, and the number of folds. It returns an array of scores, one for
each fold.

from sklearn.model_selection import cross_val_score

scores = cross_val_score(clf, X_train, y_train, cv=5)M

Here, we have specified the number of folds as 5, meaning that the data will be partitioned into 5 equally
sized folds. The cross_val_score() function will train the model on 4 folds and test it on the remaining fold.
This process will be repeated 5 times, with each fold used once as the validation data. The function returns
an array of scores, one for each fold.

Finally, we can calculate the mean and standard deviation of the scores to get an estimate of the model's
performance.

import numpy as n

mean_score = [Link](scores)^^^^^^^^^^^^^^|

11 rint("Mean cross-validation score: {:.2f}".format(mean_score))


12 rint("Standard deviation of cross-validation score: {:.2f}".format(std_score))

The output of this code will be the mean and standard deviation of the scores. The mean score represents
the average performance of the model across all folds, while the standard deviation represents the variabil­
ity of the scores.

Example
Here is the complete implementation of Cross-Validation in Python -

[from [Link] import load Jris^^^^^^^M


from [Link] import DecisionTreeClassifier^^B
from [Link] import cross_val_score

iris loacLinsO^^H

X = [Link]^^^^^^^^^^^^^^^B

elf = DecisionTreeClassifier(random_state=42)
scores = cross_val_score(clf, X, y, cv=5)

mean_score = [Link](scores)^^^^^^^^^^^^^^[

print("Mean cross-validation score: {:.2f}".format(mean_score))


print("Standard deviation of cross-validation score: {:.2f}".format(std_score))

Output
When you execute this code, it will produce the following output -

Mean cross-validation score: 0.95


Standard deviation of cross-validation score: 0.03

AUC-ROC Curve

The AUC-ROC curve is a commonly used performance metric in machine learning that is used to evaluate
the performance of binary classification models. It is a plot of the true positive rate (TPR) against the false
positive rate (FPR) at different threshold values.
What is the AUC-ROC Curve?
The AUC-ROC curve is a graphical representation of the performance of a binary classification model at
different threshold values. It plots the true positive rate (TPR) on the y-axis and the false positive rate (FPR)
on the x-axis. The TPR is the proportion of actual positive cases that are correctly identified by the model,
while the FPR is the proportion of actual negative cases that are incorrectly classified as positive by the
model.

The AUC-ROC curve is a useful metric for evaluating the overall performance of a binary classification
model because it takes into account the trade-off between TPR and FPR at different threshold values. The
area under the curve (AUC) represents the overall performance of the model across all possible threshold
values. A perfect classifier would have an AUC of 1.0, while a random classifier would have an AUC of 0.5.

Why is the AUC-ROC Curve Important?


The AUC-ROC curve is an important performance metric in machine learning because it provides a com­
prehensive measure of a model's ability to distinguish between positive and negative cases.

It is particularly useful when the data is imbalanced, meaning that one class is much more prevalent than
the other. In such cases, accuracy alone may not be a good measure of the model's performance because it
can be skewed by the prevalence of the majority class.
The AUC-ROC curve provides a more balanced view of the model's performance by taking into account
both TPR and FPR.

Implementing the AUC ROC Curve in Python


Now that we understand what the AUC-ROC curve is and why it is important, let's see how we can imple­
ment it in Python. We will use the Scikit-learn library to build a binary classification model and plot the
AUC-ROC curve.

First, we need to import the necessary libraries and load the dataset. In this example, we will be using the
breast cancer dataset from scikit-learn.

Example

import pandas as

from [Link] import train_test_split[


from sklearn.linear_model import LogisticRegression|
from [Link] import roc auc score, roc curve
mat picthb.-pv^kv

data = load_breast_cancer()
= [Link]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Next, we will fit a logistic regression model to the training set and make predictions on the test set.

Ir = LogisticRegressionQ^^^B

y_pred = lr.predict_proba(X_test)[:, 1]

After making predictions, we can calculate the AUC-ROC score using the roc_auc_score() function from
scikit-learn.

auc_roc = roc_auc_score(y_test, y_pred)

This will output the AUC-ROC score for the logistic regression model.
Finally, we can plot the ROC curve using the roc_curve() function and matplotlib library.

^lo^h^O^urve^^^^^^^^^^^^B
fpr, tpr, thresholds = roc_curve(y_test, y_pre

plUxtleCROCCwwe^^^^^^^^^^^B

Output
When you execute this code, it will plot the ROC curve for the logistic regression model.
ROC Curve
1.0 -

0.8 -
ro
a:
d> 0.6 -
>
‘<75
<£ 0.4 -
a>

0.2 -

0.0 -
0.4 0.6
False Positive Rate

In addition, it will print the AUC-ROC score on the terminal -

AUC-ROC Score: 0.9967245332459875

Grid Search
Grid Search is a hyperparameter tuning technique in Machine Learning that helps to find the best com­
bination of hyperparameters for a given model. It works by defining a grid of hyperparameters and then
training the model with all the possible combinations of hyperparameters to find the best performing set.

In other words, Grid Search is an exhaustive search method where a set of hyperparameters are defined,
and a search is performed over all possible combinations of these hyperparameters to find the optimal val­
ues that give the best performance.

Implementation in Python
In Python, Grid Search can be implemented using the GridSearchCV class from the sklearn module. The
GridSearchCV class takes the model, the hyperparameters to tune, and a scoring function as input. It then
performs an exhaustive search over all possible combinations of hyperparameters and returns the best set
of hyperparameters that give the best score.

Here is an example implementation of Grid Search in Python using the GridSearchCV class -

Example
[Link] GridSearchCV^
from [Link] import RandomForestClassifier
from [Link] import make_classification^^B

X, y = make_classification(n_samples= 1000, n_features= 10, n_classes=2)

# Define the model and the hyperparameters to tune


model =
Ihyperparameters - {'n_estimators': [10, 50,100], 'max_depth': [None, 5,10]}

grid_search - GridSearchCV(model, hyperparameters, scoring='accuracy', cv=5)

?rint("Best hyperparameters:", grid_search.best_params_)^P


mint( Best score , g r i d _ s e a r c h. b e s t_scorej^^^^^^^^^l

In this example, we define a RandomForestClassifier model and a set of hyperparameters to tune, namely
the number of trees (n_estimators) and the maximum depth of each tree (max_depth). We then create a
GridSearchCV object and fit the data using the fit() method. Finally, we print the best set of hyperparame­
ters and the corresponding score.

Output
When you execute this code, it will produce the following output -

Best hyperparameters: {'max_depth': None, 'n_estimators': 10}


Best score: 0.953
Data Scaling
Data scaling is a pre-processing technique used in Machine Learning to normalize or standardize the range
or distribution of features in the data. Data scaling is essential because the different features in the data
may have different scales, and some algorithms may not work well with such data. By scaling the data, we
can ensure that each feature has a similar scale and range, which can improve the performance of the ma­
chine learning model.

There are two common techniques used for data scaling -

1. Normalization - Normalization scales the values of a feature between 0 and 1. This is


achieved by subtracting the minimum value of the feature from each value and dividing it by
the range of the feature (the difference between the maximum and minimum values).
2. Standardization - Standardization scales the values of a feature to have a mean of 0 and a
standard deviation of 1. This is achieved by subtracting the mean of the feature from each
value and dividing it by the standard deviation.

Example
In Python, data scaling can be implemented using the sklearn module. The [Link] sub­
module provides classes for scaling data. Below is an example implementation of data scaling in Python
using the StandardScaler class for standardization -
from [Link] import StandardScaler
from [Link] import load_iris^^^^^M

data = loaTTnsO^^B

y = [Link]

df = [Link](X, columns=data.feature_names)
print(" Before

scaler = StandardScalerQ^^^^^^I
X_scaled = scaler.fit_transform(X)^B

df_scaled = [Link](X_scaled, columns=data. feature_names)


print("After
In this example, we load the iris dataset and create a DataFrame from it. We then use the StandardScaler
class to scale the data and create a new DataFrame from the scaled data. Finally, we print the dataframes
to see the difference in the data before and after scaling. Note that we fit and transform the data using the
fit_transform() method of the scaler object.

Output
When you execute this code, it will produce the following output -

Before scaling:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
After scaling:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 -0.900681 1.019004 -1.340227 -1.315444
1 -1.143017 -0.131979 -1.340227 -1.315444
2 -1.385353 0.328414 -1.397064 -1.315444
3 -1.506521 0.098217 -1.283389 -1.315444
4 -1.021849 1.249201 -1.340227 -1.315444
Train and Test
In machine learning, the train-test split is a common technique used to evaluate the performance of a
machine learning model. The basic idea behind the train-test split is to split the available data into two
sets: a training set and a testing set. The training set is used to train the model, and the testing set is used
to evaluate the model's performance.

The train-test split is important because it allows us to test the model on data that it has not seen before.
This is important because if we evaluate the model on the same data that it was trained on, the model may
perform well on the training data but may not generalize well to new data.

Example
In Python, the train_test_split function from the sklearn.model_selection module can be used to split the
data into training and testing sets. Here is an example implementation -

data = loadjirisO^^B
y = [Link]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegressionQ^^^^^^^^^^^^^^^^^^B

accuracy = [Link](X_test, y_test)J


print(f "Accuracy: {accuracy:.

In this example, we load the iris dataset and split it into training and testing sets using the train_test_split
function. We then create a logistic regression model and fit it to the training data. Finally, we evaluate the
model on the testing data using the score method of the model object.

The test_size parameter in the train_test_split function specifies the proportion of the data that should be
used for testing. In this example, we set it to 0.2, which means that 20% of the data will be used for testing
and 80% will be used for training. The random_state parameter ensures that the split is reproducible, so
we get the same split every time we run the code.

Output
When you execute this code, it will produce the following output -

Accuracy: 1.00

Overall, the train-test split is a crucial step in evaluating the performance of a machine learning model. By
splitting the data into training and testing sets, we can ensure that the model is not overfitting to the train­
ing data and can generalize well to new data.

Association Rules
Association rule mining is a technique used in machine learning to discover interesting patterns in large
datasets. These patterns are expressed in the form of association rules, which represent relationships
between different items or attributes in the dataset. The most common application of association rule
mining is in market basket analysis, where the goal is to identify products that are frequently purchased
together.

Association rules are expressed as a set of antecedents and a set of consequents. The antecedents represent
the conditions or items that must be present for the rule to apply, while the consequents represent the out­
comes or items that are likely to be associated with the antecedents. The strength of an association rule is
measured by two metrics: support and confidence. Support is the proportion of transactions in the dataset
that contain both the antecedent and the consequent, while confidence is the proportion of transactions
that contain the consequent given that they also contain the antecedent.
Example
In Python, the mlxtend library provides several functions for association rule mining. Here is an example
implementation of association rule mining in Python using the apriori function from mlxtend -

import pandas as

। : Il । । "

from mlxtend.frequent_patterns import apriori, association_rules

['milk1, 'bread', 'butter', 'cheese'],


Encode the dataset

te = TransactionEncoderQ

te_ary = [Link](data).transform(data)

df = [Link](te_ary, columns=te.columns_)

frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True)

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)

rint("Frequent Itemsets:")J

:' rc-uLieni

rint("\nAssociation Rules:")
In this example, we create a sample dataset of shopping transactions and encode it using TransactionEn-
coder from mlxtend. We then use the apriori function to find frequent itemsets with a minimum support
of 0.5. Finally, we use the association_rules function to generate association rules with a minimum confi­
dence of 0.5.

The apriori function takes two parameters: the encoded dataset and the minimum support threshold.
The use_colnam.es parameter is set to True to use the original item names instead of Boolean values. The
association_rules function takes two parameters: the frequent itemsets and the metric and minimum
threshold for generating association rules. In this example, we use the confidence metric with a minimum
threshold of 0.5.

Output
The output of this code will show the frequent itemsets and the generated association rules. The frequent
itemsets represent the sets of items that occur together frequently in the dataset, while the association
rules represent the relationships between the items in the frequent itemsets.

Frequent Itemsets:

support itemsets

0 0.666667 (bread)

1 0.666667 (butter)

2 0.833333 (milk)
3 0.500000 (bread, butter)

4 0.500000 (bread, milk)

5 0.500000 (butter, milk)

Association Rules:

antecedents consequents antecedent support consequent support support \

0 (bread) (butter) 0.666667 0.666667 0.5

1 (butter) (bread) 0.666667 0.666667 0.5

2 (bread) (milk) 0.666667 0.833333 0.5

3 (milk) (bread) 0.833333 0.666667 0.5

4 (butter) (milk) 0.666667 0.833333 0.5

5 (milk) (butter) 0.833333 0.666667 0.5

confidence lift leverage conviction zhangs_metric

0 0.75 1.125 0.055556 1.333333 0.333333

1 0.75 1.125 0.055556 1.333333 0.333333


2 0.75 0.900 -0.055556 0.666667 -O.25OOOO

3 0.60 0.900 -0.055556 0.833333 -0.400000

4 0.75 0.900 -0.055556 0.666667 -O.25OOOO

5 0.60 0.900 -0.055556 0.833333 -0.400000

Association rule mining is a powerful technique that can be applied to many different types of datasets. It
is commonly used in market basket analysis to identify products that are frequently purchased together,
but it can also be applied to other domains such as healthcare, finance, and social media. With the help of
Python libraries such as mlxtend, it is easy to implement association rule mining and generate valuable in­
sights from large datasets.

Apriori Algorithm
Apriori is a popular algorithm used for association rule mining in machine learning. It is used to find
frequent itemsets in a transaction database and generate association rules based on those itemsets. The al­
gorithm was first introduced by Rakesh Agrawal and Ramakrishnan Srikant in 1994.

The Apriori algorithm works by iteratively scanning the database to find frequent itemsets of increasing
size. It uses a "bottom-up" approach, starting with individual items and gradually adding more items to
the candidate itemsets until no more frequent itemsets can be found. The algorithm also employs a prun­
ing technique to reduce the number of candidate itemsets that need to be checked.

Here's a brief overview of the steps involved in the Apriori algorithm -

1. Scan the database to find the support count of each item.


2. Generate a set of frequent 1-itemsets based on the minimum support threshold.
3. Generate a set of candidate 2-itemsets by combining frequent 1-itemsets.
4. Scan the database again to find the support count of each candidate 2-itemset.
5. Generate a set of frequent 2-itemsets based on the minimum support threshold and prune
any candidate 2-itemsets that are not frequent.
6. Repeat steps 3-5 to generate candidate k-itemsets and frequent k-itemsets until no more fre­
quent itemsets can be found.

Example
In Python, the mlxtend library provides an implementation of the Apriori algorithm. Below is an example
of how to use use the mlxtend library in conjunction with the sklearn datasets to implement the Apriori al­
gorithm on iris dataset.

a i. : . ' i. 1 ' 1 .J ... । ■■

from [Link] import TransactionEncoder

'.'dear. ।
Load the iris datase

iris = datasets.load_iris()

for i in range(len(iriTdata))i^^^^^^^^^^^^^^^B

[Link]('sepal_length=' + str([Link][i][O]))

[Link]('sepal_width-' + str([Link][i][l]))|

[Link]('petal_length=' + str([Link][i][2]))

[Link]('petal_width=' + str([Link][i][3]))[

[Link]('target=' + str([Link][i]))^^^B
te_ary = [Link](transactions).transform(transactions)

df = [Link](te_ary, columns=te.columns_)

frequent_itemsets = apriori(df, min_support=0.3, use_colnames=True)

In this example, we load the iris dataset from sklearn, which contains information about iris flowers. We
convert the dataset into a list of transactions, where each transaction represents a single flower and con­
tains the values for its four attributes (sepaljength, sepaLwidth, petal_length, and petaLwidth) as well as
its target label (target). We then encode the transactions using one-hot encoding and find frequent item­
sets with a minimum support of 0.3 using the apriori function from mlxtend.

The output of this code will show the frequent itemsets and their corresponding support counts. Since the
iris dataset is relatively small, we only find a single frequent itemset -

Output
support itemsets

0 0.333333 (target=O)

1 0.333333 (target=l)

2 0.333333 (target=2)

This indicates that 33% of the transactions in the dataset contain both a petaljength value of 1.4 and a
target label of 0 (which corresponds to the setosa species in the iris dataset).

The Apriori algorithm is widely used in market basket analysis to identify patterns in customer purchas­
ing behavior. For example, a retailer might use the algorithm to find frequently purchased items that can
be promoted together to increase sales. The algorithm can also be used in other domains such as health­
care, finance, and social media to identify patterns and generate insights from large datasets.

Gaussian Discriminant Analysis


Gaussian Discriminant Analysis (GDA) is a statistical algorithm used in machine learning for classification
tasks. It is a generative model that models the distribution of each class using a Gaussian distribution, and
it is also known as the Gaussian Naive Bayes classifier.

The basic idea behind GDA is to model the distribution of each class as a multivariate Gaussian distribu­
tion. Given a set of training data, the algorithm estimates the mean and covariance matrix of each class's
distribution. Once the parameters of the model are estimated, it can be used to predict the probability
of a new data point belonging to each class, and the class with the highest probability is chosen as the
prediction.

The GDA algorithm makes several assumptions about the data -

1. The features are continuous and normally distributed.


2. The covariance matrix of each class is the same.
3. The features are independent of each other given the class.

Assumption 1 means that GDA is not suitable for data with categorical or discrete features. Assumption
2 means that GDA assumes that the variance of each feature is the same across all classes. If this is not
true, the algorithm may not perform well. Assumption 3 means that GDA assumes that the features are
independent of each other given the class label. This assumption can be relaxed using a different algorithm
called Linear Discriminant Analysis (LDA).

Example
The implementation of GDA in Python is relatively straightforward. Here's an example of how to imple­
ment GDA on the Iris dataset using the scikit-learn library -

from [Link] import load_iris

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis


from [Link] import train_test_split

Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split([Link], [Link], test_size=0.3, random_state=42)

Trai^^GD^nodel^^^^^^^^^^B

da = QuadraticDiscriminantAnalysisQ

[Link](X_train, y_train)

Make predictions on the testing set

y_pred = [Link](X_test)

Evaluate the model's accurac


accuracy = (y_pred == y_test).mean()

p r i n t (A c c u r a c y p accuracyj^^^^H

In this example, we first load the Iris dataset using the loadjris function from scikit-learn. We then split
the data into training and testing sets using the train_test_split function. We create a QuadraticDiscrim-
inantAnalysis object, which represents the GDA model, and train it on the training data using the fit
method. We then make predictions on the testing set using the predict method and evaluate the model's
accuracy by comparing the predicted labels to the true labels.

Output
The output of this code will show the model's accuracy on the testing set. For the Iris dataset, the GDA
model typically achieves an accuracy of around 97-99%.

Accuracy: 0.9811320754716981

Overall, GDA is a powerful algorithm for classification tasks that can handle a wide range of data types,
including continuous and normally distributed data. While it makes several assumptions about the data,
it is still a useful and effective algorithm for many real-world applications.
Cost Function
In machine learning, a cost function is a measure of how well a machine learning model is performing. It
is a mathematical function that takes in the model's predicted values and the true values of the data and
outputs a single scalar value that represents the cost or error of the model's predictions. The goal of train­
ing a machine learning model is to minimize the cost function.

The choice of cost function depends on the specific problem being solved. For example, in binary classi­
fication tasks, where the goal is to predict whether a data point belongs to one of two classes, the most
commonly used cost function is the binary cross-entropy function. In regression tasks, where the goal is to
predict a continuous value, the mean squared error function is commonly used.

Let's take a closer look at the binary cross-entropy function. Given a binary classification problem with
two classes, let's call them class 0 and class 1, and let's denote the model's predicted probability of class 1
as "p(y = 1 |x)". The true label of each data point is either 0 or 1. We can define the binary cross-entropy cost
function as follows -

J=-(
1

m
) x S(y x log(p)+(1 - y) x log( 1 -p))
* =-(l * )*S( * X # ) + (l- & )x (1- ))

where "m" is the number of data points, "y" is the true label of each data point, and "p" is the predicted
probability of class 1.

The binary cross-entropy function has several desirable properties. First, it is a convex function, which
means that it has a unique global minimum that can be found using optimization techniques. Second, it is
a strictly positive function, which means that it penalizes incorrect predictions. Third, it is a differentiable
function, which means that it can be used with gradient-based optimization algorithms.

Implementation in Python
Now let's see how to implement the binary cross-entropy function in Python using NumPy -

import num

def binary_cross_entropy(y_pred, y_true):

eps = le-15

y_pred = [Link](y_pred, eps, 1 - eps)

return -(y_true * [Link](y_pred) + (1 - y_true) * [Link](l - y_pred)).mean()


In this implementation, we first clip the predicted probabilities to avoid numerical issues with logarithms.
We then compute the binary cross-entropy loss using NumPy functions and return the mean over all data
points.

Once we have defined a cost function, we can use it to train a machine learning model using optimization
techniques such as gradient descent. The goal of optimization is to find the set of model parameters that
minimizes the cost function.

Example
Here is an example of using the binary cross-entropy function to train a logistic regression model on the
Iris dataset using scikit-learn -

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split|

Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split([Link], [Link], test_size=0.3, random_state=42)

Train a logistic regression model

logreg = LogisticRegressionQ

[Link](X_tram, y_tram)

Make predictions on the testing set

y_pred = [Link](X_test)

Compute the binary cross-entropy loss

loss = binary_cross_entropy(logreg.predict_proba(X_test)[:, 1], y_test)

printfLoss:', loss)

In the above example, we first load the Iris dataset using the loadjris function from scikit-learn. We then
split the data into training and testing sets using the ' train_test _splitfunction. We train a logistic regres­
sion model on the training set using theLogisticRegressionclass from scikit-learn. We then make predic­
tions on the testing set using thepredict' method of the trained model.
To compute the binary cross-entropy loss, we use the predict_proba method of the logistic regression
model to get the predicted probabilities of class 1 for each data point in the testing set. We then extract the
probabilities for class 1 using indexing and pass them to our binary_cross_entropy function along with
the true labels of the testing set. The function computes the loss and returns it, which we display on the
terminal.

Output
When you execute this code, it will produce the following output -

Loss: 1.6312339784720309

The binary cross-entropy loss is a measure of how well the logistic regression model is able to predict the
class of each data point in the testing set. A lower loss indicates better performance, and a loss of 0 would
indicate perfect performance.

Bayes Theorem

Bayes Theorem is a fundamental concept in probability theory that has many applications in machine
learning. It allows us to update our beliefs about the probability of an event given new evidence. Actually,
it forms the basis for probabilistic reasoning and decision making.
Bayes Theorem states that the probability of an event A given evidence B is equal to the probability of
evidence B given event A, multiplied by the prior probability of event A, divided by the probability of evi­
dence B. In mathematical notation, this can be written as -

P(A\B) = P(B|A)*P(A)/P(B)
where -

• P(A\B) is the probability of event A given evidence B (the


posterior probability)

• P(B A) is the probability of evidence B given event A (the


likelihood)

• P(A) is the prior probability of event A (our initial belief about


the probability of event A)

• P(B) is the probability of evidence B (the total probability)

Bayes Theorem can be used in a wide range of applications, such as spam filtering, medical diagnosis, and
image recognition. In machine learning, Bayes Theorem is commonly used in Bayesian inference, which is
a statistical technique for updating our beliefs about the parameters of a model based on new data.
Implementation in Python
In Python, there are several libraries that implement Bayes Theorem and Bayesian inference. One of the
most popular is the scikit-learn library, which provides a range of tools for machine learning and data
analysis.

Let's consider an example of how Bayes Theorem can be implemented in Python using scikit-learn. Sup­
pose we have a dataset of emails, some of which are spam and some of which are not. Our goal is to build a
classifier that can accurately predict whether a new email is spam or not.

We can use Bayes Theorem to calculate the probability of an email being spam given its features (such as
the words in the subject line or body). To do this, we first need to estimate the parameters of the model,
which in this case are the prior probabilities of spam and non-spam emails, as well as the likelihood of each
feature given the class (spam or non-spam).

We can estimate these probabilities using maximum likelihood estimation or Bayesian inference. In our
example, we will be using the Multinomial Naive Bayes algorithm, which is a variant of the Naive Bayes al­
gorithm that is commonly used for text classification tasks.

Example
from [Link] import fetch_20newsgroups

from sklearn.feature_extraction.text import CountVectorizer


from sklearn.naive_baves import MultinomialNB

from [Link] import accuracy_score

categories = ['[Link]', '[Link]', '[Link]', '[Link].

train ~ fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)J

vectorizer =

X_train = vectorizer.fit_transform(traimdata)^^^^^^^^^^^^^^^^^^^^^^^^^^^^M

r ''i
Make predictions on the test set and calculate accurac

y_pred = [Link](X_test)

accuracy = accuracy_score([Link], y_pred)

print("Accuracy:", accuracy)

In the above code, we first load the 20 newsgroups dataset, which is a collection of newsgroup posts clas­
sified into different categories. We select four categories ([Link], [Link], [Link], and [Link]-
[Link]) and split the data into training and testing sets.

We then use the CountVectorizer class from scikit-learn to convert the text data into a bag-of-words repre­
sentation. This representation counts the occurrence of each word in the text and represents it as a vector.

Next, we train a Multinomial Naive Bayes classifier using the fit() method. This method estimates the prior
probabilities and the likelihood of each word given the class using maximum likelihood estimation. The
classifier can then be used to make predictions on the test set using the predict() method.

Finally, we calculate the accuracy of the classifier using the accuracy_score() function from scikit-learn.

Output
When you execute this code, it will produce the following output -

Accuracy: 0.9340878828229028
Precision and Recall
Precision and recall are two important metrics used to evaluate the performance of classification models
in machine learning. They are particularly useful for imbalanced datasets where one class has signifi­
cantly fewer instances than the other.

Precision is a measure of how many of the positive predictions made by a classifier were correct. It is
defined as the ratio of true positives (TP) to the total number of positive predictions (TP + FP). In other
words, precision measures the proportion of true positives among all positive predictions.

Precision=TP/(TP+FP)
Recall, on the other hand, is a measure of how many of the actual positive instances were correctly iden­
tified by the classifier. It is defined as the ratio of true positives (TP) to the total number of actual positive
instances (TP + FN). In other words, recall measures the proportion of true positives among all actual pos­
itive instances.

Recall=TP/(TP+FN)
To understand precision and recall, consider the problem of detecting spam emails. A classifier may label
an email as spam (positive prediction) or not spam (negative prediction). The actual label of the email can
be either spam or not spam. If the email is actually spam and the classifier correctly labels it as spam, then
it is a true positive. If the email is not spam but the classifier incorrectly labels it as spam, then it is a false
positive. If the email is actually spam but the classifier incorrectly labels it as not spam, then it is a false
negative. Finally, if the email is not spam and the classifier correctly labels it as not spam, then it is a true
negative.

In this scenario, precision measures the proportion of spam emails that were correctly identified as spam
by the classifier. A high precision indicates that the classifier is correctly identifying most of the spam
emails and is not labeling many legitimate emails as spam. On the other hand, recall measures the propor­
tion of all spam emails that were correctly identified by the classifier. A high recall indicates that the classi­
fier is correctly identifying most of the spam emails, even if it is labeling some legitimate emails as spam.

Implementation in Python
In scikit-learn, precision and recall can be calculated using the precision_score() and recall_score() func­
tions, respectively. These functions take as input the true labels and predicted labels for a set of instances,
and return the corresponding precision and recall scores.

For example, consider the following code snippet that uses the breast cancer dataset from scikit-learn to
train a logistic regression classifier and evaluate its precision and recall scores -

Example

from sklearn.linear_model import LogisticRegression


from [Link] import train_test_split

from [Link] import precision_score, recall_score

Split the data into training and testing sets

X_train, X_test, y_train, y_test - train_test_split([Link], [Link], test_size=0.2, random_state=42)

elf = LogisticRegression(random_state=42)

Make predictions on the testing set

y_pred = [Link](X_test)
precision = precision_score(y_test, y_pre

In the above example, we first load the breast cancer dataset and split it into training and testing sets. We
then train a logistic regression classifier on the training set and make predictions on the testing set using
the predict() method. Finally, we calculate the precision and recall scores using the precision_score() and
recall_score() functions.

Output
When you execute this code, it will produce the following output -

Precision: 0.9459459459459459

Recall: 0.9859154929577465
Adversarial

Adversarial machine learning is a subfield of machine learning that focuses on studying the vulnerability
of machine learning models to adversarial attacks. An adversarial attack is a deliberate attempt to fool a
machine learning model by introducing small perturbations in the input data. These perturbations are
often imperceptible to humans, but they can cause the model to make incorrect predictions with high
confidence. Adversarial attacks can have serious consequences in real-world applications, such as autono­
mous driving, security systems, and healthcare.

There are several types of adversarial attacks, including -

1. Evasion attacks - These attacks aim to manipulate the input data to cause the model to
misclassify it. Evasion attacks can be targeted, where the attacker knows the target class, or
untargeted, where the attacker only wants to cause a misclassification.
2. Poisoning attacks - These attacks aim to manipulate the training data to bias the model
towards a particular class or to reduce its overall accuracy. Poisoning attacks can be either
data poisoning, where the attacker modifies the training data, or model poisoning, where the
attacker modifies the model itself.
3. Model inversion attacks - These attacks aim to infer sensitive information about the training
data or the model itself by observing the outputs of the model.

To defend against adversarial attacks, researchers have proposed several techniques, including -
1. Adversarial training - This technique involves augmenting the training data with adversar­
ial examples to make the model more robust to adversarial attacks.
2. Defensive distillation - This technique involves training a second model on the outputs of
the first model to make it more resistant to adversarial attacks.
3. Randomization - This technique involves adding random noise to the input data or the
model parameters to make it harder for attackers to craft adversarial examples.
4. Detection and rejection - This technique involves detecting adversarial examples and reject­
ing them before they are processed by the model.

Implementation in Python
In Python, several libraries provide implementations of adversarial attacks and defenses, including -

1. CleverHans - This library provides a collection of adversarial attacks and defenses for Tensor-
Flow, Keras, and PyTorch.
2. ART (Adversarial Robustness Toolbox) - This library provides a comprehensive set of tools
to evaluate and defend against adversarial attacks in machine learning models.
3. Foolbox - This library provides a collection of adversarial attacks for PyTorch, TensorFlow,
and Keras.

In the following example, we will do implementation of Adversarial Machine Learning using the Adversar­
ial Robustness Toolbox (ART) -
First, we need to install the ART package using pip -

pip install adversarial-robustness-toolbox

Then, we can create an adversarial example using the ART library on a pre-trained model.

Example
import tensorflow as

from [Link] import Sequential^^^^^^^^^^^^^M

from [Link] import Dense, Flatten, Conv2D, MaxPooling2D

from [Link] importAdam^^^^^^^^^^^^^^B

from [Link] import to_categoncal^^^^^^^^^^^^^M

from [Link] import KerasClassifier^^^H

import tensorflow as tl
[Link].disable_eager_execution()

^oacKh^MNIST^iataset^^^^^^^^^^^^^^B

(x_train, y_train), (x_test, y_test) = mnist.load_data()

Preprocess the data

x_train = x_train.reshape(-l, 28,28, l).astype('float32') 7 255

x_test = x_test.reshape(-l, 28, 28, l).astype('float32')/ 255

vtrain = to_categorical(v train, 10)

vtest = to_categorical(y_test, 10)

model =

mode^dd(Conv2D(32^erne^ize=(3^)^ctivation=Yelu^npu^hape=(28^8^)))

s n o d e 1. a d d (Ma x P o o i i n g 2 D i p o o 1_ s i z e (

[Link](Flatten())
[Link](Dense( 10, activation='softmax'))

Compile the mode

[Link](loss='categoricaLcrossentropy1, optimizer=AdarnQr=0.001), metrics=['accuracy'])

Wrap the model with ART KerasClassifier

classifier = KerasClassifier(model=model, clip_values=(0,1), use_logits=False)

Evaluate the model on the test se

accuracy = [Link](x_test, y_test)[l

print("Accuracy on test set: %.2f%%" % (accuracy * 100))

Generate adversarial examples using the FastGradientMethod attac


attack = FastGradientMethod(estimator=classifier, eps=0.1)

x [Link]: i a c k. a a; i a r a t e; t

print("Accuracy on adversarial examples: %.2f%%" % (accuracy_adv * 100))

In this example, we first load and preprocess the MNIST dataset. Then, we define a simple convolutional
neural network (CNN) model and compile it using categorical cross-entropy loss and Adam optimizer.

We wrap the model with the ART KerasClassifier to make it compatible with ART attacks. We then train the
model for 10 epochs on the training set and evaluate it on the test set.

Next, we generate adversarial examples using the FastGradientMethod attack with a maximum perturba­
tion of 0.1. Finally, we evaluate the model on the adversarial examples.

Output
When you execute this code, it will produce the following output -

Train on 60000 samples


Epoch 1/20

60000/60000 [==============================] - 17s 277us/sample - loss: 0.3530 - accuracy: 0.9030

Epoch 2/20

60000/60000 [===== ================ ======= ==]-15s 251us/sample - loss: 0.1296 - accuracy: 0.9636

Epoch 3/20

60000/60000 [==============================] - 18s 300us/sample - loss: 0.0912 - accuracy: 0.9747

Epoch 4/20

60000/60000 [==============================] - 18s 295us/sample - loss: 0.0738 - accuracy: 0.9791

Epoch 5/20

60000/60000 [==============================] - 18s 300us/sample - loss: 0.0654 - accuracy: 0.9809

------- continue

Stacking

Stacking, also known as stacked generalization, is an ensemble learning technique in machine learning
where multiple models are combined in a hierarchical manner to improve prediction accuracy. The tech-
nique involves training a set of base models on the original training dataset, and then using the predictions
of these base models as inputs to a meta-model, which is trained to make the final predictions.

The basic idea behind stacking is to leverage the strengths of multiple models by combining them in a way
that compensates for their individual weaknesses. By using a diverse set of models that make different
assumptions and capture different aspects of the data, we can improve the overall predictive power of the
ensemble.

The stacking technique can be divided into two stages -

1. Base Model Training - In this stage, a set of base models are trained on the original training
data. These models can be of any type, such as decision trees, random forests, support vector
machines, neural networks, or any other algorithm. Each model is trained on a subset of the
training data, and produces a set of predictions for the remaining data points.
2. Meta-model Training - In this stage, the predictions of the base models are used as inputs to
a meta-model, which is trained on the original training data. The goal of the meta-model is
to learn how to combine the predictions of the base models to produce more accurate predic­
tions. The meta-model can be of any type, such as linear regression, logistic regression, or any
other algorithm. The meta-model is trained using cross-validation to avoid overfitting.

Once the meta-model is trained, it can be used to make predictions on new data points by passing the
predictions of the base models as inputs. The predictions of the base models can be combined in different
ways, such as by taking the average, weighted average, or maximum.
Example
Here is an example implementation of stacking in Python using scikit-learn -

X, y = [Link], [Link]

rf = RandomForestClassifier(n_estimators=10, random_state=42)
gb = GradientBoostingClassifier(random_state=42)

Ir = LogisticRegressionQ

stack = StackingClassifier(classifiers=[rf, gb], meta_classifier=lr)

Use cross-validation to generate predictions for the meta-mode

y_pred = cross_val_predict(stack, X, y, cv=5)

acc = accuracy_score(y, y_pred)^^^^^^^^B

In this code, we first load the iris dataset and define the base models, which are a random forest and a
gradient boosting classifier. We then define the meta-model, which is a logistic regression model.
We create a StackingClassifier object with the base models and meta-model, and use cross-validation to
generate predictions for the meta-model. Finally, we evaluate the performance of the stacked model using
the accuracy score.

Output
When you execute this code, it will produce the following output -

Accuracy: 0.9666666666666667

Epoch
In machine learning, an epoch refers to a complete iteration over the entire training dataset during the
model training process. In simpler terms, it is the number of times the algorithm goes through the entire
dataset during the training phase.

During the training process, the algorithm makes predictions on the training data, computes the loss, and
updates the model parameters to reduce the loss. The objective is to optimize the model's performance by
minimizing the loss function. One epoch is considered complete when the model has made predictions on
all the training data.
Epochs are an essential parameter in the training process as they can significantly affect the performance
of the model. Setting the number of epochs too low can result in an underfit model, while setting it too
high can lead to overfitting.

Underfitting occurs when the model fails to capture the underlying patterns in the data and performs
poorly on both the training and testing datasets. It happens when the model is too simple or not trained
enough. In such cases, increasing the number of epochs can help the model learn more from the data and
improve its performance.

Overfitting, on the other hand, happens when the model learns the noise in the training data and performs
well on the training set but poorly on the testing data. It occurs when the model is too complex or trained
for too many epochs. To avoid overfitting, the number of epochs must be limited, and other regularization
techniques like early stopping or dropout should be used.

Implementation in Python
In Python, the number of epochs is specified in the training loop of the machine learning model. For
example, when training a neural network using the Keras library, you can set the number of epochs using
the "epochs" argument in the "fit" method.

Example
from [Link] import Sequential

from [Link] import Dense

.train = [Link]( 1

train = [Link](Q, 2, size=(100,))

model =

[Link](Dense(16, input_dim= 10, activation-relu'))

[Link](Dense( ■, activation-

compile the model with binary cross-entropv loss and adam optimizer

[Link](loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

train the model with 10 epochs


[Link](X_train, y_train, epochs = 10)

In this example, we generate some random data for training and create a simple neural network model
with one input layer, one hidden layer, and one output layer. We compile the model with binary cross-en­
tropy loss and the Adam optimizer and set the number of epochs to 10 in the "fit" method.

During the training process, the model makes predictions on the training data, computes the loss, and
updates the weights to minimize the loss. After completing 10 epochs, the model is considered trained,
and we can use it to make predictions on new, unseen data.

Output
When you execute this code, it will produce an output like this -

Epoch 1/10

4/4 [====== ======= ======= ======= = ==]- 3is 2ms/step - loss: 0.7012 - accuracy: 0.4976

Epoch 2/10

4/4 [====== ======= ======= ======= = ==] - 0s ims/step - loss: 0.6995 - accuracy: 0.4390

Epoch 3/10

4/4 [====== ======= ======= ======= = ==]-0s ims/step - loss: 0.6921 - accuracy: 0.5123
Epoch 4/10

4/4 [==============================] - Os Ims/step - loss: 0.6778 - accuracy: 0.5474

Epoch 5/10

4/4 [==============================] - Os Ims/step - loss: 0.6819 - accuracy: 0.5542

Epoch 6/10

4/4 [==============================] - 0s Ims/step - loss: 0.6795 - accuracy: 0.5377

Epoch 7/10

4/4 [==============================] - 0s Ims/step - loss: 0.6840 - accuracy: 0.5303

Epoch 8/10

4/4 [==============================] - 0s Ims/step - loss: 0.6795 - accuracy: 0.5554

Epoch 9/10

4/4 [==============================] - 0s Ims/step - loss: 0.6706 - accuracy: 0.5545

Epoch 10/10

4/4 [==============================] - 0s Ims/step - loss: 0.6722 - accuracy: 0.5556


Perceptron
Perceptron is one of the oldest and simplest neural network architectures. It was invented in the 1950s by
Frank Rosenblatt. The Perceptron algorithm is a linear classifier that classifies input into one of two possi­
ble output categories. It is a type of supervised learning that trains the model by providing labeled training
data. The Perceptron algorithm is based on a threshold function that takes the weighted sum of inputs and
applies a threshold to generate a binary output.

Architecture of Perceptron
A single layer of Perceptron consists of an input layer, a weight layer, and an output layer. Each node in the
input layer is connected to each node in the weight layer with a weight assigned to each connection. Each
node in the weight layer computes a weighted sum of inputs and applies a threshold function to generate
the output.

The threshold function in Perceptron is the Heaviside step function, which returns a binary value of 1 if
the input is greater than or equal to zero, and 0 otherwise. The output of each node in the weight layer is
determined by -

i;
0;
if

w
0

w
1

w
2

2
w
n

>=0

otherwise
& ={1; &&& 0+^>l^l+«>2«>2+ ••• + >=00;

Where "y" is the output,xt,x2, ...,x_ are the input features; and wo, wi; w2,..., wn are the corresponding weights,
and > = 0 indicates the Heaviside step function.

Training of Perceptron
The training process of the Perceptron algorithm involves iteratively updating the weights until the model
converges to a set of weights that can correctly classify all training examples. Initially, the weights are set
to random values. For each training example, the predicted output is compared to the actual output, and
the weights are updated accordingly to minimize the error.
The weight update rule in Perceptron is as follows -

W
i

W
i

+ax(y-

y
/

)x

X
1

0WWx(W')xH

Where W is the weight of the i-th feature,


& is the learning rate,y is the actual output, y' is the predicted output, and x. is the i-th input feature.

Implementation of Perceptron in Python


The Perceptron algorithm is implemented in Python using the scikit-learn library. The scikit-learn library
provides a Perceptron class that can be used for binary classification problems.

Here is an example of implementing the Perceptron algorithm in Python using scikit-learn -

Example
[Link], j/Kd ■

from [Link] import load„iris^^^^^^^B

from sklearn.model_selection import train_test_split

from [Link] import accuracyScore^^^^B

i Load the iris dataset

iris = load_iris()
Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split([Link], [Link], test_size=0.3, random_state=Q)

Use the trained Perceptron to make predictions on the testing data

y_pred = [Link](X_test)

accuracy = accuracy_score(y_test, y_pred)

print (''Accuracy:'1, accuracy)^^^^^^M


Output
When you execute this code, it will produce the following output -

Accuracy: 0.8

Once the perceptron is trained, it can be used to make predictions on new input data. Given a set of input
values, the perceptron computes a weighted sum of the inputs and applies an activation function to the
sum to obtain the output value. This output value can then be interpreted as a prediction for the corre­
sponding input.

Role of Step Functions in the Training of Perceptrons


The activation function used in a perceptron can vary, but a common choice is the step function. The step
function returns 1 if the input is positive or 0 if it is negative or zero. This function is useful because it pro­
vides a binary output, which can be interpreted as a prediction for a binary classification problem.

Here is an example implementation of a perceptron in Python using the step function as the activation
function -

import numi

class Perceptron:
def_ init_ (self, learning_rate=0.1, epochs = 100):

self.learning_rate = learning_rate

~ Ml
[Link] =

selahias

return [Link](x > = 0,1,0)

n_samples, n_features = [Link]

# initialize weights and bias to 0 J

[Link] = [Link](n_features)|

[Link] = 0
# iterate over epochs and update weights and bias

for _ in range([Link]):

for i in range(n_samples):

linear_output = [Link]([Link], X[i]) + [Link]

y_pred = self.step_function(linear_output)

# update weights and bias based on error

update = self.learning_rate * (y[i] - y_pred)

[Link] + = update * X[i]

[Link] + = update^^^^B

def prodict(self,

linear_output = [Link](X, [Link]) + [Link]

y_pred = self.step_function(linear_output)

return y_pred
In this implementation, the Perceptron class takes two parameters: learning_rate and epochs. The fit
method trains the perceptron on the input data X and the corresponding target values y. The predict
method takes an input data array and returns the predicted output values.

To use this implementation, we can create an instance of the Perceptron class and call the fit method to
train the model -

jerceptron = Perceptron(learning_rate=0.1, epochs =10)

Once the model is trained, we can make predictions on new input data using the predict method -

predictions = [Link](test_data)

U 1 l': 'U ' J :

The output of this code is [1,0], which are the predicted values for the input data [[1,1], [0,1]].
Regularization
In machine learning, regularization is a technique used to prevent overfitting, which occurs when a model
is too complex and fits the training data too well, but fails to generalize to new, unseen data. Regularization
introduces a penalty term to the cost function, which encourages the model to have smaller weights and a
simpler structure, thereby reducing overfitting.

There are several types of regularization techniques commonly used in machine learning, including LI
and L2 regularization, dropout regularization, and early stopping. In this article, we will focus on LI and
L2 regularization, which are the most commonly used techniques.

LI Regularization

LI regularization, also known as Lasso regularization, is a technique that adds a penalty term to the cost
function, equal to the absolute value of the sum of the weights. The formula for the LI regularization
penalty is -

XxS|wi|
where A. is a hyperparameter that controls the strength of the regularization, and n«v«y is the i-th weight
in the model.

The effect of the LI regularization penalty is to encourage the model to have sparse weights, that is, to
eliminate the weights that have little or no impact on the output. This has the effect of simplifying the
model and reducing overfitting.

Example

To implement LI regularization in Python, we can use the Lasso class from the scikit-learn library. Here is
an example of how to use LI regularization for linear regression -

from sklearn.linear_model import Lasso

from [Link] import load_boston

from [Link] import train_test_split

from [Link] import mean_squared_error

# Load the Boston Housing dataset


boston = load_boston()

# Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split([Link], [Link], test_size=0.2, random_s-


tate=42)

# Create a Lasso model with LI regularization

lasso = Lasso(alpha=0.1)

# Train the model on the training data

[Link](X_train, y_train)

# Make predictions on the test data

y_pred = [Link](X_test)

# Calculate the mean squared error of the predictions

mse = mean_squared_error(y_test, y_pred)


printf'Mean squared error:", mse)

In this example, we load the Boston Housing dataset, split it into training and test sets, and create a Lasso
model with L1 regularization using an alpha value of 0.1. We then train the model on the training data and
make predictions on the test data. Finally, we calculate the mean squared error of the predictions.

Output

When you execute this code, it will produce the following output -

Mean squared error: 25.155593753934173

L2 Regularization

L2 regularization, also known as Ridge regularization, is a technique that adds a penalty term to the cost
function, equal to the square of the sum of the weights. The formula for the L2 regularization penalty is -

Xx£(wi)2

where A. is a hyperparameter that controls the strength of the regularization, and wi is the ith weight in the
model.
The effect of the L2 regularization penalty is to encourage the model to have small weights, that is, to
reduce the magnitude of all the weights in the model. This has the effect of smoothing the model and re­
ducing overfitting.

Example

To implement L2 regularization in Python, we can use the Ridge class from the scikit-learn library. Here is
an example of how to use L2 regularization for linear regression -

from sklearn.linear_model import Ridge

from [Link] import train_test_split

from [Link] import mean_squared_error

from [Link] import load_boston

from [Link] import StandardScaler

import numpy as np

# load the Boston housing dataset

boston = load_boston()
# create feature and target arrays

X = [Link]

y = [Link]

# standardize the feature data

scaler = StandardScalerQ

X = scaler.fit_transform(X)

# split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# define the Ridge regression model with L2 regularization

model - Ridge(alpha=0.1)

# fit the model on the training data

[Link](X_train, y_train)
# make predictions on the testing data

y_pred = [Link](X_test)

# calculate the mean squared error

mse = mean_squared_error(y_test, y_pred)

print("Mean Squared Error:", mse)

In this example, we first load the Boston housing dataset and split it into training and testing sets. We then
standardize the feature data using a StandardScaler.

Next, we define the Ridge regression model and set the alpha parameter to 0.1, which controls the strength
of the L2 regularization.

We fit the model on the training data and make predictions on the testing data. Finally, we calculate the
mean squared error to evaluate the performance of the model.

Output

When you execute this code, it will produce the following output -
Mean Squared Error: 2 4.2 9 3 4 6 2 5 0 5 9 610 7

Overfitting

Overfitting occurs when a model learns the noise in the training data, rather than the underlying patterns. This
causes the model to perform well on the training data, but poorly on new data. Essentially, the model becomes too
specialized to the training data, and is unable to generalize to new data.

Overfitting is a common problem when using complex models, such as deep neural networks. These models have
many parameters, and are able to fit the training data very closely. However, this often comes at the expense of gen­
eralization performance.

Causes of Overfitting
There are several factors that can contribute to overfitting -

1. Complex models - As mentioned earlier, complex models are more likely to overfit than simpler mod­
els. This is because they have more parameters, and are able to fit the training data more closely.
2. Limited training data - When there is not enough training data, it becomes difficult for the model to
learn the underlying patterns, and it may instead learn the noise in the data.
3. Unrepresentative training data - If the training data is not representative of the problem that the
model is trying to solve, the model may learn irrelevant patterns that do not generalize well to new
data.
4. Lack of regularization - Regularization is a technique used to prevent overfitting by adding a penalty
term to the cost function. If this penalty term is not present, the model is more likely to overfit.

Techniques to Prevent Overfitting


There are several techniques that can be used to prevent overfitting in machine learning -

1. Cross-validation - Cross-validation is a technique used to evaluate a model's performance on new,


unseen data. It involves dividing the data into several subsets, and using each subset in turn as a vali­
dation set, while training on the remaining data. This helps to ensure that the model generalizes well
to new data.
2. Early stopping - Early stopping is a technique used to prevent a model from overfitting by stopping
the training process before it has converged completely. This is done by monitoring the validation
error during training, and stopping when the error stops improving.
3. Regularization - Regularization is a technique used to prevent overfitting by adding a penalty term to
the cost function. The penalty term encourages the model to have smaller weights, and helps to pre­
vent it from fitting the noise in the training data.
4. Dropout - Dropout is a technique used in deep neural networks to prevent overfitting. It involves
randomly dropping out some of the neurons during training, which forces the remaining neurons to
learn more robust features.

Example
Here is an implementation of early stopping and L2 regularization in Python using Keras -
frondcera^ayer^mpoi^Dense^^^^H
ton^era^aUback^mpor^arlyStoppin
from keras import regularizersMMMI

mode^dd(Dense(64Jnpu^inm}Oraii^hape^^ctivation=l^hfAerne^egulanze^regulanze^2(M^))

compile the mode


[Link](loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# set up early stopping callback


early_stopping = EarlyStopping(monitor='val_loss', patience=5)

# tram the model with early stopping and L2 regularization


history = [Link](X_train, y_train, validation_split=0.2, epochs=100, batch_size=64, callbacks=[early_stoppingl)

In this code, we have used the Sequential model in Keras to define the model architecture, and we have added L2
regularization to the first two layers using the kernel_regularizer argument. We have also set up an early stopping
callback using the EarlyStopping class in Keras, which will monitor the validation loss and stop training if it stops
improving for 5 epochs.
During training, we pass in the X_train and y_train data as well as a validation split of 0.2 to monitor the validation
loss. We also set a batch size of 64 and train for a maximum of 100 epochs.

Output
When you execute this code, it will produce an output like the one shown below -

Train on 323 samples, validate on 81 samples


Epoch 1/100
323/323 [==============================] - Os 792us/sample - loss: -8.9033 - accuracy: 0.0000e+00 -
valjoss: -15.1467 - val_accuracy: 0.0000e+00
Epoch 2/100
323/323 [== = ====== = ============= = ======] - Os 46us/sample - loss: -20.4505 - accuracy: 0.0000e+00 -
val_loss: -25.7619 - val_accuracy: 0.0000e+00
Epoch 3/100
323/323 [== = ====== = ====== = ====== = ======] - Os 43us/sample - loss: -31.9206 - accuracy: 0.0000e+00 -
valjoss: -36.8155 - val_accuracy: 0.0000e+00
Epoch 4/100
323/323 [==============================] - Os 46us/sample - loss: -44.2281 - accuracy: 0.0000e+00 -
valjoss: -49.0378 - val_accuracy: 0.0000e+00
Epoch 5/100
323/323 [==============================] - Os 52us/sample - loss: -58.3326 - accuracy: 0.0000e+00 -
valjoss: -62.9369 - val_accuracy: 0.0000e+00
Epoch 6/100
323/323 [==============================] - Os 40us/sample - loss: -74.2131 - accuracy: 0.0000e+00 -
val_loss: -78.7068 - val_accuracy: 0.0000e+00
----- continue

By using early stopping and L2 regularization, we can help prevent overfitting and improve the generalization per­
formance of our model.

P-value
In machine learning, we use P-value to test the null hypothesis that there is no significant relationship between two
variables. For example, if we have a dataset of house prices and we want to determine whether there is a significant
relationship between the size of the house and its price, we can use P-value to test this hypothesis.

To understand the concept of P-value in machine learning, we need to first understand the concept of null hy­
pothesis and alternative hypothesis. The null hypothesis is the hypothesis that there is no significant relationship
between the two variables, while the alternative hypothesis is the opposite of the null hypothesis, which states that
there is a significant relationship between the two variables.

Once we have defined our null hypothesis and alternative hypothesis, we can use P-value to test the significance of
our hypothesis. The P-value is the probability of obtaining the observed result or a more extreme result, assuming
that the null hypothesis is true.

If the P-value is less than the significance level (usually set at 0.05), then we reject the null hypothesis and accept the
alternative hypothesis. This means that there is a significant relationship between the two variables. On the other
hand, if the P-value is greater than the significance level, then we fail to reject the null hypothesis and conclude that
there is no significant relationship between the two variables.

Implementation of P-value in Python


Python provides several libraries for statistical analysis and hypothesis testing. One of the most popular libraries
for statistical analysis is the scipy library. The scipy library provides a function called ttest_ind() that can be used to
calculate the P-value for two independent samples.

To demonstrate the implementation of p-value in Machine Learning, we will use the breast cancer dataset provided
by scikit-learn. The goal of this dataset is to predict whether a breast tumor is malignant or benign based on various
features such as the tumor's radius, texture, perimeter, area, smoothness, compactness, concavity, and symmetry.

First, we will load the dataset and split it into training and testing sets -

from [Link] import train_test_split

data = load_breast_cancer()

= [Link]

X train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Next, we will use the SelectKBest class from scikit-learn to select the top k features based on their p-values. Here, we
will select the top 5 features -
from sklearn.feature_selection import SelectKBest, f_classif

^etectom^electKBest(scor^unc=^lassifJ<=k)^^B
t^miroiew^selecto^^mnsfonnQOrain/y^min)
X_test_new [Link](X_test)^^^^^^M

The SelectKBest class takes a score function as input to calculate the p-values for each feature. We use the f_classif
function, which is the ANOVA F-value between each feature and the target variable. The k parameter specifies the
number of top features to select.

After fitting the selector on the training data, we transform the data to keep only the top k features using the
fit_transform() method. We also transform the testing data to keep only the selected features using the transform()
method.

We can now train a model on the selected features and evaluate its performance -

ftnn^kleanUineaoiiodeHmpor^ogisticRegression
from [Link] import accuracy scoreHnEH

y_pred = [Link](X_test_new)

accuracy = accuracy_score(y_test, y_pred)


print(f'Accuracy: |accuracy:.2f]")
In this example, we trained a logistic regression model on the top 5 selected features and evaluated its performance
using accuracy. However, the p-value can also be used for hypothesis testing to determine whether a feature is sta­
tistically significant or not.

For example, to test the hypothesis that the mean radius feature is significant, we can use the ttest_ind() function
from the [Link] module -

from [Link] import ttest ind

print(f"P-value: {p_value:.2f}")

The ttest_ind() function takes two arrays as input and returns the t-statistic and the two-tailed p-value.

Output
We will get the following output from the above implementation -

Accuracy: 0.97
P-value: 0.00

In this example, we calculated the p-value for the mean radius feature between the malignant and benign classes.
Entropy

Entropy is a concept that originates from thermodynamics and was later applied in various fields, includ­
ing information theory, statistics, and machine learning. In machine learning, entropy is used as a mea­
sure of the impurity or randomness of a set of data. Specifically, entropy is used in decision tree algorithms
to decide how to split the data to create a more homogeneous subset. In this article, we will discuss entropy
in machine learning, its properties, and its implementation in Python.

Entropy is defined as a measure of disorder or randomness in a system. In the context of decision trees,
entropy is used as a measure of the impurity of a node. A node is considered pure if all the examples in it
belong to the same class. In contrast, a node is impure if it contains examples from multiple classes.

To calculate entropy, we need to first define the probability of each class in the data set. Let p(i) be the
probability of an example belonging to class i. If we have k classes, then the total entropy of the system, de­
noted by H(S), is calculated as follows -

[Math Processing Error]

where the sum is taken over all k classes. This equation is called the Shannon entropy.
For example, suppose we have a dataset with 100 examples, of which 60 belong to class A and 40 belong to
class B. Then the probability of class A is 0.6 and the probability of class B is 0.4. The entropy of the dataset
is then -

[Math Processing Error]


&(#)=-(0.6* 2(0.6)+0.4* &^2(0.4))=0.971

If all the examples in the dataset belong to the same class, then the entropy is 0, indicating a pure node. On
the other hand, if the examples are evenly distributed across all classes, then the entropy is high, indicat­
ing an impure node.

In decision tree algorithms, entropy is used to determine the best split at each node. The goal is to create a
split that results in the most homogeneous subsets. This is done by calculating the entropy of each possi­
ble split and selecting the split that results in the lowest total entropy.

For example, suppose we have a dataset with two features, XI and X2, and the goal is to predict the class
label, Y. We start by calculating the entropy of the entire dataset, H(S). Next, we calculate the entropy of
each possible split based on each feature. For example, we could split the data based on the value of XI or
the value of X2. The entropy of each split is calculated as follows -

[Math Processing Error]

where pt, p2, p3, and p. are the probabilities of each subset; and HCS,), H(S2), H(S3), and H(S4) are the entropies
of each subset.
We then select the split that results in the lowest total entropy, which is given by -

[Math Processing Error]


1) ( & 1)< & ( * 2); ***** ( * 2)

This split is then used to create the child nodes of the decision tree, and the process is repeated recursively
until all nodes are pure or a stopping criterion is met.

Example
Let's take an example to understand how it can be implemented in Python. Here we will use the "iris"
dataset -

from [Link] import loadjris

import numi

Extract features and target

X = [Link]
y = [Link]

Define a function to calculate entrop

def entropy(y):

n = len(y)

, counts = [Link](y, return_counts=True)

probs = counts / n

return -[Link](probs * np.log2(probs))

Calculate the entropy of the target variable


r
target_entropv = entroi

rint(f"Target entropy: {target_entro]

The above code loads the iris dataset, extracts the features and target, and defines a function to calculate
entropy. The entropyQ function takes a vector of target values and returns the entropy of the set.

The function first calculates the number of examples in the set and the count of each class. It then
calculates the proportion of each class and uses these to calculate the entropy of the set using the entropy
formula. Finally, the code calculates the entropy of the target variable in the iris dataset and prints it to the
console.

Output
When you execute this code, it will produce the following output -

Target entropy: 1.585

MLOps
MLOps (Machine Learning Operations) is a set of practices and tools that combine software engineering,
data science, and operations to enable the automated deployment, monitoring, and management of ma­
chine learning models in production environments.

MLOps addresses the challenges of managing and scaling machine learning models in production, which
include version control, reproducibility, model deployment, monitoring, and maintenance. It aims to
streamline the entire machine learning lifecycle, from data preparation and model training to deployment
and maintenance.

MLOps Best Practices


MLOps involves a number of key practices and tools, including -
1. Version control - This involves tracking changes to code, data, and models using tools like Git
to ensure reproducibility and maintain a history of all changes.
2. Continuous integration and delivery (CI/CD) - This involves automating the process of
building, testing, and deploying machine learning models using tools like Jenkins, Travis CI,
or Circled.
3. Containerization - This involves packaging machine learning models and dependencies into
containers using tools like Docker or Kubernetes, which enables easy deployment and scaling
of models in production environments.
4. Model serving - This involves setting up a server to host machine learning models and serv­
ing predictions on incoming data.
5. Monitoring and logging - This involves tracking the performance of machine learning mod­
els in production environments using tools like Prometheus or Grafana, and logging errors
and alerts to enable proactive maintenance.
6. Automated testing - This involves automating the testing of machine learning models to en­
sure they are accurate and robust.

Python Libraries for MLOps


Python has a number of libraries and tools that can be used for MLOps, including -

1. Scikit-learn - A popular machine learning library that provides tools for data preprocessing,
model selection, and evaluation.
2. TensorFlow - A widely used open-source platform for building and deploying machine learn­
ing models.
3. Keras - A high-level neural networks API that can run on top of TensorFlow.
4. PyTorch - A deep learning framework that provides tools for building and deploying neural
networks.
5. MLflow - An open-source platform for managing the machine learning lifecycle that provides
tools for tracking experiments, packaging code and models, and deploying models in produc­
tion.
6. Kubeflow - A machine learning toolkit for Kubernetes that provides tools for managing and
scaling machine learning workflows.

Data Leakage
Data leakage is a common problem in machine learning that occurs when information from outside the
training dataset is used to create or evaluate a model. This can lead to overfitting, where the model is too
closely tailored to the training data and performs poorly on new data.

There are two main types of data leakage: Target Leakage and Train-test Contamination
Target Leakage
Target leakage occurs when features that are not available during prediction are used to create the model.
For example, if we are predicting whether a customer will churn, and we include the customer's cancella­
tion date as a feature, then the model will have access to information that would not be available in prac­
tice. This can lead to unrealistically high accuracy during training and poor performance on new data.

Train-test Contamination
Train-test contamination occurs when information from the test set is inadvertently used in the training
process. For example, if we normalize the data based on the mean and standard deviation of the entire
dataset instead of just the training set, then the model will have access to information that would not be
available in practice. This can lead to overly optimistic estimates of model performance.

How to Prevent Data Leakage?


To prevent data leakage, it is important to carefully preprocess the data and ensure that no information
from the test set is used in the training process. Some strategies for preventing data leakage include -

1. Splitting the data into separate training and test sets before doing any preprocessing or fea­
ture engineering.
2. Only using features that would be available at the time of prediction.
3. Using cross-validation to evaluate model performance instead of a single train-test split.
4. Ensuring that all preprocessing steps (such as normalization or scaling) are applied to the
training set only and then using the same transformations on the test set.
5. Being aware of any potential sources of leakage, such as date or time-based features, and han­
dling them appropriately.

Implementation in Python
Here is an example in which we will be using Sklearn breast cancer dataset and ensure that no information
from the test set is leaked into the model during training -

Example

from sklearn.model_selection import train_test_split

from [Link] import Pipelme^^^^^^^M

from [Link] import StandardScalerJ

[Link]
Split the data into train and test sets

X_train, X_test, y_train, y_test - train_test_split(X, y, test_size=0.2, random_state=42)

Define the pipelint

('scaler', StandardScalerQ),

Make predictions on the test se


y_pred = [Link](X_test)

accuracy = accuracy_score(y_test, y_pre

print("Accuracy:", accuracy)^^^^^B

Output
When you execute this code, it will produce the following output -

Accuracy: 0.9824561403508771

Thank You

You might also like