Machine Learning Concepts and Python Guide
Machine Learning Concepts and Python Guide
PROGRAMMING ■»
9
MACHINE LEARNING
0 6
I^TWOW
KAVISHANKAR
Programming Machine Learning : Machine
Learning Basics Concepts + Artificial Intelli
gence + Python Programming + Python Ma
chine Learning
(Language of Book: English)
Author
Kavishankar Panchtilak
(Professional Blogger, Author)
Publisher
Kavis Web Designer
214, Ukwa Balaghat 481105 (Madhya Pradesh)
Company Website: Kavis Web Designer
Amazon Author Profile: [Link]
Table Of Contents
Basic Concepts
Data
Feature
Model
Training
Testing
Overfitting
Underfitting
Why & When to Make Machines Learn?
Lack of human expertise
Dynamic scenarios
Difficulty in translating expertise into computational tasks
Machine Learning Model
Task(T)
Experience (E)
Performance (P)
Python Libraries
Problem Definition
Data Collection
Data Preparation
Model Building
Model Evaluation
Model Deployment
Required Skills
Programming Skills
Statistics and Mathematics
Mathematical Notation
Probability Theory
Optimization Problem
Data Preprocessing
Data Visualization
Machine Learning Algorithms
Deep Learning
Natural Language Processing
Problem-solving Skills
Communication Skills
Business Acumen
Implementation
Overfitting
Underfitting
Data Quality Issues
Imbalanced Datasets
Model Interpretability
Generalization
Scalability
Ethical Considerations
Limitations
Linear Algebra
Calculus
Probability
Statistics
Artificial Intelligence
Public Datasets
Data Scraping
Data Purchase
Data Collection
Strategies for Acquiring High Quality Datasets
Identify the Problem You Want to Solve
Determine the Size of the Dataset
Ensure the Data is Relevant and Accurate
Preprocess the Data
Categorical Data
Classification
Regression
Algorithms for Supervised Learning
k-Nearest Neighbours
Decision Trees
Naive Bayes
Logistic Regression
Support Vector Machines
Unsupervised
What is Unsupervised Learning?
Algorithms for Unsupervised Learning
k-means clustering
Cluster Identification
Clustering
Association
Dimensionality Reduction
Anomaly Detection
Semi-supervised
Reinforcement
Mean
Median
Mode
Python Implementation
Output
Standard Deviation
Types of Examples
Example 1
Example 2
Percentiles
Example
Output
Bias and Variance
Example
Output
Reducing Bias and Variance
Example
Example
Example
Hypothesis
Python Implementation
Data Preparation
Model Training
Model Testing
Model Evaluation
Plotting the Regression Line
Complete Implementation Example
Multiple Linear Regression
Python Implementation
Example
Polynomial Regression
Python Implementation
Example
Classification Algorithms
Data Preparation
Feature Extraction/Selection
Model Selection
Model Training
Model Evaluation
Hyperparameter Tuning
Model Deployment
Types of Learners in Classification
Lazy Learners
Eager Learners
Building a Classifier in Python
Step 1: Importing necessary python package
Step 2: Importing dataset
Step 3: Organizing data into training & testing sets
Step 4: Model evaluation
Step 5: Finding accuracy
Classification Evaluation Metrics
Confusion Matrix
Various ML Classification Algorithms
Applications
Logistic Regression
Implementation in Python
Load the Dataset
Plot the Training Data
Split the Dataset
Create the Logistic Regression Model
Train the Model
Make Predictions
Evaluate the Model
Complete Implementation Example
K-Nearest Neighbors (KNN)
Implementation in Python
Example
Advantages of Agglomerative Clustering
Disadvantages of Agglomerative Clustering
Dimensionality Reduction
Feature Selection
Feature Extraction
Feature Selection
Filter Methods
Wrapper Methods
Embedded Methods
Principal Component Analysis (PCA)
Recursive Feature Elimination (RFE)
Example
Feature Extraction
Example
Output
Advantages of Feature Extraction
Disadvantages of Feature Extraction
Backward Elimination
Implementation in Python
Example
Forward Feature Construction
Example
Output
High Correlation Filter
Example
Output
Advantages of High Correlation Filter
Disadvantages of High Correlation Filter
Low Variance Filter
Example
Output
Advantages of Low Variance Filter
Disadvantages of Low Variance Filter
Missing Values Ratio
Example
Output
Advantages of Missing Value Ratio
Disadvantages of Missing Value Ratio
Principal Component Analysis
Example
Output
Advantages of PC A
Disadvantages of PC A
Performance Metrics
Example
Output
Automatic Workflows
Introduction
Challenges Accompanying ML Pipelines
Quality of Data
Data Reliability
Data Accessibility
Modelling ML Pipeline and Data Preparation
Example
Modelling ML Pipeline and Feature Extraction
Example
Boost Model Performance
Example
Output
Performance Improvement with Ensembles
Sequential ensemble methods
Parallel ensemble methods
Ensemble Learning Methods
Bagging
Boosting
Voting
Bagging Ensemble Algorithms
Bagged Decision Tree
Output
Random Forest
Output
Extra Trees
Output
Boosting Ensemble Algorithms
AdaBoost
Output
Stochastic Gradient Boosting
Output
Voting Ensemble Algorithms
Output
Gradient Boosting
What is a Gradient Boosting Machine (GBM)?
How Does a Gradient Boosting Machine Work?
Example
Output
Advantages of Using Gradient Boosting Machines
Limitations of Gradient Boosting Machines
Bootstrap Aggregation (Bagging)
How Does Bagging Work?
Example
Output
Cross Validation
What is Cross-Validation?
Why is Cross-Validation Important?
Implementing Cross-Validation in Python
Example
AUC-ROC Curve
What is the AUC-ROC Curve?
Why is the AUC-ROC Curve Important?
Implementing the AUC ROC Curve in Python
Example
Grid Search
Implementation in Python
Example
Data Scaling
Example
Output
Train and Test
Example
Output
Association Rules
Example
Output
Apriori Algorithm
Example
Output
Gaussian Discriminant Analysis
Example
Output
Cost Function
Implementation in Python
Example
Bayes Theorem
Implementation in Python
Example
Precision and Recall
Implementation in Python
Example
Adversarial
Implementation in Python
Example
Stacking
Example
Output
Epoch
Implementation in Python
Example
Perceptron
Architecture of Perceptron
Training of Perceptron
Implementation of Perceptron in Python
Example
Role of Step Functions in the Training of Perceptrons
Regularization
Overfitting
Causes of Overfitting
Techniques to Prevent Overfitting
Example
Output
P-value
Target Leakage
Train-test Contamination
How to Prevent Data Leakage?
Implementation in Python
Example
Basic Concepts
Machine learning, as we know, is a subset of artificial intelligence that involves training computer algo
rithms to automatically learn patterns and relationships in data. Here are some basic concepts of machine
learning -
Data
Data is the foundation of machine learning. Without data, there would be nothing for the algorithm to
learn from. Data can come in many forms, including structured data (such as spreadsheets and databases)
and unstructured data (such as text and images). The quality and quantity of the data used to train the ma
chine learning algorithm are crucial factors that can significantly impact its performance.
Feature
In machine learning, features are the variables or attributes used to describe the input data. The goal is to
select the most relevant and informative features that will allow the algorithm to make accurate predic
tions or decisions. Feature selection is a crucial step in the machine learning process because the perfor
mance of the algorithm is heavily dependent on the quality and relevance of the features used.
Model
A machine learning model is a mathematical representation of the relationship between the input data
(features) and the output (predictions or decisions). The model is created using a training dataset and then
evaluated using a separate validation dataset. The goal is to create a model that can accurately generalize
to new, unseen data.
Training
Training is the process of teaching the machine learning algorithm to make accurate predictions or
decisions. This is done by providing the algorithm with a large dataset and allowing it to learn from the
patterns and relationships in the data. During training, the algorithm adjusts its internal parameters to
minimize the difference between its predicted output and the actual output.
Testing
Testing is the process of evaluating the performance of the machine learning algorithm on a separate
dataset that it has not seen before. The goal is to determine how well the algorithm generalizes to new, un
seen data. If the algorithm performs well on the testing dataset, it is considered to be a successful model.
Overfitting
Overfitting occurs when a machine learning model is too complex and fits the training data too closely.
This can lead to poor performance on new, unseen data because the model is too specialized to the training
dataset. To prevent overfitting, it is important to use a validation dataset to evaluate the model's perfor
mance and to use regularization techniques to simplify the model.
Underfitting
Underfitting occurs when a machine learning model is too simple and cannot capture the patterns and
relationships in the data. This can lead to poor performance on both the training and testing datasets.
To prevent underfitting, we can use several techniques such as increasing model complexity, collect more
data, reduce regularization, and feature engineering.
It is important to note that preventing underfitting is a balancing act between model complexity and the
amount of data available. Increasing model complexity can help prevent underfitting, but if there is not
enough data to support the increased complexity, overfitting may occur instead. Therefore, it is important
to monitor the model's performance and adjust the complexity as necessary.
Dynamic scenarios
There are some scenarios which are dynamic in nature i.e. they keep changing over time. In case of these
scenarios and behaviors, we want a machine to learn and take data-driven decisions. Some of the examples
can be network connectivity and availability of infrastructure in an organization.
"A computer program is said to learn from experience E with respect to some class of tasks T and perfor
mance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”
The above definition is basically focusing on three parameters, also the main components of any learning
algorithm, namely Task(T), Performance(P) and experience (E). In this context, we can simplify this defi
nition as -
Based on the above, the following diagram represents a Machine Learning Model -
Let us discuss them more in detail now -
Task(T)
From the perspective of problem, we may define the task T as the real-world problem to be solved. The
problem can be anything like finding best house price in a specific location or to find best marketing strat
egy etc. On the other hand, if we talk about machine learning, the definition of task is different because it
is difficult to solve ML based tasks by conventional programming approach.
A task T is said to be a ML based task when it is based on the process and the system must follow for
operating on data points. The examples of ML based tasks are Classification, Regression, Structured anno
tation, Clustering, Transcription etc.
Experience (E)
As name suggests, it is the knowledge gained from data points provided to the algorithm or model. Once
provided with the dataset, the model will run iteratively and will learn some inherent pattern. The learn
ing thus acquired is called experience(E). Making an analogy with human learning, we can think of this
situation as in which a human being is learning or gaining some experience from various attributes like
situation, relationships etc. Supervised, unsupervised and reinforcement learning are some ways to learn
or gain experience. The experience gained by out ML model or algorithm will be used to solve the task T.
Performance (P)
An ML algorithm is supposed to perform task and gain experience with the passage of time. The measure
which tells whether ML algorithm is performing as per expectation or not is its performance (P). P is basi
cally a quantitative metric that tells how a model is performing the task, T, using its experience, E. There
are many metrics that help to understand the ML performance, such as accuracy score, Fl score, confusion
matrix, precision, recall, sensitivity etc.
Python Libraries
Python has become one of the most popular programming languages for machine learning due to its
simplicity, versatility, and extensive ecosystem of libraries and tools. In this chapter, we will explore the
Python ecosystem for machine learning and highlight some of the most popular libraries and frameworks.
Easy prototyping
Another important feature of Python that makes it the choice of language for data science is the easy and
fast prototyping. This feature is useful for developing new algorithm.
Collaboration feature
The field of data science basically needs good collaboration and Python provides many useful tools that
make this extremely.
Easy to learn and understand - The syntax of Python is simpler; hence it is relatively easy, even for begin
ners also, to learn and understand the language.
Huge number of modules - Python has huge number of modules for covering every aspect of program
ming. These modules are easily available for use hence making Python an extensible language.
Support of open source community - As being open source programming language, Python is supported
by a very large developer community. Due to this, the bugs are easily fixed by the Python community. This
characteristic makes Python very robust and adaptive.
Scalability - Python is a scalable programming language because it provides an improved structure for
supporting large programs than shell-scripts.
Weakness
Although Python is a popular and powerful programming language, it has its own weakness of slow exe
cution speed.
The execution speed of Python is slow as compared to compiled languages because Python is an inter
preted language. This can be the major area of improvement for Python community.
Installing Python
For working in Python, we must first have to install it. You can perform the installation of Python in any of
the following two ways -
With the help of following steps, we can install Python on Unix and Linux platform -
1. First, go to [Link]/downloads/.
2. Next, click on the link to download zipped source code available for Unix/Linux.
3. Now, Download and extract files.
4. Next, we can edit the Modules/Setup file if we want to customize some options.
a. Next, write the command run ./configure script
b. make
c. make install
On Windows platform
With the help of following steps, we can install Python on Windows platform -
1. First, go to [Link]/downloads/.
2. Next, click on the link for Windows installer [Link] file. Here XYZ is the version we
wish to install.
3. Now, we must run the file that is downloaded. It will take us to the Python install wizard,
which is easy to use. Now, accept the default settings and wait until the install is finished.
On Macintosh platform
For Mac OS X, Homebrew, a great and easy to use package installer is recommended to install Python 3. In
case if you don't have Homebrew, you can install it with the help of following command -
p[Link]
Now, to install Python 3 on your system, we need to run the following command -
1. Step 1 - First, we need to download the required installation package from Anaconda distri
bution. The link for the same is [Link]/distribution/. You can choose from Win
dows, Mac and Linux OS as per your requirement.
2. Step 2 - Next, select the Python version you want to install on your machine. The latest
Python version is 3.7. There you will get the options for 64-bit and 32-bit Graphical installer
both.
3. Step 3 - After selecting the OS and Python version, it will download the Anaconda installer on
your computer. Now, double click the file and the installer will install Anaconda package.
4. Step 4 - For checking whether it is installed or not, open a command prompt and type Python.
You can also check this in detailed video lecture at Python Essentials Online Training.
Components of Python ML Ecosystem
In this section, let us discuss some core Data Science libraries that form the components of Python
Machine learning ecosystem. These useful components make Python an important language for Data
Science. Though there are many such components, let us discuss some of the importance components of
Python ecosystem here -
Jupyter Notebook
Jupyter notebooks basically provides an interactive computational environment for developing Python
based Data Science applications. They are formerly known as ipython notebooks. The following are some
of the features of Jupyter notebooks that makes it one of the best components of Python ML ecosystem -
1. Jupyter notebooks can illustrate the analysis process step by step by arranging the stuff like
code, images, text, output etc. in a step by step manner.
2. It helps a data scientist to document the thought process while developing the analysis
process.
3. One can also capture the result as the part of the notebook.
4. With the help of jupyter notebooks, we can share our work with a peer also.
After pressing enter, it will start a notebook server at localhost:8888 of your computer. It is shown in the
following screen shot -
3 x + —ox
r -» C © localhost-SSS.-tree ☆ © © :
* * -
Now, after clicking the New tab, you will get a list of options. Select Python 3 and it will take you to the new
notebook for start working in it. You will get a glimpse of it in the following screenshots -
3 X D Untrtitdl -ox
-> C © localhost:8888/tree Gt ☆ © O :
□ CD Contacts Folder
Terminal
□ CD Creative Cloud Files
C 0 localhost8888/notebooks/[Link]’kernel name=python3 ☆ © O :
C® Logout
2 jupyter Untitledl Last Checkpoint a few seconds ago (unsaved changes)
File Edit View Insect Cell Kernel Widgets Help Trusted ✓ | Python 3 O
Code cells - As the name suggests, we can use these cells to write code. After writing the code/content, it
will send it to the kernel that is associated with the notebook.
Markdown cells - We can use these cells for notating the computation process. They can contain the stuff
like text, images, Latex equations, HTML tags etc.
Raw cells - The text written in them is displayed as it is. These cells are basically used to add the text that
we do not wish to be converted by the automatic conversion mechanism of jupyter notebook.
For more detailed study of jupyter notebook, you can go to the link [Link]/jupyter/
[Link].
NumPy
NumPy is a fundamental library for scientific computing in Python. It provides support for large, multi
dimensional arrays and matrices, along with a collection of mathematical functions to operate on them.
NumPy is a critical component of the Python machine learning ecosystem, as it provides the underlying
data structure and numerical operations required for many machine learning algorithms. Below is the
command to install NumPy -
Pandas
Pandas is a powerful library for data manipulation and analysis. It provides a range of functions for im
porting, cleaning, and transforming data, along with powerful tools for grouping and aggregating data.
Pandas is particularly useful for data preprocessing in machine learning, as it allows for efficient data han
dling and manipulation. Below is the command to install Pandas -
Scikit-learn
Scikit-learn is a popular machine learning library in Python, providing a range of algorithms for classifica
tion, regression, clustering, and more. It also includes tools for data preprocessing, feature selection, and
model evaluation. Scikit-learn is widely used in the machine learning community due to its ease of use,
performance, and extensive documentation.
TensorFlow
TensorFlow is an open-source library for machine learning developed by Google. It provides support for
building and training deep learning models, along with tools for distributed computing and deployment.
TensorFlow is a powerful tool for building complex machine learning models, particularly in the areas of
computer vision and natural language processing. Below is the command to install TensorFlow -
PyTorch
PyTorch is another popular deep learning library in Python. Developed by Facebook, it provides a range of
tools for building and training neural networks, along with support for dynamic computation graphs and
GPU acceleration.
PyTorch is particularly useful for researchers and developers who need a flexible and powerful deep learn
ing framework. Below is the command to install PyTorch -
OpenCV
OpenCV is a computer vision library that provides tools for image and video processing, along with sup
port for machine learning algorithms. It is widely used in the computer vision community for tasks such
as object detection, image segmentation, and facial recognition. Below is the command to install OpenCV -
ip install opencv-python
In addition to these libraries, there are many other tools and frameworks in the Python ecosystem for
machine learning, including XGBoost, LightGBM, spaCy, and NLTK. The Python ecosystem for machine
learning is constantly evolving, with new libraries and tools being developed all the time.
Whether you are a beginner or an experienced machine learning practitioner, Python provides a rich and
flexible environment for developing and deploying machine learning models.
Here, it is also important to note that some libraries may require additional dependencies or system
specific requirements. In such cases, it is recommended to consult the library's documentation for instal
lation instructions and requirements.
Applications
Machine learning has become a ubiquitous technology that has impacted many aspects of our lives, from
business to healthcare to entertainment. Here are some popular applications of machine learning -
Fraud Detection
Machine learning is widely used in the finance industry for fraud detection. Machine learning algorithms
can analyze vast amounts of transactional data to detect patterns and anomalies that may indicate fraud
ulent activity, helping to prevent financial losses and protect customers.
Predictive Maintenance
Predictive maintenance is a process of using machine learning algorithms to predict when maintenance
will be required on a machine, such as a piece of equipment in a factory. By analyzing data from sensors
and other sources, machine learning algorithms can detect patterns that indicate when a machine is likely
to fail, enabling maintenance to be performed before the machine breaks down.
Healthcare
Machine learning has also found many applications in the healthcare industry. For example, machine
learning algorithms can be used to analyze medical images and detect diseases such as cancer, or to predict
patient outcomes based on their medical history and other factors.
Recommendation Systems
Recommendation systems are used to provide personalized recommendations to users based on their past
behavior and preferences. Machine learning algorithms are used to analyze user data and generate recom
mendations for products, services, and content.
Autonomous Vehicles
Machine learning is a critical technology for the development of autonomous vehicles. Machine learning
algorithms are used to process data from sensors and cameras, allowing vehicles to detect and respond to
their environment in real-time.
These are just a few examples of the many applications of machine learning. As machine learning contin
ues to evolve and improve, we can expect to see it used in more areas of our lives, improving efficiency,
accuracy, and convenience in a variety of industries.
Life Cycle
The machine learning life cycle is a process that involves several stages from problem identification to
model deployment. Here are the six stages of the machine learning life cycle -
Problem Definition
The first step in the machine learning life cycle is to identify the problem you want to solve. This stage
involves understanding the business problem, defining the problem statement, and identifying the suc
cess criteria for the machine learning model.
Data Collection
The second stage is to collect the data that will be used to train the machine learning model. This stage
involves identifying the relevant data sources, collecting and storing the data, and cleaning and prepro
cessing the data to prepare it for analysis.
Data Preparation
In this stage, the data is prepared for analysis by performing data exploration, feature engineering, and
feature selection. Data exploration involves visualizing and understanding the data, while feature engi
neering involves creating new features from the existing data. Feature selection involves selecting the
most relevant features that will be used to train the machine learning model.
Model Building
In this stage, the machine learning model is built using the prepared data. The model building process
involves selecting the appropriate machine learning algorithm, tuning the hyperparameters of the algo
rithm, and evaluating the performance of the model using cross-validation techniques.
Model Evaluation
In this stage, the performance of the machine learning model is evaluated using a set of evaluation
metrics. These metrics measure the accuracy, precision, recall, and Fl score of the model. If the model's
performance is not satisfactory, it may be necessary to return to the model building stage to improve the
model's performance.
Model Deployment
The final stage of the machine learning life cycle is to deploy the machine learning model into production.
This involves integrating the model into the production environment, testing the model in a real-world
scenario, and monitoring the model's performance to ensure that it continues to perform as expected.
The machine learning life cycle is an iterative process, and it may be necessary to revisit previous stages
to improve the model's performance or address new requirements. By following the machine learning life
cycle, data scientists can ensure that their machine learning models are effective, accurate, and meet the
business requirements.
Required Skills
Machine learning is a rapidly growing field that requires a combination of technical and soft skills to be
successful. Here are some of the key skills required for machine learning -
Programming Skills
Machine learning requires a solid foundation in programming skills, particularly in languages such as
Python, R, and Java. Proficiency in programming allows data scientists to build, test, and deploy machine
learning models.
To give you a brief idea of what skills you need to acquire, let us discuss some examples -
Mathematical Notation
Most of the machine learning algorithms are heavily based on mathematics. The level of mathematics that
you need to know is probably just a beginner level. What is important is that you should be able to read the
notation that mathematicians use in their equations. For example - if you are able to read the notation and
comprehend what it means, you are ready for learning machine learning. If not, you may need to brush up
your mathematics knowledge.
7 if net — 9 > e
^X(net—0) g—X(net—0) \
Probability Theory
Here is an example to test your current knowledge of probability theory: Classifying with conditional
probabilities.
With these definitions, we can define the Bayesian classification rule -
Optimization Problem
Here is an optimization function
in
max label^ • labels • a, • aj{x^'\x^
m
a > [Link] ? a. • label^ = 0
t—1
If you can read and understand the above, you are all set.
Data Preprocessing
Preparing data for machine learning requires knowledge of data cleaning, data transformation, and data
normalization. This involves identifying and correcting errors, missing values, and inconsistencies in the
data.
Data Visualization
Data visualization is the process of creating graphical representations of data to help users understand and
interpret complex data sets. Data scientists must be able to create effective visualizations that communi
cate insights from the data.
In many cases, you will need to understand the various types of visualization plots to understand your
data distribution and interpret the results of the algorithm’s output.
Besides the above theoretical aspects of machine learning, you need good programming skills to code those
algorithms.
Problem-solving Skills
Machine learning requires strong problem-solving skills, including the ability to identify problems, gen
erate hypotheses, and develop solutions. Data scientists must be able to think creatively and logically to
develop effective solutions to complex problems.
Communication Skills
Communication skills are essential for data scientists, as they must be able to explain complex technical
concepts to non-technical stakeholders. Data scientists must be able to communicate the results of their
analysis and the implications of their findings in a clear and concise manner.
Business Acumen
Machine learning is used to solve business problems, and therefore, understanding the business context
and the ability to apply machine learning to business problems is essential.
Overall, machine learning requires a broad range of skills, including technical, mathematical, and soft
skills. To be successful in this field, data scientists must be able to combine these skills to develop effective
machine learning models that solve complex business problems.
Implementation
Implementing machine learning involves several steps, which include -
Model Evaluation
After training the model, it needs to be evaluated to determine its performance. The performance of the
model can be evaluated using metrics such as accuracy, precision, recall, and Fl score. Cross-validation
techniques can also be used to test the model's performance.
Model Tuning
The performance of the model can be improved by tuning its hyperparameters. Hyperparameters are
settings that are not learned from the data, but rather set by the user. The optimal values for these hyper
parameters can be found using techniques such as grid search and random search.
Each of the above steps requires different tools and techniques, and successful implementation requires a
combination of technical and business skills.
The language of your choice - this essentially is your proficiency in one of the languages supported in ML
development.
The IDE that you use - This would depend on your familiarity with the existing IDEs and your comfort
level.
Development platform - There are several platforms available for development and deployment. Most of
these are free-to-use. In some cases, you may have to incur a license fee beyond a certain amount of usage.
Here is a brief list of choice of languages, IDEs and platforms for your ready reference.
Language Choice
Here is a list of languages that support ML development -
1. Python
2. R
3. Matlab
4. Octave
5. Julia
6. C+ +
7. C
This list is not essentially comprehensive; however, it covers many popular languages used in machine
learning development. Depending upon your comfort level, select a language for the development, develop
your models and test.
IDEs
Here is a list of IDEs which support ML development -
1. R Studio
2. Pycharm
3. iPython/Jupyter Notebook
4. Julia
5. Spyder
6. Anaconda
7. Rodeo
8. Google-Colab
The above list is not essentially comprehensive. Each one has its own merits and demerits. The reader is
encouraged to try out these different IDEs before narrowing down to a single one.
Challenges & Common Issues
Machine learning is a rapidly growing field with many promising applications. However, there are also
several challenges and issues that must be addressed to fully realize the potential of machine learning.
Some of the major challenges and common issues faced in machine learning include -
Overfitting
Overfitting occurs when a model is trained on a limited set of data and becomes too complex, leading to
poor performance when tested on new data. This can be addressed by using techniques such as cross-vali
dation, regularization, and early stopping.
Underfitting
Underfitting occurs when a model is too simple and fails to capture the patterns in the data. This can be
addressed by using more complex models or by adding more features to the data.
Model Interpretability
Machine learning models can be very complex, making it difficult to understand how they arrive at their
predictions. This can be a challenge when explaining the model to stakeholders or regulatory bodies. Tech
niques such as feature importance and partial dependence plots can help improve model interpretability.
Generalization
Machine learning models are trained on a specific dataset, and they may not perform well on new data
that is outside the training set. This can be addressed by using techniques such as cross-validation and
regularization.
Scalability
Machine learning models can be computationally expensive and may not scale well to large datasets.
Techniques such as distributed computing, parallel processing, and sampling can help address scalability
issues.
Ethical Considerations
Machine learning models can raise ethical concerns when they are used to make decisions that affect
people's lives. These concerns include bias, privacy, and transparency. Techniques such as fairness metrics
and explainable Al can help address ethical considerations.
Addressing these issues requires a combination of technical expertise and business knowledge, as well as
an understanding of ethical considerations. By addressing these issues, machine learning can be used to
develop accurate and reliable models that can provide valuable insights and drive business value.
Limitations
Machine learning is a powerful technology that has transformed the way we approach data analysis, but
like any technology, it has its limitations. Here are some of the key limitations of machine learning -
Limited Applicability
Machine learning models are designed to find patterns in data, which means they may not be suitable for
all types of data or problems.
Ethical Considerations
Machine learning models can sometimes perpetuate biases or discriminate against certain groups, raising
ethical concerns.
Dependence on Experts
Developing and deploying machine learning models requires significant expertise in data science, statis
tics, and programming, making it challenging for organizations without access to these skills.
Limited Interpretability
Some machine learning models, such as deep neural networks, can be difficult to interpret. This means
that it may be challenging to understand how the model arrived at its predictions.
Real-Life Examples
Machine learning has been transforming various industries by automating processes, predicting out
comes, and discovering patterns in large data sets. Here we are providing the top 5 real-life examples of
machine learning -
Fraud Detection in Banking and Finance
Machine learning algorithms are widely used in the financial industry to detect fraudulent activities.
These algorithms can analyze transaction data and identify patterns that indicate fraud. For example,
credit card companies use machine learning to identify transactions that are likely to be fraudulent and
notify customers in real-time. Banks also use machine learning to detect money laundering, identify un
usual behavior in accounts, and analyze credit risk.
In addition to these examples, machine learning is being used in many other applications, such as energy
management, social media analysis, and predictive maintenance. Machine learning is a powerful tool that
has the potential to revolutionize many industries and improve the lives of people around the world.
Data Structure
Data structure plays a critical role in machine learning as it facilitates the organization, manipulation, and
analysis of data. Data is the foundation of machine learning models, and the data structure used can sig
nificantly impact the model's performance and accuracy.
Here we will discuss some commonly used data structures and how they are used in Machine Learning.
Arrays
Arrays are a collection of similar data types that can be accessed using an index. They are commonly used
in machine learning for storing data in the form of tables, such as CSV files. Arrays are easy to use and offer
fast indexing, but their size is fixed, which can be a limitation when working with large datasets.
Lists
Lists are collections of heterogeneous data types that can be accessed using an iterator. They are com
monly used in machine learning for storing complex data structures, such as nested lists, dictionaries, and
tuples. Lists offer flexibility and can handle varying data sizes, but they are slower than arrays due to the
need for iteration.
Dictionaries
Dictionaries are a collection of key-value pairs that can be accessed using the keys. They are commonly
used in machine learning for storing metadata or labels associated with data. Dictionaries offer fast access
to data and are useful for creating lookup tables, but they can be memory-intensive when dealing with
large datasets.
Linked Lists
Linked lists are collections of nodes, each containing a data element and a reference to the next node in the
list. They are commonly used in machine learning for storing and manipulating sequential data, such as
time-series data. Linked lists offer efficient insertion and deletion operations, but they are slower than ar
rays and lists when it comes to accessing data.
Trees
Trees are hierarchical data structures that are commonly used in machine learning for decision-making
algorithms, such as decision trees and random forests. Trees offer efficient searching and sorting algo
rithms, but they can be complex to implement and can suffer from overfitting.
Graphs
Graphs are collections of nodes and edges that are commonly used in machine learning for representing
complex relationships between data points. Graphs offer powerful algorithms for clustering, classifica
tion, and prediction, but they can be complex to implement and can suffer from scalability issues.
In addition to the above-mentioned data structures, many machine learning libraries and frameworks
provide specialized data structures for specific use cases, such as matrices and tensors for deep learning. It
is important to choose the right data structure for the task at hand, considering factors such as data size,
processing speed, and memory usage.
Pre-processing Data
Before training a machine learning model, it is necessary to pre-process the data to clean, transform, and
normalize it. Data structures such as lists and arrays can be used to store and manipulate the data during
pre-processing. For example, a list can be used to filter out missing values, while an array can be used to
normalize the data.
Building Graphs
Graphs are used in machine learning to represent complex relationships between data points. Data struc
tures such as adjacency matrices and linked lists are used to create and manipulate graphs. Graphs are
used for clustering, classification, and prediction tasks.
Mathematics
Machine learning is an interdisciplinary field that involves computer science, statistics, and mathematics.
In particular, mathematics plays a critical role in developing and understanding machine learning algo
rithms. In this article, we will discuss the mathematical concepts that are essential for machine learning,
including linear algebra, calculus, probability, and statistics.
Linear Algebra
Linear algebra is the branch of mathematics that deals with linear equations and their representation in
vector spaces. In machine learning, linear algebra is used to represent and manipulate data. In particular,
vectors and matrices are used to represent and manipulate data points, features, and weights in machine
learning models.
A vector is an ordered list of numbers, while a matrix is a rectangular array of numbers. For example, a
vector can represent a single data point, and a matrix can represent a dataset. Linear algebra operations,
such as matrix multiplication and inversion, can be used to transform and analyze data.
Calculus
Calculus is the branch of mathematics that deals with rates of change and accumulation. In machine
learning, calculus is used to optimize models by finding the minimum or maximum of a function. In par
ticular, gradient descent, a widely used optimization algorithm, is based on calculus.
Gradient descent is an iterative optimization algorithm that updates the weights of a model based on the
gradient of the loss function. The gradient is the vector of partial derivatives of the loss function with
respect to each weight. By iteratively updating the weights in the direction of the negative gradient, gradi
ent descent tries to minimize the loss function.
Probability
Probability is the branch of mathematics that deals with uncertainty and randomness. In machine learn
ing, probability is used to model and analyze data that are uncertain or variable. In particular, probability
distributions, such as Gaussian and Poisson distributions, are used to model the probability of data points
or events.
Bayesian inference, a probabilistic modeling technique, is also widely used in machine learning. Bayesian
inference is based on Bayes' theorem, which states that the probability of a hypothesis given the data is
proportional to the probability of the data given the hypothesis multiplied by the prior probability of the
hypothesis. By updating the prior probability based on the observed data, Bayesian inference can make
probabilistic predictions or classifications.
Statistics
Statistics is the branch of mathematics that deals with the collection, analysis, interpretation, and pre
sentation of data. In machine learning, statistics is used to evaluate and compare models, estimate model
parameters, and test hypotheses.
For example, cross-validation is a statistical technique that is used to evaluate the performance of a model
on new, unseen data. In cross-validation, the dataset is split into multiple subsets, and the model is trained
and evaluated on each subset. This allows us to estimate the performance of the model on new data and
compare different models.
Artificial Intelligence
Artificial intelligence and machine learning are two buzzwords that are commonly used in the world
of technology. Although they are often used interchangeably, they are not the same thing. Artificial
intelligence (Al) and machine learning (ML) are related concepts, but they have different definitions, ap
plications, and implications. In this article, we will explore the differences between machine learning and
artificial intelligence and how they are related.
Examples of machine learning include image recognition, speech recognition, recommendation systems,
fraud detection, and natural language processing.
There are two types of Al: narrow or weak Al and general or strong Al. Narrow Al is designed to perform
specific tasks, such as speech recognition or image recognition, while general Al is designed to be able to
perform any intellectual task that a human can do. Currently, we only have narrow Al in use, but the goal
is to develop general Al that can be applied to a wide range of tasks.
Al is like a basket containing several branches, the important ones being Machine Learning (ML), Robotics,
Expert Systems, Fuzzy Logic, Neural Networks, Computer Vision, and Natural Language Processing (NLP).
NLP
Machine
Learning
Computer
Systems Vision
While we highlight the features of ML in the next section, here is a brief overview of the other important
branches of Al:
1. Robotics - Robots are primarily designed to perform repetitive and tedious tasks. Robotics is
an important branch of Al that deals with designing, developing and controlling the applica
tion of robots.
2. Computer Vision - It is an exciting field of Al that helps computers, robots, and other digital
devices to process and understand digital images and videos, and extract vital information.
With the power of Al, Computer Vision develops algorithms that can extract, analyze and
comprehend useful information from digital images.
3. Expert Systems - Expert systems are applications specifically designed to solve complex
problems in a specific domain, with humanlike intelligence, precision, and expertise. Just like
human experts, Expert Systems excel in a specific domain in which they are trained.
4. Fuzzy Logic - We know computers take precise digital inputs like True (Yes) or False (No), but
Fuzzy Logic is a method of reasoning that helps machines to reason like human beings before
taking a decision. With Fuzzy Logic, machines can analyze all intermediate possibilities be
tween a YES or NO, for example, "Possibly Yes", "Maybe No", etc.
5. Neural Networks - Inspired by the natural neural networks of the human brain, Artificial
Neural Networks (ANN) can be considered as a group of highly interconnected group of pro
cessing elements (nodes) that can process information by their dynamic state response to ex
ternal inputs. ANNs use training data to improve their efficiency and accuracy.
6. Natural Language Processing (NLP) - NLP is a field of Al that empowers intelligent systems
to communicate with humans using a natural language like English. With the power of NLP,
one can easily interact with a robot and instruct it in plain English to perform a task. NLP can
also process text data and comprehend its full meaning. It is heavily used these days in virtual
chatbots and sentiment analysis.
Examples of Al include virtual assistants, autonomous vehicles, facial recognition, natural language pro
cessing, and decision-making systems.
Firstly, machine learning is a subset of artificial intelligence, meaning that machine learning is a part of
the larger field of Al. Machine learning is a technique used to implement artificial intelligence.
Secondly, while machine learning focuses on developing algorithms that can learn from data, artificial in
telligence focuses on developing intelligent machines that can perform tasks that normally require human
intelligence. In other words, machine learning is more focused on the process of learning from data, while
Al is more focused on the end goal of creating machines that can perform intelligent tasks.
Thirdly, machine learning algorithms are designed to learn from data and improve their accuracy over
time, while artificial intelligence systems are designed to learn and adapt to new situations and environ
ments. Machine learning algorithms require a lot of data to be trained effectively, while Al systems can
adapt and learn from new data in real-time.
Finally, machine learning is more limited in its capabilities compared to AL Machine learning algorithms
can only learn from the data they are trained on, while Al systems can learn and adapt to new situations
and environments. Machine learning is great for solving specific problems that can be solved through
pattern recognition, while Al is better suited for complex, real-world problems that require reasoning and
decision-making.
The following table highlights the important differences between Machine Learning and Artificial Intelli
gence -
Concept Al revolves around making smart and intelligent ML revolves around making a
devices. machine learn/decide and improve
its results.
Goal The goal of Al is to simulate human intelligence The goal of ML is to learn from data
to solve complex problems. provided and make improvements
in machine's performance.
Neural Networks
Machine learning and neural networks are two important technologies in the field of artificial intelligence
(Al). While they are often used together, they are not the same thing. In this article, we will explore the
differences between machine learning and neural networks and how they are related.
We understood about machine learning in last section so let's see what neural networks are.
What are Neural Networks?
Neural networks are a type of machine learning algorithm that is inspired by the structure of the human
brain. They are designed to simulate the way the brain works by using layers of interconnected nodes, or
artificial neurons. Each neuron takes in input from the neurons in the previous layer and uses that input to
produce an output. This process is repeated for each layer until a final output is produced.
Neural networks can be used for a wide range of tasks, including image recognition, speech recognition,
natural language processing, and prediction. They are particularly well-suited to tasks that involve pro
cessing complex data or recognizing patterns in data.
1. Firstly, machine learning is a broad category that encompasses many different types of algo
rithms, including neural networks. Neural networks are a specific type of machine learning
algorithm that is designed to simulate the way the brain works.
2. Secondly, while machine learning algorithms can be used for a wide range of tasks, neural
networks are particularly well-suited to tasks that involve processing complex data or recog
nizing patterns in data. Neural networks can recognize complex patterns and relationships in
data that other machine learning algorithms may not be able to detect.
3. Thirdly, neural networks require a lot of data and processing power to train. Neural networks
typically require large datasets and powerful hardware, such as graphics processing units
(GPUs), to train effectively. Machine learning algorithms, on the other hand, can be trained on
smaller datasets and less powerful hardware.
4. Finally, neural networks can provide highly accurate predictions and decisions, but they can
be more difficult to understand and interpret than other machine learning algorithms. The
way that neural networks make decisions is not always transparent, which can make it diffi
cult to understand how they arrived at their conclusions.
Deep Learning
In the world of artificial intelligence, two terms that are often used interchangeably are machine learning
and deep learning. While both of these technologies are used to create intelligent systems, they are not the
same thing. In this article, we will explore the differences between machine learning and deep learning
and how they are related.
We understood about machine learning in last section so let's see what deep learning is.
Deep learning is particularly well-suited to tasks that involve processing complex data, such as image and
speech recognition, natural language processing, and self-driving cars. Deep learning algorithms are able
to process vast amounts of data and can learn to recognize complex patterns and relationships in that data.
Examples of deep learning include facial recognition, voice recognition, and self-driving cars.
• Firstly, machine learning is a broad category that encompasses many different types of
algorithms, including deep learning. Deep learning is a specific type of machine learning algo
rithm that uses neural networks to process complex data.
• Secondly, while machine learning algorithms are designed to learn from data and improve
their accuracy over time, deep learning algorithms are designed to process complex data and
recognize patterns and relationships in that data. Deep learning algorithms are able to recog
nize complex patterns and relationships that other machine learning algorithms may not be
able to detect.
• Thirdly, deep learning algorithms require a lot of data and processing power to train. Deep
learning algorithms typically require large datasets and powerful hardware, such as graphics
processing units (GPUs), to train effectively. Machine learning algorithms, on the other hand,
can be trained on smaller datasets and less powerful hardware.
• Finally, deep learning algorithms can provide highly accurate predictions and decisions, but
they can be more difficult to understand and interpret than other machine learning algo
rithms. Deep learning algorithms can process vast amounts of data and recognize complex
patterns and relationships in that data, but it can be difficult to understand how the algo
rithm arrived at its conclusion.
Getting Datasets
Machine learning models are only as good as the data they are trained on. Therefore, obtaining good qual
ity and relevant datasets is a critical step in the machine learning process. Let's see some different sources
of datasets for machine learning and how to obtain them.
Public Datasets
There are many publicly available datasets that you can use for machine learning. Some of the popular
sources of public datasets include Kaggle, UCI Machine Learning Repository, Google Dataset Search, and
AWS Public Datasets. These datasets are often used for research and are open to the public.
Data Scraping
Data scraping involves automatically extracting data from websites or other sources. It can be a useful way
to obtain data that is not available as a pre-packaged dataset. However, it is important to ensure that the
data is being scraped ethically and legally, and that the source is reliable and accurate.
Data Purchase
In some cases, it may be necessary to purchase a dataset for machine learning. Many companies sell pre
packaged datasets that are tailored to specific industries or use cases. Before purchasing a dataset, it is im
portant to evaluate its quality and relevance to your machine learning project.
Data Collection
Data collection involves manually collecting data from various sources. This can be time-consuming and
requires careful planning to ensure that the data is accurate and relevant to your machine learning project.
It may involve surveys, interviews, or other forms of data collection.
Categorical data is often represented using discrete values, such as integers or strings, and is frequently
encoded as one-hot vectors before being used as input to machine learning models. One-hot encoding in
volves creating a binary vector for each category, where the vector has a 1 in the position corresponding to
the category and Os in all other positions.
In the subsequent sections of this chapter, we will discuss the different techniques for handling categorical
data in machine learning along with their implementations in Python.
One-Hot Encoding
One-hot encoding is a popular technique for handling categorical data in machine learning. It involves
creating a binary vector for each category, where each element of the vector represents the presence or
absence of the category. For example, if we have a categorical variable for color with values red, blue, and
green, one-hot encoding would create three binary vectors: [ 1,0,0], [0,1,0], and [0, 0,1 ] respectively.
Example
Below is an example of how to perform one-hot encoding in Python using the Pandas library -
import pandas as pd
Output
This will create a one-hot encoded dataframe with three binary variables ("color_blue," "color_green," and
"colorjred") that take the value 1 if the corresponding color is present and 0 if it is not. This encoded data,
output given below, can then be used for machine learning tasks such as classification and regression.
One-Hot Encoding technique works well for small and finite categorical variables but can be problematic
for large categorical variables as it can lead to a high number of input features.
Label Encoding
Label Encoding is another technique for handling categorical data in machine learning. It involves assign
ing a unique numerical value to each category in a categorical variable, with the order of the values based
on the order of the categories.
For example, suppose we have a categorical variable "Size" with three categories: "small," "medium," and
"large." Using label encoding, we would assign the values 0,1, and 2 to these categories, respectively.
Example
Below is an example of how to perform label encoding in Python using the scikit-learn library -
label [Link] 1E a c
encoded_data = label_encoder.fit_transform(data)|
Output
[2 1 020]
Label encoding can be useful when there is a natural ordering between the categories, such as in the case
of ordinal categorical variables. However, it should be used with caution for nominal categorical variables
because the numerical values may imply an order that does not actually exist. In these cases, one-hot en
coding is a safer option.
Frequency Encoding
Frequency Encoding is another technique for handling categorical data in machine learning. It involves
replacing each category in a categorical variable with its frequency (or count) in the dataset. The idea
behind frequency encoding is that categories that appear more frequently may be more important or in
formative for the machine learning algorithm.
Example
Below is an example of how to perform frequency encoding in Python -
import pandas as pd
# create a sample dataset with a categorical variable
data = ('color': ['red', 'green', 'blue', 'red', 'green']}
freq dfl'color'[Link]-CountsCnonnalize^TruO^^^^^^^^^B
df [Link]( - ,axis
This will create an encoded dataframe with one variable ("color_freq") that represents the frequency of
each category in the original categorical variable. For example, if the original variable had two occur
rences of "red" and three occurrences of "green," then the corresponding frequencies would be 0.4 and 0.6,
respectively.
Output
color_freq
0 0.4
1 0.4
2 0.2
3 0.4
4 0.4
Frequency encoding can be a useful alternative to one-hot encoding or label encoding, especially when
dealing with high-cardinality categorical variables (i.e., variables with a large number of categories). How
ever, it may not always be effective, and its performance can depend on the particular dataset and machine
learning algorithm being used.
Target Encoding
Target Encoding is another technique for handling categorical data in machine learning. It involves replac
ing each category in a categorical variable with the mean (or other aggregation) of the target variable (i.e.,
the variable you want to predict) for that category. The idea behind target encoding is that it can capture
the relationship between the categorical variable and the target variable, and therefore improve the predic
tive performance of the machine learning model.
Example
Below is an example of how to perform target encoding in Python with the Scikit-learn library by using a
combination of a label encoder and a mean encoder -
import pandas as pd
from [Link] import LabelEncoder
d ."J. ''
label_encoder = LabelEncoderQ^^^^^^^^^M
d-bcl a;[Link]
df['color_encoded'] - label_encoder.transform(df[bolorJ)M
mean_encoder = [Link]('color_encoded')['target'].mean().to_dict()
df['color_encoded'] - df['color_encoded'].map(mean_encoder)
Next, we transform the categorical variable 'color' using the label encoder by calling the transform method
on the label encoder object and assigning the resulting encoded values to a new column 'color_encoded' in
df.
Finally, we create a mean encoder object by grouping df by the 'color_encoded' column and calculating the
mean of the 'target' column for each group. We then convert this mean encoder object to a dictionary and
map the mean encoded values to the original 'color' column of df.
Output
color target color_encoded
0 red 1 0.5
1 green 0 0.5
2 blue 1 1.0
3 red 0 0.5
4 green 1 0.5
Target encoding can be a powerful technique for improving the predictive performance of machine learn
ing models, especially for datasets with high-cardinality categorical variables. However, it is important to
avoid overfitting by using cross-validation and regularization techniques.
Binary Encoding
Binary encoding is another technique used for encoding categorical variables in machine learning. In bi
nary encoding, each category is assigned a binary code, where each digit represents whether the category
is present (1) or not (0). The binary codes are typically based on the position of the category in a sorted list
of all categories.
Example
Here's an example Python implementation of binary encoding using the [Link] library -
[Link] = binary_encoder.transform(df['color'])^^^B
tr merge the encoded variable with the original datatrame
df = [Link]([df, encoded_data], axis=l)
In this example, we first create a Pandas DataFrame df with a categorical variable 'color'. We then create a
BinaryEncoder object from the category_encoders library and fit it to the 'color' column of df.
Next, we transform the categorical variable 'color' using the binary encoder by calling the transform
method on the binary encoder object and assigning the resulting encoded values to a new DataFrame
encoded_data.
Finally, we merge the encoded variable with the original DataFrame df using the concat method along the
column axis (axis=1). The resulting DataFrame should have the original 'color' column along with the en
coded binary columns.
Output
When you run the code, it will produce the following output -
The binary encoding works best for categorical variables with a moderate number of categories, as it can
quickly become inefficient for variables with a large number of categories.
Data Loading
Suppose if you want to start a ML project then what is the first and most important thing you would re
quire? It is the data that we need to load for starting any of the ML project.
In machine learning, data loading refers to the process of importing or reading data from external sources
and converting it into a format that can be used by the machine learning algorithm. The data is then
preprocessed to remove any inconsistencies, missing values, or outliers. Once the data is preprocessed, it is
split into training and testing sets, which are then used for model training and evaluation.
The data can come from various sources such as CSV files, databases, web APIs, cloud storage, etc. The most
common file formats for machine learning projects is CSV (Comma Separated Values).
In Python, we can load CSV data into ML projects with different ways but before loading CSV data we must
have to take care about some considerations.
In this chapter, let's understand the main parts of a CSV file, how they might affect the loading and analysis
of data, and some consideration we should take care before loading CSV data into ML projects.
File Header
This is the first row of the CSV file, and it typically contains the names of the columns in the table. When
loading CSV data into an ML project, the file header (also known as column headers or variable names) can
play an important role in data analysis and model training. Here are some considerations to keep in mind
regarding the file header -
1. Consistency - The header row should be consistent across the entire CSV file. This means that
the number of columns and their names should be the same for each row. Inconsistencies can
cause issues with parsing and analysis.
2. Meaningful names - Column names should be meaningful and descriptive. This can help
with understanding the data and building more accurate models. Avoid using generic names
like "column 1", "column2", etc.
3. Case sensitivity - Depending on the tool or library being used to load the CSV file, the column
names may be case sensitive. It's important to ensure that the case of the header row matches
the expected case sensitivity of the tool or library being used.
4. Special characters - Column names should not contain any special characters, such as spaces,
commas, or quotation marks. These characters can cause issues with parsing and analysis. In
stead, use underscores or camelCase to separate words.
5. Missing header - If the CSV file does not have a header row, it's important to specify the
column names manually or provide a separate file or documentation that includes the col
umn names.
6. Encoding - The encoding of the header row can affect its interpretation when loading the CSV
file. It's important to ensure that the encoding of the header row is compatible with the tool or
library being used to read the file.
Comments
These are optional lines that begin with a specified character, such as "#" or and are ignored by most
programs that read CSV files. They can be used to provide additional information or context about the data
in the file.
Comments in a CSV file are not typically used to represent data that would be used in a machine learning
project. However, if comments are present in a CSV file, it's important to consider how they might affect
the loading and analysis of the data. Here are some considerations -
1. Comment markers - In a CSV file, comments can be indicated using a specific marker, such as
"#" orIt's important to know what marker is being used, so that the loading process can
ignore comments properly.
2. Placement - Comments should be placed in a separate line from the actual data. If a comment
is included in a line with actual data, it may cause issues with parsing and analysis.
3. Consistency - If comments are used in a CSV file, it's important to ensure that the comment
marker is used consistently throughout the entire file. Inconsistencies can cause issues with
parsing and analysis.
4. Handling comments - Depending on the tool or library being used to load the CSV file, com
ments may be ignored by default or may require a specific parameter to be set. It's important
to understand how comments are handled by the tool or library being used.
5. Effect on analysis - If comments contain important information about the data, it may be
necessary to process them separately from the data itself. This can add complexity to the load
ing and analysis process.
Delimiter
This is the character that separates the fields in each row. While the name suggests that a comma is used
as the delimiter, other characters such as tabs, semicolons, or pipes can also be used depending on the file.
The delimiter used in a CSV file can significantly affect the accuracy and performance of a machine learn
ing model, so it is important to consider the following while loading data into an ML project -
1. Delimiter choice - The delimiter used in a CSV file should be carefully chosen based on the
data being used. For example, if the data contains commas within the values (e.g. "New York,
NY"), then using a comma as a delimiter may cause issues.
In this case, a different delimiter, such as a tab or semicolon, may be more appropriate.
2. Consistency - The delimiter used in the CSV file should be consistent throughout the entire
file. Mixing different delimiters or using whitespace inconsistently can lead to errors and
make it difficult to parse the data accurately.
3. Encoding - The delimiter can also be affected by the encoding of the CSV file. For example, if
the CSV file uses a non-ASCII delimiter and is encoded in UTF-8, it may not be correctly read by
some machine learning libraries or tools. It is important to ensure that the encoding and de
limiter are compatible with the machine learning tools being used.
4. Other considerations - In some cases, the delimiter may need to be customized based on the
machine learning tool being used. For example, some libraries may require a specific delimiter
or may not support certain delimiters. It is important to check the documentation of the ma
chine learning tool being used and customize the delimiter as needed.
Quotes
These are optional characters that can be used to enclose fields that contain the delimiter character or
newlines. For example, if a field contains a comma, enclosing the field in quotes ensures that the comma is
treated as part of the field and not as a delimiter. When loading CSV data into an ML project, there are sev
eral considerations to keep in mind regarding the use of quotes -
1. Quote character - The quote character used in a CSV file should be consistent throughout the
file. The most commonly used quote character is the double quote (") but some files may use
single quotes or other characters. It's important to make sure that the quote character used is
consistent with the tool or library being used to read the CSV file.
2. Quoted values - In some cases, values in a CSV file may be enclosed in quotes to differentiate
them from other values. For example, if a field contains a comma, it maybe enclosed in quotes
to prevent it from being interpreted as a new field. It's important to make sure that quoted val
ues are properly handled when loading the data into an ML project.
3. Escaping quotes - If a field contains the quote character used to enclose values, it must be
escaped. This is typically done by doubling the quote character. For example, if the quote char
acter is double quote (") and a field contains the value "John "the Hammer" Smith", it would be
enclosed in quotes and the internal quotes would be escaped like this: "John ""the Hammer""
Smith".
4. Use of quotes - The use of quotes in CSV files can vary depending on the tool or library being
used to generate the file. Some tools may use quotes around every field, while others may only
use quotes around fields that contain special characters. It's important to make sure that the
quote usage is consistent with the tool or library being used to read the file.
5. Encoding - The use of quotes can also be affected by the encoding of the CSV file. If the file is
encoded in a non-standard way, it may cause issues when loading the data into an ML project.
It's important to make sure that the encoding of the CSV file is compatible with the tool or li
brary being used to read the file.
Various Methods of Loading a CSV Data File
While working with ML projects, the most crucial task is to load the data properly into it. As told earlier,
the most common data format for ML projects is CSV and it comes in various flavors and varying difficul
ties to parse.
In this section, we are going to discuss some common approaches in Python to load CSV data file into
machine learning project -
import
[with [Link]', 'r') as file:
This code reads a CSV file called [Link] and prints each row in the file.
data = pd.read_csv('[Link]'
This code reads a CSV file called [Link] and loads it into a pandas DataFrame object called data.
This code reads a CSV file called [Link] and loads it into a numpy array called 'data'.
This code reads a CSV file called [Link] and loads it into a numpy array called 'data'.
da:?.: baa
This code loads the iris dataset, which is included in the sklearn library, and loads it into a numpy array
called data.
Data Understanding
While working with machine learning projects, usually we ignore two most important parts called mathe
matics and data. What makes data understanding a critical step in ML is its data driven approach. Our ML
model will produce only as good or as bad results as the data we provided to it.
Data understanding basically involves analyzing and exploring the data to identify any patterns or trends
that may be present.
1. Data Collection - This involves gathering the relevant data that you will be using for your
analysis. The data can be collected from various sources such as databases, websites, and APIs.
2. Data Cleaning - This involves cleaning the data by removing any irrelevant or duplicate data,
and dealing with missing data values. The data should be formatted in a way that makes it
easy to analyze.
3. Data Exploration - This involves exploring the data to identify any patterns or trends that
may be present. This can be done using various statistical techniques such as histograms,
scatter plots, and correlation analysis.
4. Data Visualization - This involves creating visual representations of the data to help you un
derstand it better. This can be done using tools such as graphs, charts, and maps.
5. Data Preprocessing - This involves transforming the data to make it suitable for use in ma
chine learning algorithms. This can include scaling the data, transforming it into a different
format, or reducing its dimensionality.
In this chapter, with the help of following Python recipes, we are going to understand ML data with
statistics.
Example
from pandas import
path = r"C:\[Link]"^^^^^^^^^^^^^^^^^|
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
Output
preg plas pres skin test mass pedi age class
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
5 5 116 74 0 0 25.6 0.201 30 0
6 3 78 50 32 88 31.0 0.248 26 1
7 10 115 0 0 0 35.3 0.134 29 0
8 2 197 70 45 543 30.5 0.158 53 1
9 8 125 96 0 0 0.0 0.232 54 1
10 4 110 92 0 0 37.6 0.191 30 0
We can observe from the above output that first column gives the row number which can be very useful for
referencing a specific observation.
1. Suppose if we have too many rows and columns then it would take long time to run the algo
rithm and train the model.
2. Suppose if we have too less rows and columns then it we would not have enough data to well
train the model.
Following is a Python script implemented by printing the shape property on Pandas Data Frame. We are
going to implement it on iris data set for getting the total number of rows and columns in it.
Example
from pandas import read_csv
data = read_csv(path)^^^B
prin (datELshapej^^^^^M
Output
(150, 4)
We can easily observe from the output that iris data set, we are going to use, is having 150 rows and 4
columns.
Example
from pandas import read_csv
path = r"C:\[Link]"^^^^M
data = read_csv(path)^^^M
print(dataxitypes)^^^^^B
Output
sepal_length float64
sepal_width float64
petal_length float64
petal_width float64
dtype: object
From the above output, we can easily get the datatypes of each attribute.
1. Count
2. Mean
3. Standard Deviation
4. Minimum Value
5. Maximum value
6. 25%
7. Median i.e. 50%
8. 75%
Example
from pandas import
from pandas import set_option^^^^^^^^^^^^^^^^^M
path = r"C:\[Link]"^^^^^^^^^^^^^B
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
[data = read_csv(path, names^iames)^^^^^^^^^M
set_option('precision72)^^^^^^^^^^^^^^^^^^^^^B
Output
(768,9)
preg plas pres skin test mass pedi age class
count 768.00 768.00 768.00 768.00 768.00 768.00 768.00 768.00 768.00
mean 3.85 120.89 69.11 20.54 79.80 31.99 0.47 33.24 0.35
std 3.37 31.97 19.36 15.95 115.24 7.88 0.33 11.76 0.48
min 0.00 0.00 0.00 0.00 0.00 0.00 0.08 21.00 0.00
25% 1.00 99.00 62.00 0.00 0.00 27.30 0.24 24.00 0.00
50% 3.00 117.00 72.00 23.00 30.50 32.00 0.37 29.00 0.00
75% 6.00 140.25 80.00 32.00 127.25 36.60 0.63 41.00 1.00
max 17.00 199.00 122.00 99.00 846.00 67.10 2.42 81.00 1.00
From the above output, we can observe the statistical summary of the data of Pima Indian Diabetes dataset
along with shape of data.
Example
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)^
count_class = [Link]('class').sizeQ
Output
Class
0 500
1 268
dtype: int64
From the above output, it can be clearly seen that the number of observations with class 0 are almost dou
ble than number of observations with class 1.
It is always good for us to review the pairwise correlations of the attributes in our dataset before using it
into ML project because some machine learning algorithms such as linear regression and logistic regres
sion will perform poorly if we have highly correlated attributes. In Python, we can easily calculate a corre
lation matrix of dataset attributes with the help of corr() function on Pandas DataFrame.
Example
from pandas import
from pandas import
path = r"C:\[Link]"^^^^^^^^^^^^^B
names = ['preg', 'plash 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names
set_option('precision72)^^^^^^^^^^^^^^^^^^^^^B
correlations = d ata, corr (method ^'pearsonQ^^^^^^^^^^M
1
Output
preg plas pres skin test mass pedi age class
preg 1.00 0.13 0.14 -0.08 -0.07 0.02 -0.03 0.54 0.22
plas 0.13 1.00 0.15 0.06 0.33 0.22 0.14 0.26 0.47
pres 0.14 0.15 1.00 0.21 0.09 0.28 0.04 0.24 0.07
skin -0.08 0.06 0.21 1.00 0.44 0.39 0.18 -0.11 0.07
test -0.07 0.33 0.09 0.44 1.00 0.20 0.19 -0.04 0.13
mass 0.02 0.22 0.28 0.39 0.20 1.00 0.14 0.04 0.29
pedi -0.03 0.14 0.04 0.18 0.19 0.14 1.00 0.03 0.17
age 0.54 0.26 0.24 -0.11 -0.04 0.04 0.03 1.00 0.24
class 0.22 0.47 0.07 0.07 0.13 0.29 0.17 0.24 1.00
The matrix in above output gives the correlation between all the pairs of the attribute in dataset.
1. Presence of skewness in data requires the correction at data preparation stage so that we can
get more accuracy from our model.
2. Most of the ML algorithms assumes that data has a Gaussian distribution i.e. either normal of
bell curved data.
In Python, we can easily calculate the skew of each attribute by using skew() function on Pandas
DataFrame.
Example
from pandas import
path = r"C:\[Link]"^^^^^^^^^^^^^B
names = ['preg', 'plas', 'pres', 'skin', 'test1, 'mass', 'pedi', 'age', 'class']
data read_csv(path, names
print([Link]())
Output
preg O.9O
plas 0.17
pres -1.84
skin 0.11
test 2.27
mass -0.43
pedi 1.92
age 1.13
class 0.64
dtype: float64
From the above output, positive or negative skew can be observed. If the value is closer to zero, then it
shows less skew.
Data Preparation
Data preparation, also known as data preprocessing, is a crucial step in machine learning. The quality of
the data you use for your model can have a significant impact on the performance of the model.
Data preparation involves cleaning, transforming, and pre-processing the data to make it suitable for
analysis and modeling. The goal of data preparation is to make sure that the data is accurate, complete, and
relevant for the analysis.
The following are some of the key steps involved in data preparation -
1. Data cleaning - This involves identifying and correcting errors, missing values, and outliers
in the data. Common techniques used for data cleaning include imputation, outlier detection
and removal, and data normalization.
2. Data transformation - This involves converting the data from its original format into a
format that is suitable for analysis. This could involve converting categorical variables into
numerical variables, or scaling the data to a certain range.
3. Feature engineering - This involves creating new features from the existing data that may
be more informative or useful for the analysis. Feature engineering can involve combining
or transforming existing features, or creating new features based on domain knowledge or
insights.
4. Data integration - This involves combining data from multiple sources into a single dataset
for analysis. This may involve matching or linking records across different datasets, or merg
ing datasets based on common variables.
5. Data reduction - This involves reducing the size of the dataset by selecting a subset of fea
tures or observations that are most relevant for the analysis. This can help to reduce noise and
improve the accuracy of the model.
Data preparation is a critical step in the machine learning process, and can have a significant impact on the
accuracy and effectiveness of the final model. It requires careful attention to detail and a thorough under
standing of the data and the problem at hand.
Example
Let's check an example of data preparation using the breast cancer dataset -
data = load_breast_cancer()
X = [Link]^^^^^^^^^^M
X_test s c a 1 e r. t. r a n s fo r n X
In this example, we first load the breast cancer dataset using load_breast_cancer function from scikit-
learn. Then we separate the features and target, and split the data into training and testing sets using
train_test_split function.
Finally, we normalize the data using StandardScaler from scikit-learn, which subtracts the mean and
scales the data to unit variance. This helps to bring all the features to a similar scale, which is particularly
important for models like SVM and neural networks.
Scaling
Most probably our dataset comprises of the attributes with varying scale, but we cannot provide such data
to ML algorithm hence it requires rescaling. Data rescaling makes sure that attributes are at same scale.
Generally, attributes are rescaled into the range of 0 and 1. ML algorithms like gradient descent and k-Near-
est Neighbors requires scaled data. We can rescale the data with the help of MinMaxScaler class of scikit-
learn Python library.
Example
In this example we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the
CSV data will be loaded (as done in the previous chapters) and then with the help of MinMaxScaler class, it
will be rescaled in the range of 0 and 1.
The first few lines of the following script are same as we have written in previous chapters while loading
CSV data.
Now, we can use MinMaxScaler class to rescale the data in the range of 0 and 1.
data_scaler = [Link](feature_range=(Q;l))
data_rescaled = data_scaler.fit_transform(array)^^^^^^^B
We can also summarize the data for output as per our choice. Here, we are setting the precision to 1 and
showing the first 10 rows in the output.
se^nntogtions(^mcisio^^^^^^^^^^|
print ("\nScaled data:\n", data_rescaled[0:10])
Output
Scaled data:
From the above output, all the data got rescaled into the range of 0 and 1.
Normalization
Another useful data preprocessing technique is Normalization. This is used to rescale each row of data to
have a length of 1. It is mainly useful in Sparse dataset where we have lots of zeros. We can rescale the data
with the help of Normalizer class of scikit-learn Python library.
Types of Normalization
In machine learning, there are two types of normalization preprocessing techniques as follows -
LI Normalization
It may be defined as the normalization technique that modifies the dataset values in a way that in each row
the sum of the absolute values will always be up to 1. It is also called Least Absolute Deviations.
Example
In this example, we use LI Normalize technique to normalize the data of Pima Indians Diabetes dataset
which we used earlier. First, the CSV data will be loaded and then with the help of Normalizer class it will
be normalized.
The first few lines of following script are same as we have written in previous chapters while loading CSV
data.
path = r'CApima-indians-diabetesTsV^^^^^^^^^^^^^M
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv (path, names=names)^^^^^^^^^^B
[Link] = Data_normalizer.transform(arra5
We can also summarize the data for output as per our choice. Here, we are setting the precision to 2 and
showing the first 3 rows in the output.
set_printoptions(precision=2)
14 rint ("\nNormalized data:\n", Data_normalized [0:3])
Output
Normalized data:
L2 Normalization
It may be defined as the normalization technique that modifies the dataset values in a way that in each row
the sum of the squares will always be up to 1. It is also called least squares.
Example
In this example, we use L2 Normalization technique to normalize the data of Pima Indians Diabetes
dataset which we used earlier. First, the CSV data will be loaded (as done in previous chapters) and then
with the help of Normalizer class it will be normalized.
The first few lines of following script are same as we have written in previous chapters while loading CSV
data.
from pandas import
from numpy import set_printoptions^^^^^^^^^^^^^M
path = r'CApima-indians-diabetesTsv'^^^^^^^^^^^^^M
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv (path, names=names)^^^^^^^^^^M
pata_normalized = Data_normalizer.transform(array)
We can also summarize the data for output as per our choice. Here, we are setting the precision to 2 and
showing the first 3 rows in the output.
Output
Normalized data:
Binarization
As the name suggests, this is the technique with the help of which we can make our data binary. We can
use a binary threshold for making our data binary. The values above that threshold value will be converted
to 1 and below that threshold will be converted to 0. For example, if we choose threshold value = 0.5, then
the dataset value above it will become 1 and below this will become 0. That is why we can call it binarizing
the data or thresholding the data. This technique is useful when we have probabilities in our dataset and
want to convert them into crisp values.
We can binarize the data with the help of Binarizer class of scikit-learn Python library.
Example
In this example, we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the
CSV data will be loaded and then with the help of Binarizer class it will be converted into binary values i.e.
0 and 1 depending upon the threshold value. We are taking 0.5 as threshold value.
The first few lines of following script are same as we have written in previous chapters while loading CSV
data.
from pandas import read_csv^^^^^^^J
from [Link] import Binarizer
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
Il
Now, we can use Binarize class to convert the data into binary values.
pinarizer = Binarizer(threshold=Q.5).fit(array)
Data_binarized = [Link](array)B
Output
Binary data:
Example
In this example, we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the
CSV data will be loaded and then with the help of StandardScaler class it will be converted into Gaussian
Distribution with mean = 0 and SD = 1.
The first few lines of following script are same as we have written in previous chapters while loading CSV
data.
data_scaler = StandardScaler().fit(array)M
data_rescaled = data_scaler.transform(arra
We can also summarize the data for output as per our choice. Here, we are setting the precision to 2 and
showing the first 5 rows in the output.
Output
Rescaled data:
Data Labeling
We discussed the importance of good fata for ML algorithms as well as some techniques to pre-process
the data before sending it to ML algorithms. One more aspect in this regard is data labeling. It is also very
important to send the data to ML algorithms having proper labeling. For example, in case of classification
problems, lot of labels in the form of words, numbers etc. are there on the data.
Example
In the following example, Python script will perform the label encoding.
import num as n
from sklearn import preprocessing
inputjabels = ['red','black','red','green','black','yellow','whiter
The next line of code will create the label encoder and train it.
encoder = [Link]
The next lines of script will check the performance by encoding the random ordered list -
testjabels = ['green','red',
We can get the list of encoded values with the help of following python script -
Models
There are four main types of machine learning models -
1. Supervised Learning
2. Unsupervised Learning
3. Semi-supervised Learning
4. Reinforcement Learning
In the next four chapters, we will discuss each of these machine learning models in detail. Here, let's have
a brief overview of these methods:
Supervised Learning
Supervised learning algorithms or methods are the most commonly used ML algorithms. This method or
learning algorithm take the data sample i.e. the training data and its associated output i.e. labels or re
sponses with each data samples during the training process.
The main objective of supervised learning algorithms is to learn an association between input data sam
ples and corresponding outputs after performing multiple training data instances.
Based on the ML tasks, supervised learning algorithms can be divided into following two broad classes -
Unsupervised Learning
As the name suggests, it is opposite to supervised ML methods or algorithms which means in unsu
pervised machine learning algorithms we do not have any supervisor to provide any sort of guidance.
Unsupervised learning algorithms are handy in the scenario in which we do not have the liberty, like in
supervised learning algorithms, of having pre-labeled training data and we want to extract useful pattern
from input data.
Examples of unsupervised machine learning algorithms includes K-means clustering, K-nearest neigh
bors etc.
Based on the ML tasks, unsupervised learning algorithms can be divided into following broad classes -
1. Clustering - Clustering methods are one of the most useful unsupervised ML methods. These
algorithms used to find similarity as well as relationship patterns among data samples and
then cluster those samples into groups having similarity based on features. The real-world ex
ample of clustering is to group the customers by their purchasing behavior.
2. Association - Another useful unsupervised ML method is Association which is used to
analyze large dataset to find patterns which further represents the interesting relationships
between various items. It is also termed as Association Rule Mining or Market basket analy
sis which is mainly used to analyze customer shopping patterns.
3. Dimensionality Reduction - This unsupervised ML method is used to reduce the number of
feature variables for each data sample by selecting set of principal or representative features.
4. Anomaly Detection - This unsupervised ML method is used to find out the occurrences
of rare events or observations that generally do not occur. By using the learned knowledge,
anomaly detection methods would be able to differentiate between anomalous or a normal
data point.
Semi-supervised Learning
Such kind of algorithms or methods are neither fully supervised nor fully unsupervised. They basically fall
between the two i.e. supervised and unsupervised learning methods. These kinds of algorithms generally
use small supervised learning component i.e. small amount of pre-labeled annotated data and large unsu
pervised learning component i.e. lots of unlabeled data for training.
Reinforcement Learning
These methods are different from previously studied methods and very rarely used also. In this kind of
learning algorithms, there would be an agent that we want to train over a period of time so that it can
interact with a specific environment. The agent will follow a set of strategies for interacting with the en
vironment and then after observing the environment it will take actions regards the current state of the
environment.
Supervised
Supervised learning algorithms or methods are the most commonly used ML algorithms. This method or
learning algorithm take the data sample i.e. training data and associated output i.e. labels or responses
with each data samples during the training process. The main objective of supervised learning algorithms
is to learn an association between input data samples and corresponding outputs after performing multi
ple training data instances.
Now, apply an algorithm to learn the mapping function from the input to output as follows -
Y=f(x)
Now, the main objective would be to approximate the mapping function so well that even when we have
new input data (x), we can easily predict the output variable (Y) for that new input data.
It is called supervised because the whole process of learning can be thought as it is being supervised by a
teacher or supervisor. Examples of supervised machine learning algorithms includes Decision tree, Ran
dom Forest, KNN, Logistic Regression etc.
Based on the ML tasks, supervised learning algorithms can be divided into two broad classes - Classifica
tion and Regression.
Classification
The key objective of classification-based tasks is to predict categorial output labels or responses for the
given input data. The output will be based on what the model has learned in its training phase.
As we know that the categorial output responses means unordered and discrete values, hence each output
response will belong to a specific class or category. We will discuss Classification and associated algo
rithms in detail in further chapters also.
Regression
The key objective of regression-based tasks is to predict output labels or responses which are continues
numeric values, for the given input data. The output will be based on what the model has learned in train
ing phase.
Basically, regression models use the input data features (independent variables) and their corresponding
continuous numeric output values (dependent or outcome variables) to learn specific association between
inputs and corresponding outputs. We will discuss regression and associated algorithms in detail in fur
ther chapters also.
There are several algorithms available for supervised learning. Some of the widely used algorithms of su
pervised learning are as shown below -
1. k-Nearest Neighbours
2. Decision Trees
3. Naive Bayes
4. Logistic Regression
5. Support Vector Machines
As we move ahead in this chapter, let us discuss in detail about each of the algorithms.
k-Nearest Neighbours
The k-Nearest Neighbours, which is simply called kNN is a statistical technique that can be used for solving
for classification and regression problems. Let us discuss the case of classifying an unknown object using
kNN. Consider the distribution of objects as shown in the image given below -
The diagram shows three types of objects, marked in red, blue and green colors. When you run the kNN
classifier on the above dataset, the boundaries for each type of object will be marked as shown below -
Now, consider a new unknown object that you want to classify as red, green or blue. This is depicted in the
figure below.
As you see it visually, the unknown data point belongs to a class of blue objects. Mathematically, this can be
concluded by measuring the distance of this unknown point with every other point in the data set. When
you do so, you will know that most of its neighbours are of blue color. The average distance to red and green
objects would be definitely more than the average distance to blue objects. Thus, this unknown object can
be classified as belonging to blue class.
The kNN algorithm can also be used for regression problems. The kNN algorithm is available as ready-to-
use in most of the ML libraries.
Decision Trees
A simple decision tree in a flowchart format is shown below -
You would write a code to classify your input data based on this flowchart. The flowchart is self-explana
tory and trivial. In this scenario, you are trying to classify an incoming email to decide when to read it.
In reality, the decision trees can be large and complex. There are several algorithms available to create and
traverse these trees. As a Machine Learning enthusiast, you need to understand and master these tech
niques of creating and traversing decision trees.
Naive Bayes
Naive Bayes is used for creating classifiers. Suppose you want to sort out (classify) fruits of different kinds
from a fruit basket. You may use features such as color, size and shape of a fruit, For example, any fruit that
is red in color, is round in shape and is about 10 cm in diameter may be considered as Apple. So to train
the model, you would use these features and test the probability that a given feature matches the desired
constraints. The probabilities of different features are then combined to arrive at a probability that a given
fruit is an Apple. Naive Bayes generally requires a small number of training data for classification.
Logistic Regression
Look at the following diagram. It shows the distribution of data points in XY plane.
20
15"
3 -2 -1 -0 2 3
From the diagram, we can visually inspect the separation of red dots from green dots. You may draw a
boundary line to separate out these dots. Now, to classify a new data point, you will just need to determine
on which side of the line the point lies.
1.0
0.5
0.0
-0.5
-1.0
-1.5
The Support Vector Machines (SVM) comes handy in determining the separation boundaries in such
situations.
Unsupervised
Examples of unsupervised machine learning algorithms includes K-means clustering, K-nearest neigh
bors etc.
In regression, we train the machine to predict a future value. In classification, we train the machine to
classify an unknown object in one of the categories defined by us. In short, we have been training ma
chines so that it can predict Y for our data X. Given a huge data set and not estimating the categories, it
would be difficult for us to train the machine using supervised learning. What if the machine can look up
and analyze the big data running into several Gigabytes and Terabytes and tell us that this data contains so
many distinct categories?
As an example, consider the voter’s data. By considering some inputs from each voter (these are called
features in Al terminology), let the machine predict that there are so many voters who would vote for X
political party and so many would vote for Y, and so on. Thus, in general, we are asking the machine given
a huge set of data points X, "What can you tell me about X?”. Or it may be a question like "What are the five
best groups we can make out of X?”. Or it could be even like “What three features occur together most fre
quently in X?”.
k-means clustering
The 2000 and 2004 Presidential elections in the United States were close — very close. The largest percent
age of the popular vote that any candidate received was 50.7% and the lowest was 47.9%. If a percentage
of the voters were to have switched sides, the outcome of the election would have been different. There are
small groups of voters who, when properly appealed to, will switch sides. These groups may not be huge,
but with such close races, they may be big enough to change the outcome of the election. How do you find
these groups of people? How do you appeal to them with a limited budget? The answer is clustering.
Clustering is a type of unsupervised learning that automatically forms clusters of similar things. It is like
automatic classification. You can cluster almost anything, and the more similar the items are in the clus
ter, the better the clusters are. In this chapter, we are going to study one type of clustering algorithm called
k-means. It is called k-means because it finds ‘k’ unique clusters, and the center of each cluster is the mean
of the values in that cluster.
Cluster Identification
Cluster identification tells an algorithm, “Here’s some data. Now group similar things together and tell me
about those groups.” The key difference from classification is that in classification you know what you are
looking for. While that is not the case in clustering.
Clustering is sometimes called unsupervised classification because it produces the same result as classifi
cation does but without having predefined classes.
Based on the ML tasks, unsupervised learning algorithms can be divided into the following broad classes:
Clustering, Association, Dimensionality Reduction, and Anomaly Detection.
Clustering
Clustering methods are one of the most useful unsupervised ML methods. These algorithms used to find
similarity as well as relationship patterns among data samples and then cluster those samples into groups
having similarity based on features. The real-world example of clustering is to group the customers by
their purchasing behavior.
Association
Another useful unsupervised ML method is Association which is basically used to analyze large dataset
to find patterns which further represent the interesting relationships between various items. It is also
termed as Association Rule Mining or Market basket analysis which is mainly used to analyze customer
shopping patterns.
Dimensionality Reduction
As the name suggests, this unsupervised ML method is used to reduce the number of feature variables for
each data sample by selecting set of principal or representative features.
A question arises here is that, why we need to reduce the dimensionality? The reason behind this is
the problem of feature space complexity which arises when we start analyzing and extracting millions
of features from data samples. This problem generally refers to "curse of dimensionality". PCA (Principal
Component Analysis), K-nearest neighbors and discriminant analysis are some of the popular algorithms
for this purpose.
Anomaly Detection
This unsupervised ML method is used to find out occurrences of rare events or observations that generally
do not occur. By using the learned knowledge, anomaly detection methods would be able to differentiate
between anomalous or a normal data point.
Some of the unsupervised algorithms like clustering, KNN can detect anomalies based on the data and its
features.
Semi-supervised
Semi-supervised machine learning algorithms are neither fully supervised nor fully unsupervised. They
basically fall between the two, i.e., supervised and unsupervised learning methods.
Semi-supervised algorithms generally use small supervised learning component, i.e., small amount of
pre-labeled annotated data and large unsupervised learning component, i.e., lots of unlabeled data for
training.
We can follow any of the following approaches for implementing semi-supervised learning methods -
1. The first and simple approach is to build the supervised model based on a small labeled and
annotated data and then build the unsupervised model by applying the same to the large
amounts of unlabeled data to get more labeled samples. Now, train the model on them and re
peat the process.
2. The second approach needs some extra efforts. In this approach, we can first use the unsuper
vised methods to cluster similar data samples, annotate these groups and then use a combina
tion of this information to train the model.
The algorithm is trained on a dataset that contains both labeled and unlabeled data. Semi-supervised
learning is generally used when we have a huge set of unlabeled data available. In any supervised learn
ing algorithm, the available data has to be manually labelled which can be quite an expensive process. In
contrast, the unlabelled data used in unsupervised learning has limited applications. Hence, unsupervised
learning algorithms were developed which can provide a perfect balance between the two.
Semi-Supervised Learning algorithm find its application in text classification, image classification, speech
analysis, anomaly detection, etc. where the general goal is to classify an entity into a predefined category.
Semi-supervised algorithm assumes that the data can be divided into discrete clusters and the data points
closer to each other are more likely to share the same output label.
Reinforcement
These methods are a bit different from previously studied methods and very rarely used also. In this kind
of learning algorithms, there would be an agent that we want to train over a period of time so that it
can interact with a specific environment. The agent will follow a set of strategies for interacting with the
environment and then after observing the environment it will take actions regards the current state of the
environment.
1. Step 1 - First, we need to prepare an agent with some initial set of strategies.
2. Step 2 - Then observe the environment and its current state.
3. Step 3 - Next, select the optimal policy regards the current state of the environment and per
form important action.
4. Step 4 - Now, the agent can get corresponding reward or penalty as per accordance with the
action taken by it in previous step.
5. Step 5 - Now, we can update the strategies if it is required so.
6. Step 6 - At last, repeat steps 2-5 until the agent got to learn & adopt the optimal policies.
The following diagram shows what type of task is appropriate for various ML problems -
Ms
Is the data
Correlated or
Redundant?
Supervised vs. Unsupervised
Machine Learning approaches can be either Supervised or Unsupervised. If you can anticipate the expanse
of data, and if it is possible to divide the data into categories, then the best approach is to help the algo
rithm become smarter by Supervised Learning.
If you anticipate that the amount of data is massive, and if you think that the data cannot be simply clas
sified or labelled, then it is better to go for Unsupervised Learning approach and let the algorithms handle
predictions smartly.
Supervised ML model takes feedback to check Unsupervised ML model does not take any kind
whether it is predicting the correct output or not. of feedback.
As name entails, supervised machine learning algo As name entails, unsupervised machine learn
rithms needs supervision to train the model. ing algorithms does not any kind of supervision
to train the model.
We can divide supervised machine learning algo Clustering, Anomaly Detection, Association,
rithms in two broad classes namely Classification and Association are some of the broad classed of
and Regression. unsupervised machine learning algorithms.
Supervised machine learning methods are highly Unsupervised machine learning methods are
accurate. less accurate.
In supervised machine learning, the learning takes In unsupervised machine learning, the learning
place offline. takes place in real time.
Number of classes is already known before imple In unsupervised learning methods, number of
menting supervised machine learning methods. classes are not known in prior.
One of the main drawbacks of supervised learning As the data used in unsupervised learning is not
is to classify big data. labeled, getting precise information regarding
data sorting is one of the main drawbacks of it.
Some of the well-known supervised machine learn- Some of the well-known unsupervised machine
ing algorithms are KNN (k-nearest neighbors), learning algorithms are Hebbian Learning, K-
Decision tree, Logistic Regression, and Random means Clustering, and Hierarchical Clustering.
Forest.
Data Visualization
Data visualization is an important aspect of machine learning (ML) as it helps to analyze and communicate
patterns, trends, and insights in the data. Data visualization involves creating graphical representations
of the data, which can help to identify patterns and relationships that may not be apparent from the raw
data.
Here are some of the ways data visualization is used in machine learning -
1. Exploring Data - Data visualization is an essential tool for exploring and understanding data.
Visualization can help to identify patterns, correlations, and outliers, and can also help to de
tect data quality issues such as missing values and inconsistencies.
2. Feature Selection - Data visualization can help to select relevant features for the ML model.
By visualizing the data and its relationship with the target variable, you can identify features
that are strongly correlated with the target variable and exclude irrelevant features that have
little predictive power.
3. Model Evaluation - Data visualization can be used to evaluate the performance of the ML
model. Visualization techniques such as ROC curves, precision-recall curves, and confusion
matrices can help to understand the accuracy, precision, recall, and Fl score of the model.
4. Communicating Insights - Data visualization is an effective way to communicate insights
and results to stakeholders who may not have a technical background. Visualizations such as
scatter plots, line charts, and bar charts can help to convey complex information in an easily
understandable format.
Some popular libraries used for data visualization in Python include Matplotlib, Seaborn, Plotly, and
Bokeh. These libraries provide a wide range of visualization techniques and customization options to suit
different needs and preferences.
Univariate Plots: Understanding Attributes Independently
The simplest type of visualization is single-variable or "univariate” visualization. With the help of univari
ate visualization, we can understand each attribute of our dataset independently. The following are some
techniques in Python to implement univariate visualization -
1. Histograms
2. Density Plots
3. Box and Whisker Plots
Multivariate Plots: Interaction Among Multiple Variables
Another type of visualization is multi-variable or “multivariate” visualization. With the help of multivari
ate visualization, we can understand interaction between multiple attributes of our dataset. The following
are some techniques in Python to implement multivariate visualization -
In the next few chapters, we will look at some of the popular and widely used visualization techniques
available in machine learning.
Histograms
A histogram is a bar graph-like representation of the distribution of a variable. It shows the frequency of
occurrences of each value of the variable. The x-axis represents the range of values of the variable, and the
y-axis represents the frequency or count of each value. The height of each bar represents the number of
data points that fall within that value range.
Histograms are useful for identifying patterns in data, such as skewness, modality, and outliers. Skewness
refers to the degree of asymmetry in the distribution of the variable. Modality refers to the number of
peaks in the distribution. Outliers are data points that fall outside of the range of typical values for the
variable.
We will use the breast cancer dataset from the Sklearn library for this example. The breast cancer dataset
contains information about the characteristics of breast cancer cells and whether they are malignant or
benign. The dataset has 30 features and 569 samples.
Example
Let's start by importing the necessary libraries and loading the dataset -
Next, we will create a histogram of the mean radius feature of the dataset -
Oigure(figsiz^7\2^j^>))^|
[Link]([Link][:,O], bins=20)
[Link]('Mean Radius')
In this code, we have used the hist() function from Matplotlib to create a histogram of the mean radius
feature of the dataset. We have set the number of bins to 20 to divide the data range into 20 intervals. We
have also added labels to the x and y axes using the xlabel() and ylabel() functions.
Output
The resulting histogram shows the distribution of mean radius values in the dataset. We can see that the
data is roughly normally distributed, with a peak around 12-14.
80 -
>. 60'
c
(L>
f 40-
20 -
0 -1----- —F
10 15 20 25
Mean Radius
Example
[Link](figsize=(7.2, 3.5))
pl^ust^dat^atafdat^arget^OjOl^ins^ZO^lpha^O^Jabel^Malignant)
[Link]
In this code, we have used the hist() function twice to create two histograms of the mean radius feature,
one for the malignant samples and one for the benign samples. We have set the transparency of the bars
to 0.5 using the alpha parameter so that they don't overlap completely. We have also added a legend to the
plot using the legend() function.
Output
On executing this code, you will get the following plot as the output -
Malignant
40 Benign
30
20
10
0
10 15 20 25
Mean Radius
The resulting histogram shows the distribution of mean radius values for both the malignant and benign
samples. We can see that the distributions are different, with the malignant samples having a higher fre
quency of higher mean radius values.
Density Plots
A density plot is a type of plot that shows the probability density function of a continuous variable. It is
similar to a histogram, but instead of using bars to represent the frequency of each value, it uses a smooth
curve to represent the probability density function. The xaxis represents the range of values of the vari
able, and the y-axis represents the probability density.
Density plots are useful for identifying patterns in data, such as skewness, modality, and outliers. Skew
ness refers to the degree of asymmetry in the distribution of the variable. Modality refers to the number
of peaks in the distribution. Outliers are data points that fall outside of the range of typical values for the
variable.
We will use the breast cancer dataset from the Sklearn library for this example. The breast cancer dataset
contains information about the characteristics of breast cancer cells and whether they are malignant or
benign. The dataset has 30 features and 569 samples.
Example
Let's start by importing the necessary libraries and loading the dataset -
import [Link] as
Next, we will create a density plot of the mean radius feature of the dataset -
gl^gure(figsiz^7^L5))^^^^^J
[Link]([Link][:,Q], shade=True)
[Link] RadiusQ^^^^^^^M
In this code, we have used the kdeplot() function from Seaborn to create a density plot of the mean radius
feature of the dataset. We have set the shade parameter to True to shade the area under the curve. We have
also added labels to the x and y axes using the xlabelQ and ylabelQ functions.
Output
The resulting density plot shows the probability density function of mean radius values in the dataset. We
can see that the data is roughly normally distributed, with a peak around 12-14.
Example
[Link]([Link][[Link]==Q,0], shade=True, label='Malignant')
[Link]([Link][[Link]= = 1,0], shade=True, label='Bemgn')J
[Link]('Mean
In this code, we have used the kdeplot() function twice to create two density plots of the mean radius
feature, one for the malignant samples and one for the benign samples. We have set the shade parameter
to True to shade the area under the curve, and we have added labels to the plots using the label parameter.
We have also added a legend to the plot using the legend() function.
Output
On executing this code, you will get the following plot as the output -
The resulting density plot shows the probability density functions of mean radius values for both the
malignant and benign samples. We can see that the probability density function for the malignant sam
ples is shifted to the right, indicating a higher mean radius value.
Box and Whisker Plots
A boxplot is a graphical representation of a dataset that displays the five-number summary of the data - the
minimum value, the first quartile, the median, the third quartile, and the maximum value.
The boxplot consists of a box with whiskers extending from the top and bottom of the box.
1. The box represents the interquartile range (IQR) of the data, which is the range between the
first and third quartiles.
2. The whiskers extend from the top and bottom of the box to the highest and lowest values that
are within 1.5 times the IQR.
Any values that fall outside this range are considered outliers and are represented as points beyond the
whiskers.
Example
import [Link] as plt^^^M
data = iris
Next, we can create a boxplot of the sepal length for each of the three iris species using the Seaborn library.
Output
This code will produce a boxplot of the sepal length for each of the three iris species, with the x-axis repre
senting the species and the y-axis representing the sepal length in centimeters.
8.0
7.5
7.0
6.5
6.0
5.5
5.0
0 1
Species
From this boxplot, we can see that the setosa species has a shorter sepal length compared to the versicolor
and virginica species, which have a similar median and range of sepal lengths. Additionally, we can see
that there are no outliers in the setosa species, but there are a few outliers in the versicolor and virginica
specie.
Correlation Matrix Plot
A correlation matrix plot is a graphical representation of the pairwise correlation between variables in a
dataset. The plot consists of a matrix of scatterplots and correlation coefficients, where each scatterplot
represents the relationship between two variables, and the correlation coefficient indicates the strength of
the relationship. The diagonal of the matrix usually shows the distribution of each variable.
The correlation coefficient is a measure of the linear relationship between two variables and ranges from
-1 to 1. A coefficient of 1 indicates a perfect positive correlation, where an increase in one variable is as
sociated with an increase in the other variable. A coefficient of -1 indicates a perfect negative correlation,
where an increase in one variable is associated with a decrease in the other variable. A coefficient of 0 indi
cates no correlation between the variables.
Example
sns. set(style='
Output
This code will produce a correlation matrix plot of the Iris dataset, with each square representing the cor
relation coefficient between two variables.
-02
-01
sepal width (cm)
-00
- -0 .1
k -0 2
k -0.3
petal length (cm)
k -0 4
petal width (cm)
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
From this plot, we can see that the variables 'sepal width (cm)' and 'petal length (cm)' have a moderate
negative correlation (-0.37), while the variables 'petal length (cm)' and 'petal width (cm)' have a strong
positive correlation (0.96). We can also see that the variable 'sepal length (cm)' has a weak positive correla
tion (0.87) with the variable 'petal length (cm)'.
Scatter Matrix Plot is a graphical representation of the relationship between multiple variables. It is a use
ful tool in machine learning for visualizing the correlation between features in a dataset. This plot is also
known as a Pair Plot, and it is used to identify the correlation between two or more variables in a dataset.
A Scatter Matrix Plot displays the scatter plot of each pair of features in a dataset. Each scatter plot repre
sents the relationship between two variables. It is also possible to add a diagonal line to the plot that shows
the distribution of each variable.
We will use the Seaborn library to implement the Scatter Matrix Plot. Seaborn is a Python data visualiza
tion library that is built on top of the Matplotlib library.
Example
Below is the Python code to implement the Scatter Matrix Plot -
iris = sns.load_dataset('iris')
[Link](iris, hue='species'
^Lshowfl
In this code, we first import the necessary libraries, Seaborn and Pandas. Then, we load the Iris dataset
using the sns.load_dataset() function. This function loads the Iris dataset from the Seaborn library.
Next, we create the Scatter Matrix Plot using the [Link]() function. The hue parameter is used to
specify the column in the dataset that should be used for color encoding. In this case, we use the species
column to color the points according to the species of each sample.
Output
The output of this code will be a Scatter Matrix Plot that shows the scatter plots of each pair of features in
the Iris dataset.
4 6 8
&epal_k>ng1h
Notice that each scatter plot is color-coded according to the species of each sample.
Descriptive Statistics
Descriptive statistics is a branch of statistics that deals with the summary and analysis of data. It includes
measures such as mean, median, mode, variance, and standard deviation. These measures help us under
stand the central tendency, variability, and distribution of the data.
In machine learning, descriptive statistics can be used to summarize the data, identify outliers, and detect
patterns. For example, we can use the mean and standard deviation to describe the distribution of a
dataset.
In Python, we can calculate descriptive statistics using libraries such as NumPy and Pandas. Below is an
example -
Example
import pandas as
df = [Link](data, columns=["Values"])
;..'. . .' ■ >./ r:
Output
This will output a summary of the dataset, including the count, mean, standard deviation, minimum, and
maximum values as follows -
Values
count 5.000000
mean 3.000000
std 1.581139
min 1.000000
25% 2.000000
50% 3.000000
75% 4.000000
max 5.000000
Inferential Statistics
Inferential statistics is a branch of statistics that deals with making predictions and inferences about a
population based on a sample of data. It involves using hypothesis testing, confidence intervals, and re
gression analysis to draw conclusions about the data.
In machine learning, inferential statistics can be used to make predictions about new data based on exist
ing data. For example, we can use regression analysis to predict the price of a house based on its features,
such as the number of bedrooms and bathrooms.
In Python, we can perform inferential statistics using libraries such as Scikit-Learn and StatsModels.
Below is an example -
Example
import [Link] as sm
im:';u iiumov
^i^rraj([1^^4^])|
= [Link]([2,4,6, 8,10])
Output
This will output a summary of the regression model, including the coefficients, standard errors, t-statis-
tics, and p-values as follows -
Mean
The "mean" is the average value of a dataset. It is calculated by adding up all the values in the dataset and
dividing by the number of observations. The mean is a useful measure of central tendency because it is
sensitive to outliers, meaning that extreme values can significantly affect the value of the mean.
In Python, we can calculate the mean using the NumPy library, which provides a function called mean().
Median
The "median" is the middle value in a dataset. It is calculated by arranging the values in the dataset in order
and finding the value that lies in the middle. If there are an even number of values in the dataset, the me
dian is the average of the two middle values.
The median is a useful measure of central tendency because it is not affected by outliers, meaning that
extreme values do not significantly affect the value of the median.
In Python, we can calculate the median using the NumPy library, which provides a function called
median().
Mode
The "mode" is the most common value in a dataset. It is calculated by finding the value that occurs most
frequently in the dataset. If there are multiple values that occur with the same frequency, the dataset is
said to be bimodal, trimodal, or multimodal.
The mode is a useful measure of central tendency because it can identify the most common value in a
dataset. However, it is not a good measure of central tendency for datasets with a wide range of values or
datasets with no repeating values.
In Python, we can calculate the mode using the SciPy library, which provides a function called mode().
Python Implementation
Let's see an example of calculating mean, median, and mode for a salary table in Python using NumPy and
Pandas -
import pandas as
salary =
^salary^[5OOOOi65OOO^5OOO^5OOO^OOO^6OOOO^5OOO^5OOO^
j 7MinBHHnHH^HH^MUUMH^KHHHKH^KHnHHHH|m^^H9MHBHHHRH^H^HnH
mean_salary = [Link](salary['salary'])
print('Mean salary:', mean_salaiy)^^M
|median_salary = [Link](salary['salary'])
print('Median salary:', median2salary)^^M
calculate mode
mode_salary = salary['salary'].mode()[0]
print('Mode salary:', mode_salary)
Output
On executing this code, you will get the following output -
Standard Deviation
Standard deviation is a measure of the amount of variation or dispersion of a set of data values around
their mean. In machine learning, it is an important statistical concept that is used to describe the spread or
distribution of a dataset.
Standard deviation is calculated as the square root of the variance, which is the average of the squared
differences from the mean. The formula for calculating standard deviation is as follows -
Z(x-nY/N
Where -
• E is the sum of
In machine learning, standard deviation is used to understand the variability of a dataset and to detect
outliers. For example, in finance, standard deviation is used to measure the volatility of stock prices. In
image processing, standard deviation can be used to detect image noise.
Types of Examples
Example 1
In this example, we will be using the NumPy library to calculate the standard deviation -
std_dev = [Link](data)^^^^^^M
Output
It will produce the following output -
Example 2
Let's see another example in which we will calculate the standard deviation of each column in Iris flower
dataset using Python and Pandas library -
import pandas as pd
iris_df = pd.read_csv('[Link]
names ['>[Link] , - ■ / 7 7 ■ , - , . ' ,
- ... d 7'
rintfStandard deviationsd)J
In this example, we load the Iris dataset from the UCI Machine Learning Repository using Pandas' read-
_csv() method. We then calculate the standard deviation of each column using the std() method of the
Pandas dataframe. Finally, we print the standard deviations for each column.
Output
On executing the code, you will get the following output -
Standard deviations:
sepal length 0.828066
sepal width 0.433594
petal length 1.764420
petal width 0.763161
dtype: float64
This example demonstrates how standard deviation can be used to understand the variability of a dataset.
In this case, we can see that the standard deviation of the 'petal length' column is much higher than that of
the other columns, which suggests that this feature may be more variable and potentially more informa
tive for classification tasks.
Percentiles
Percentiles are a statistical concept used in machine learning to describe the distribution of a dataset. A
percentile is a measure that indicates the value below which a given percentage of observations in a group
of observations falls.
For example, the 25th percentile (also known as the first quartile) is the value below which 25% of the
observations in the dataset fall, while the 75 th percentile (also known as the third quartile) is the value
below which 75% of the observations in the dataset fall.
Percentiles can be used to summarize the distribution of a dataset and identify outliers. In machine learn
ing, percentiles are often used in data preprocessing and exploratory data analysis to gain insights into the
data.
Python provides several libraries for calculating percentiles, including NumPy and Pandas.
Example
import num as n
prmtC75d^ementile^^75)|
In this example, we create a sample dataset using NumPy and then calculate the 25th and 75th percentiles
using the [Link]() function.
Output
The output shows the values of the percentiles for the dataset.
Example
da^^m^enes(^^^^^])
pHntC75tb^ementHef^7^
In this example, we create a Pandas series object and then calculate the 25th and 75th percentiles using the
quantile() method of the series object.
Output
The output shows the values of the percentiles for the dataset.
In machine learning, data distribution refers to the way in which data points are distributed or spread out
across a dataset. It is important to understand the distribution of data in a dataset, as it can have a signifi
cant impact on the performance of machine learning algorithms.
Data distribution can be characterized by several statistical measures, including mean, median, mode,
standard deviation, and variance. These measures help to describe the central tendency, spread, and shape
of the data.
Some common types of data distribution in machine learning are given below -
Normal Distribution
Normal distribution, also known as Gaussian distribution, is a continuous probability distribution that
is widely used in machine learning and statistics. It is a bell-shaped curve that describes the probability
distribution of a random variable that is symmetric around the mean. The normal distribution has two pa
rameters, the mean (p) and the standard deviation (a).
In machine learning, normal distribution is often used to model the distribution of error terms in linear
regression and other statistical models. It is also used as a basis for various hypothesis tests and confidence
intervals.
One important property of normal distribution is the empirical rule, also known as the 68- 95-99.7 rule.
This rule states that approximately 68% of the observations fall within one standard deviation of the
mean, 95% of the observations fall within two standard deviations of the mean, and 99.7% of the observa
tions fall within three standard deviations of the mean.
Python provides various libraries that can be used to work with normal distributions. One such library is
[Link], which provides functions for calculating the probability density function (PDF), cumulative
distribution function (CDF), percent point function (PPF), and random variables for normal distribution.
Example
Here is an example of using [Link] to generate and visualize a normal distribution -
import numi
from [Link] import norm
import [Link] as pit
In this example, we first generate a random sample of 1000 values from a normal distribution with mean
0 and standard deviation 1 using [Link]. We then use [Link] to calculate the PDF for the
normal distribution and [Link] to generate an array of 100 evenly spaced values between p -3o and p.
+ 3o
Finally, we plot the histogram of the random sample using [Link] and overlay the PDF of the normal dis
tribution using [Link].
Output
The resulting plot shows the bell-shaped curve of the normal distribution and the histogram of the ran
dom sample that approximates the normal distribution.
0.4
0.3
0.2
0.1
0.0
-3-2-10 1 2 3
Skewed Distribution
A skewed distribution in machine learning refers to a dataset that is not evenly distributed around its
mean, or average value. In a skewed distribution, the majority of the data points tend to cluster towards
one end of the distribution, with a smaller number of data points at the other end.
There are two types of skewed distributions: left-skewed and right-skewed. A left-skewed distribution,
also known as a negative-skewed distribution, has a long tail towards the left side of the distribution, with
the majority of data points towards the right side. In contrast, a right-skewed distribution, also known as
a positive-skewed distribution, has a long tail towards the right side of the distribution, with the majority
of data points towards the left side.
Skewed distributions can occur in many different types of datasets, such as financial data, social media
metrics, or healthcare records. In machine learning, it is important to identify and handle skewed distri
butions appropriately, as they can affect the performance of certain algorithms and models. For example,
skewed data can lead to biased predictions and inaccurate results in some cases and may require pre
processing techniques such as normalization or data transformation to improve the performance of the
model.
Example
Here is an example of generating and plotting a skewed distribution using Python's NumPy and Matplotlib
libraries -
import numpy as np
import [Link] as pit
data = [Link](2,
Output
On executing this code, you will get the following plot as the output -
Skewed Distribution
Value
Uniform Distribution
A uniform distribution in machine learning refers to a probability distribution in which all possible out
comes are equally likely to occur. In other words, each value in a dataset has the same probability of being
observed, and there is no clustering of data points around a particular value.
The uniform distribution is often used as a baseline for comparison with other distributions, as it repre
sents a random and unbiased sampling of the data. It can also be useful in certain types of applications,
such as generating random numbers or selecting items from a set without bias.
In probability theory, the probability density function of a continuous uniform distribution is defined as -
Example
In Python, the NumPy library provides functions for generating random numbers from a uniform distri
bution, such as [Link](). These functions take as arguments the minimum and maxi
mum values of the distribution and can be used to generate datasets with a uniform distribution.
import numpy as np
import [Link] as pit
[Link]('Uniform Distribution^]
Output
It will produce the following plot as the output -
Bimodal Distribution
In machine learning, a bimodal distribution is a probability distribution that has two distinct modes or
peaks. In other words, the distribution has two regions where the data values are most likely to occur, sep
arated by a valley or trough where the data is less likely to occur.
Bimodal distributions can arise in various types of data, such as biometric measurements, economic indi
cators, or social media metrics. They can represent different subpopulations within the dataset, or differ
ent modes of behavior or trends over time.
Bimodal distributions can be identified and analyzed using various statistical methods, such as his
tograms, kernel density estimations, or hypothesis testing. In some cases, bimodal distributions can be
fitted to specific probability distributions, such as the Gaussian mixture model, which allows for modeling
the underlying subpopulations separately.
Example
In Python, libraries such as NumPy, SciPy, and Matplotlib provide functions for generating and visualizing
bimodal distributions.
For example, the following code generates and plots a bimodal distribution -
import numpy as np
import [Link] as pit
pHS^ueJ^^^B
[Link]('Bimodal Distribution,)]
Output
On executing this code, you will get the following plot as the output -
Bimodal Distribution
0.200
0.175
0.150
Frequency
0.125
0.100
0.075
0.050
0.025
0.000
-6 -4 -2 0 2 4
value
Kurtosis refers to the degree of peakedness of a distribution. A distribution with high kurtosis has a
sharper peak and heavier tails than a normal distribution, while a distribution with low kurtosis has a
flatter peak and lighter tails. Kurtosis can be positive, indicating a higher-than-normal peak, or negative,
indicating a lower than normal peak. A kurtosis of zero indicates a normal distribution.
Both skewness and kurtosis can have important implications for machine learning algorithms, as they can
affect the assumptions of the models and the accuracy of the predictions. For example, a highly skewed
distribution may require data transformation or the use of non-parametric methods, while a highly kur-
totic distribution may require different statistical models or more robust estimation methods.
Example
In Python, the SciPy library provides functions for calculating skewness and kurtosis of a dataset. For
example, the following code calculates the skewness and kurtosis of a dataset using the skew() and kurto-
sis() functions -
[skewness - skew(data)^^^^^^^^^^^^^^^B
/'.u .■..■-■I.-.
rint('Skewness:', skewness)
This code generates a random dataset of 1000 samples from a normal distribution with mean 0 and
standard deviation 1. It then calculates the skewness and kurtosis of the dataset using the skew() and kur-
tosis() functions from the SciPy library. Finally, it prints the results to the console.
Output
On executing this code, you will get the following output -
Skewness: -0.04119418903611285
Kurtosis:-0.1152250196054534
The resulting skewness and kurtosis values should be close to zero for a normal distribution.
Bias and Variance
Bias and variance are two important concepts in machine learning that describe the sources of error in a
model's predictions. Bias refers to the error that results from oversimplifying the underlying relationship
between the input features and the output variable, while variance refers to the error that results from
being too sensitive to fluctuations in the training data.
In machine learning, we strive to minimize both bias and variance in order to build a model that can
accurately predict on unseen data. A model with high bias may be too simplistic and underfit the training
data, while a model with high variance may overfit the training data and fail to generalize to new data.
Example
Below is an implementation example in Python that illustrates how bias and variance can be analyzed
using the Boston Housing dataset -
import pandas as
from [Link] import load_boston
boston = load_bostonQ^^^^^^^^^B
X I> ' ' ■ I/ ;
from [Link] import train_test_split
Ir = LinearRegressionQ
Output
The output shows the training and testing mean squared errors (MSE) of the linear regression model. The
training MSE is 21.64 and the testing MSE is 24.29, indicating that the model has a moderate level of bias
and variance.
Training MSE: 21.641412753226312
TestingMSE: 24.291119474973456
Example
Let's try a polynomial regression model -
poly = PolynomialFeatures(degree=2)^^^^^^^^M
X_test_poly = [Link](X_test)^^^^^^^^^|
pr = LinearRegressionQ^^^^^^^^^^^^^^^^M
train_preds - [Link](X_tram7poly)^^^^^^^^B
train_mse = mean_squared_error(y_train, train_preds)
t e s t. _ p r e d s p r. p redid (, X _ t e s
iest_mse = mean_squared_error(y_test, test_preds)
print("Testing MSE:", test_mse)
Output
The output shows the training and testing MSE of the polynomial regression model with degree=2. The
training MSE is 5.31 and the testing MSE is 14.18, indicating that the model has a lower bias but higher
variance compared to the linear regression model.
TrainingMSE: 5.31446956670908
Testing MSE: 14.183558207567042
Example
To reduce variance, we can use regularization techniques such as ridge regression or lasso regression. In
the following example, we will be using ridge regression -
ridge = Ridge(alpha-1)^^^^^^^^^^^^^^^^M
train_preds - [Link](X_train_poly)^^^^^^M
train_mse = mean_squared_error(y_train, train_preds)
12 rint("Training MSE:", train_mse)
test_£red^^ddgy3mdicfi}^e^^Dol^)^^^^^^|
testjmse = mean_squared_error(y_test, test_preds)
Output
The output shows the training and testing MSE of the ridge regression model with alpha= 1. The training
MSE is 9.03 and the testing MSE is 13.88 compared to the polynomial regression model, indicating that the
model has a lower variance but slightly higher bias.
TrainingMSE: 9.O322O93786O839
Testing MSE: 13.882093755326755
Example
We can further tune the hyperparameter alpha to find the optimal balance between bias and variance.
Let's see an example -
Output
The output shows the training and testing MSE of the ridge regression model with the optimal alpha value.
The training MSE is 8.32 and the testing MSE is 12.87, indicating that the model has a good balance be
tween bias and variance.
Hypothesis
In machine learning, a hypothesis is a proposed explanation or solution for a problem. It is a tentative
assumption or idea that can be tested and validated using data. In supervised learning, the hypothesis is
the model that the algorithm is trained on to make predictions on unseen data.
The hypothesis is generally expressed as a function that maps input data to output labels. In other words,
it defines the relationship between the input and output variables. The goal of machine learning is to find
the best possible hypothesis that can generalize well to unseen data.
The process of finding the best hypothesis is called model training or learning. During the training process,
the algorithm adjusts the model parameters to minimize the error or loss function, which measures the
difference between the predicted output and the actual output.
Once the model is trained, it can be used to make predictions on new data. However, it is important to
evaluate the performance of the model before using it in the real world. This is done by testing the model
on a separate validation set or using cross-validation techniques.
There are many types of machine learning algorithms that can be used to generate hypotheses, including
linear regression, logistic regression, decision trees, support vector machines, neural networks, and more.
Regression Analysis
Regression is a type of supervised learning algorithm in machine learning. The key objective of regression
based tasks is to predict output labels or responses which are continues numeric values, for the given input
data. The output will be based on what the model has learned in training phase.
Basically, regression models use the input data features (independent variables) and their corresponding
continuous numeric output values (dependent or outcome variables) to learn specific association between
inputs and corresponding outputs.
Y-Output Variables,
(dependent on Input)
X-Input Variables
(independent in nature)
Simple regression model - This is the most basic regression model in which predictions are formed from
a single, univariate feature of the data.
Multiple regression model - As name implies, in this regression model the predictions are formed from
multiple features of the data.
In the following example, we will be building basic regression model that will fit a line to the data i.e. linear
regressor. The necessary steps for building a regressor in Python are as follows -
input = r'[Link]'
Next, we need to load this data. We are using [Link] function to load it.
reg_linear = linear_model.LinearRegression()
eg_linear.fit(X_train, y_train
_test_pred = reg_linear.predict(X_test)
Output
In the above output, we can see the regression line between the data points.
Output
Regressor model performance:
Mean absolute error(MAE) = 1.78
Mean squared error(MSE) = 3.89
Median absolute error = 2.01
Explain variance score = -0.09
R2 score = -0.09
Applications
The applications of ML regression algorithms are as follows -
Forecasting or Predictive analysis - One of the important uses of regression is forecasting or predictive
analysis. For example, we can forecast GDP, oil prices or in simple words the quantitative data that changes
with the passage of time.
Optimization - We can optimize business processes with the help of regression. For example, a store man
ager can create a statistical model to understand the peek time of coming of customers.
Error correction - In business, taking correct decision is equally important as optimizing the business
process. Regression can help us to take correct decision as well in correcting the already implemented
decision.
Economics - It is the most used tool in economics. We can use regression to predict supply, demand, con
sumption, inventory investment etc.
Finance - A financial company is always interested in minimizing the risk portfolio and want to know the
factors that affects the customers. All these can be predicted with the help of regression model.
Linear Regression
Linear regression may be defined as the statistical model that analyzes the linear relationship between a
dependent variable with given set of independent variables. Linear relationship between variables means
that when the value of one or more independent variables will change (increase or decrease), the value of
dependent variable will also change accordingly (increase or decrease).
Mathematically the relationship can be represented with the help of following equation -
Y=mX+b
Here,
Furthermore, the linear relationship can be positive or negative in nature as explained below -
Positive Linear Relationship
A linear relationship will be called positive if both independent and dependent variable increases. It can be
understood with the help of following graph -
Assumptions
The following are some assumptions about dataset that is made by Linear Regression model -
Multi-collinearity - Linear regression model assumes that there is very little or no multi-collinearity in
the data. Basically, multi-collinearity occurs when the independent variables or features have dependency
in them.
Auto-correlation - Another assumption Linear regression model assumes is that there is very little or no
auto-correlation in the data. Basically, auto-correlation occurs when there is dependency between residual
errors.
Relationship between variables - Linear regression model assumes that the relationship between re
sponse and feature variables must be linear.
Simple Linear Regression
Simple linear regression is a type of regression analysis in which a single independent variable (also known
as a predictor variable) is used to predict the dependent variable. In other words, it models the linear rela
tionship between the dependent variable and a single independent variable.
Python Implementation
Given below is an example that shows how to implement simple linear regression using the Pima-Indian-
Diabetes dataset in Python. We will also plot the regression line.
Data Preparation
First, we need to import the Diabetes dataset from scikit-learn and split it into training and testing sets. We
will use 80% of the data for training the model and the remaining 20% for testing.
diabetes = load_diabetes()B
<_train = X_train.reshape(-1,1)
<_test = X_test.reshape(-1,1)M
Here, we are using the third feature (column) of the dataset, which represents the mean blood pressure, as
our independent variable (predictor variable) and the target variable as our dependent variable (response
variable).
Model Training
We will use scikit-learn's LinearRegression class to train a simple linear regression model on the training
data. The code for this is as follows -
Model Testing
Once the model is trained, we can use it to make predictions on the test data. The code for this is as follows
y_pred = lr_model.predict(X_test)
Here, X_test represents the input feature of the test data and y_pred represents the predicted output vari
able (target variable).
Model Evaluation
We need to evaluate the performance of the model to determine its accuracy. We will use the mean squared
error (MSE) and the coefficient of determination (RA2) as evaluation metrics. The code for this is as follows
Here, y_test represents the actual output variable of the test data.
Here, we are using the scatter() function from the matplotlib library to plot the training data points and
the plot() function to plot the regression line. The xlabel() and ylabel() functions are used to label the x-
axis and y-axis of the plot, respectively. Finally, we use the show() function to display the plot.
diabetes = load_diabetes()B
lr_model = LinearRegressionQ^^B
y_pred = lr_model.predict(X_test)B
r2 = r2_score(y_test7y2pred)^^^^^^^M
Output
On executing this code, you will get the following plot as the output and it will also print the Mean Squared
Error and the Coefficient of Determination on the terminal -
250
200
150
100
50
It is basically the extension of simple linear regression that predicts a response using two or more features.
Mathematically we can explain it as follows -
Consider a dataset having n observations, p features i.e. independent variables and y as one response i.e.
dependent variable the regression line for p features can be calculated as follows -
Here,h (z,-) is the predicted response value and &i, 62- • • • bp are the
regression coefficients.
yi = h + ei or e, = yi - h (z£)
Python Implementation
To implement multiple linear regression in Python using Scikit-Learn, we can use the same LinearRegres-
sion class as in simple linear regression, but this time we need to provide multiple independent variables
as input.
Let's consider the Boston Housing dataset from Scikit-Learn and implement multiple linear regression
using it.
Example
manleach a > t
lr_model = LinearRegressionQ^l
Make predictions on the test data
y_pred = lr_model.predict(X_test)
[Link]
[Link](x, y, color='red')
In this code, we first load the Boston Housing dataset using the load_boston() function from Scikit-Learn.
We then split the dataset into training and testing sets using the train_test_split() function.
Next, we create a LinearRegression object and fit it on the training data using the fit() method. We then
make predictions on the test data using the predict() method and calculate the mean squared error and co
efficient of determination using the mean_squared_error() and r2_score() functions, respectively.
Finally, we plot the predicted values against the actual values using the scatter() function and add a
regression line to the plot using the plot() function. We label the x-axis and y-axis using the xlabel() and
ylabel() functions and display the plot using the show() function.
Output
When you execute the program, it will produce the following plot as the output and it will print the Mean
Squared Error and the Coefficient of Determination on the terminal -
40 -
30 -
20-
10 -
o-
10 30 40 50
Actual Values
Polynomial Regression
Polynomial Linear Regression is a type of regression analysis in which the relationship between the
independent variable and the dependent variable is modeled as an n-th degree polynomial function. Poly
nomial regression allows for a more complex relationship between the variables to be captured, beyond the
linear relationship in Simple and Multiple Linear Regression.
Python Implementation
Here's an example implementation of Polynomial Linear Regression using the Boston Housing dataset
from Scikit-Learn -
Example
# Transform the input data to include polynomial features
train_poly = poly.fit_transform(X_train)
,test_poly = [Link](xZtest)^^B
lr_model = LinearRegressionQM
___________________________
printfCoefficient of Determination:', r 2)|
[Link]
^^“TB
Output
When you execute the program, it will produce the following plot as the output and it will print the Mean
Squared Error and the Coefficient of Determination on the terminal -
Let us now take a look at the steps involved in building a classification model -
Data Preparation
The first step is to collect and preprocess the data. This involves cleaning the data, handling missing val
ues, and converting categorical variables to numerical values.
Feature Extraction/Selection
The next step is to extract or select relevant features from the data. This is an important step because the
quality of the features can greatly impact the performance of the model. Some common feature selection
techniques include correlation analysis, feature importance ranking, and principal component analysis.
Model Selection
Once the features are selected, the next step is to choose an appropriate classification algorithm. There
are many different algorithms to choose from, each with its own strengths and weaknesses. Some popular
algorithms include logistic regression, decision trees, random forests, support vector machines, and neu
ral networks
Model Training
After selecting a suitable algorithm, the next step is to train the model on the labeled training data. During
training, the model learns the mapping function between the input features and the target variable. The
model parameters are adjusted iteratively to minimize the difference between the predicted outputs and
the actual outputs.
Model Evaluation
Once the model is trained, the next step is to evaluate its performance on a separate set of validation data.
This is done to estimate the model's accuracy and generalization performance. Common evaluation met
rics include accuracy, precision, recall, Fl-score, and area under the receiver operating characteristic (ROC)
curve.
Hyperparameter Tuning
In many cases, the performance of the model can be further improved by tuning its hyperparameters.
Hyperparameters are settings that are chosen before training the model and control aspects such as the
learning rate, regularization strength, and the number of hidden layers in a neural network. Grid search,
random search, and Bayesian optimization are some common techniques used for hyperparameter tuning.
Model Deployment
Once the model has been trained and evaluated, the final step is to deploy it in a production environment.
This involves integrating the model into a larger system, testing it on realworld data, and monitoring its
performance over time.
Lazy Learners
As the name suggests, such kind of learners waits for the testing data to be appeared after storing the
training data. Classification is done only after getting the testing data. They spend less time on training
but more time on predicting. Examples of lazy learners are K-nearest neighbor and case-based reasoning.
Eager Learners
As opposite to lazy learners, eager learners construct classification model without waiting for the testing
data to be appeared after storing the training data. They spend more time on training but less time on pre
dicting. Examples of eager learners are Decision Trees, Naive Bayes and Artificial Neural Networks (ANN).
Building a Classifier in Python
Scikit-learn, a Python library for machine learning can be used to build a classifier in Python. The steps for
building a classifier in Python are as follows -
import sklearn
data = load_breast_cancer()
We also need to organize the data and it can be done with the help of following scripts -
|label_names - data['target_names']^^
geJ^ataHarged^^^^^B
feature_names = data['feature_names']
The following command will print the name of the labels, 'malignant’ and 'benign’ in case of our database.
print(label_names)
['malignant' 'benign']
These labels are mapped to binary values 0 and 1. Malignant cancer is represented by 0 and Benign cancer
is represented by 1.
The feature names and feature values of these labels can be seen with the help of following commands -
pr int(feature_name s [ 0 ])
The output of the above command is the names of the features for label 0 i.e. Malignant cancer -
ean radius
Similarly, names of the features for label can be produced as follows -
pr int(feature_name s [ 1 ])
The output of the above command is the names of the features for label 1 i.e. Benign cancer -
ean texture
We can print the features for these labels with the help of following command -
print(features[O])
We can print the features for these labels with the help of following command -
>rint(features[l])
This will give the following output -
Now, next command will split the data into training & testing data. In this example, we are using taking 40
percent of the data for testing purpose and 60 percent of the data for training purpose -
;nb = GaussianNBQ
Next, with the help of following command we can train the model -
Now, for evaluation purpose we need to make predictions. It can be done by using predictQ function as
follows -
>reds = [Link](test)
[1001100011101010111011011111101111110
1011011111111001111100110011100110010
1111110110000011111111001001001110110
1100011100110100110001110110010110100
1111111001111111111110111011011111100
0110101111011011101001111111101111101
001101]
The above series of Os and Is in output are the predicted values for the Malignant and Benign tumor
classes.
The following are some of the important classification evaluation metrics among which you can choose
based upon your dataset and kind of problem -
Confusion Matrix
1. Confusion Matrix - It is the easiest way to measure the performance of a classification prob
lem where the output can be of two or more type of classes.
1. Logistic Regression
2. K-Nearest Neighbors Regression
3. Support Vector Machine (SVM)
4. Decision Tree
5. Naive Bayes
6. Random Forest
7. Stochastic Gradient Descent
1. Speech Recognition
2. Handwriting Recognition
3. Biometric Identification
4. Document Classification
In the subsequent chapters, we will discuss some of the most popular classification algorithms in machine
learning.
Logistic Regression
Logistic regression is a popular algorithm used for binary classification problems, where the target vari
able is categorical with two classes. It models the probability of the target variable given the input features
and predicts the class with the highest probability.
Logistic regression is a type of generalized linear model, where the target variable follows a Bernoulli
distribution. The model consists of a linear function of the input features, which is transformed using the
logistic function to produce a probability value between 0 and 1.
The linear function is basically used as an input to another function such as g in the following relation -
Implementation in Python
Now we will implement the above concept of logistic regression in Python. For this purpose, we are using
a multivariate flower dataset named 'iris'. The iris dataset is a well-known dataset in machine learning,
consisting of measurements of the sepal length, sepal width, petal length, and petal width of three differ
ent species of iris flowers. We will use logistic regression to predict the species of an iris flower given its
measurements.
Let us now check the steps to implement logistic regression in Python using the iris dataset -
X = [Link] # inputfeatures^^^^M
V = [Link] # target variable^^^H
Plot the Training Data
This is an optional step but for more clarification about the dataset we are plotting the training data as
follows -
[Link](X_train, y_train
Make Predictions
Once the model is trained, we can use it to make predictions on the test set using the predictQ method.
red = [Link](X_test)
Here, we have used the average parameter with the value 'macro' to calculate the metrics for each class
separately and then take the average.
clMt(X^ram^^ram)^^^^^^H
Output
When you execute this code, it will produce the following plot as the output -
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
Fl-score: 1.0
Iris Training Data
4.0 -
Sepal width (cm)
3.5 -
3.0 -
2.5 -
2.0 -
4.5 5.0 5.5 6.0 6.5 7.0 7.5
Sepal length (cm)
KNN is a supervised learning algorithm that can be used for both classification and regression problems.
The main idea behind KNN is to find the k-nearest data points to a given test data point and use these
nearest neighbors to make a prediction. The value of k is a hyperparameter that needs to be tuned, and it
represents the number of neighbors to consider.
For classification problems, the KNN algorithm assigns the test data point to the class that appears most
frequently among the k-nearest neighbors. In other words, the class with the highest number of neighbors
is the predicted class.
For regression problems, the KNN algorithm assigns the test data point the average of the k-nearest neigh
bors' values.
The distance metric used to measure the similarity between two data points is an essential factor that
affects the KNN algorithm's performance. The most commonly used distance metrics are Euclidean dis
tance, Manhattan distance, and Minkowski distance.
1. Load the data - The first step is to load the dataset into memory. This can be done using vari
ous libraries such as pandas or numpy.
2. Split the data - The next step is to split the data into training and test sets. The training set is
used to train the KNN algorithm, while the test set is used to evaluate its performance.
3. Normalize the data - Before training the KNN algorithm, it is essential to normalize the data
to ensure that each feature contributes equally to the distance metric calculation.
4. Calculate distances - Once the data is normalized, the KNN algorithm calculates the dis
tances between the test data point and each data point in the training set.
5. Select k-nearest neighbors - The KNN algorithm selects the k-nearest neighbors based on
the distances calculated in the previous step.
6. Make a prediction - For classification problems, the KNN algorithm assigns the test data
point to the class that appears most frequently among the k-nearest neighbors. For regression
problems, the KNN algorithm assigns the test data point the average of the k-nearest neigh
bors' values.
7. Evaluate performance - Finally, the KNN algorithm's performance is evaluated using various
metrics such as accuracy, precision, recall, and Fl-score.
Implementation in Python
Now that we have discussed the KNN algorithm's theory, let's implement it in Python using scikit-learn.
Scikit-learn is a popular library for Machine Learning in Python and provides various algorithms for clas
sification and regression problems.
We will use the Iris dataset, which is a popular dataset in Machine Learning and contains information
about three different species of Iris flowers. The dataset has four features, including the sepal length, sepal
width, petal length, and petal width, and a target variable, which is the species of the flower.
To implement KNN in Python, we need to follow the steps mentioned earlier. Here's the Python code for
implementing KNN on the Iris dataset -
Example
scaler = StandardScalerQ^^^^^^^^^M
j 'i ।~
knn = KNeighborsClassifier(n_neighbors=5)
Output
When you execute this code, it will produce the following output -
Accuracy: 98.11%
The Naive Bayes algorithm is a classification algorithm based on Bayes' theorem. The algorithm assumes
that the features are independent of each other, which is why it is called "naive." It calculates the prob
ability of a sample belonging to a particular class based on the probabilities of its features. For example,
a phone may be considered as smart if it has touch-screen, internet facility, good camera, etc. Even if all
these features are dependent on each other, but all these features independently contribute to the proba
bility of that the phone is a smart phone.
In Bayesian classification, the main interest is to find the posterior probabilities i.e. the probability of a
label given some observed features, P( L I features). With the help of Bayes theorem, we can express
this in quantitative form as follows -
P (L) P (features\L)
P (L\ features) =
Here,
In the Naive Bayes algorithm, we use Bayes' theorem to calculate the probability of a sample belonging to
a particular class. We calculate the probability of each feature of the sample given the class and multiply
them to get the likelihood of the sample belonging to the class. We then multiply the likelihood with the
prior probability of the class to get the posterior probability of the sample belonging to the class. We repeat
this process for each class and choose the class with the highest probability as the class of the sample.
1. Gaussian Naive Bayes - This algorithm is used when the features are continuous variables
that follow a normal distribution. It assumes that the probability distribution of each feature
is Gaussian, which means it is a bell-shaped curve.
2. Multinomial Naive Bayes - This algorithm is used when the features are discrete variables. It
is commonly used in text classification tasks where the features are the frequency of words in
a document.
3. Bernoulli Naive Bayes - This algorithm is used when the features are binary variables. It
is also commonly used in text classification tasks where the features are whether a word is
present or not in a document.
Implementation in Python
Here we will implement the Gaussian Naive Bayes algorithm in Python. We will use the iris dataset, which
is a popular dataset for classification tasks. It contains 150 samples of iris flowers, each with four features:
sepal length, sepal width, petal length, and petal width. The flowers belong to three classes: setosa, versi
color, and virginica.
First, we will import the necessary libraries and load the datase -
1
from [Link] import load_iris^^^^^^^M
from [Link] import train_test_split
from sklearn.naive_bayes import GaussianNB^^^H
iris : loadjnsO^^B
We then create an instance of the Gaussian Naive Bayes classifier and train it on the training set -
” C : KI< i i ।;
We can now use the trained classifier to make predictions on the testing set -
Output
When you execute this program, it will produce the following output -
Accuracy: 0.9622641509433962
Decision Trees Algorithm
The Decision Tree algorithm is a hierarchical tree-based algorithm that is used to classify or predict out
comes based on a set of rules. It works by splitting the data into subsets based on the values of the input
features. The algorithm recursively splits the data until it reaches a point where the data in each subset
belongs to the same class or has the same value for the target variable. The resulting tree is a set of decision
rules that can be used to make predictions or classify new data.
The Decision Tree algorithm works by selecting the best feature to split the data at each node. The best
feature is the one that provides the most information gain or the most reduction in entropy. Information
gain is a measure of the amount of information gained by splitting the data at a particular feature, while
entropy is a measure of the randomness or disorder in the data. The algorithm uses these measures to de
termine the best feature to split the data at each node.
The example of a binary tree for predicting whether a person is fit or unfit providing various information
like age, eating habits and exercise habits, is given below -
In the above decision tree, the question are decision nodes and final outcomes are leaves.
Types of Decision Tree Algorithm
There are two main types of Decision Tree algorithm -
1. Classification Tree - A classification tree is used to classify data into different classes or
categories. It works by splitting the data into subsets based on the values of the input features
and assigning each subset to a different class.
2. Regression Tree - A regression tree is used to predict numerical values or continuous vari
ables. It works by splitting the data into subsets based on the values of the input features and
assigning each subset a numerical value.
Implementation in Python
Let's implement the Decision Tree algorithm in Python using a popular dataset for classification tasks
named Iris dataset. It contains 150 samples of iris flowers, each with four features: sepal length, sepal
width, petal length, and petal width. The flowers belong to three classes: setosa, versicolor, and virginica.
First, we will import the necessary libraries and load the dataset -
We then create an instance of the Decision Tree classifier and train it on the training set -
dt^t(^^ram^^ram)^^^^^^H
We can now use the trained classifier to make predictions on the testing set -
The plot_tree function from the [Link] module can be used to plot the Decision Tree. We can pass
in the trained Decision Tree classifier, the filled argument to fill the nodes with color, the feature_names
argument to label the features, and the class_names argument to label the target classes. We also specify
the figsize argument to set the size of the figure and call the show function to display the plot.
import numpy as n
from [Link] import train_test_split
from [Link] import DecisionTreeClassifier^^B
in^ToaOnsQ^^B
d^^TecisionrreeClassifie^)^B
Output
This will create a plot of the Decision Tree that looks like this -
As you can see, the plot shows the structure of the Decision Tree, with each node representing a decision
based on the value of a feature, and each leaf node representing a class or numerical value. The color of
each node indicates the majority class or value of the samples in that node, and the numbers at the bottom
indicate the number of samples that reach that node.
Working of SVM
The goal of SVM is to find a hyperplane that separates the data points into different classes. A hyperplane
is a line in 2D space, a plane in 3D space, or a higher-dimensional surface in n-dimensional space. The
hyperplane is chosen in such a way that it maximizes the margin, which is the distance between the hyper
plane and the closest data points of each class. The closest data points are called the support vectors.
The distance between the hyperplane and a data point "x" can be calculated using the formula -
where "w" is the weight vector, "b" is the bias term, and "I |w||" is the Euclidean norm of the weight vector.
The weight vector "w" is perpendicular to the hyperplane and determines its orientation, while the bias
term "b" determines its position.
The optimal hyperplane is found by solving an optimization problem, which is to maximize the margin
subject to the constraint that all data points are correctly classified. In other words, we want to find the
hyperplane that maximizes the margin between the two classes while ensuring that no data point is mis
classified. This is a convex optimization problem that can be solved using quadratic programming.
If the data points are not linearly separable, we can use a technique called kernel trick to map the data
points into a higher-dimensional space where they become separable. The kernel function computes the
inner product between the mapped data points without computing the mapping itself. This allows us to
work with the data points in the higherdimensional space without incurring the computational cost of
mapping them.
X-Axis
Implementation in Python
We will use the scikit-learn library to implement SVM in Python. Scikit-learn is a popular machine learn
ing library that provides a wide range of algorithms for classification, regression, clustering, and dimen
sionality reduction tasks.
We will use the famous Iris dataset, which contains the sepal length, sepal width, petal length, and petal
width of three species of iris flowers: Iris setosa, Iris versicolor, and Iris virginica. The goal is to classify the
flowers into their respective species based on these four features.
Example
_ __
from [Link] import train_test_split
from [Link] import accuracy_score
ins loadjnsO^^B
We load the iris dataset using load_iris and split the data into training and testing sets using train_test_s-
plit. We use a test size of 0.2, which means that 20% of the data will be used for testing and 80% for train
ing. We set the random state to 42 to ensure reproducibility of the results.
We create an SVM classifier with a linear kernel using SVC(kernel='linear'). We then train the SVM classi
fier on the training set using [Link](X_train, y_train).
Once the classifier is trained, we make predictions on the testing set using [Link](X_test). We then
calculate the accuracy of the classifier using accuracy_score(y_test, y_pred) and print it to the console.
Output
The output of the code should be something like this -
Accuracy: 1.0
The regularization parameter C controls the trade-off between maximizing the margin and minimizing
the classification error. A higher value of C means that the classifier will try to minimize the classification
error at the expense of a smaller margin, while a lower value of C means that the classifier will try to max
imize the margin even if it means more misclassifications.
The kernel-specific parameters depend on the type of kernel being used. For example, the polynomial
kernel has parameters for the degree of the polynomial and the coefficient of the polynomial, while the RBF
kernel has a parameter for the width of the Gaussian function.
We can use cross-validation to tune the parameters of the SVM. Cross-validation involves splitting the data
into several subsets and training the classifier on each subset while using the remaining subsets for test
ing. This allows us to evaluate the performance of the classifier on different subsets of the data and choose
the best set of parameters.
Example
from [Link] import GridSearchCV
define the parameter grid
param_grid =
'0: [0.1,1,10,100],
'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
sv^^VCol^^^^H
We start by importing the GridSearchCV module from scikit-learn, which is a tool for performing grid
search on a set of parameters. We define a parameter grid that contains the possible values for each param
eter we want to tune.
We create an SVM classifier using SVC() and then pass it to GridSearchCV along with the parameter grid
and the number of cross-validation folds (cv=5). We then call grid_search.fit(X_train, y_train) to perform
the grid search.
Once the grid search is complete, we print the best set of parameters and their accuracy using
grid_search.best_params_ and grid_search.best_score_, respectively.
Output
On executing this program, you will get the following output -
Best parameters: {'C: 0.1, 'coefO': 0.5, 'degree': 3, 'gamma': 'scale', 'kernel': 'poly'}
Best accuracy: 0.975
This means that the best set of parameters found by the grid search are: C=0.1, coef0=0.5, degree=3,
gamma=scale, and kernel=poly. The accuracy achieved by this set of parameters on the training set is
97.5%.
You can now use these parameters to create a new SVM classifier and test its performance on the testing
set.
Random Forest
Random Forest is a machine learning algorithm that uses an ensemble of decision trees to make predic
tions. The algorithm was first introduced by Leo Breiman in 2001. The key idea behind the algorithm is to
create a large number of decision trees, each of which is trained on a different subset of the data. The pre
dictions of these individual trees are then combined to produce a final prediction.
Working of Random Forest Algorithm
We can understand the working of Random Forest algorithm with the help of following steps -
1. Step 1 - First, start with the selection of random samples from a given dataset.
2. Step 2 - Next, this algorithm will construct a decision tree for every sample. Then it will get
the prediction result from every decision tree.
3. Step 3 - In this step, voting will be performed for every predicted result.
4. Step 4 - At last, select the most voted prediction result as the final prediction result.
The following diagram illustrates how the Random Forest Algorithm works -
Random Forest is a flexible algorithm that can be used for both classification and regression tasks. In
classification tasks, the algorithm uses the mode of the predictions of the individual trees to make the final
prediction. In regression tasks, the algorithm uses the mean of the predictions of the individual trees.
1. Robustness to Overfitting - Random Forest algorithm is known for its robustness to overfit
ting. This is because the algorithm uses an ensemble of decision trees, which helps to reduce
the impact of outliers and noise in the data.
2. High Accuracy - Random Forest algorithm is known for its high accuracy. This is because the
algorithm combines the predictions of multiple decision trees, which helps to reduce the im
pact of individual decision trees that may be biased or inaccurate.
3. Handles Missing Data - Random Forest algorithm can handle missing data without the need
for imputation. This is because the algorithm only considers the features that are available for
each data point and does not require all features to be present for all data points.
4. Non-Linear Relationships - Random Forest algorithm can handle non-linear relationships
between the features and the target variable. This is because the algorithm uses decision trees,
which can model non-linear relationships.
5. Feature Importance - Random Forest algorithm can provide information about the impor
tance of each feature in the model. This information can be used to identify the most impor
tant features in the data and can be used for feature selection and feature engineering.
print("Accuracy:", accuracy)^^^^^^^^^^^^^^B
print("Precision:",
rfc = RandomForestClassifier(n_estimators=IOO)
y_pred = [Link](X_test)
Output
This will give us the performance metrics of our Random Forest classifier as follows -
Accuracy: 0.9811320754716981
Precision: 0.9821802935010483
Recall: 0.9811320754716981
Fl-score: 0.9811157396063056
Confusion Matrix
It is the easiest way to measure the performance of a classification problem where the output can be of two
or more type of classes. A confusion matrix is nothing but a table with two dimensions viz. "Actual" and
"Predicted" and furthermore, both the dimensions have "True Positives (TP)", "True Negatives (TN)", "False
Positives (FP)", "False Negatives (FN)" as shown below -
Actual
o
Predicted
1. True Positives (TP) - It is the case when both actual class & predicted class of data point is 1.
2. True Negatives (TN) - It is the case when both actual class & predicted class of data point is 0.
3. False Positives (FP) - It is the case when actual class of data point is 0 & predicted class of data
point is 1.
4. False Negatives (FN) - It is the case when actual class of data point is 1 & predicted class of
data point is 0.
How to Implement Confusion Matrix in Python?
To implement the confusion matrix in Python, we can use the confusion_matrix() function from the
[Link] module of the scikit-learn library. Here is an simple example of how to use the confu-
sion_matrix() function -
>rint(cm)
In this example, we have two arrays: y_actual contains the actual values of the target variable, and y_pred
contains the predicted values of the target variable. We then call the confusion_matrix() function, pass
ing in y_actual and y_pred as arguments. The function returns a 2D array that represents the confusion
matrix.
We can also visualize the confusion matrix using a heatmap. Below is how we can do that using the
heatmapQ function from the seaborn library
In this heatmap, the x-axis represents the predicted values, and the y-axis represents the actual values. The
color of each square in the heatmap indicates the number of samples that fall into each category.
Stochastic Gradient Descent
Gradient Descent is a popular optimization algorithm that is used to minimize the cost function of a ma
chine learning model. It works by iteratively adjusting the model parameters to minimize the difference
between the predicted output and the actual output. The algorithm works by calculating the gradient of
the cost function with respect to the model parameters and then adjusting the parameters in the opposite
direction of the gradient.
Stochastic Gradient Descent is a variant of Gradient Descent that updates the parameters for each training
example instead of updating them after evaluating the entire dataset. This means that instead of using the
entire dataset to calculate the gradient of the cost function, SGD only uses a single training example. This
approach allows the algorithm to converge faster and requires less memory to store the data.
The main difference between Stochastic Gradient Descent and regular Gradient Descent is the way that the
gradient is calculated and the way that the model parameters are updated. In Stochastic Gradient Descent,
the gradient is calculated using a single training example, while in Gradient Descent, the gradient is calcu
lated using the entire dataset.
Example
■Il
import numpy as np^^H
X_data, y_data = [Link], [Link]
# Getting the Iris dataset with only the first two attributes
scaler = StandardScaler().fit(X_train)
X_train [Link](X_train)|
X_test = [Link](X_test)^H
clfmode^G^^t(^jram^^mm)^^B
Output
When you run this code, it will produce the following output -
Clustering Algorithms
Clustering methods are one of the most useful unsupervised ML methods. These methods used to find
similarity as well as relationship patterns among data samples and then cluster those samples into groups
having similarity based on features. Clustering is important because it determines the intrinsic grouping
among the present unlabeled data. They basically make some assumptions about data points to constitute
their similarity. Each assumption will construct different but equally valid clusters.
For example, below is the diagram which shows clustering system grouped together the similar kind of
data in different clusters -
It is not necessary that clusters will be formed in spherical form. Followings are some other cluster forma
tion methods -
Density-based
In these methods, the clusters are formed as the dense region. The advantage of these methods is that they
have good accuracy as well as good ability to merge two clusters. Ex. Density-Based Spatial Clustering of
Applications with Noise (DBSCAN), Ordering Points to identify Clustering structure (OPTICS) etc.
Hierarchical-based
In these methods, the clusters are formed as a tree type structure based on the hierarchy. They have two
categories namely, Agglomerative (Bottom up approach) and Divisive (Top down approach). Ex. Clustering
using Representatives (CURE), Balanced iterative Reducing Clustering using Hierarchies (BIRCH) etc.
Partitioning
In these methods, the clusters are formed by portioning the objects into k clusters. Number of clusters will
be equal to the number of partitions. Ex. K-means, Clustering Large Applications based upon randomized
Search (CLARANS).
Grid
In these methods, the clusters are formed as a grid like structure. The advantage of these methods is that
all the clustering operation done on these grids are fast and independent of the number of data objects. Ex.
Statistical Information Grid (STING), Clustering in Quest (CLIQUE).
Types of ML Clustering Algorithms
The following are the most important and useful ML clustering algorithms -
K-means Clustering
This clustering algorithm computes the centroids and iterates until we it finds optimal centroid. It as
sumes that the number of clusters are already known. It is also called flat clustering algorithm. The num
ber of clusters identified from data by algorithm is represented by ‘K’ in K-means.
Mean-Shift Algorithm
It is another powerful clustering algorithm used in unsupervised learning. Unlike K-means clustering, it
does not make any assumptions hence it is a non-parametric algorithm.
Hierarchical Clustering
It is another unsupervised learning algorithm that is used to group together the unlabeled data points
having similar characteristics.
Data summarization and compression - Clustering is widely used in the areas where we require data
summarization, compression and reduction as well. The examples are image processing and vector quan
tization.
Collaborative systems and customer segmentation - Since clustering can be used to find similar prod
ucts or same kind of users, it can be used in the area of collaborative systems and customer segmentation.
Serve as a key intermediate step for other data mining tasks - Cluster analysis can generate a compact
summary of data for classification, testing, hypothesis generation; hence, it serves as a key intermediate
step for other data mining tasks also.
Trend detection in dynamic data - Clustering can also be used for trend detection in dynamic data by
making various clusters of similar trends.
Social network analysis - Clustering can be used in social network analysis. The examples are generating
sequences in images, videos or audios.
Biological data analysis - Clustering can also be used to make clusters of images, videos hence it can suc
cessfully be used in biological data analysis.
Now that you know what is clustering and how it works, let's see some of the clustering algorithms used in
machine learning, in the next few chapters.
Centroid-Based Clustering
Centroid-based clustering is a class of machine learning algorithms that aims to partition a dataset into
groups or clusters based on the proximity of data points to the centroid of each cluster.
The centroid of a cluster is the arithmetic mean of all the data points in that cluster and serves as a repre
sentative point for that cluster.
K-means Clustering
K-Means clustering is a popular unsupervised machine learning algorithm used for clustering data. It is
a simple and efficient algorithm that can group data points into K clusters based on their similarity. The
algorithm works by first randomly selecting K centroids, which are the initial centers of each cluster. Each
data point is then assigned to the cluster whose centroid is closest to it. The centroids are then updated by
taking the mean of all the data points in the cluster. This process is repeated until the centroids no longer
move or the maximum number of iterations is reached.
K-Medoids Clustering
K-medoids clustering is a partition-based clustering algorithm that is used to cluster a set of data points
into "k" clusters. Unlike K-means clustering, which uses the mean value of the data points to represent the
center of the cluster, K-medoids clustering uses a representative data point, called a medoid, to represent
the center of the cluster. The medoid is the data point that minimizes the sum of the distances between it
and all the other data points in the cluster. This makes K-medoids clustering more robust to outliers and
noise than K-means clustering.
We will discuss these two clustering methods in the next two chapters.
K-Means Clustering
Implementation in Python
Python has several libraries that provide implementations of various machine learning algorithms, in
cluding K-Means clustering. Let's see how to implement the K-Means algorithm in Python using the scikit-
learn library.
import numi
import [Link] as pit
from [Link] import KMeans
[Link](X
[Link](figsize=(7.5, 3.5))
[Link](kmeans.cluster_centers_[:,O], kmeans.cluster_centers_[:,l],
The output of the above code will be a plot with the data points colored based on their assigned cluster, and
the centroids marked with an 'x' symbol in red color.
:: I! • I J
pl^catter(X[^]^<^^^=2O^map=^summer2;
pl^gure(figsize=(^^5))^^^^^^^^^^B
[Link](kmeans.cluster_centers_[:,O], kmeans.cluster_centers_[:,l],
Output
When you execute this code, it will produce the following plots as the output -
1.0 -
0.8 -
0.6 -
0.4 -
0.2 -
0.0 -
—i--------------------------- 1---------------------------- 1----------------------------1---------------------------- 1— ~i—
0.0 0.2 0.4 0.6 0.8 1.0
Image Segmentation
K-Means clustering can be used to segment an image into different regions based on the color or texture of
the pixels. This technique is widely used in computer vision applications, such as object recognition, image
retrieval, and medical imaging.
Customer Segmentation
K-Means clustering can be used to segment customers into different groups based on their purchasing
behavior or demographic characteristics. This technique is widely used in marketing applications, such as
customer retention, loyalty programs, and targeted advertising.
Anomaly Detection
K-Means clustering can be used to detect anomalies in a dataset by identifying data points that do not
belong to any cluster. This technique is widely used in fraud detection, network intrusion detection, and
predictive maintenance.
K-Medoids Clustering
Implementation in Python
To implement K-medoids clustering in Python, we can use the scikit-learn library. The scikit-learn library
provides the KMedoids class, which can be used to perform K-medoids clustering on a dataset.
Next, we generate a sample dataset using the make_blobs() function from scikit-learn -
Here, we set the number of clusters to 3 and use the random_state parameter to ensure reproducibility.
Example
Here is the complete implementation in Python -
kmedoids = KMedoids(n_clusters=3,random_state=42)
Output
Here, we plot the data points as a scatter plot and color them based on their cluster labels. We also plot the
medoids as red crosses.
10
-5
-10
-7.5 -5.0 -2.5 0.0 2.5 5.0 7.5
1. Robust to outliers and noise - K-medoids clustering is more robust to outliers and noise than
K-means clustering because it uses a representative data point, called a medoid, to represent
the center of the cluster.
2. Can handle non-Euclidean distance metrics - K-medoids clustering can be used with any
distance metric, including non-Euclidean distance metrics, such as Manhattan distance and
cosine similarity.
3. Computationally efficient - K-medoids clustering has a computational complexity of
O(k*nA2), which is lower than the computational complexity of K-means clustering.
1. Sensitive to the choice of k - The performance of K-medoids clustering can be sensitive to the
choice of k, the number of clusters.
2. Not suitable for high-dimensional data - K-medoids clustering may not perform well on
high-dimensional data because the medoid selection process becomes computationally ex
pensive.
Mean-Shift Clustering
The Mean-Shift clustering algorithm is a non-parametric clustering algorithm that works by iteratively
shifting the mean of a data point towards the densest area of the data. The densest area of the data is de
termined by the kernel function, which is a function that assigns weights to the data points based on their
distance from the mean. The kernel function used in Mean-Shift clustering is usually a Gaussian function.
The steps involved in the Mean-Shift clustering algorithm are as follows -
The Mean-Shift clustering algorithm is a density-based clustering algorithm, which means that it identi
fies clusters based on the density of the data points rather than the distance between them. In other words,
the algorithm identifies clusters based on the areas where the density of the data points is highest.
import numi
import [Link] as pit
from [Link] import MeanShift, estimate_bandwidth
= [Link](500,2)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
[labels = msdabels^^^^^^^^^^^^^^^^^B
|cluster_centers = ms.cluster_centersZ^^^^^^M
. mi
printf'Number of estimated clusters:", n_clusters_)
[Link](cluster_centers[:,O], cluster_centers[:,l], marker^'*', s=300, c='r')
In this step, we visualize the results of the Mean-Shift clustering algorithm. We extract the cluster labels
and the cluster centers from the trained model. We then print the number of estimated clusters. Finally, we
plot the data points and the centroids using the matplotlib library.
Example
Here is the complete implementation example of Mean-Shift Clustering Algorithm in python -
= [Link](500,2)
[labels =
|cluster_centers = ms.cluster_centersZ^^^^^^M
. mi
printf'Number of estimated clusters:", n_clusters_)
Output
When you execute the program, it will produce the following plot as the output -
3-
2-
1-
0-
-1 -
-2 -
-3 -
0 2
1. Computer vision - Mean-Shift clustering is widely used in computer vision for object track
ing, image segmentation, and feature extraction.
2. Image processing - Mean-Shift clustering is used for image segmentation, which is the
process of dividing an image into multiple segments based on the similarity of the pixels.
3. Anomaly detection - Mean-Shift clustering can be used for detecting anomalies in data by
identifying the areas with low density.
4. Customer segmentation - Mean-Shift clustering can be used for customer segmentation in
marketing by identifying groups of customers with similar behavior and preferences.
5. Social network analysis - Mean-Shift clustering can be used for clustering users in social net
works based on their interests and interactions.
Hierarchical Clustering
Hierarchical clustering is another unsupervised learning algorithm that is used to group together the
unlabeled data points having similar characteristics. Hierarchical clustering algorithms falls into follow
ing two categories -
1. Step 1 - Treat each data point as single cluster. Hence, we will be having say K clusters at start.
The number of data points will also be K at start.
2. Step 2 - Now, in this step we need to form a big cluster by joining two closet datapoints. This
will result in total of K-1 clusters.
3. Step 3 - Now, to form more clusters we need to join two closet clusters. This will result in total
of K-2 clusters.
4. Step 4 - Now, to form one big cluster repeat the above three steps until K would become 0 i.e.
no more data points left to join.
5. Step 5 - At last, after making one single big cluster, dendrograms will be used to divide into
multiple clusters depending upon the problem.
%matplotlib inline
import [Link] as pit
import num as
Next, we will be plotting the datapoints we have taken for this example -
5^i^rray([[7^1J12^om7i9U26^1B2J7U8775U73^5
plt.subplots_adjust(bottom=oT)^^^^^^^^^^^^^^^M
pl^nnotate(label4xy=(x/y)^{ytext=^^)Xextcooids=^oflfce^oints^Ji^nghf/va=bottom'
Output
When you execute this code, it will produce the following plot as the output -
100
80
60
40
20
10 20 30 40 50 60 70 80 90
From the above diagram, it is very easy to see we have two clusters in our datapoints but in real-world data,
there can be thousands of clusters.
Next, we will be plotting the dendrograms of our datapoints by using Scipy library -
from [Link] import dendrogram, linkage
fronwiatplotliMmpor^yplo^^lt
linked = linkage(X,
labelLis^^nge(k^^^^^^^^H
dendrogram(linked, orientation='top',labels=labelList
|distance_sort='descending',show_leaf_counts=True)|
>[Link]
10
0
10 67851423
Next, we need to import the class for clustering and call its fit_predict method to predict the cluster. We
are importing AgglomerativeClustering class of [Link] library -
:luster.fit_predict(X
The following diagram shows the two clusters from our datapoints.
100
80 • •
60 - •
40 -
20 - • •
Example 2
As we understood the concept of dendrograms from the simple example above, let's move to another
example in which we are creating clusters of the data point in Pima Indian Diabetes Dataset by using hier
archical clustering -
impo^natplotlibpvplo^^lt
import pandas as
'bmatplotlib
numpv iipKBBBBI
impo^^cim^lustenhierarch^^hc
frai^^lean^lustei^^^^Agglomera^eClmstering^^^^^^^^l^^^J^^H
cluste^^gglnme^tiveClustering(i^lusteK=4^ffinity=^euclideanUinkage=^ward2
[Link](figsize=(7.2. 5.5))
[Link](patient_data[:, 01. patient_data[:,l], c=[Link]
cmap='rainbow')
Output
When you run this code, it will produce the following two plots as the output -
3500 -
3000 -
2500 -
2000-
1500 -
1000 -
800 -
0 20 40 60 80 100
Density-Based Clustering
Density-based clustering is based on the idea that clusters are regions of high density separated by regions
of low density.
1. The algorithm works by first identifying "core" data points, which are data points that have a
minimum number of neighbors within a specified distance. These core data points form the
center of a cluster.
2. Next, the algorithm identifies "border" data points, which are data points that are not core
data points but have at least one core data point as a neighbor.
3. Finally, the algorithm identifies "noise" data points, which are data points that are not core
data points or border data points.
DBSCAN Clustering
The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm is one of the most
common density-based clustering algorithms. The DBSCAN algorithm requires two parameters: the mini
mum number of neighbors (minPts) and the maximum distance between core data points (eps).
OPTICS Clustering
OPTICS (Ordering Points to Identify the Clustering Structure) is a density-based clustering algorithm that
operates by building a reachability graph of the dataset. The reachability graph is a directed graph that
connects each data point to its nearest neighbors within a specified distance threshold. The edges in the
reachability graph are weighted according to the distance between the connected data points. The algo
rithm then constructs a hierarchical clustering structure by recursively splitting the reachability graph
into clusters based on a specified density threshold.
HDBSCAN Clustering
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a clustering algo
rithm that is based on density clustering. It is a newer algorithm that builds upon the popular DBSCAN
algorithm and offers several advantages over it, such as better handling of clusters of varying densities and
the ability to detect clusters of different shapes and sizes.
In the next three chapters, we will discuss all the three density-based clustering algorithms in detail along
with their implementation in Python.
DBSCAN Clustering
Implementation in Python
We can implement the DBSCAN algorithm in Python using the scikit-learn library. Here are the steps to do
so -
fron^kleam^atase^mgormnak^noons^^^^^^^^^^^^|
X, y = make_moons(n_samples=200, noise=0.05, random_state=Q)
Example
Here is the complete implementation of DBSCAN clustering in Python -
clusterin^^BSCAN(eps=^^nii^amples=5)
clusterm?fit(X)^^^^^^^^^^^^^^^^H
import [Link] as
Output
The resulting scatter plot should show two distinct clusters, each corresponding to one of the moons in the
dataset. The noise data points should be colored black.
1.00
0.75
0.50
0.25
0.00
-0.25
-0.50
Advantages of DBSCAN
Following are the advantages of using DBSCAN clustering -
1. DBSCAN can handle clusters of arbitrary shape, unlike k-means, which assumes that clusters
are spherical.
2. It does not require prior knowledge of the number of clusters in the dataset, unlike k-means.
3. It can detect outliers, which are points that do not belong to any cluster. This is because
DBSCAN defines clusters as dense regions of points, and points that are far from any dense re
gion are considered outliers.
4. It is relatively insensitive to the initial choice of parameters, such as the epsilon and min_sam-
ples parameters, unlike k-means.
5. It is scalable to large datasets, as it only needs to compute pairwise distances between neigh
boring points, rather than all pairs of points.
Disadvantages of DBSCAN
Following are the disadvantages of using DBSCAN clustering -
1. It can be sensitive to the choice of the epsilon and min_samples parameters. If these parame
ters are not chosen carefully, DBSCAN may fail to identify clusters or merge them incorrectly.
2. It may not work well on datasets with varying densities, as it assumes that all clusters have
the same density.
3. It may produce different results for different runs on the same dataset, due to the non-deter-
ministic nature of the algorithm.
4. It may be computationally expensive for high-dimensional datasets, as the distance computa
tions become more expensive as the number of dimensions increases.
5. It may not work well on datasets with noise or outliers if the density of the noise or outliers is
too high. In such cases, the noise or outliers may be wrongly assigned to clusters.
OPTICS Clustering
OPTICS is like DBSCAN (Density-Based Spatial Clustering of Applications with Noise), another popular
density-based clustering algorithm. However, OPTICS has several advantages over DBSCAN, including the
ability to identify clusters of varying densities, the ability to handle noise, and the ability to produce a hi
erarchical clustering structure.
Here's an example of how to use the OPTICS class in scikit-learn to cluster a dataset -
Example
opdc^^OPTICS(mii^ample^^O^<^^5)
labels = optics.
pl^catter(X[^]^<[^^^=labels^map^turbo2
In this example, we first generate a sample dataset using the make_blobs function from scikit-learn. We
then instantiate an OPTICS object with the min_samples parameter set to 50 and the xi parameter set to
0.05. The min_samples parameter specifies the minimum number of samples required for a cluster to be
formed, and the xi parameter controls the steepness of the cluster hierarchy. We then fit the OPTICS object
to the dataset using the fit method. Finally, we plot the results using a scatter plot, where each data point is
colored according to its cluster label.
Output
When you execute this program, it will produce the following plot as the output -
10
-3-2-10 1 2 3 4
1. Ability to handle clusters of varying densities - OPTICS can handle clusters that have vary
ing densities, unlike some other clustering algorithms that require clusters to have uniform
densities.
2. Ability to handle noise - OPTICS can identify noise data points that do not belong to any
cluster, which is useful for removing outliers from the dataset.
3. Hierarchical clustering structure - OPTICS produces a hierarchical clustering structure that
can be useful for analyzing the dataset at different levels of granularity.
1. Sensitivity to parameters - OPTICS requires careful tuning of its parameters, such as the
min_samples and xi parameters, which can be challenging.
2. Computational complexity - OPTICS can be computationally expensive for large datasets,
especially when using a high min_samples value.
HDBSCAN Clustering
The mutual reachability distance between two points is the maximum of their reachability distances,
which is a measure of how easily one point can be reached from the other. The reachability distance
between two points is defined as the maximum of their distance and the minimum density of any point
along their path.
The hierarchy of clusters is then extracted from the mutual-reachability graph using a minimum span
ning tree (MST) algorithm. The leaves of the MST correspond to the individual data points, while the inter
nal nodes correspond to clusters of varying sizes and shapes.
The HDBSCAN algorithm then applies a condensed tree algorithm to the MST to extract the clusters. The
condensed tree is a compact representation of the MST that only includes the internal nodes of the tree.
The condensed tree is then cut at a certain level to obtain the clusters, with the level of the cut determined
by a user-defined minimum cluster size or a heuristic based on the stability of the clusters.
Implementation in Python
HDBSCAN is available as a Python library that can be installed using pip. The library provides an im
plementation of the HDBSCAN algorithm along with several useful functions for data preprocessing and
visualization.
Installation
To install HDBSCAN, open a terminal window and type the following command -
Usage
To use HDBSCAN, first import the hdbscan library -
import hdbscan
Next, we generate a sample dataset using the make_blobs() function from scikit-learn -
Now, create an instance of the HDBSCAN class and fit it to the data -
This will apply HDBSCAN to the dataset and assign each point to a cluster. To visualize the clustering re
sults, you can plot the data with color each point according to its cluster label -
labels = [Link].
This code will produce a scatter plot of the data with each point colored according to its cluster label as
follows -
--- 1-------------------------- 1-------------------------- 1-------------------------- 1-------------------------- 1-------------------------- 1-------------------------- 1-------------------------- T"
-10.0 -7.5 -5.0 -2.5 0.0 2.5 5.0 7.5
HDBSCAN also provides several parameters that can be adjusted to fine-tune the clustering results -
1. min_cluster_size - The minimum size of a cluster. Points that are not part of any cluster are
labeled as noise.
2. min_samples - The minimum number of samples in a neighborhood for a point to be consid
ered a core point.
3. cluster_selection_epsilon - The radius of the neighborhood used for cluster selection.
4. metric - The distance metric used to measure the similarity between points.
Advantages of HDBSCAN Clustering
HDBSCAN has several advantages over other clustering algorithms -
1. Better handling of clusters of varying densities - HDBSCAN can identify clusters of differ
ent densities, which is a common problem in many datasets.
2. Ability to detect clusters of different shapes and sizes - HDBSCAN can identify clusters that
are not necessarily spherical, which is another common problem in many datasets.
3. No need to specify the number of clusters - HDBSCAN does not require the user to specify
the number of clusters, which can be difficult to determine a priori.
4. Robust to noise - HDBSCAN is robust to noisy data and can identify outliers as noise points.
BIRCH Clustering
BIRCH (Balanced Iterative Reducing and Clustering hierarchies) is a hierarchical clustering algorithm that
is designed to handle large datasets efficiently. The algorithm builds a treelike structure of clusters by re
cursively partitioning the data into subclusters until a stopping criterion is met.
BIRCH uses two main data structures to represent the clusters: Clustering Feature (CF) and Sub-Cluster
Feature (SCF). CF is used to summarize the statistical properties of a set of data points, while SCF is used to
represent the structure of subclusters.
BIRCH clustering has three main steps -
1. Initialization - BIRCH constructs an empty tree structure and sets the maximum number of
CFs that can be stored in a node.
2. Clustering - BIRCH reads the data points one by one and adds them to the tree structure. If a
CF is already present in a node, BIRCH updates the CF with the new data point. If there is no
CF in the node, BIRCH creates a new CF for the data point. BIRCH then checks if the number of
CFs in the node exceeds the maximum threshold. If the threshold is exceeded, BIRCH creates a
new subcluster by recursively partitioning the CFs in the node.
3. Refinement - BIRCH refines the tree structure by merging the subclusters that are similar
based on a distance metric.
Example
from [Link] import make_blobs
from [Link] import Birch^^^B
import [Link] as plt^^^^^B
X, y = make_blobs(n_samples= 1000, centers= 10, cluster_std=0.50,
labels [Link](X)^^^^^^^^^B
pl^catter(X[^]^<^^^=labels^map^wintef)
In this example, we first generate a sample dataset using the make_blobs function from scikit-learn. We
then cluster the dataset using the BIRCH algorithm. For the BIRCH algorithm, we instantiate a Birch object
with the threshold parameter set to 1.5 and the n_clusters parameter set to 4. We then fit the Birch object
to the dataset using the fit method and predict the cluster labels using the predict method. Finally, we plot
the results using a scatter plot.
Output
When you execute the given program, it will produce the following plot as the output -
Advantages of BIRCH Clustering
BIRCH clustering has several advantages over other clustering algorithms, including -
1. Scalability - BIRCH is designed to handle large datasets efficiently by using a treelike struc
ture to represent the clusters.
2. Memory efficiency - BIRCH uses CF and SCF data structures to summarize the statistical
properties of the data points, which reduces the memory required to store the clusters.
3. Fast clustering - BIRCH can cluster the data points quickly because it uses an incremental
clustering approach.
Affinity Propagation
Affinity Propagation is a clustering algorithm that identifies "exemplars" in a dataset and assigns each data
point to one of these exemplars. It is a type of clustering algorithm that does not require a pre-specified
number of clusters, making it a useful tool for exploratory data analysis. Affinity Propagation was intro
duced by Frey and Dueck in 2007 and has since been widely used in many fields such as biology, computer
vision, and social network analysis.
The idea behind Affinity Propagation is to iteratively update two matrices: the responsibility matrix and
the availability matrix. The responsibility matrix contains information about how well-suited each data
point is to serve as an exemplar for another data point, while the availability matrix contains information
about how much each data point wants to select another data point as an exemplar. The algorithm alter
nates between updating these two matrices until convergence is achieved. The final exemplars are chosen
based on the maximum values in the responsibility matrix.
Implementation in Python
In Python, the Scikit-learn library provides the AffinityPropagation class for implementing the Affinity
Propagation algorithm. The class takes several parameters, including the preference parameter, which
controls how many exemplars are chosen, and the damping factor, which controls the convergence speed
of the algorithm.
Here is an example of how to implement Affinity Propagation using the Scikit-learn library in Python -
Example
from [Link] import AffinityProp agation^^^^^^^B
from [Link] import make_blobs^^^^^^^^^^^B
import [Link] as
print("Exemplars:", af.cluster_centers_mdicesZ)^^^^^^^^^^^^^^^^^^M
In this example, we first generate a synthetic dataset using the make_blobs() function from Scikit-learn.
We then create an instance of the AffinityPropagation class with a preference value of-50 and fit the model
to the dataset using the fit() method. Finally, we print the cluster labels and the exemplars identified by the
algorithm.
Output
When you execute this code, it will produce the following plot as the output -
10 4
The preference parameter in Affinity Propagation controls the number of exemplars that are chosen. A
higher preference value leads to more exemplars, while a lower preference value leads to fewer exemplars.
The damping factor controls the convergence speed of the algorithm, with larger damping factors leading
to slower convergence.
Overall, Affinity Propagation is a powerful clustering algorithm that can identify the number of clusters
automatically and does not require a pre-specified number of clusters. However, it can be computationally
expensive and may not work well with very large datasets.
1. Affinity Propagation can identify the number of clusters automatically without specifying
the number of clusters in advance.
2. It can handle clusters of arbitrary shapes and sizes.
3. It can handle datasets with noisy or incomplete data.
4. It is relatively insensitive to the choice of initial parameters.
5. It has been shown to outperform other clustering algorithms on certain types of datasets.
1. It can be computationally expensive for large datasets or datasets with many features.
2. It may converge to suboptimal solutions, especially when the data has a high degree of vari
ability or noise.
3. It can be sensitive to the choice of the damping factor, which controls the rate of convergence.
4. It may produce many small clusters or clusters with only one or a few members, which may
not be meaningful.
5. It can be difficult to interpret the resulting clusters, as the algorithm does not provide explicit
information about the meaning or characteristics of the clusters.
Distribution-Based Clustering
Distribution-based clustering algorithms, also known as probabilistic clustering algorithms, are a class of
machine learning algorithms that assume that the data points are generated from a mixture of probability
distributions. These algorithms aim to identify the underlying probability distributions that generate the
data, and use this information to cluster the data into groups with similar properties.
One common distribution-based clustering algorithm is the Gaussian Mixture Model (GMM). GMM as
sumes that the data points are generated from a mixture of Gaussian distributions, and aims to estimate
the parameters of these distributions, including the means and covariances of each distribution. Let's see
below what is GMM in ML and how we can implement in Python programming language.
Gaussian Mixture Model
Gaussian Mixture Models (GMM) is a popular clustering algorithm used in machine learning that assumes
that the data is generated from a mixture of Gaussian distributions. In other words, GMM tries to fit a set of
Gaussian distributions to the data, where each Gaussian distribution represents a cluster in the data.
GMM has several advantages over other clustering algorithms, such as the ability to handle overlapping
clusters, model the covariance structure of the data, and provide probabilistic cluster assignments for each
data point. This makes GMM a popular choice in many applications, such as image segmentation, pattern
recognition, and anomaly detection.
Implementation in Python
In Python, the Scikit-learn library provides the GaussianMixture class for implementing the GMM algo
rithm. The class takes several parameters, including the number of components (i.e., the number of clus
ters to identify), the covariance type, and the initialization method.
Here is an example of how to implement GMM using the Scikit-learn library in Python -
Example
from [Link] import GaussianMixture
from [Link] import make_blobs^^B
[Link]
generate a dataset
X, _ = make_blobs(n_samples=200, centers=4, random_state=Q)
gm^^^aussianMixmm^^mmponen^^J^^I
labels = [Link](X)^^^^^^^^^^^^M
pl^catter(X[^]^<^^^=labels^map=Vindis2
In this example, we first generate a synthetic dataset using the make_blobs() function from Scikit-learn.
We then create an instance of the GaussianMixture class with 4 components and fit the model to the
dataset using the fit() method. Finally, we predict the cluster labels for the data points using the predictQ
method and print the resulting labels.
Output
When you execute this program, it will produce the following plot as the output -
10 -
8-
-4 -2 0 2 4
1. Gaussian Mixture Models (GMM) can model arbitrary distributions of data, making it a flexible
clustering algorithm.
2. It can handle datasets with missing or incomplete data.
3. It provides a probabilistic framework for clustering, which can provide more information
about the uncertainty of the clustering results.
4. It can be used for density estimation and generation of new data points that follow the same
distribution as the original data.
5. It can be used for semi-supervised learning, where some data points have known labels and
are used to train the model.
Agglomerative Clustering
Agglomerative clustering is a hierarchical clustering algorithm that starts with each data point as its own
cluster and iteratively merges the closest clusters until a stopping criterion is reached. It is a bottom-up
approach that produces a dendrogram, which is a tree-like diagram that shows the hierarchical relation
ship between the clusters. The algorithm can be implemented using the scikit-learn library in Python.
Implementation in Python
We will use the iris dataset for demonstration. The first step is to import the necessary libraries and load
the dataset.
import pandas as
import [Link] as
from [Link] import load_iris^^^^^^^^^B
from [Link] import AgglomerativeClustermg^P
from [Link] import dendrogram, linkage
iris = load_iris()
X = irisxiata^B
y = [Link]®
The next step is to create a linkage matrix that contains the distances between each pair of clusters. We can
use the linkage function from the [Link] module to create the linkage matrix.
Z = linkage(X, 'ward')
The 'ward' method is used to calculate the distances between the clusters. It minimizes the variance of the
distances between the clusters being merged.
We can visualize the dendrogram using the dendrogram function from the same module.
pltJigure(figsize=(^^5))
The resulting dendrogram (see the following plot) shows the hierarchical relationship between the clus
ters. We can see that the algorithm has merged the closest clusters first, and the distance between the clus
ters increases as we move up the tree.
The final step is to apply the clustering algorithm and extract the cluster labels. We can use the Agglomer-
ativeClustering class from the [Link] module to apply the algorithm.
model = AgglomerativeClustering(n_clusters=3)
The n_clusters parameter specifies the number of clusters to be extracted from the data. In this case, we
have specified n_clusters=3 because we know that the iris dataset has three classes.
phS^SZdt^^^^^^M
[Link]('Agglomerative Clustering Results")
The resulting plot shows the three clusters identified by the algorithm. We can see that the algorithm has
successfully separated the data points into their respective classes.
Agglomerative Clustering Results
4.5 -f
4.0 -
2.0 - •
--------------------------------- 1--------------------------------------------1------------------------------------------- 1------------------------------------------- 1--------------------------------------------1------------------------------------------- 1--------------------------------------------1------------------------------------------ 1—
Example
Here is the complete implementation of Agglomerative Clustering in Python -
import numi
import pandas as
import [Link] as plt^^^B
from [Link] import loadjris
lroi^klea^^luste^^o^^gglomerativeClustering^|
from [Link] import dendrogram, linkage
X = [Link]^^^^B
y = [Link]^^^^B
^^mkage(X/ward9|
pl^gure(figsiz^^^^))
model = AgglomerativeClustering(n_clusters=3)^^^^B
[Link](figsize=(7.5, 3.5))
pl^catter(X[^)]^£^^<abels)^^^^^B
1. Produces a dendrogram that shows the hierarchical relationship between the clusters.
2. Can handle different types of distance metrics and linkage methods.
3. Allows for a flexible number of clusters to be extracted from the data.
4. Can handle large datasets with efficient implementations.
Dimensionality Reduction
Dimensionality reduction in machine learning is the process of reducing the number of features or vari
ables in a dataset while retaining as much of the original information as possible. In other words, it is a way
of simplifying the data by reducing its complexity.
The need for dimensionality reduction arises when a dataset has a large number of features or variables.
Having too many features can lead to overfitting and increase the complexity of the model. It can also
make it difficult to visualize the data and can slow down the training process.
Feature Selection
This involves selecting a subset of the original features based on certain criteria, such as their importance
or relevance to the target variable.
1. Filter Methods
2. Wrapper Methods
3. Embedded Methods
Feature Extraction
Feature extraction is a process of transforming raw data into a set of meaningful features that can be
used for machine learning models. It involves reducing the dimensionality of the input data by selecting,
combining or transforming features to create a new set of features that are more useful for the machine
learning model.
Dimensionality reduction can improve the accuracy and speed of machine learning models, reduce over
fitting, and simplify data visualization.
Feature Selection
Feature selection is an important step in machine learning that involves selecting a subset of the available
features to improve the performance of the model. The following are some commonly used feature selec
tion techniques -
Filter Methods
This method involves evaluating the relevance of each feature by calculating a statistical measure (e.g.,
correlation, mutual information, chi-square, etc.) and ranking the features based on their scores. Features
that have low scores are then removed from the model.
To implement filter methods in Python, you can use the SelectKBest or SelectPercentile functions from the
sklearn.feature_selection module. Below is a small code snippet to implement Feature selection.
Wrapper Methods
This method involves evaluating the model's performance by adding or removing features and selecting
the subset of features that yields the best performance. This approach is computationally expensive, but it
is more accurate than filter methods.
To implement wrapper methods in Python, you can use the RFE (Recursive Feature Elimination) function
from the sklearn.feature_selection module. Below is a small code snippet to implement Wrapper method.
frai^klean^eatu]^^lectioi^^^^FE^^^^^|
from sklearn.linear_model import LogisticRegression
jestimator = LogisticRegressionQ^^^^^^^^^^M
selector RFEtotoimator, n_Jeai.ures._to
selector [Link] tor.
X_new = [Link](X)
Embedded Methods
This method involves incorporating feature selection into the model building process itself. This can be
done using techniques such as Lasso regression, Ridge regression, or Decision Trees. These methods assign
weights to each feature and features with low weights are removed from the model.
To implement embedded methods in Python, you can use the Lasso or Ridge regression functions from the
sklearn.linear_model module. Below is a small code snippet for implementing embedded methods -
[lasso = Lasso(alpha=0.1)^^^^^^^^^^^J
To implement PCA in Python, you can use the PCA function from the [Link] module. For
example, to reduce the number of features you can use PCA as given the following code -
To implement RFE in Python, you can use the RFECV (Recursive Feature Elimination with Cross Valida
tion) function from the sklearn.feature_selection module. For example, below is a small code snippet with
the help of which we can implement to use Recursive Feature Elimination -
_new = [Link](X)
These feature selection techniques can be used alone or in combination to improve the performance of
machine learning models. It is important to choose the appropriate technique based on the size of the
dataset, the nature of the features, and the type of model being used.
Example
In the below example, we will implement three feature selection methods - univariate feature selection
using the chi-square test, recursive feature elimination with cross-validation (RFECV), and principal com
ponent analysis (PCA).
We will use the Breast Cancer Wisconsin (Diagnostic) Dataset, which is included in scikit-learn. This
dataset contains 569 samples with 30 features, and the task is to classify whether a tumor is malignant or
benign based on these features.
Here is the Python code to implement these feature selection methods on the Breast Cancer Wisconsin
(Diagnostic) Dataset -
import pandas as
[from [Link] import load_diabetes^^^^^M
from sklearn.feature_selection import SelectKBest, chi2
from [Link] import LogisticRegression
| diabc. a
X_ n e w s e i e c t o r. fi t_t r a n s fo
elf = LogisticRegressionQ^^^^^^^^^^^^^^^M
accuracy = [Link](X_test,
print("Accuracy using univariate feature selection: {:.2f}".format(accuracy))
estimator =
selector - RFECV(estimator, step=l,
Output
When you execute this code, it will produce the following output on the terminal -
Example
Here is an example of how to perform feature extraction using Principal Component Analysis (PCA) on the
Iris Dataset using Python -
In this code, we first import the necessary libraries, including sklearn for performing feature extraction
using PCA and matplotlib for visualizing the transformed data.
Next, we load the Iris Dataset using load_iris(). We then perform feature extraction using PCA with PCA()
and set the number of components to 2 (n_components=2). This reduces the dimensionality of the input
data from 4 features to 2 principal components.
We then transform the input data using fit_transform() and store the transformed data in X_pca. Finally,
we visualize the transformed data using [Link]() and color the data points based on their target value.
We label the axes as PCI and PC2, which are the first and second principal components, respectively, and
show the plot using [Link].
Output
When you execute the given program, it will produce the following plot as the output -
1.5 i
-3-2-10 1 2
PCI
1. Reduced Dimensionality - Feature extraction reduces the dimensionality of the input data
by transforming it into a new set of features. This makes the data easier to visualize, process
and analyze.
2. Improved Performance - Feature extraction can improve the performance of machine learn
ing algorithms by creating a set of more meaningful features that capture the essential infor
mation from the input data.
3. Feature Selection - Feature extraction can be used to perform feature selection by selecting
a subset of the most relevant features that are most informative for the machine learning
model.
4. Noise Reduction - Feature extraction can also help reduce noise in the data by filtering out
irrelevant features or combining related features.
Backward Elimination
Backward Elimination is a feature selection technique used in machine learning to select the most signifi
cant features for a predictive model. In this technique, we start by considering all the features initially, and
then we iteratively remove the least significant features until we get the best subset of features that gives
the best performance.
Implementation in Python
To implement Backward Elimination in Python, you can follow these steps -
Import [Link] as sm
Load your dataset into a Pandas DataFrame. We will be using Pima-Indians-Diabetes dataset
diabetes = pd.read_csv(r'C:\Users\Leekha\Desktop\[Link]')
Define the predictor variables (X) and the target variable (y).
^atasetiloch^^hzalues
= [Link]:, -1 Lvalues!
Use the Ordinary Least Squares (OLS) method from the statsmodels library to fit the multiple linear regres
sion model with all the predictor variables.
Check the p-values of each predictor variable and remove the one with the highest p-value (i.e., the least
significant).
Repeat steps 5 and 6 until all the remaining predictor variables have a p-value below the significance level
(e.g., 0.05).
|regressor_uLd - [Link]^enaog = y, exog = A_opr;.nr()
egressor_OLS.summa
The final subset of predictor variables with p-values below the significance level is the optimal set of fea
tures for the model.
Example
Here is the complete implementation of Backward Elimination in Python -
pandas i; d
import numpv as np
import [Link] as sm
frDehn^hepredictouvariable^XMn^h^argenzariabl^v
X = [Link][:,
J L'.i ■i ■>./h v :
#Ad^^olumiw^ne^^hepredicto^anable^^epresen^h^ntercept
X = [Link](arr = [Link]((len(X), l)).astype(int), values = X, axis=l)M
# Fit the multiple linear regression model with all the predictor variables
^hecl^hei^alue^feachpredictomzariabl^mc^emov^h^me
# Repeat the above step until all the remaining predictor variables!
regressoi^OL^^>nWLS(endo£^y^xoE^^^pt)^0
regressoi^OL$^^nWLS(endo: = y, exog = X_opt).fit()
Output
When you execute this program, it will produce the following output
OLS Regression Results
The goal of feature selection is to identify the most important features that are relevant for predicting the
target variable, while ignoring the less important features that add noise to the model and may lead to
overfitting.
Example
Here is an example to implement Forward Feature Construction in Python -
import pandas as
diabetes = pd.read_csv(r'C:\Users\Leekha\Desktop\[Link]')
X = [Link][:,
# Update the best feature and score if the current feature performs better
if score >
al
# Add the best feature to the set of selected features
Output
On execution, it will produce the following output -
Selected Features: [1]
Score: 0.23530716168783583
Selected Features: [0,1]
Score: 0.2923143573608237
Selected Features: [0,1, 5]
Score: 0.3164103491569179
Selected Features: [0,1, 5, 6]
Score: 0.3287368302427327
Selected Features: [0,1, 2, 5, 6]
Score: 0.334586804842275
Selected Features: [0,1, 2, 3, 5, 6]
Score: O.335626473655O455
Selected Features: [0,1, 2, 3,4, 5, 6]
Score: 0.3313166516703744
Selected Features: [0,1, 2, 3, 4, 5, 6, 7]
Score: 0.32230203252064216
High Correlation Filter is a feature selection technique used in machine learning to identify and remove
highly correlated features from the dataset. This technique is used to improve the performance of the
model by reducing the number of features used for training the model and to avoid the problem of multi
collinearity, which occurs when two or more predictor variables are highly correlated with each other.
The High Correlation Filter works by computing the correlation between each pair of features in the
dataset and removing one of the two features that are highly correlated with each other. This is done by
setting a threshold for the correlation coefficient between the features, and removing one of the features if
the absolute value of the correlation coefficient is greater than the threshold.
The advantage of using High Correlation Filter is that it reduces the number of features used for training
the model, which in turn reduces the complexity of the model and makes it easier to interpret. Moreover, it
helps to avoid the problem of multicollinearity, which can lead to unstable and unreliable estimates of the
model parameters.
However, there are some limitations to High Correlation Filter. For example, it may not always select the
best set of features for the model, especially if there are non-linear relationships between the features and
the target variable. Also, if two features are highly correlated, removing one of them may result in the loss
of some important information that was present in the removed feature.
Example
Here is an example to implement High Correlation Filter in Python -
import pandas as
nunwv
diabetes = pd.read_csv(r'C:\Users\Leekha\Desktop\[Link]')
X = [Link][:,
JI. PC..... :■■■. i tj c ■: ; ckc:
thresholc^^^^^^^^^^^^^^^H
for i, j in zip(*high_corr_indices)i^^^^^^^^^^^^^^^M
if i != j and (j, i) not in features_to_removei^^^^^^^^^B
features_to_remove = list(features_to_remove)
Output
When you execute this code, it will produce the following output -
1. Reduces multicollinearity - The High Correlation Filter can reduce multicollinearity, which
occurs when two or more features are highly correlated with each other. Multicollinearity can
negatively impact the performance of machine learning models.
2. Improves model performance - By removing highly correlated features, the High Correla
tion Filter can improve the performance of machine learning models.
3. Simplifies the model - With fewer features, the model can be easier to interpret and under
stand.
4. Saves computational resources - With fewer features, the computational resources required
to train machine learning models are reduced.
1. Information loss - The High Correlation Filter can lead to information loss because it re
moves features that may contain important information.
2. Affects non-linear relationships - The High Correlation Filter assumes that the relationships
between the features are linear. It may not work well for datasets where the relationships be
tween the features are non-linear.
3. Impact on the dependent variable - Removing highly correlated features can sometimes
have a negative impact on the dependent variable, particularly if the features are strongly cor
related with the dependent variable.
4. Selection bias - The High Correlation Filter may introduce selection bias if it removes fea
tures that are important for predicting the dependent variable.
The Low Variance Filter works by computing the variance of each feature in the dataset and removing the
features that have a variance below a certain threshold. This is done because features with low variance
have little or no discriminatory power and are unlikely to be useful for predicting the target variable.
Example
Here is an example to implement Low Variance Filter in Python -
import pandas as
nuniPV —
diabetes = pd.read_csv(r'C:\Users\Leekha\Desktop\[Link]')
X = [Link][:,
Output
When you execute this code, it will produce the following output -
1. Reduces overfitting - The Low Variance Filter can help reduce overfitting by removing fea
tures that do not contribute much to the prediction of the target variable.
2. Saves computational resources - With fewer features, the computational resources required
to train machine learning models are reduced.
3. Improves model performance - By removing low variance features, the Low Variance Filter
can improve the performance of machine learning models.
4. Simplifies the model - With fewer features, the model can be easier to interpret and under
stand.
1. Information loss - The Low Variance Filter can lead to information loss because it removes
features that may contain important information.
2. Affects non-linear relationships - The Low Variance Filter assumes that the relationships
between the features are linear. It may not work well for datasets where the relationships be
tween the features are non-linear.
3. Impact on the dependent variable - Removing low variance features can sometimes have a
negative impact on the dependent variable, particularly if the features are important for pre
dicting the dependent variable.
4. Selection bias - The Low Variance Filter may introduce selection bias if it removes features
that are important for predicting the dependent variable.
Missing Values Ratio
Missing Values Ratio is a feature selection technique used in machine learning to identify and remove
features from the dataset that have a high percentage of missing values. This technique is used to improve
the performance of the model by reducing the number of features used for training the model and to avoid
the problem of bias caused by missing values.
The Missing Values Ratio works by computing the percentage of missing values for each feature in the
dataset and removing the features that have a missing value percentage above a certain threshold. This is
done because features with a high percentage of missing values may not be useful for predicting the target
variable and can introduce bias into the model.
1. Compute the percentage of missing values for each feature in the dataset.
2. Set a threshold for the percentage of missing values for the features.
3. Remove the features that have a missing value percentage above the threshold.
4. Use the remaining features for training the machine learning model.
Example
Here is an example of how you can implement Missing Values Ratio in Python -
[diabetes = [Link](r'C:\Users\Leekha\Desktop\[Link]', delimiter=',')
Define the predictor variables (X) and the target variable (y)
= diabetes[:, :-l
= diabetes!:, -1]|
Output
When you execute this code, it will produce the following output -
1. Saves computational resources - With fewer features, the computational resources required
to train machine learning models are reduced.
2. Improves model performance - By removing features with a high percentage of missing val
ues, the Missing Value Ratio can improve the performance of machine learning models.
3. Simplifies the model - With fewer features, the model can be easier to interpret and under
stand.
4. Reduces bias - By removing features with a high percentage of missing values, the Missing
Value Ratio can reduce bias in the model.
Disadvantages of Missing Value Ratio
Following are the disadvantages of using Missing Value Ratio -
1. Information loss - The Missing Value Ratio can lead to information loss because it removes
features that may contain important information.
2. Affects non-missing data - Removing features with a high percentage of missing values can
sometimes have a negative impact on non-missing data, particularly if the features are impor
tant for predicting the dependent variable.
3. Impact on the dependent variable - Removing features with a high percentage of missing
values can sometimes have a negative impact on the dependent variable, particularly if the
features are important for predicting the dependent variable.
4. Selection bias - The Missing Value Ratio may introduce selection bias if it removes features
that are important for predicting the dependent variable.
PCA works by identifying the principal components (PCs) of the data, which are linear combinations of
the original variables that capture the most variation in the data. The first principal component accounts
for the most variance in the data, followed by the second principal component, and so on. By reducing the
dimensionality of the data to only the most significant PCs, PCA can simplify the problem and improve the
computational efficiency of downstream machine learning algorithms.
1. Standardize the data - PCA requires that the data be standardized to have zero mean and unit
variance.
2. Compute the covariance matrix - PCA computes the covariance matrix of the standardized
data.
3. Compute the eigenvectors and eigenvalues of the covariance matrix - PCA then computes
the eigenvectors and eigenvalues of the covariance matrix.
4. Select the principal components - PCA selects the principal components based on their cor
responding eigenvalues, which indicate the amount of variation in the data explained by each
component.
5. Project the data onto the new feature space - PCA projects the data onto the new feature
space defined by the selected principal components.
Example
Here is an example of how you can implement PC A in Python using the scikit-learn library
In this example, we load the iris dataset, standardize the data, and create a PCA object with two compo
nents. We then fit the PCA object to the standardized data and transform the data onto the two principal
components. We print the explained variance ratio of the selected components and plot the transformed
data using the first two principal components as the x and y axes.
Output
When you execute this code, it will produce the following plot as the output -
2 -
<n
0-
-2 -
■"T” T 1
-3 -2 0 2 3
PCI
Explained variance ratio: [0.72962445 0.22850762]
Advantages of PCA
Following are the advantages of using Principal Component Analysis -
Disadvantages of PCA
Following are the disadvantages of using Principal Component Analysis -
1. Information loss - PCA reduces the dimensionality of the data by projecting it onto a lower
dimensional space, which may lead to some loss of information.
2. Can be sensitive to outliers - PCA can be sensitive to outliers, which can have a significant
impact on the resulting principal components.
3. Interpretability may be reduced - Although PCA can improve interpretability by reducing
the number of features, the resulting principal components may be more difficult to interpret
than the original features.
4. Assumes linearity - PCA assumes that the relationships between the features are linear,
which may not always be the case.
5. Requires standardization - PCA requires that the data be standardized, which may not al
ways be possible or appropriate.
Performance Metrics
Performance metrics in machine learning are used to evaluate the performance of a machine learning
model. These metrics provide quantitative measures to assess how well a model is performing and to
compare the performance of different models. Performance metrics are important because they help us
understand how well our model is performing and whether it is meeting our requirements. In this way, we
can make informed decisions about whether to use a particular model or not.
There are many performance metrics that can be used in machine learning, depending on the type of
problem being solved and the specific requirements of the problem. Some common performance metrics
include -
1. Accuracy - Accuracy is one of the most basic performance metrics and measures the propor
tion of correctly classified instances in the dataset. It is calculated as the number of correctly
classified instances divided by the total number of instances in the dataset.
2. Precision - Precision measures the proportion of true positive instances out of all predicted
positive instances. It is calculated as the number of true positive instances divided by the sum
of true positive and false positive instances.
3. Recall - Recall measures the proportion of true positive instances out of all actual positive
instances. It is calculated as the number of true positive instances divided by the sum of true
positive and false negative instances.
4. Fl Score - Fl score is the harmonic mean of precision and recall. It is a balanced measure that
takes into account both precision and recall. It is calculated as 2 * (precision * recall) I (preci
sion + recall).
5. ROC AUC Score - ROC AUC (Receiver Operating Characteristic Area Under the Curve) score is
a measure of the ability of a classifier to distinguish between positive and negative instances.
It is calculated by plotting the true positive rate against the false positive rate at different clas
sification thresholds and calculating the area under the curve.
6. Confusion Matrix - A confusion matrix is a table that is used to evaluate the performance of a
classification model. It shows the number of true positives, true negatives, false positives, and
false negatives for each class in the dataset.
Example
Here is an example code snippet to calculate the accuracy, precision, recall, and Fl score for a binary clas
sification problem -
model = LogisticRegressionQ^^^^^^^^^^^^M
print("Precision:", precision)^B
12 rint("Recall:", recall)
nnt(?Scom^l)|
Output
When you execute this code, it will produce the following output -
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
Fl Score: 1.0
Automatic Workflows
Introduction
In order to execute and produce results successfully, a machine learning model must automate some stan
dard workflows. The process of automate these standard workflows can be done with the help of Scikit-
learn Pipelines. From a data scientist’s perspective, pipeline is a generalized, but very important concept. It
basically allows data flow from its raw format to some useful information. The working of pipelines can be
understood with the help of following diagram -
The blocks of ML pipelines are as follows -
Data ingestion - As the name suggests, it is the process of importing the data for use in ML project. The
data can be extracted in real time or batches from single or multiple systems. It is one of the most challeng
ing steps because the quality of data can affect the whole ML model.
Data Preparation - After importing the data, we need to prepare data to be used for our ML model. Data
preprocessing is one of the most important technique of data preparation.
ML Model Training - Next step is to train our ML model. We have various ML algorithms like supervised,
unsupervised, reinforcement to extract the features from data, and make predictions.
Model Evaluation - Next, we need to evaluate the ML model. In case of AutoML pipeline, ML model can be
evaluated with the help of various statistical methods and business rules.
ML Model retraining - In case of AutoML pipeline, it is not necessary that the first model is best one. The
first model is considered as a baseline model and we can train it repeatably to increase model’s accuracy.
Deployment - At last, we need to deploy the model. This step involves applying and migrating the model
to business operations for their use.
Quality of Data
The success of any ML model depends heavily on the quality of data. If the data we are providing to ML
model is not accurate, reliable and robust, then we are going to end with wrong or misleading output.
Data Reliability
Another challenge associated with ML pipelines is the reliability of data we are providing to the ML model.
As we know, there can be various sources from which data scientist can acquire data but to get the best re
sults, it must be assured that the data sources are reliable and trusted.
Data Accessibility
To get the best results out of ML pipelines, the data itself must be accessible which requires consolidation,
cleansing and curation of data. As a result of data accessibility property, metadata will be updated with
new tags.
By using ML pipelines, we can prevent this data leakage because pipelines ensure that data preparation like
standardization is constrained to each fold of our cross-validation procedure.
Example
The following is an example in Python that demonstrate data preparation and model evaluation workflow.
For this purpose, we are using Pima Indian Diabetes dataset from Sklearn. First, we will be creating pipe
line that standardized the data. Then a Linear Discriminative analysis model will be created and at last the
pipeline will be evaluated using 10-fold cross validation.
First, import the required packages as follows -
Now, we need to load the Pima diabetes dataset as did in previous examples -
path - r"C:\[Link]"
Next, we will create a pipeline with the help of the following code -
[estimators -
[Link](('standardize', StandardScalerQ))^^
[Link](('lda', LinearDiscriminantAnalysisQ))
model Pipeline:
At last, we are going to evaluate this pipeline and output its accuracy as follows -
Output
0.7790148448043184
The above output is the summary of accuracy of the setup on the dataset.
Example
The following is an example in Python that demonstrates feature extraction and model evaluation work
flow. For this purpose, we are using Pima Indian Diabetes dataset from Sklearn.
First, 3 features will be extracted with PCA (Principal Component Analysis). Then, 6 features will be
extracted with Statistical Analysis. After feature extraction, result of multiple feature selection and extrac
tion procedures will be combined by using
FeatureUnion tool. At last, a Logistic Regression model will be created, and the pipeline will be evaluated
using 10-fold cross validation.
Now, we need to load the Pima diabetes dataset as did in previous examples -
path = r"C:\[Link]"^^^^^^^^^^^^^^^^M
|headernames = ['preg1, 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'
Next, feature union will be created as follows -
[features =
features. append(('pca', PCA(n_components=3)))|
[Link](('select_best', SelectKBest(k=6)))
feature_union Fe a t ureU mo n (fe a
Next, pipeline will be creating with the help of following script lines -
[estimators =
[Link](('feature_union', feature_union))
[Link](('logistic', LogisticRegression()))[
model Pipeline(estimaLors)^^^^^^^^^^^^B
At last, we are going to evaluate this pipeline and output its accuracy as follows -
Output
O.7789811O66126855
The above output is the summary of accuracy of the setup on the dataset.
Boosting is a popular ensemble learning technique that combines several weak learners to create a strong
learner. It works by iteratively training weak learners on subsets of the data and assigning higher weights
to the misclassified samples to increase their importance in the subsequent iterations. This process is re
peated until the desired level of performance is achieved.
1. Feature Engineering - Feature engineering involves creating new features from the existing
features or transforming the existing features to make them more informative for the model.
This can include techniques such as one-hot encoding, scaling, normalization, and feature
selection.
2. Hyperparameter Tuning - Hyperparameters are parameters that are not learned during
training but are set by the data scientist. They control the behavior of the model, and tuning
them can significantly impact model performance. Grid search and randomized search are
common techniques for hyperparameter tuning.
3. Ensemble Learning - Ensemble learning involves combining multiple models to improve
performance. Techniques such as bagging, boosting, and stacking can be used to create en
sembles. Random forests are an example of a bagging ensemble, while gradient boosting ma
chines (GBMs) are an example of a boosting ensemble.
4. Regularization - Regularization is a technique that prevents overfitting by adding a penalty
term to the loss function. LI regularization (Lasso) and L2 regularization (Ridge) are common
techniques used in linear models, while dropout is a technique used in neural networks.
5. Data Augmentation - Data augmentation involves generating new data from the existing
data by applying transformations such as rotation, scaling, and flipping. This can help to re
duce overfitting and improve model performance.
6. Model Architecture - The architecture of the model can significantly impact its performance.
Techniques such as deep learning and convolutional neural networks (CNNs) can be used to
create more complex models that are better able to learn complex patterns in the data.
7. Early Stopping - Early stopping is a technique used to prevent overfitting by stopping the
training process once the model performance stops improving on a validation set. This pre
vents the model from continuing to learn the noise in the data and can help to improve
generalization.
8. Cross-Validation - Cross-validation is a technique used to evaluate the performance of a
model on multiple subsets of the data. This can help to identify overfitting and can be used to
select the best hyperparameters for the model.
These techniques can be implemented in Python using various machine learning libraries such as scikit-
learn, TensorFlow, and Keras. By using these techniques, data scientists can improve the performance of
their models and create more accurate predictions.
The following example below in which implement cross-validation using Scikit-learn -
Example
X = [Link]^^^^M
v i<i -
gb_clf = GradientBoostingClassifierQl
pnntCAccumcy^^^^^?^O^Hscom^nean0^com^td(^^))^^^^^B
Output
When you execute this code, it will produce the following output -
Bagging
The term bagging is also known as bootstrap aggregation. In bagging methods, ensemble model tries to
improve prediction accuracy and decrease model variance by combining predictions of individual models
trained over randomly generated training samples. The final prediction of ensemble model will be given by
calculating the average of all predictions from the individual estimators. One of the best examples of bag
ging methods are random forests.
Boosting
In boosting method, the main principle of building ensemble model is to build it incrementally by training
each base model estimator sequentially. As the name suggests, it basically combine several week base
learners, trained sequentially over multiple iterations of training data, to build powerful ensemble. During
the training of week base learners, higher weights are assigned to those learners which were misclassified
earlier. The example of boosting method is AdaBoost.
Voting
In this ensemble learning model, multiple models of different types are built and some simple statistics,
like calculating mean or median etc., are used to combine the predictions. This prediction will serve as the
additional input for training to make the final prediction.
Bagging Ensemble Algorithms
The following are three bagging ensemble algorithms -
Now, we need to load the Pima diabetes dataset as we did in the previous examples -
path - r"C:\[Link]"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'
data = read_csv(path, names=headernames)
array = datavalues^^^^^^^^^^^^^M
We need to provide the number of trees we are going to build. Here we are building 150 trees -
hum_trees =150
The output above shows that we got around 77% accuracy of our bagged decision tree classifier model.
Random Forest
It is an extension of bagged decision trees. For individual classifiers, the samples of training dataset are
taken with replacement, but the trees are constructed in such a way that reduces the correlation between
them. Also, a random subset of features is considered to choose each split point rather than greedily choos
ing the best split point in construction of each tree.
In the following Python recipe, we are going to build bagged random forest ensemble model by using Ran-
domForestClassifier class of sklearn on Pima Indians diabetes dataset.
Now, we need to load the Pima diabetes dataset as did in previous examples -
It ath = r"C:\[Link]"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
[array =
[i
We need to provide the number of trees we are going to build. Here we are building 150 trees with split
points chosen from 5 features -
|num_trees =150
max_features = 5
Output
0.7629357484620642
The output above shows that we got around 76% accuracy of our bagged random forest classifier model.
Extra Trees
It is another extension of bagged decision tree ensemble method. In this method, the random trees are
constructed from the samples of the training dataset.
In the following Python recipe, we are going to build extra tree ensemble model by using ExtraTreesClassi-
fier class of sklearn on Pima Indians diabetes dataset.
path = r"C:\[Link]"^^^^^^^^^^^^^^^^^B
pheadernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
array =
We need to provide the number of trees we are going to build. Here we are building 150 trees with split
points chosen from 5 features -
|num_trees = 150
max_features = 5
Output
0.7551435406698566
The output above shows that we got around 75.5% accuracy of our bagged extra trees classifier model.
AdaBoost
It is one the most successful boosting ensemble algorithm. The main key of this algorithm is in the way
they give weights to the instances in dataset. Due to this the algorithm needs to pay less attention to the in
stances while constructing subsequent models.
In the following Python recipe, we are going to build Ada Boost ensemble model for classification by using
AdaBoostClassifier class of sklearn on Pima Indians diabetes dataset.
Now, we need to load the Pima diabetes dataset as did in previous examples -
path = r"C:\[Link]"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
[array = [Link]^^^^^^^^^^^^^l
We need to provide the number of trees we are going to build. Here we are building 150 trees with split
points chosen from 5 features -
num_trees = 50
Next, build the model with the help of following script -
Output
0.7539473684210527
The output above shows that we got around 75% accuracy of our AdaBoost classifier ensemble model.
Now, we need to load the Pima diabetes dataset as did in previous examples -
path = r"C:\[Link]"
peadernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'
array =
We need to provide the number of trees we are going to build. Here we are building 150 trees with split
points chosen from 5 features -
num_trees = 50
Output
0.7746582365003418
The output above shows that we got around 77.5% accuracy of our Gradient Boosting classifier ensemble
model.
In the following Python recipe, we are going to build Voting ensemble model for classification by using
VotingClassifier class of sklearn on Pima Indians diabetes dataset. We are combining the predictions of lo
gistic regression, Decision Tree classifier and SVM together for a classification problem as follows -
First, import the required packages as follows -
Now, we need to load the Pima diabetes dataset as did in previous examples -
path = r"C:\[Link]"
[headernames = ['preg1, 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'
array =
MB
[Link](('logistic', modell))
<< .<12 .
|[Link](('cart', model2))^P
|[Link](('svm', model3))M
Now, create the voting ensemble model by combining the predictions of above created sub models.
ensembl^^odngClassifie^esdmatoK)^^^^^
[results = cross_val_score(ensemble, X, Y, cv=kfold)
Output
0.7382262474367738
The output above shows that we got around 74% accuracy of our voting classifier ensemble model.
Gradient Boosting
Gradient Boosting Machines (GBM) is a powerful machine learning technique that is widely used for build
ing predictive models. It is a type of ensemble method that combines the predictions of multiple weaker
models to create a stronger and more accurate model.
GBM is a popular choice for a wide range of applications, including regression, classification, and ranking
problems. Let's understand the workings of GBM and how it can be used in machine learning.
The algorithm works by training a sequence of decision trees, each of which is designed to correct the
errors of the previous tree.
In each iteration, the algorithm identifies the samples in the dataset that are most difficult to predict and
focuses on improving the model's performance on these samples.
This is achieved by fitting a new decision tree that is optimized to reduce the errors on the difficult
samples. The process continues until a specified stopping criteria is met, such as reaching a certain level of
accuracy or the maximum number of iterations.
How Does a Gradient Boosting Machine Work?
The basic steps involved in training a GBM model are as follows -
1. Initialize the model - The algorithm starts by creating a simple model, such as a single deci
sion tree, to serve as the initial model.
2. Calculate residuals - The initial model is used to make predictions on the training data, and
the residuals are calculated as the differences between the predicted values and the actual
values.
3. Train a new model - A new decision tree is trained on the residuals, with the goal of minimiz
ing the errors on the difficult samples.
4. Update the model - The predictions of the new model are added to the predictions of the
previous model, and the residuals are recalculated based on the updated predictions.
5. Repeat - Steps 3-4 are repeated until a specified stopping criteria is met.
GBM can be further improved by introducing regularization techniques, such as LI and L2 regularization,
to prevent overfitting. Additionally, GBM can be extended to handle categorical variables, missing data,
and multi-class classification problems.
Example
Here is an example of implementing GBM using the Sklearn breast cancer dataset -
[Link]:[Link] malm mst split^M
from [Link] import GradientBoostingClassifier|
from [Link] import accuracylscore^^^^^^H
data = load_breast_cancer()^M
X fl./ d..j I
y = [Link]
We train the GBM model using the fit method and make predictions on the testing set using the predict
method. Finally, we evaluate the model's accuracy using the accuracy_score function from Sklearn's met
rics module.
When you execute this code, it will produce the following output -
Accuracy: 0.956140350877193
1. High accuracy - GBM is known for its high accuracy, as it combines the predictions of multi
ple weaker models to create a stronger and more accurate model.
2. Robustness - GBM is robust to outliers and noisy data, as it focuses on improving the model's
performance on the most difficult samples.
3. Flexibility - GBM can be used for a wide range of applications, including regression, classifi
cation, and ranking problems.
4. Interpretability - GBM provides insights into the importance of different features in making
predictions, which can be useful for understanding the underlying factors driving the predic
tions.
5. Scalability - GBM can handle large datasets and can be parallelized to accelerate the training
process.
1. Training time - GBM can be computationally expensive and may require a significant
amount of training time, especially when working with large datasets.
2. Hyperparameter tuning - GBM requires careful tuning of hyperparameters, such as the
learning rate, number of trees, and maximum depth, to achieve optimal performance.
3. Black box model - GBM can be difficult to interpret, as the final model is a combination of
multiple decision trees and may not provide clear insights into the underlying factors driving
the predictions.
Bootstrap Aggregation (Bagging)
Bagging is an ensemble learning technique that combines the predictions of multiple models to improve
the accuracy and stability of a single model. It involves creating multiple subsets of the training data by
randomly sampling with replacement. Each subset is then used to train a separate model, and the final pre
diction is made by averaging the predictions of all models.
The main idea behind Bagging is to reduce the variance of a single model by using multiple models that are
less complex but still accurate. By averaging the predictions of multiple models, Bagging reduces the risk
of overfitting and improves the stability of the model.
1. Create multiple subsets of the training data by randomly sampling with replacement.
2. Train a separate model on each subset of the data.
3. Make predictions on the testing data using each model.
4. Combine the predictions of all models by taking the average or majority vote.
The key feature of Bagging is that each model is trained on a different subset of the training data, which
introduces diversity into the ensemble. The models are typically trained using a base model, such as a de
cision tree, logistic regression, or support vector machine.
Example
Now let's see how we can implement Bagging in Python using the Scikit-learn library. For this example, we
will use the famous Iris dataset.
ins loadnrisO^^H
base_estimator = DecisionTreeClassifier(max_depth=3)
ff Dehne the Bagging classifier
bagging = BaggingClassifier(base_estimator=base_estimator, n_estimators= 10, random_state=42)
In this example, we first load the Iris dataset using Scikit-learn's loadjris function and split it into training
and testing sets using the train_test_split function.
We then define the base estimator, which is a decision tree with a maximum depth of 3, and the Bagging
classifier, which consists of 10 decision trees.
We train the Bagging classifier using the fit method and make predictions on the testing set using the
predict method. Finally, we evaluate the model's accuracy using the accuracy_score function from Scikit-
learn's metrics module.
Output
When you execute this code, it will produce the following output -
Accuracy: 1.0
Cross Validation
Cross-validation is a powerful technique used in machine learning to estimate the performance of a model
on unseen data. It is an essential step in building a robust machine learning model, as it helps to identify
overfitting or underfitting, and helps to determine the optimal model hyperparameters.
What is Cross-Validation?
Cross-validation is a technique used to evaluate the performance of a model by partitioning the dataset
into subsets, training the model on a portion of the data, and then validating the model on the remaining
data. The basic idea behind cross-validation is to use a subset of the data to train the model and another
subset to test its performance. This allows the machine learning model to be trained on a variety of data
and to generalize better to new data.
There are different types of cross-validation techniques available, but the most commonly used technique
is k-fold cross-validation. In k-fold cross-validation, the data is partitioned into k equally sized folds. The
model is then trained on k-1 folds and tested on the remaining fold. This process is repeated k times, with
each of the k folds used once as the validation data. The final performance of the model is then averaged
over the k iterations to obtain an estimate of the model's performance.
Why is Cross-Validation Important?
Cross-validation is an essential technique in machine learning because it helps to prevent overfitting or
underfitting of a model. Overfitting occurs when the model is too complex and fits the training data too
closely, resulting in poor performance on new data. On the other hand, underfitting occurs when the
model is too simple and does not capture the underlying patterns in the data, resulting in poor perfor
mance on both the training and test data.
Cross-validation also helps to determine the optimal model hyperparameters. Hyperparameters are the
settings that control the behavior of the model. For example, in a decision tree algorithm, the maximum
depth of the tree is a hyperparameter that determines the level of complexity of the model. By using cross-
validation to evaluate the performance of the model at different hyperparameter values, we can select the
optimal hyperparameters that maximize the model's performance.
To demonstrate how to implement cross-validation in Python, we will use the famous Iris dataset. The
Iris dataset contains measurements of the sepal length, sepal width, petal length, and petal width of three
different species of iris flowers. The goal is to build a model that can predict the species of an iris flower
based on its measurements.
First, we need to load the dataset using the Scikit-learn load_iris() function and split it into a training set
and a test set using the train_test_split() function. The training set will be used to train the model, and the
test set will be used to evaluate the performance of the model.
ins loadjrisO^^H
Next, we will create a decision tree classifier using the Scikit-learn DecisionTreeClassifierQ function.
elf = DecisionTreeClassifier(random_state=42)
Now, we can use k-fold cross-validation to evaluate the performance of the model. We will use the cross_
val_score() function from Scikit-learn to perform k-fold cross-validation. The function takes as input the
model, the training data, the target variable, and the number of folds. It returns an array of scores, one for
each fold.
Here, we have specified the number of folds as 5, meaning that the data will be partitioned into 5 equally
sized folds. The cross_val_score() function will train the model on 4 folds and test it on the remaining fold.
This process will be repeated 5 times, with each fold used once as the validation data. The function returns
an array of scores, one for each fold.
Finally, we can calculate the mean and standard deviation of the scores to get an estimate of the model's
performance.
import numpy as n
mean_score = [Link](scores)^^^^^^^^^^^^^^|
The output of this code will be the mean and standard deviation of the scores. The mean score represents
the average performance of the model across all folds, while the standard deviation represents the variabil
ity of the scores.
Example
Here is the complete implementation of Cross-Validation in Python -
iris loacLinsO^^H
X = [Link]^^^^^^^^^^^^^^^B
elf = DecisionTreeClassifier(random_state=42)
scores = cross_val_score(clf, X, y, cv=5)
mean_score = [Link](scores)^^^^^^^^^^^^^^[
Output
When you execute this code, it will produce the following output -
AUC-ROC Curve
The AUC-ROC curve is a commonly used performance metric in machine learning that is used to evaluate
the performance of binary classification models. It is a plot of the true positive rate (TPR) against the false
positive rate (FPR) at different threshold values.
What is the AUC-ROC Curve?
The AUC-ROC curve is a graphical representation of the performance of a binary classification model at
different threshold values. It plots the true positive rate (TPR) on the y-axis and the false positive rate (FPR)
on the x-axis. The TPR is the proportion of actual positive cases that are correctly identified by the model,
while the FPR is the proportion of actual negative cases that are incorrectly classified as positive by the
model.
The AUC-ROC curve is a useful metric for evaluating the overall performance of a binary classification
model because it takes into account the trade-off between TPR and FPR at different threshold values. The
area under the curve (AUC) represents the overall performance of the model across all possible threshold
values. A perfect classifier would have an AUC of 1.0, while a random classifier would have an AUC of 0.5.
It is particularly useful when the data is imbalanced, meaning that one class is much more prevalent than
the other. In such cases, accuracy alone may not be a good measure of the model's performance because it
can be skewed by the prevalence of the majority class.
The AUC-ROC curve provides a more balanced view of the model's performance by taking into account
both TPR and FPR.
First, we need to import the necessary libraries and load the dataset. In this example, we will be using the
breast cancer dataset from scikit-learn.
Example
import pandas as
data = load_breast_cancer()
= [Link]
Next, we will fit a logistic regression model to the training set and make predictions on the test set.
Ir = LogisticRegressionQ^^^B
y_pred = lr.predict_proba(X_test)[:, 1]
After making predictions, we can calculate the AUC-ROC score using the roc_auc_score() function from
scikit-learn.
This will output the AUC-ROC score for the logistic regression model.
Finally, we can plot the ROC curve using the roc_curve() function and matplotlib library.
^lo^h^O^urve^^^^^^^^^^^^B
fpr, tpr, thresholds = roc_curve(y_test, y_pre
plUxtleCROCCwwe^^^^^^^^^^^B
Output
When you execute this code, it will plot the ROC curve for the logistic regression model.
ROC Curve
1.0 -
0.8 -
ro
a:
d> 0.6 -
>
‘<75
<£ 0.4 -
a>
0.2 -
0.0 -
0.4 0.6
False Positive Rate
Grid Search
Grid Search is a hyperparameter tuning technique in Machine Learning that helps to find the best com
bination of hyperparameters for a given model. It works by defining a grid of hyperparameters and then
training the model with all the possible combinations of hyperparameters to find the best performing set.
In other words, Grid Search is an exhaustive search method where a set of hyperparameters are defined,
and a search is performed over all possible combinations of these hyperparameters to find the optimal val
ues that give the best performance.
Implementation in Python
In Python, Grid Search can be implemented using the GridSearchCV class from the sklearn module. The
GridSearchCV class takes the model, the hyperparameters to tune, and a scoring function as input. It then
performs an exhaustive search over all possible combinations of hyperparameters and returns the best set
of hyperparameters that give the best score.
Here is an example implementation of Grid Search in Python using the GridSearchCV class -
Example
[Link] GridSearchCV^
from [Link] import RandomForestClassifier
from [Link] import make_classification^^B
In this example, we define a RandomForestClassifier model and a set of hyperparameters to tune, namely
the number of trees (n_estimators) and the maximum depth of each tree (max_depth). We then create a
GridSearchCV object and fit the data using the fit() method. Finally, we print the best set of hyperparame
ters and the corresponding score.
Output
When you execute this code, it will produce the following output -
Example
In Python, data scaling can be implemented using the sklearn module. The [Link] sub
module provides classes for scaling data. Below is an example implementation of data scaling in Python
using the StandardScaler class for standardization -
from [Link] import StandardScaler
from [Link] import load_iris^^^^^M
data = loaTTnsO^^B
y = [Link]
df = [Link](X, columns=data.feature_names)
print(" Before
scaler = StandardScalerQ^^^^^^I
X_scaled = scaler.fit_transform(X)^B
Output
When you execute this code, it will produce the following output -
Before scaling:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
After scaling:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 -0.900681 1.019004 -1.340227 -1.315444
1 -1.143017 -0.131979 -1.340227 -1.315444
2 -1.385353 0.328414 -1.397064 -1.315444
3 -1.506521 0.098217 -1.283389 -1.315444
4 -1.021849 1.249201 -1.340227 -1.315444
Train and Test
In machine learning, the train-test split is a common technique used to evaluate the performance of a
machine learning model. The basic idea behind the train-test split is to split the available data into two
sets: a training set and a testing set. The training set is used to train the model, and the testing set is used
to evaluate the model's performance.
The train-test split is important because it allows us to test the model on data that it has not seen before.
This is important because if we evaluate the model on the same data that it was trained on, the model may
perform well on the training data but may not generalize well to new data.
Example
In Python, the train_test_split function from the sklearn.model_selection module can be used to split the
data into training and testing sets. Here is an example implementation -
data = loadjirisO^^B
y = [Link]
model = LogisticRegressionQ^^^^^^^^^^^^^^^^^^B
In this example, we load the iris dataset and split it into training and testing sets using the train_test_split
function. We then create a logistic regression model and fit it to the training data. Finally, we evaluate the
model on the testing data using the score method of the model object.
The test_size parameter in the train_test_split function specifies the proportion of the data that should be
used for testing. In this example, we set it to 0.2, which means that 20% of the data will be used for testing
and 80% will be used for training. The random_state parameter ensures that the split is reproducible, so
we get the same split every time we run the code.
Output
When you execute this code, it will produce the following output -
Accuracy: 1.00
Overall, the train-test split is a crucial step in evaluating the performance of a machine learning model. By
splitting the data into training and testing sets, we can ensure that the model is not overfitting to the train
ing data and can generalize well to new data.
Association Rules
Association rule mining is a technique used in machine learning to discover interesting patterns in large
datasets. These patterns are expressed in the form of association rules, which represent relationships
between different items or attributes in the dataset. The most common application of association rule
mining is in market basket analysis, where the goal is to identify products that are frequently purchased
together.
Association rules are expressed as a set of antecedents and a set of consequents. The antecedents represent
the conditions or items that must be present for the rule to apply, while the consequents represent the out
comes or items that are likely to be associated with the antecedents. The strength of an association rule is
measured by two metrics: support and confidence. Support is the proportion of transactions in the dataset
that contain both the antecedent and the consequent, while confidence is the proportion of transactions
that contain the consequent given that they also contain the antecedent.
Example
In Python, the mlxtend library provides several functions for association rule mining. Here is an example
implementation of association rule mining in Python using the apriori function from mlxtend -
import pandas as
। : Il । । "
te = TransactionEncoderQ
te_ary = [Link](data).transform(data)
df = [Link](te_ary, columns=te.columns_)
rint("Frequent Itemsets:")J
:' rc-uLieni
rint("\nAssociation Rules:")
In this example, we create a sample dataset of shopping transactions and encode it using TransactionEn-
coder from mlxtend. We then use the apriori function to find frequent itemsets with a minimum support
of 0.5. Finally, we use the association_rules function to generate association rules with a minimum confi
dence of 0.5.
The apriori function takes two parameters: the encoded dataset and the minimum support threshold.
The use_colnam.es parameter is set to True to use the original item names instead of Boolean values. The
association_rules function takes two parameters: the frequent itemsets and the metric and minimum
threshold for generating association rules. In this example, we use the confidence metric with a minimum
threshold of 0.5.
Output
The output of this code will show the frequent itemsets and the generated association rules. The frequent
itemsets represent the sets of items that occur together frequently in the dataset, while the association
rules represent the relationships between the items in the frequent itemsets.
Frequent Itemsets:
support itemsets
0 0.666667 (bread)
1 0.666667 (butter)
2 0.833333 (milk)
3 0.500000 (bread, butter)
Association Rules:
Association rule mining is a powerful technique that can be applied to many different types of datasets. It
is commonly used in market basket analysis to identify products that are frequently purchased together,
but it can also be applied to other domains such as healthcare, finance, and social media. With the help of
Python libraries such as mlxtend, it is easy to implement association rule mining and generate valuable in
sights from large datasets.
Apriori Algorithm
Apriori is a popular algorithm used for association rule mining in machine learning. It is used to find
frequent itemsets in a transaction database and generate association rules based on those itemsets. The al
gorithm was first introduced by Rakesh Agrawal and Ramakrishnan Srikant in 1994.
The Apriori algorithm works by iteratively scanning the database to find frequent itemsets of increasing
size. It uses a "bottom-up" approach, starting with individual items and gradually adding more items to
the candidate itemsets until no more frequent itemsets can be found. The algorithm also employs a prun
ing technique to reduce the number of candidate itemsets that need to be checked.
Example
In Python, the mlxtend library provides an implementation of the Apriori algorithm. Below is an example
of how to use use the mlxtend library in conjunction with the sklearn datasets to implement the Apriori al
gorithm on iris dataset.
'.'dear. ।
Load the iris datase
iris = datasets.load_iris()
for i in range(len(iriTdata))i^^^^^^^^^^^^^^^B
[Link]('sepal_length=' + str([Link][i][O]))
[Link]('sepal_width-' + str([Link][i][l]))|
[Link]('petal_length=' + str([Link][i][2]))
[Link]('petal_width=' + str([Link][i][3]))[
[Link]('target=' + str([Link][i]))^^^B
te_ary = [Link](transactions).transform(transactions)
df = [Link](te_ary, columns=te.columns_)
In this example, we load the iris dataset from sklearn, which contains information about iris flowers. We
convert the dataset into a list of transactions, where each transaction represents a single flower and con
tains the values for its four attributes (sepaljength, sepaLwidth, petal_length, and petaLwidth) as well as
its target label (target). We then encode the transactions using one-hot encoding and find frequent item
sets with a minimum support of 0.3 using the apriori function from mlxtend.
The output of this code will show the frequent itemsets and their corresponding support counts. Since the
iris dataset is relatively small, we only find a single frequent itemset -
Output
support itemsets
0 0.333333 (target=O)
1 0.333333 (target=l)
2 0.333333 (target=2)
This indicates that 33% of the transactions in the dataset contain both a petaljength value of 1.4 and a
target label of 0 (which corresponds to the setosa species in the iris dataset).
The Apriori algorithm is widely used in market basket analysis to identify patterns in customer purchas
ing behavior. For example, a retailer might use the algorithm to find frequently purchased items that can
be promoted together to increase sales. The algorithm can also be used in other domains such as health
care, finance, and social media to identify patterns and generate insights from large datasets.
The basic idea behind GDA is to model the distribution of each class as a multivariate Gaussian distribu
tion. Given a set of training data, the algorithm estimates the mean and covariance matrix of each class's
distribution. Once the parameters of the model are estimated, it can be used to predict the probability
of a new data point belonging to each class, and the class with the highest probability is chosen as the
prediction.
Assumption 1 means that GDA is not suitable for data with categorical or discrete features. Assumption
2 means that GDA assumes that the variance of each feature is the same across all classes. If this is not
true, the algorithm may not perform well. Assumption 3 means that GDA assumes that the features are
independent of each other given the class label. This assumption can be relaxed using a different algorithm
called Linear Discriminant Analysis (LDA).
Example
The implementation of GDA in Python is relatively straightforward. Here's an example of how to imple
ment GDA on the Iris dataset using the scikit-learn library -
Trai^^GD^nodel^^^^^^^^^^B
da = QuadraticDiscriminantAnalysisQ
[Link](X_train, y_train)
y_pred = [Link](X_test)
p r i n t (A c c u r a c y p accuracyj^^^^H
In this example, we first load the Iris dataset using the loadjris function from scikit-learn. We then split
the data into training and testing sets using the train_test_split function. We create a QuadraticDiscrim-
inantAnalysis object, which represents the GDA model, and train it on the training data using the fit
method. We then make predictions on the testing set using the predict method and evaluate the model's
accuracy by comparing the predicted labels to the true labels.
Output
The output of this code will show the model's accuracy on the testing set. For the Iris dataset, the GDA
model typically achieves an accuracy of around 97-99%.
Accuracy: 0.9811320754716981
Overall, GDA is a powerful algorithm for classification tasks that can handle a wide range of data types,
including continuous and normally distributed data. While it makes several assumptions about the data,
it is still a useful and effective algorithm for many real-world applications.
Cost Function
In machine learning, a cost function is a measure of how well a machine learning model is performing. It
is a mathematical function that takes in the model's predicted values and the true values of the data and
outputs a single scalar value that represents the cost or error of the model's predictions. The goal of train
ing a machine learning model is to minimize the cost function.
The choice of cost function depends on the specific problem being solved. For example, in binary classi
fication tasks, where the goal is to predict whether a data point belongs to one of two classes, the most
commonly used cost function is the binary cross-entropy function. In regression tasks, where the goal is to
predict a continuous value, the mean squared error function is commonly used.
Let's take a closer look at the binary cross-entropy function. Given a binary classification problem with
two classes, let's call them class 0 and class 1, and let's denote the model's predicted probability of class 1
as "p(y = 1 |x)". The true label of each data point is either 0 or 1. We can define the binary cross-entropy cost
function as follows -
J=-(
1
m
) x S(y x log(p)+(1 - y) x log( 1 -p))
* =-(l * )*S( * X # ) + (l- & )x (1- ))
where "m" is the number of data points, "y" is the true label of each data point, and "p" is the predicted
probability of class 1.
The binary cross-entropy function has several desirable properties. First, it is a convex function, which
means that it has a unique global minimum that can be found using optimization techniques. Second, it is
a strictly positive function, which means that it penalizes incorrect predictions. Third, it is a differentiable
function, which means that it can be used with gradient-based optimization algorithms.
Implementation in Python
Now let's see how to implement the binary cross-entropy function in Python using NumPy -
import num
eps = le-15
Once we have defined a cost function, we can use it to train a machine learning model using optimization
techniques such as gradient descent. The goal of optimization is to find the set of model parameters that
minimizes the cost function.
Example
Here is an example of using the binary cross-entropy function to train a logistic regression model on the
Iris dataset using scikit-learn -
logreg = LogisticRegressionQ
[Link](X_tram, y_tram)
y_pred = [Link](X_test)
printfLoss:', loss)
In the above example, we first load the Iris dataset using the loadjris function from scikit-learn. We then
split the data into training and testing sets using the ' train_test _splitfunction. We train a logistic regres
sion model on the training set using theLogisticRegressionclass from scikit-learn. We then make predic
tions on the testing set using thepredict' method of the trained model.
To compute the binary cross-entropy loss, we use the predict_proba method of the logistic regression
model to get the predicted probabilities of class 1 for each data point in the testing set. We then extract the
probabilities for class 1 using indexing and pass them to our binary_cross_entropy function along with
the true labels of the testing set. The function computes the loss and returns it, which we display on the
terminal.
Output
When you execute this code, it will produce the following output -
Loss: 1.6312339784720309
The binary cross-entropy loss is a measure of how well the logistic regression model is able to predict the
class of each data point in the testing set. A lower loss indicates better performance, and a loss of 0 would
indicate perfect performance.
Bayes Theorem
Bayes Theorem is a fundamental concept in probability theory that has many applications in machine
learning. It allows us to update our beliefs about the probability of an event given new evidence. Actually,
it forms the basis for probabilistic reasoning and decision making.
Bayes Theorem states that the probability of an event A given evidence B is equal to the probability of
evidence B given event A, multiplied by the prior probability of event A, divided by the probability of evi
dence B. In mathematical notation, this can be written as -
P(A\B) = P(B|A)*P(A)/P(B)
where -
Bayes Theorem can be used in a wide range of applications, such as spam filtering, medical diagnosis, and
image recognition. In machine learning, Bayes Theorem is commonly used in Bayesian inference, which is
a statistical technique for updating our beliefs about the parameters of a model based on new data.
Implementation in Python
In Python, there are several libraries that implement Bayes Theorem and Bayesian inference. One of the
most popular is the scikit-learn library, which provides a range of tools for machine learning and data
analysis.
Let's consider an example of how Bayes Theorem can be implemented in Python using scikit-learn. Sup
pose we have a dataset of emails, some of which are spam and some of which are not. Our goal is to build a
classifier that can accurately predict whether a new email is spam or not.
We can use Bayes Theorem to calculate the probability of an email being spam given its features (such as
the words in the subject line or body). To do this, we first need to estimate the parameters of the model,
which in this case are the prior probabilities of spam and non-spam emails, as well as the likelihood of each
feature given the class (spam or non-spam).
We can estimate these probabilities using maximum likelihood estimation or Bayesian inference. In our
example, we will be using the Multinomial Naive Bayes algorithm, which is a variant of the Naive Bayes al
gorithm that is commonly used for text classification tasks.
Example
from [Link] import fetch_20newsgroups
vectorizer =
X_train = vectorizer.fit_transform(traimdata)^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
r ''i
Make predictions on the test set and calculate accurac
y_pred = [Link](X_test)
print("Accuracy:", accuracy)
In the above code, we first load the 20 newsgroups dataset, which is a collection of newsgroup posts clas
sified into different categories. We select four categories ([Link], [Link], [Link], and [Link]-
[Link]) and split the data into training and testing sets.
We then use the CountVectorizer class from scikit-learn to convert the text data into a bag-of-words repre
sentation. This representation counts the occurrence of each word in the text and represents it as a vector.
Next, we train a Multinomial Naive Bayes classifier using the fit() method. This method estimates the prior
probabilities and the likelihood of each word given the class using maximum likelihood estimation. The
classifier can then be used to make predictions on the test set using the predict() method.
Finally, we calculate the accuracy of the classifier using the accuracy_score() function from scikit-learn.
Output
When you execute this code, it will produce the following output -
Accuracy: 0.9340878828229028
Precision and Recall
Precision and recall are two important metrics used to evaluate the performance of classification models
in machine learning. They are particularly useful for imbalanced datasets where one class has signifi
cantly fewer instances than the other.
Precision is a measure of how many of the positive predictions made by a classifier were correct. It is
defined as the ratio of true positives (TP) to the total number of positive predictions (TP + FP). In other
words, precision measures the proportion of true positives among all positive predictions.
Precision=TP/(TP+FP)
Recall, on the other hand, is a measure of how many of the actual positive instances were correctly iden
tified by the classifier. It is defined as the ratio of true positives (TP) to the total number of actual positive
instances (TP + FN). In other words, recall measures the proportion of true positives among all actual pos
itive instances.
Recall=TP/(TP+FN)
To understand precision and recall, consider the problem of detecting spam emails. A classifier may label
an email as spam (positive prediction) or not spam (negative prediction). The actual label of the email can
be either spam or not spam. If the email is actually spam and the classifier correctly labels it as spam, then
it is a true positive. If the email is not spam but the classifier incorrectly labels it as spam, then it is a false
positive. If the email is actually spam but the classifier incorrectly labels it as not spam, then it is a false
negative. Finally, if the email is not spam and the classifier correctly labels it as not spam, then it is a true
negative.
In this scenario, precision measures the proportion of spam emails that were correctly identified as spam
by the classifier. A high precision indicates that the classifier is correctly identifying most of the spam
emails and is not labeling many legitimate emails as spam. On the other hand, recall measures the propor
tion of all spam emails that were correctly identified by the classifier. A high recall indicates that the classi
fier is correctly identifying most of the spam emails, even if it is labeling some legitimate emails as spam.
Implementation in Python
In scikit-learn, precision and recall can be calculated using the precision_score() and recall_score() func
tions, respectively. These functions take as input the true labels and predicted labels for a set of instances,
and return the corresponding precision and recall scores.
For example, consider the following code snippet that uses the breast cancer dataset from scikit-learn to
train a logistic regression classifier and evaluate its precision and recall scores -
Example
elf = LogisticRegression(random_state=42)
y_pred = [Link](X_test)
precision = precision_score(y_test, y_pre
In the above example, we first load the breast cancer dataset and split it into training and testing sets. We
then train a logistic regression classifier on the training set and make predictions on the testing set using
the predict() method. Finally, we calculate the precision and recall scores using the precision_score() and
recall_score() functions.
Output
When you execute this code, it will produce the following output -
Precision: 0.9459459459459459
Recall: 0.9859154929577465
Adversarial
Adversarial machine learning is a subfield of machine learning that focuses on studying the vulnerability
of machine learning models to adversarial attacks. An adversarial attack is a deliberate attempt to fool a
machine learning model by introducing small perturbations in the input data. These perturbations are
often imperceptible to humans, but they can cause the model to make incorrect predictions with high
confidence. Adversarial attacks can have serious consequences in real-world applications, such as autono
mous driving, security systems, and healthcare.
1. Evasion attacks - These attacks aim to manipulate the input data to cause the model to
misclassify it. Evasion attacks can be targeted, where the attacker knows the target class, or
untargeted, where the attacker only wants to cause a misclassification.
2. Poisoning attacks - These attacks aim to manipulate the training data to bias the model
towards a particular class or to reduce its overall accuracy. Poisoning attacks can be either
data poisoning, where the attacker modifies the training data, or model poisoning, where the
attacker modifies the model itself.
3. Model inversion attacks - These attacks aim to infer sensitive information about the training
data or the model itself by observing the outputs of the model.
To defend against adversarial attacks, researchers have proposed several techniques, including -
1. Adversarial training - This technique involves augmenting the training data with adversar
ial examples to make the model more robust to adversarial attacks.
2. Defensive distillation - This technique involves training a second model on the outputs of
the first model to make it more resistant to adversarial attacks.
3. Randomization - This technique involves adding random noise to the input data or the
model parameters to make it harder for attackers to craft adversarial examples.
4. Detection and rejection - This technique involves detecting adversarial examples and reject
ing them before they are processed by the model.
Implementation in Python
In Python, several libraries provide implementations of adversarial attacks and defenses, including -
1. CleverHans - This library provides a collection of adversarial attacks and defenses for Tensor-
Flow, Keras, and PyTorch.
2. ART (Adversarial Robustness Toolbox) - This library provides a comprehensive set of tools
to evaluate and defend against adversarial attacks in machine learning models.
3. Foolbox - This library provides a collection of adversarial attacks for PyTorch, TensorFlow,
and Keras.
In the following example, we will do implementation of Adversarial Machine Learning using the Adversar
ial Robustness Toolbox (ART) -
First, we need to install the ART package using pip -
Then, we can create an adversarial example using the ART library on a pre-trained model.
Example
import tensorflow as
import tensorflow as tl
[Link].disable_eager_execution()
^oacKh^MNIST^iataset^^^^^^^^^^^^^^B
model =
mode^dd(Conv2D(32^erne^ize=(3^)^ctivation=Yelu^npu^hape=(28^8^)))
s n o d e 1. a d d (Ma x P o o i i n g 2 D i p o o 1_ s i z e (
[Link](Flatten())
[Link](Dense( 10, activation='softmax'))
x [Link]: i a c k. a a; i a r a t e; t
In this example, we first load and preprocess the MNIST dataset. Then, we define a simple convolutional
neural network (CNN) model and compile it using categorical cross-entropy loss and Adam optimizer.
We wrap the model with the ART KerasClassifier to make it compatible with ART attacks. We then train the
model for 10 epochs on the training set and evaluate it on the test set.
Next, we generate adversarial examples using the FastGradientMethod attack with a maximum perturba
tion of 0.1. Finally, we evaluate the model on the adversarial examples.
Output
When you execute this code, it will produce the following output -
Epoch 2/20
60000/60000 [===== ================ ======= ==]-15s 251us/sample - loss: 0.1296 - accuracy: 0.9636
Epoch 3/20
Epoch 4/20
Epoch 5/20
------- continue
Stacking
Stacking, also known as stacked generalization, is an ensemble learning technique in machine learning
where multiple models are combined in a hierarchical manner to improve prediction accuracy. The tech-
nique involves training a set of base models on the original training dataset, and then using the predictions
of these base models as inputs to a meta-model, which is trained to make the final predictions.
The basic idea behind stacking is to leverage the strengths of multiple models by combining them in a way
that compensates for their individual weaknesses. By using a diverse set of models that make different
assumptions and capture different aspects of the data, we can improve the overall predictive power of the
ensemble.
1. Base Model Training - In this stage, a set of base models are trained on the original training
data. These models can be of any type, such as decision trees, random forests, support vector
machines, neural networks, or any other algorithm. Each model is trained on a subset of the
training data, and produces a set of predictions for the remaining data points.
2. Meta-model Training - In this stage, the predictions of the base models are used as inputs to
a meta-model, which is trained on the original training data. The goal of the meta-model is
to learn how to combine the predictions of the base models to produce more accurate predic
tions. The meta-model can be of any type, such as linear regression, logistic regression, or any
other algorithm. The meta-model is trained using cross-validation to avoid overfitting.
Once the meta-model is trained, it can be used to make predictions on new data points by passing the
predictions of the base models as inputs. The predictions of the base models can be combined in different
ways, such as by taking the average, weighted average, or maximum.
Example
Here is an example implementation of stacking in Python using scikit-learn -
X, y = [Link], [Link]
rf = RandomForestClassifier(n_estimators=10, random_state=42)
gb = GradientBoostingClassifier(random_state=42)
Ir = LogisticRegressionQ
In this code, we first load the iris dataset and define the base models, which are a random forest and a
gradient boosting classifier. We then define the meta-model, which is a logistic regression model.
We create a StackingClassifier object with the base models and meta-model, and use cross-validation to
generate predictions for the meta-model. Finally, we evaluate the performance of the stacked model using
the accuracy score.
Output
When you execute this code, it will produce the following output -
Accuracy: 0.9666666666666667
Epoch
In machine learning, an epoch refers to a complete iteration over the entire training dataset during the
model training process. In simpler terms, it is the number of times the algorithm goes through the entire
dataset during the training phase.
During the training process, the algorithm makes predictions on the training data, computes the loss, and
updates the model parameters to reduce the loss. The objective is to optimize the model's performance by
minimizing the loss function. One epoch is considered complete when the model has made predictions on
all the training data.
Epochs are an essential parameter in the training process as they can significantly affect the performance
of the model. Setting the number of epochs too low can result in an underfit model, while setting it too
high can lead to overfitting.
Underfitting occurs when the model fails to capture the underlying patterns in the data and performs
poorly on both the training and testing datasets. It happens when the model is too simple or not trained
enough. In such cases, increasing the number of epochs can help the model learn more from the data and
improve its performance.
Overfitting, on the other hand, happens when the model learns the noise in the training data and performs
well on the training set but poorly on the testing data. It occurs when the model is too complex or trained
for too many epochs. To avoid overfitting, the number of epochs must be limited, and other regularization
techniques like early stopping or dropout should be used.
Implementation in Python
In Python, the number of epochs is specified in the training loop of the machine learning model. For
example, when training a neural network using the Keras library, you can set the number of epochs using
the "epochs" argument in the "fit" method.
Example
from [Link] import Sequential
.train = [Link]( 1
model =
[Link](Dense( ■, activation-
compile the model with binary cross-entropv loss and adam optimizer
In this example, we generate some random data for training and create a simple neural network model
with one input layer, one hidden layer, and one output layer. We compile the model with binary cross-en
tropy loss and the Adam optimizer and set the number of epochs to 10 in the "fit" method.
During the training process, the model makes predictions on the training data, computes the loss, and
updates the weights to minimize the loss. After completing 10 epochs, the model is considered trained,
and we can use it to make predictions on new, unseen data.
Output
When you execute this code, it will produce an output like this -
Epoch 1/10
4/4 [====== ======= ======= ======= = ==]- 3is 2ms/step - loss: 0.7012 - accuracy: 0.4976
Epoch 2/10
4/4 [====== ======= ======= ======= = ==] - 0s ims/step - loss: 0.6995 - accuracy: 0.4390
Epoch 3/10
4/4 [====== ======= ======= ======= = ==]-0s ims/step - loss: 0.6921 - accuracy: 0.5123
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Architecture of Perceptron
A single layer of Perceptron consists of an input layer, a weight layer, and an output layer. Each node in the
input layer is connected to each node in the weight layer with a weight assigned to each connection. Each
node in the weight layer computes a weighted sum of inputs and applies a threshold function to generate
the output.
The threshold function in Perceptron is the Heaviside step function, which returns a binary value of 1 if
the input is greater than or equal to zero, and 0 otherwise. The output of each node in the weight layer is
determined by -
i;
0;
if
w
0
w
1
w
2
2
w
n
>=0
otherwise
& ={1; &&& 0+^>l^l+«>2«>2+ ••• + >=00;
Where "y" is the output,xt,x2, ...,x_ are the input features; and wo, wi; w2,..., wn are the corresponding weights,
and > = 0 indicates the Heaviside step function.
Training of Perceptron
The training process of the Perceptron algorithm involves iteratively updating the weights until the model
converges to a set of weights that can correctly classify all training examples. Initially, the weights are set
to random values. For each training example, the predicted output is compared to the actual output, and
the weights are updated accordingly to minimize the error.
The weight update rule in Perceptron is as follows -
W
i
W
i
+ax(y-
y
/
)x
X
1
0WWx(W')xH
Example
[Link], j/Kd ■
iris = load_iris()
Split the dataset into training and testing sets
y_pred = [Link](X_test)
Accuracy: 0.8
Once the perceptron is trained, it can be used to make predictions on new input data. Given a set of input
values, the perceptron computes a weighted sum of the inputs and applies an activation function to the
sum to obtain the output value. This output value can then be interpreted as a prediction for the corre
sponding input.
Here is an example implementation of a perceptron in Python using the step function as the activation
function -
import numi
class Perceptron:
def_ init_ (self, learning_rate=0.1, epochs = 100):
self.learning_rate = learning_rate
~ Ml
[Link] =
selahias
[Link] = [Link](n_features)|
[Link] = 0
# iterate over epochs and update weights and bias
for _ in range([Link]):
for i in range(n_samples):
y_pred = self.step_function(linear_output)
[Link] + = update^^^^B
def prodict(self,
y_pred = self.step_function(linear_output)
return y_pred
In this implementation, the Perceptron class takes two parameters: learning_rate and epochs. The fit
method trains the perceptron on the input data X and the corresponding target values y. The predict
method takes an input data array and returns the predicted output values.
To use this implementation, we can create an instance of the Perceptron class and call the fit method to
train the model -
Once the model is trained, we can make predictions on new input data using the predict method -
predictions = [Link](test_data)
The output of this code is [1,0], which are the predicted values for the input data [[1,1], [0,1]].
Regularization
In machine learning, regularization is a technique used to prevent overfitting, which occurs when a model
is too complex and fits the training data too well, but fails to generalize to new, unseen data. Regularization
introduces a penalty term to the cost function, which encourages the model to have smaller weights and a
simpler structure, thereby reducing overfitting.
There are several types of regularization techniques commonly used in machine learning, including LI
and L2 regularization, dropout regularization, and early stopping. In this article, we will focus on LI and
L2 regularization, which are the most commonly used techniques.
LI Regularization
LI regularization, also known as Lasso regularization, is a technique that adds a penalty term to the cost
function, equal to the absolute value of the sum of the weights. The formula for the LI regularization
penalty is -
XxS|wi|
where A. is a hyperparameter that controls the strength of the regularization, and n«v«y is the i-th weight
in the model.
The effect of the LI regularization penalty is to encourage the model to have sparse weights, that is, to
eliminate the weights that have little or no impact on the output. This has the effect of simplifying the
model and reducing overfitting.
Example
To implement LI regularization in Python, we can use the Lasso class from the scikit-learn library. Here is
an example of how to use LI regularization for linear regression -
lasso = Lasso(alpha=0.1)
[Link](X_train, y_train)
y_pred = [Link](X_test)
In this example, we load the Boston Housing dataset, split it into training and test sets, and create a Lasso
model with L1 regularization using an alpha value of 0.1. We then train the model on the training data and
make predictions on the test data. Finally, we calculate the mean squared error of the predictions.
Output
When you execute this code, it will produce the following output -
L2 Regularization
L2 regularization, also known as Ridge regularization, is a technique that adds a penalty term to the cost
function, equal to the square of the sum of the weights. The formula for the L2 regularization penalty is -
Xx£(wi)2
where A. is a hyperparameter that controls the strength of the regularization, and wi is the ith weight in the
model.
The effect of the L2 regularization penalty is to encourage the model to have small weights, that is, to
reduce the magnitude of all the weights in the model. This has the effect of smoothing the model and re
ducing overfitting.
Example
To implement L2 regularization in Python, we can use the Ridge class from the scikit-learn library. Here is
an example of how to use L2 regularization for linear regression -
import numpy as np
boston = load_boston()
# create feature and target arrays
X = [Link]
y = [Link]
scaler = StandardScalerQ
X = scaler.fit_transform(X)
model - Ridge(alpha=0.1)
[Link](X_train, y_train)
# make predictions on the testing data
y_pred = [Link](X_test)
In this example, we first load the Boston housing dataset and split it into training and testing sets. We then
standardize the feature data using a StandardScaler.
Next, we define the Ridge regression model and set the alpha parameter to 0.1, which controls the strength
of the L2 regularization.
We fit the model on the training data and make predictions on the testing data. Finally, we calculate the
mean squared error to evaluate the performance of the model.
Output
When you execute this code, it will produce the following output -
Mean Squared Error: 2 4.2 9 3 4 6 2 5 0 5 9 610 7
Overfitting
Overfitting occurs when a model learns the noise in the training data, rather than the underlying patterns. This
causes the model to perform well on the training data, but poorly on new data. Essentially, the model becomes too
specialized to the training data, and is unable to generalize to new data.
Overfitting is a common problem when using complex models, such as deep neural networks. These models have
many parameters, and are able to fit the training data very closely. However, this often comes at the expense of gen
eralization performance.
Causes of Overfitting
There are several factors that can contribute to overfitting -
1. Complex models - As mentioned earlier, complex models are more likely to overfit than simpler mod
els. This is because they have more parameters, and are able to fit the training data more closely.
2. Limited training data - When there is not enough training data, it becomes difficult for the model to
learn the underlying patterns, and it may instead learn the noise in the data.
3. Unrepresentative training data - If the training data is not representative of the problem that the
model is trying to solve, the model may learn irrelevant patterns that do not generalize well to new
data.
4. Lack of regularization - Regularization is a technique used to prevent overfitting by adding a penalty
term to the cost function. If this penalty term is not present, the model is more likely to overfit.
Example
Here is an implementation of early stopping and L2 regularization in Python using Keras -
frondcera^ayer^mpoi^Dense^^^^H
ton^era^aUback^mpor^arlyStoppin
from keras import regularizersMMMI
mode^dd(Dense(64Jnpu^inm}Oraii^hape^^ctivation=l^hfAerne^egulanze^regulanze^2(M^))
In this code, we have used the Sequential model in Keras to define the model architecture, and we have added L2
regularization to the first two layers using the kernel_regularizer argument. We have also set up an early stopping
callback using the EarlyStopping class in Keras, which will monitor the validation loss and stop training if it stops
improving for 5 epochs.
During training, we pass in the X_train and y_train data as well as a validation split of 0.2 to monitor the validation
loss. We also set a batch size of 64 and train for a maximum of 100 epochs.
Output
When you execute this code, it will produce an output like the one shown below -
By using early stopping and L2 regularization, we can help prevent overfitting and improve the generalization per
formance of our model.
P-value
In machine learning, we use P-value to test the null hypothesis that there is no significant relationship between two
variables. For example, if we have a dataset of house prices and we want to determine whether there is a significant
relationship between the size of the house and its price, we can use P-value to test this hypothesis.
To understand the concept of P-value in machine learning, we need to first understand the concept of null hy
pothesis and alternative hypothesis. The null hypothesis is the hypothesis that there is no significant relationship
between the two variables, while the alternative hypothesis is the opposite of the null hypothesis, which states that
there is a significant relationship between the two variables.
Once we have defined our null hypothesis and alternative hypothesis, we can use P-value to test the significance of
our hypothesis. The P-value is the probability of obtaining the observed result or a more extreme result, assuming
that the null hypothesis is true.
If the P-value is less than the significance level (usually set at 0.05), then we reject the null hypothesis and accept the
alternative hypothesis. This means that there is a significant relationship between the two variables. On the other
hand, if the P-value is greater than the significance level, then we fail to reject the null hypothesis and conclude that
there is no significant relationship between the two variables.
To demonstrate the implementation of p-value in Machine Learning, we will use the breast cancer dataset provided
by scikit-learn. The goal of this dataset is to predict whether a breast tumor is malignant or benign based on various
features such as the tumor's radius, texture, perimeter, area, smoothness, compactness, concavity, and symmetry.
First, we will load the dataset and split it into training and testing sets -
data = load_breast_cancer()
= [Link]
Next, we will use the SelectKBest class from scikit-learn to select the top k features based on their p-values. Here, we
will select the top 5 features -
from sklearn.feature_selection import SelectKBest, f_classif
^etectom^electKBest(scor^unc=^lassifJ<=k)^^B
t^miroiew^selecto^^mnsfonnQOrain/y^min)
X_test_new [Link](X_test)^^^^^^M
The SelectKBest class takes a score function as input to calculate the p-values for each feature. We use the f_classif
function, which is the ANOVA F-value between each feature and the target variable. The k parameter specifies the
number of top features to select.
After fitting the selector on the training data, we transform the data to keep only the top k features using the
fit_transform() method. We also transform the testing data to keep only the selected features using the transform()
method.
We can now train a model on the selected features and evaluate its performance -
ftnn^kleanUineaoiiodeHmpor^ogisticRegression
from [Link] import accuracy scoreHnEH
y_pred = [Link](X_test_new)
For example, to test the hypothesis that the mean radius feature is significant, we can use the ttest_ind() function
from the [Link] module -
print(f"P-value: {p_value:.2f}")
The ttest_ind() function takes two arrays as input and returns the t-statistic and the two-tailed p-value.
Output
We will get the following output from the above implementation -
Accuracy: 0.97
P-value: 0.00
In this example, we calculated the p-value for the mean radius feature between the malignant and benign classes.
Entropy
Entropy is a concept that originates from thermodynamics and was later applied in various fields, includ
ing information theory, statistics, and machine learning. In machine learning, entropy is used as a mea
sure of the impurity or randomness of a set of data. Specifically, entropy is used in decision tree algorithms
to decide how to split the data to create a more homogeneous subset. In this article, we will discuss entropy
in machine learning, its properties, and its implementation in Python.
Entropy is defined as a measure of disorder or randomness in a system. In the context of decision trees,
entropy is used as a measure of the impurity of a node. A node is considered pure if all the examples in it
belong to the same class. In contrast, a node is impure if it contains examples from multiple classes.
To calculate entropy, we need to first define the probability of each class in the data set. Let p(i) be the
probability of an example belonging to class i. If we have k classes, then the total entropy of the system, de
noted by H(S), is calculated as follows -
where the sum is taken over all k classes. This equation is called the Shannon entropy.
For example, suppose we have a dataset with 100 examples, of which 60 belong to class A and 40 belong to
class B. Then the probability of class A is 0.6 and the probability of class B is 0.4. The entropy of the dataset
is then -
If all the examples in the dataset belong to the same class, then the entropy is 0, indicating a pure node. On
the other hand, if the examples are evenly distributed across all classes, then the entropy is high, indicat
ing an impure node.
In decision tree algorithms, entropy is used to determine the best split at each node. The goal is to create a
split that results in the most homogeneous subsets. This is done by calculating the entropy of each possi
ble split and selecting the split that results in the lowest total entropy.
For example, suppose we have a dataset with two features, XI and X2, and the goal is to predict the class
label, Y. We start by calculating the entropy of the entire dataset, H(S). Next, we calculate the entropy of
each possible split based on each feature. For example, we could split the data based on the value of XI or
the value of X2. The entropy of each split is calculated as follows -
where pt, p2, p3, and p. are the probabilities of each subset; and HCS,), H(S2), H(S3), and H(S4) are the entropies
of each subset.
We then select the split that results in the lowest total entropy, which is given by -
This split is then used to create the child nodes of the decision tree, and the process is repeated recursively
until all nodes are pure or a stopping criterion is met.
Example
Let's take an example to understand how it can be implemented in Python. Here we will use the "iris"
dataset -
import numi
X = [Link]
y = [Link]
def entropy(y):
n = len(y)
probs = counts / n
The above code loads the iris dataset, extracts the features and target, and defines a function to calculate
entropy. The entropyQ function takes a vector of target values and returns the entropy of the set.
The function first calculates the number of examples in the set and the count of each class. It then
calculates the proportion of each class and uses these to calculate the entropy of the set using the entropy
formula. Finally, the code calculates the entropy of the target variable in the iris dataset and prints it to the
console.
Output
When you execute this code, it will produce the following output -
MLOps
MLOps (Machine Learning Operations) is a set of practices and tools that combine software engineering,
data science, and operations to enable the automated deployment, monitoring, and management of ma
chine learning models in production environments.
MLOps addresses the challenges of managing and scaling machine learning models in production, which
include version control, reproducibility, model deployment, monitoring, and maintenance. It aims to
streamline the entire machine learning lifecycle, from data preparation and model training to deployment
and maintenance.
1. Scikit-learn - A popular machine learning library that provides tools for data preprocessing,
model selection, and evaluation.
2. TensorFlow - A widely used open-source platform for building and deploying machine learn
ing models.
3. Keras - A high-level neural networks API that can run on top of TensorFlow.
4. PyTorch - A deep learning framework that provides tools for building and deploying neural
networks.
5. MLflow - An open-source platform for managing the machine learning lifecycle that provides
tools for tracking experiments, packaging code and models, and deploying models in produc
tion.
6. Kubeflow - A machine learning toolkit for Kubernetes that provides tools for managing and
scaling machine learning workflows.
Data Leakage
Data leakage is a common problem in machine learning that occurs when information from outside the
training dataset is used to create or evaluate a model. This can lead to overfitting, where the model is too
closely tailored to the training data and performs poorly on new data.
There are two main types of data leakage: Target Leakage and Train-test Contamination
Target Leakage
Target leakage occurs when features that are not available during prediction are used to create the model.
For example, if we are predicting whether a customer will churn, and we include the customer's cancella
tion date as a feature, then the model will have access to information that would not be available in prac
tice. This can lead to unrealistically high accuracy during training and poor performance on new data.
Train-test Contamination
Train-test contamination occurs when information from the test set is inadvertently used in the training
process. For example, if we normalize the data based on the mean and standard deviation of the entire
dataset instead of just the training set, then the model will have access to information that would not be
available in practice. This can lead to overly optimistic estimates of model performance.
1. Splitting the data into separate training and test sets before doing any preprocessing or fea
ture engineering.
2. Only using features that would be available at the time of prediction.
3. Using cross-validation to evaluate model performance instead of a single train-test split.
4. Ensuring that all preprocessing steps (such as normalization or scaling) are applied to the
training set only and then using the same transformations on the test set.
5. Being aware of any potential sources of leakage, such as date or time-based features, and han
dling them appropriately.
Implementation in Python
Here is an example in which we will be using Sklearn breast cancer dataset and ensure that no information
from the test set is leaked into the model during training -
Example
[Link]
Split the data into train and test sets
('scaler', StandardScalerQ),
print("Accuracy:", accuracy)^^^^^B
Output
When you execute this code, it will produce the following output -
Accuracy: 0.9824561403508771
Thank You