0% found this document useful (0 votes)
8 views241 pages

Handling Imbalanced Datasets in ML

The document discusses the challenges of imbalanced datasets in machine learning, particularly in predicting minority classes like fraudulent transactions. It highlights the issues that arise when models are trained on unbalanced data, leading to bias towards the majority class and poor generalization. The document also introduces techniques for handling class imbalance, such as oversampling, undersampling, and SMOTE, along with data preprocessing methods like imputation and scaling.

Uploaded by

Aditi Raghu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views241 pages

Handling Imbalanced Datasets in ML

The document discusses the challenges of imbalanced datasets in machine learning, particularly in predicting minority classes like fraudulent transactions. It highlights the issues that arise when models are trained on unbalanced data, leading to bias towards the majority class and poor generalization. The document also introduces techniques for handling class imbalance, such as oversampling, undersampling, and SMOTE, along with data preprocessing methods like imputation and scaling.

Uploaded by

Aditi Raghu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Imbalanced Classes

Imbalanced Dataset
Imbalanced Dataset
● For example, suppose you have a credit card transaction data and you are
supposed to predict fraudulent transactions. You'll likely have 10,000
authentic transactions for every 1 fraudulent transaction, that's quite an
imbalance!
● In machine learning terms: Often you'll have a large amount of
data/observations for one class (referred to as the majority class), and
much fewer observations for one or more other classes (referred to as the
minority classes).
Imbalanced Dataset
● The problem is that machine learning models trained on
unbalanced datasets often have poor results when they have to
generalize (predict a class or classify unseen observations). Despite
the algorithm you choose, some models will be more susceptible to
unbalanced data than others. Ultimately, this means you will not
end up with a good model, and the reasons include:

○ The algorithm receives significantly more examples from one


class, prompting it to be biased towards that particular class.
○ It does not learn what makes the other class “different” and fails
to understand the underlying patterns that allow us to
distinguish classes.
Imbalanced Dataset

Can you think of any other example where there is


class imbalance?
Resources for the Tutorial
Download Dataset
● Download IEEE Fraud Dataset:
[Link]
official/Datasets/blob/master/fraud_data.csv

!!! - it is a huge dataset so opening it in excel might be


difficult, so please open it via python environment -
(Colab or Jupyter notebook)

● Read about this dataset here:


[Link]
About the Dataset
IEEE Fraud Dataset
● The data is broken into two files identity and transaction, which are joined by
TransactionID
Transaction Table *
● TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
● TransactionAMT: transaction payment amount in USD
● ProductCD: product code, the product for each transaction
● card1 - card6: payment card information, such as card type, card category, issue bank,
country, etc.
● addr: address of the customer
● dist: distance
● P_ and (R__) emaildomain: purchaser and recipient email domain
About the Dataset
● M1-M9: match, such as names on card and address, etc.

● Categorical Features:
○ ProductCD
○ card1 - card6
○ addr1, addr2
○ Pemaildomain Remaildomain
○ M1 - M9 (bank sensitive data)
Note: Some of the feature/variable description is not given as
About the Dataset
Identity Table *
● Variables in this table are identity information – network connection information (IP,
ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with
transactions.
● They're collected by Vesta’s fraud protection system and digital security partners.
● (The field names are masked and pairwise dictionary will not be provided for privacy
protection and contract agreement)

● Categorical Features:
○ DeviceType
○ DeviceInfo
○ id12 - id38
Notebook Link

[Link]
notebooks-handling-imbalanced-classes

!! The dataset takes a lot of time to load on colab.


Additional Resources

A nice article on data preprocessing:


[Link]
concepts-fa946d11c825

P.S: Try opening in incognito if it is asking for a premium


subscription upgrade
DATA PRE-
PROCESSING
Topics • Handling Missing Values
• Handling Imbalanced datasets,

today –
Oversampling - SMOTE
• Standardization/Normalization and
transformation for data

Data Pre- • Explained with an example on Detecting


Fraudulent Transactions in a given

processing
dataset!
Pre-processing – what, why and how?
What?
Pre-processing is the process of preparing the data for training.
Why?
• Data is not ready-made for us:
✔ Missing values
✔ Wrongful Data entries
✔ Class Imbalance
✔ Different scales of data..
How?
• Handling Missing Values
• Handling Imbalanced datasets, Oversampling - SMOTE
• Standardization/Normalization and transformation for data
Most datasets are not perfect, they have
missing values.

Missing Missing values can be due to:

values in • Missed entries by participants (in case of surveys)

data • Missed information by database managers


• That variable is not relevant for that data point

(e.g. in a customer dataset for banks, ‘No of


children’ not relevant for unmarried
customers. Hence, results in an NA value.)
Example dataset: IEEE Fraud
Dataset- a 10% sample used for
simplicity
IEEE Fraud Dataset was provided at Kaggle a year ago:
⮚ Categorical Features – Transaction
▪ ProductCD – Product code
▪ card1 - card6 : payment card information, such as card type, card
category, issue bank, country, etc.
▪ addr1, addr2
▪ P_emaildomain – Purchaser
▪ R_emaildomain- Recipient
% of • M1 - M9 – Match between names on card and address etc.
frau
True No of d ⮚ Categorical Features - Identity
Datas Observati positiv variabl case • DeviceType
et ons es es s
• DeviceInfo

IEEE 57049 2005 180 3.5 • id_12 - id_38


• The TransactionDT feature is a timedelta from a given reference
datetime (not an actual timestamp).
⮚ Outcome/Target Variable – isFraud –
• whether transaction is fraud or not
More about this dataset here: [Link]
detection/data
Ideas for Missingness Mechanisms

Missingness in dataset can affect a machine learning problem in different ways :


1. Missing completely at random
(eg: forgot to fill in survey, forgot to enter by data entry etc)
2. Missingness is related to other variables(predictors) used for prediction.
(E.g: Fraud dataset, ‘VAT number’ doesn’t exist because customer is of type ‘individual
and not business. Here ‘customer type’ is another predictor variable for fraud.
Ideas for Missingness Mechanisms

3. Missingness is related to outcome variable predicted itself and is hence


not random.
(E.g: Fraud dataset, ‘Transaction message’ doesn’t exist because customer is
doing perhaps something suspicious. Here, imputation might need advanced
techniques. Beyond the scope of this course..
Why fill missing
values?
• How should a model read a missing
value?
• Doesn’t it change what machine
learns if it reads it as 0?
• Mathematical models cannot
understand what a missing value
means.

Source: Tim Bock on DisplayR


Techniques for filling missing values

• Single Imputation
• Regression Imputation
• Multiple Imputation
Which variables to impute-
filling missing values vs using unreal information

• Imputation or filling in unreal values doesn’t come without consequences.


• Is it worth using variables which have 80% missing values which have
been imputed?
• Typical recommendation for variables to be used are variables with < 20%
missingness
• Heuristic for % of missingness ranges from 10% to 20% and hasn’t
reached consensus in academic research. (Schlomer, Bauman, and Card
2010)
• Conclusion: In our example, we will just impute variables with less than
20% missingness and leave out variables with missingness higher than
that
Simple Imputation: replace the
missing value in a variable by the
mean/mode of the variable:

Simple If variable is numeric: replace


with mean of the variable
Imputation
and how it
helps
If variable is categorical: replace
with mode of the variable

Imputing by mean
How it helps? or median balances
the data distribution
Other Imputation Techniques

• Multiple Imputation: Multiple Imputation fills in estimates for the missing data.
But to capture the uncertainty in those estimates, MI estimates the values
multiple times.

• Example In Python: Iterative Imputer


A strategy for imputing missing values by modeling each feature with missing
values as a function of other features in a round-robin fashion.
Example of Multivariate feature imputation:
Iterative Imputer
Recommended Imputations for various
Missingness Mechanisms
1. Missing fully at random - reasonably good to use single imputation techniques
(by mean for numeric variables and mode for categorical variables).
2. Missingness is related to other variables(predictors) used for prediction-
Maybe use multiple imputation here

1. Missingness is related to outcome variable predicted itself - Here, imputation


might need advanced techniques. Beyond the scope of this course.
References

• Common Methods to Imputing Missing Data


[Link]
common-methods-to-imputing-missing-data/
• Multiple Imputation:
• [Link]
recommended-solutions/
• Imputation in Python using scikit-learn:
• [Link]
• Schlomer, Gabriel L., Sheri Bauman, and Noel A. Card. 2010. “Best
Practices for Missing Data Management in Counseling Psychology.”
Journal of Counseling Psychology.
One Hot encoding
in Python
• In Python, typically machine
learning models expect inputs to be
numbers.
• Hence, it is easier to convert all
Binary is
categorical variables to numeric.
nothing but • One such techniques popularly
assign 0 or 1 used is one-hot encoding.
• One-hot encoding is checking
• Example :
DEMONSTRATE
THE FIRST
DECISION TREE
A simple decision tree model
CLASS
IMBALANCE
PROBLEM
Machine Learning
: Classification Vs
Regression
Classification vs Regression:

• Classification

✔ Classification is the problem of classifying the outcome into


two or more outcomes

• Classification examples:

✔ Predict colour: Red/Blue/Green etc

✔ Predict if name matches/not

• Regression

✔ Regression is the problem of predicting a continuous outcome


(a numeric outcome)

• Regression examples:

✔ Predict price of house

✔ Predict age of a dog based on data of dog images!

311
Will this lead to bias
against women?

The problem of Class


Imbalance
⮚ Classification Problem:
• Class Imbalance: Minority class constitutes a very minute fraction
compared to majority class.
⮚ Why?
• Class imbalance in classification problem is too less data points of one
class compared to another class we are trying to predict.
⮚ How it affects?
• It leads to machine to learn too much of the dominant class and too less
about the minority class!
⮚ Examples:
• Millions of black people affected by racial bias in health-care algorithms
(Source: Nature)
• Credit card approval algorithms may be biased against women. (Source:
[Link])
Will this lead to bias
against women?

Example of Class
Imbalance
• Given a database of machine learning publications,
if the problem is to predict whether a researcher is
male or female, will the default prediction be biased
by machine learning?
When is class imbalance a problem?

• Class imbalance is a problem when there are too less minority class (fraud)
observations for model to learn from.
• One needs to decide when to create new minority class (Fraud) observations or
remove existing majority (normal transactions) class observations.
Class Imbalance in Machine learning : in our example: Balanced Scale Data

• Majority class:
• Minority class:
How to Handle Imbalanced datasets

• Oversampling: Artificially increase the minority class (eg: by duplicating minority


class observations, artificially creating them… )
• Under sampling: Sample out observations from majority class to have a certain
ratio between majority and minority class
• SMOTE: Combination of both.
Delving deeper into
one technique-
SMOTE

• SMOTE:
Synthetically (S) creating minority (M)
class observations leading to
oversampling (O) using this technique
(TE) and under sampling majority to
get a certain ratio between the
classes.
• Proposed by Chawla et al 2002. (ref)
PERFORMANCE WITH AND
WITHOUT HANDLING
CLASS IMBALANCE
Performance with and without handling class imbalance

• AUC (a performance score for decision tree classifier) is slightly better using the
“SMOTE’d” data based model.
• We can play around with parameters in SMOTE and further improve the model.
• We can also use advanced machine learning models to improve further!
References

• SMOTE example in Python: [Link]


[Link]/en/stable/generated/imblearn.over_sampling.
[Link]
• SMOTE original paper:
[Link]
• Balanced Scale Data:
[Link]
scale/
• Example used to demonstrate data:
[Link]
Standardization/scaling/Normalization of
data - what, why and how?

What?
• Standardization/Scaling is bringing all variables used for building model to the same scale
Why?
• It balances the overeffect of variables with higher range (let us example in next slide)
• Sometimes, it also helps in speeding up the calculations in an algorithm.
• It is important for techniques which use distance metrics.

How?
• Scale– It means to change the range of values but without changing the shape of distribution.
Range is often set to 0 to 1.
• Standardize means changing values so that distribution standard deviation from mean equals to
one,output will be very close to normal distribution.
• NORMALIZE-It can be used either of above things
Why scaling- in our example

• Let's say you have two input vectors: X1 and X2. and let's say X1 has range(0.1 to
0.8) and X2 has range(3000 to 50000). Now your SVM classifier will be a linear
boundary lying in X1-X2 plane. My claim is that the slope of linear decision
boundary should not depend on the range of X1 and X2, but instead upon the
distribution of points.
Various scaling methods

• Min-Max Scaler
• Robust Scaler
• Standard Scaler
• Normalizer
When to scale data?

• If you build models using scaled data, it may require scaling back to original
variables to interpret variables’ effect on outcome predicted.
Put all the pre-processing techniques together

• Handle missing values via:


⮚Imputation:
1. Single Imputation
2. Multiple Imputation
• Handle class imbalance:
1. Other techniques – Oversampling, undersampling
2. SMOTE
• Additional: Standardization/scaling/Normalization of data
References

• Why scaling is important and techniques:


[Link]
Class Notebook

[Link]
official/Machine_Learning_Bootcamp/blob/master/Data_Prepa
ration_101/Data_Preparation_101.ipynb
Learning Objectives

What is Machine Machine Learning


Learning? Categorization

Classification and
Regression
What is Machine Learning?
Machine Learning Categorization
Supervised Learning Algorithms
Let’s talk about the datasets that have both input variables and target variables
(labels for the data). Ranging from predicting the survival rate of a person in
Titanic Dataset where Survival Rate is already given to predicting the House
Price according to house characteristics where the house prices are provided.

The algorithms that work on such datasets are known as Supervised Learning
Algorithms.

It is called supervised learning because the process of an algorithm learning


from the training dataset can be thought of as a teacher supervising the
learning process. We know the correct answers, the algorithm iteratively makes
predictions on the training data and is corrected by the teacher. Learning stops
when the algorithm achieves an acceptable level of performance.
Unsupervised Learning Algorithms
Unsupervised learning is where you have unlabeled data (or no target
variable) in the dataset.

The goal of Unsupervised Learning Algorithms is to find some


structure in the dataset.

These are called unsupervised learning because unlike supervised


learning, there are no correct answers and there is no teacher.
Algorithms are left to their own to discover and present the interesting
structure in the data.
Reinforcement Learning Algorithms
A robot takes a big step forward, then falls. The next time, it takes a
smaller step and is able to hold its balance. The robot tries variations
like this many times; eventually, it learns the right size of steps to take
and walks steadily. It has succeeded.

What we see here is called reinforcement learning. The robot learns


how to walk based on reward (staying on balance) and punishment
(falling). This feedback is considered “reinforcement” for doing or not
doing an action.

Reinforcement learning, in a simplistic definition, is learning best


actions based on reward or punishment.
Types of Supervised Learning Algorithms

Supervised Learning can be further divided into 2 types:


1. Classification
2. Regression
Classification vs Regression

In order to decide whether to use a regression or classification model, the


first questions you should ask yourself is:

Does your target variable have a continuous value or is it discrete (binary


or multi-class)?
Regression

If your answer is continuous values, you’re dealing with


Regression.

This means that if you’re trying to predict quantities like


height, income, price, or scores, you should be using a model
that will output a continuous number.

So if your objective is to determine tomorrow’s temperature,


you should use a regression model.
Classification
Let’s come to the second case where if you can clearly see that the
target variable is divided into classes. You’ll be using Classification.

● When the number of classes is 2, it is known as Binary


Classification. Eg. Will it be hot or cold tomorrow is a binary
classification problem with 2 classes: Hot and Cold.

● When it is more than 2, it is known as Multi Class Classification.


Eg. Classifying movies in Good, Average or Bad according to
reviews.
What is Classification?
Let’s learn with some examples:

● In Classification we classify the outcome


● Examples:
○ Predict whether a transaction is fraud or not fraud
○ Predict whether to give loan or not
○ Predict whether to give college admission or not
○ Predict the grade (Grade A, B, C, D)
○ Note: Classification can be more than two
What is Multi-Classification?
It is as simple as dividing waste into 4 categories - plastic, glass,
metal, paper
Classification vs Regression

In the above image, you can see that the classification line is dividing the
data into 2 parts or 2 classes - red and blue. On the other hand, the
regression line is going along the direction of data and not segregating it.
It’s important to understand the characteristics of your target variable
before you begin running models and forming predictions.
Supervised ML Algorithms -
Regression
Linear Regression: Introduction
Learning Objectives

Dependent and
Equation of a
Independent
Straight Line
Variables

Linear Regression
Dependent and Independent Variables
● So far you’ve been studying input and output/target variables.
Commonly, the input variable is known as independent variable and
target variable is known as dependent variable.

● In nutshell, our target variable is nothing but a dependent variable.


Why dependent? Because the values of this variable are dependent on
other variables (i.e. input variables)

● And, our input variables are known as independent variables. Here the
values of these variables are not dependent on any other variables.

Let’s look at some examples to learn more about them!


Dependent and Independent Variables
● Look at the Standard Metropolitan Areas Data below. In the dataset we
might be curious to predict “crime_rate” in future, so that becomes
our target variable (dependent variable) and rest of the variables
become input variables (independent variables) for building a
machine learning model.

Independent Dependent
variables variable
Another example

● A scientist wants to see if the brightness of light has any effect


on a moth being attracted to the light.

● The brightness of the light is controlled by the scientist. This


would be the independent variable.

● How the moth reacts to the different light levels (distance to


light source) would be the dependent variable.
Equation of a Straight Line
● In algebra, a linear equation (equation of a straight line) typically takes the form y = mx
+ b, where m and b are constants, x is the independent variable, y is the dependent
variable.

● Basically, the value of y is being calculated using x whereas x has no dependence on


value of y.

➔ y = how far up
➔ x = how far along
➔ m = Slope or Gradient (how steep the line is)
➔ b = value of y when x=0

● How do you find "m" and "b"?

○ b is easy: just see where the line crosses


the Y axis.
○ m (the Slope) needs some calculation:
Synonyms Recap

Too many synonyms to memorise? Let me put them all down at one
place for better understanding:

Variables = Features

Input Variables = Attributes = Predictor = Independent Variables

Target Variables = Labels = Outcomes = Dependent Variables


What is linear regression? - an example
Suppose you are thinking of selling your home. And, various houses around
you with different sizes (area in [Link]) around you have sold for different
prices as listed below:

And considering, your home is 3000 square feet. How much should you
sell it for?
Well! You have to look at the existing price patterns (data) and predict a
price for your home. This is called linear regression.
What is linear regression? - an example
Here's an easy way to do it. Plotting the 3 data points we have so far:

Each point represents one home.


What is linear regression? - an example
Now you can eyeball it and roughly draw a line that gets pretty close to
all of these points. Then look at the price shown by the line, where the
square footage is 3000:

Boom! Your home should sell for $260,000.


What is linear regression? - an example
That's all! You plot your data, make a rough line, and use the line to
make predictions. You need to make sure your line fits the data well:

But of course we don't want to roughly make a line, we want to


compute the exact line that best "fits" our data. That’s where
machine learning comes into play!
What is linear regression?
● Linear regression is a linear model i.e. a model that assumes a linear
relationship (straight-line relationship) between the input variables (x) and the
single output variable (y).

● When there is a single input variable (x), the method is referred to as simple
linear regression or just linear regression. Eg: Salary dataset given here. There
is only one target variable and one input variable where we are predicting the
salary of individual using their years of experience.

● When there are multiple input variables, it is often referred to as multiple


linear regression. Eg: Smart Metropolitan areas data set, we have multiple
input variables
References

● [Link]
Learning Objectives

Simple vs. Multiple Linear Regression


Linear Regression with Single Variable

Notebook for
practice
Simple vs. Multiple Linear Regression
Linear Regression with Single Variable
Notebook for practice

[Link]
official/Data_Science_Bootcamp/blob/master/Week3/Linear
_Regression/Introduction_to_Linear_Regression.ipynb
Linear Regression with Multiple Variable
Notebook for practice

[Link]
official/Data_Science_Bootcamp/blob/master/Week3/Linear
_Regression/Multiple_Linear_Regression.ipynb
Unit 3

Regression
Supervised ML Algorithms -
Regression
Evaluating a Regression Model
Learning Objectives

Cost Cost Function

Gradient Descent
Which line is good?
Now coming back to our first example. How do you decide what line is
good? Here's a bad line:

This above drawn line is way off. For example, according to the line, a
1000 sq foot house should sell for $310,000, whereas we know it
actually sold for $200,000.
Which line is good?
Here's a better line:

This line is an average of $8,333 dollars off (adding all the distances and
dividing by 3).

This $8,333 is called the cost of using this line.


Short-term Objective
What were we doing in the previous 2 examples? We plotted 2 straight lines using the
equation: y = mx+b.

If we already have the data points (x1, y1), ..., (xn, yn), it means that our values of x and y
remain the same throughout all the lines we plot.

So what remains? What exactly are we changing to plot different lines?


Yes, m and b.

Our objective is to find the values of m and b that will best fit this data.

These 2 variables are actually called hyperparameters. In machine learning, a


hyperparameter is a parameter whose value is used to control the learning process.
And we must always try to find some optimal parameters while building a machine
learning model.
Cost
The cost is how far off the line is from the real data. The best line is the one that is the least off from
the real data.

To find out what line is the best line (to find the values of m and b), we need to use a cost function.

In ML, cost functions are used to estimate how badly models are performing.

Put simply, a cost function is a measure of how wrong the model is in terms of its ability to estimate
the relationship between X and y.
Cost Function
What?

Now that we built a model, we need to measure its performance right? and understand if
it works well or not. Cost function measures the performance of a Machine Learning
model for given data. It quantifies the error between predicted values and expected
values and presents it in the form of a single real number.

Depending on the problem Cost Function can be formed in many different ways. The
purpose of this function is to be either:

● Minimized - then returned value is usually called cost, loss or error. The goal is to
find the values of model parameters for which Cost Function return as small number
as possible.
● Maximized - then the value it yields is named a reward. The goal is to find values of
model parameters for which returned number is as large as possible.
What is predicted and expected value?
● Predicted value: As the name says is the predicted value of your machine learning model.

● Expected value: Is the true value(or the label present in your data)

Often machine learning models are not 100% accurate or perfect, they tend to deviate from the
true value or expected value.

Explaining with an example: If we are predicting the age of a person based on few input
variables or features.

● Our machine learning model predicted the age as 28 years


● However, the actual age of the person is 29 years.
● Here 28 years is predicted value and 29 years is expected value or true value. As data
scientists, we try to minimize the error while building models.
Cost Function

The difference between the true value and the model’s predicted value is
called residual.
Cost Function Types/ Evaluation Metrics

There are three primary metrics used to evaluate linear models (to find
how well a model is performing):

1. Mean Squared Error:


2. Root Mean Squared Error
3. Mean Absolute Error
Mean Squared Error (MSE)
● MSE is simply the average of the squared difference between the true
target value and the value predicted by the regression model.

● As it squares the differences, it penalizes (gives some penalty or weight


for deviating from the objective) even a small error which leads to
over-estimation of how bad the model is.
Root Mean Squared Error (RMSE)
● It is just the square root of the mean square error.

● It is preferred more in some cases because the errors are first


squared before averaging which poses a high penalty on large
errors. This implies that RMSE is useful when large errors are
undesired.
Mean Absolute Error(MAE)
● MAE is the absolute difference between the target value and the value
predicted by the model.

● MAE does not penalize the errors as effectively as mse making it not
suitable for use-cases where you want to pay more attention to the
outliers.
R Squared ( Coefficient of determination)
R Squared ( Coefficient of determination)
● R-squared is a goodness-of-fit measure for linear regression models.
● It represents the coefficient of how well the values fit compared to
the original values. The values from 0 to 1 are interpreted as
percentages.
● The higher the value is, the better the model is.

● Going by the name, you might think R2 cannot be negative. However, it


can. A Negative R2 means you are doing worse than the mean value.
Which metrics to use when?

376
Which metrics to use when?
This is an important question and we get used to learning these
measures over time. Sharing some resources with you all so that
it helps you understand what metrics to be used in the context of
solving a regression problem.

● [Link]
machine-learning-models-part-1-a99d7d7414e4
(you may ignore “Bonus” section in the article for time being)
Note: Gradient Descent is a slightly advanced topic.
Gradient
Gradient is another word for "slope". The higher the gradient of a graph at a point, the steeper
the line is at that point. A negative gradient means that the line slopes downwards.

Finding the gradient of a straight-line graph

It is often useful or necessary to find out what the gradient of a graph is. For a straight-line
graph, pick two points on the graph. The gradient of the line = (change in y-coordinate)/(change
in x-coordinate) .

In this graph, the gradient = (change in y-


coordinate)/(change in x-coordinate) = (8-6)/(10-6)
= 2/4 = 1/2

We can, of course, use this to find the equation of


the line. Since the line crosses the y-axis when y =
3, the equation of this graph is y = ½x + 3 .
Gradient
Finding the gradient of a curve

To find the gradient of a curve, you must draw an accurate sketch of the curve. At
the point where you need to know the gradient, draw a tangent to the curve. A
tangent is a straight line which touches the curve at one point only. You then find the
gradient of this tangent.

Example
Find the gradient of the curve y = x² at the point (3, 9).

Gradient of tangent =
(change in y)/(change in x)
= (9 - 5)/ (3 - 2.3)
= 5.71
Gradient Descent
The cost function will tell you how good those values are (i.e. it will tell
you how far off your predictions were from the actual data). But what
do we do based on that information? How do we find the values of m
and b that will draw the best line? By using gradient descent.

In a nutshell, to update m and b values in order to reduce Cost function


(minimizing RMSE value) and achieving the best fit line the model uses
Gradient Descent. The idea is to start with random m and b values and
then iteratively updating the values, reaching minimum cost.

Let’s start with a simpler version of gradient descent, and then move on
to the real version.
Gradient Descent
Suppose we decide to leave b at zero. So we experiment with what value m
should be, always keeping b at 0. Now you can try various values for m, and you
will end up with different costs. You can plot all of these costs on a graph:
Gradient Descent
Here are the corresponding lines (remember, b is zero in these lines):

m = 75
m = 160

We can see that the line on the left seems to fit the data better than the
line on the right, so it makes sense that the cost of that line is lower. And
from this graph it looks like m = 75 gives us the lowest cost overall.
Gradient Descent

Since it is the lowest point in this graph. So with all the costs graphed out like
this, we just need to find the lowest point on the graph, and that will give us the
optimal value of m!
Gradient descent helps us find the lowest point on this graph. You start with a
value for m, and update it iteratively till you arrive at the best value. So you can
start at m = 0. Then you have to ask, should I go left or right?
Gradient Descent
Well, we want to go down, so lets go right a small step:

This is the new value for m. Again we ask, should we go left or right? At each
step, you need to head downward, till you get to a point where you're as low as
you can go:
Gradient Descent
This is gradient descent: going down bit by bit till you hit the bottom.

How do you figure out which way is down? The answer will be obvious
to calculus experts but not so obvious for the rest of us: you take the
derivative at that point.

But the important bit to know is, if you take the current value of m and
add the derivative at that point, you will go down. You just do that a
bunch of times (say 1000 times) and you will hit bottom!
Gradient Descent

The video in the next slide explains the process of gradient descent.
You only need to watch till 16:07 to gain some understanding and not
go into the python implementation explained later on.

Also, as the instructor said, there’s no need to dive into the


mathematics or worry about not understanding some math right now.
Gradient Descent

The video in the next slide explains the process of gradient descent.
You only need to watch till 16:07 to gain some understanding and not
go into the python implementation explained later on.

Also, as the instructor said, there’s no need to dive into the


mathematics or worry about not understanding some math right now.
Gradient Descent and Cost Function
Recap

➔ Linear regression is used to predict a value (like the sale price


of a house).
➔ Given a set of data, first try to fit a line to it.
➔ The cost function tells you how good your line is.
➔ You can use gradient descent to find the best line.
Bias and Variance
Bias and Variance in real world
In dictionary terms :

Bias : Prejudice in favor of or against one thing, person, or group


compared with another, usually in a way considered to be unfair.

Variance: The state or fact of disagreeing or quarreling.

In short, Bias represents how unfair is something towards others, and


Variance represents how likely something changes with respect to
others.

Confusing ? Worry not. The next example will clarify all your doubts.
Example
Let’s assume you have called two weather examiners, Mr. Bishop and
Mr. Varian to test if it will rain or not.

Mr. Bishop loves rain a lot. And Mr. Varian is a bookworm.

Let us talk about the conditions for rain.

- It rains only if it’s little humid.

- It does not rain if it's windy, hot or freezing.


Mr. Bishop representing Bias
You ask Mr. Bishop (Despite of his training, he is too biased towards rain) :

Me :Sir, its extremely hot out here, will it rain ?


Mr. Bishop : Yup.

Me :Sir, its little windy, will it rain ?


Mr. Bishop : May be not.

Me :Sir, its freezing will it rain ?


Mr. Bishop : Yes of course.

Me :Sir, its little humid, will it rain ?


Mr. Bishop : Damn sure.

Did you notice, Mr. Bishop is highly Biased towards chances of having rain. During the test, he
is unable to predict most of them correctly.

This condition is called under fitting.


Mr. Varian representing Variance
Now let us see your conversation with Mr. Varian (a bookworm who
completely remembers the training he had ):

Me :Sir, its extremely hot out here, will it rain ?


Mr. Varian: Nope.

Me :Sir, its little windy, will it rain ?


Mr. Varian: No way.

Me :Sir, its freezing, will it rain ?


Mr. Varian: No way.

Me :Sir, its little humid, will it rain ?


Mr. Varian: Yes it will.
Mr. Varian representing Variance
Mr. Varian successfully predicted whether it will rain or not. But being a bookworm,
Mr. Varian is unknown to the conditions not described in the book during training.

Now, we ask Mr. Varian :

Me :Sir, there is a giant sitting on the cloud who lost his candy. Will it rain ?
Mr. Varian: Not sure, since the answer is “No” to most of the conditions, there is a
high possibility that it will not rain .

Now, although the decision of Mr. Varian varies perfectly with the input conditions,
he is not able to predict for the new and unseen condition (other general conditions
apart from the given specific conditions while training).

This condition is called over fitting. And it offers poor generalizability.


High Bias or High Variance?
Then what is better, high bias (high generalizability) or high variance
(high accuracy on training data) ?

Well, the answer is, “Best of both worlds”. We neither need high bias
nor high variance. We would want our algorithm to perform better on
training set and also offer best result on unseen data (the test set).

In general, having high bias reduces the performance of the


algorithm on training set while having high variance reduces
performance on unseen data.

This is known as Bias Variance Trade off.


Reading Material

MUST READ
Understanding the Bias Variance Tradeoff:
[Link]
tradeoff-165e6942b229

Reference
[Link]
bias-variance-tradeoff-ec540fb13e12
Supervised ML Algorithms -
Regression
Decision Tree/ Regression Tree
Decision Trees
Decision tree is the most powerful and popular tool for classification and prediction.

A Decision tree is a flowchart like tree structure, where:


● each internal node denotes a test on an attribute/feature,
● each branch represents an outcome of the test, and
● each leaf node (terminal node) holds a class label (Yes and No in this case).
Decision Trees for Regression -Regression Trees
Decision Trees

Regression tree analysis is used when the


predicted outcome can be considered a real
number (e.g. the price of a house, or a patient's
length of stay in a hospital).
Resources on Decision Tree Regressor

● Implementation of Decision Tree Regression:


[Link]
a-decision-tree-and-scikit-learn-ac98552b43d7
CART
You might encounter the term ‘CART’ while building ML models. It’s actually nothing new but the
same old Decision Tree since it can be used for both Classification and Regression

CART = Classification and Regression Trees, an umbrella term for:

Classification Trees: where the target variable is categorical and the tree is used to identify the
"class" within which a target variable would likely fall into.

Regression Trees: where the target variable is continuous and tree is used to predict it's value.
Supervised ML Algorithms -
Regression
Support Vector Regressor
Support Vector Regressor

Support Vector Machines (SVMs) are well known in


classification problems. However, the use of SVMs in
regression is not as well documented. These types of models
are known as Support Vector Regression (SVR).

SVR gives us the flexibility to define how much error is


acceptable in our model and will find an appropriate line to
fit the data.
Support Vector Regressor
Kernels
● SVM algorithms use a set of mathematical functions that are
defined as the kernel.

● The function of kernel is to take data as input and transform it


into the required form.

● Different SVM algorithms use different types of kernel functions.


These functions can be different types.

● For example linear, nonlinear, polynomial, radial basis function


(RBF), and sigmoid.

● The most used type of kernel function is RBF. It is also the


default kernel
Support Vector Regression Implementation

[Link]
vector-regression-tutorial-for-machine-learning/

What is SVM and SVR


Supervised ML Algorithms -
Regression
Random Forest/ Regression Forest
Random Forest for regression - Regression Forest
Random forest builds multiple decision trees and merges them together to
get a more accurate and stable prediction.

Regression Forests (or Random forest Regressors) are an ensemble


(combination) of different regression trees (decision trees for regression).
Each leaf contains a distribution for the continuous output variable/s.
Random Forest for regression - Regression Forest
Ensemble Learning

Predicts using Multiple machine learning model or same machine learning


model multiple times
Difference between Bagging & Boosting
S.N
Bagging Boosting
O

The simplest way of combining predictions that A way of combining predictions that
1.
belong to the same type. belong to the different types.

2. Aim to decrease variance, not bias. Aim to decrease bias, not variance.

3. Each model receives equal weight. Models are weighted according to their performance.

New models are influenced


4. Each model is built independently.
by the performance of previously built models.

Different training data subsets are randomly


Every new subset contains the elements that were
5. drawn with replacement from the entire training
misclassified by previous models.
dataset.

6. Bagging tries to solve the over-fitting problem. Boosting tries to reduce bias.

If the classifier is unstable (high variance), then If the classifier is stable and simple (high bias) the apply
7.
apply bagging. boosting.

Example: The Random forest model uses


8. Example: The AdaBoost uses Boosting techniques
Bagging.
Ensemble Learning

Types: Boosting and Bagging


Drawbacks of Decision Tree algorithm

● Computationally Expensive

● May get deviations in predictions if the underlying data

gets changed
Random Forest
● Supervised learning algorithm which performs both classification and
regression
● Baggaged: Runs in parallel
Random Forest

● Advantages
○ Effective method for estimating missing data and maintains
accuracy when a large proportion of data are missing
○ Runs efficiently on large datasets

● Disadvantages
○ May observe random forest overfitting for some datasets with
noisy classification/regression tasks
Regression Forest - Implementation

Random forest regression Scikit learn tutorial


number of
estimators to
50,10,100
= number of trees
Supervised ML Algorithms -
Classification
Logistic Regression
Logistic Regression
• Logistic Regression is one of the basic and popular algorithms to solve a
binary classification problems
• For each input, logistic regression outputs a probability that this input
belongs to one of the 2 classes
⮚ Set a probability threshold boundary and that determines which
class the input belongs to
• Binary classification problems (2 classes):
⮚ Emails (Spam / Not Spam)
⮚ Credit Card Transactions (Fraudulent / Not Fraudulent)
⮚ Loan Default (Yes / No)
Logistic Regression

Now, you may ask why don’t we use Linear Regression? Why do we
need a new algorithm?

Well, you would find all the answers in the video in the next slides.

The video in the next slide is a must watch, the instructor has
brilliantly explained about logistic regression!
Must Watch Understanding Logistic Regression
Logistic Regression
Classification
Binary - yes or no (Spam or no spam)
Multiclass - Which party a person will vote? A,B, C?
Linear Regression vs Logistic
● Linear regression is used to solve regression problems with
continuous values
● Logistic regression is used to solve classification problems with
discrete categories
○ Binary classification (Classes 0 and 1)
○ Examples:
• Emails (Spam / Not Spam)
• Credit Card Transactions (Fraudulent / Not Fraudulent)
• Loan Default (Yes / No)
Linear Regression vs Logistic
● Let’s say a data scientist named John want to predict that whether a
customer will buy insurance or not
● Remember that linear regression is used to predict a continuous value
where the output (y) may vary between +∞ (posi ve infinity) to -∞
(negative infinity) whereas in this case, the target variable (y) takes only
two discrete values, 0 (No insurance) and 1 (Yes, got the insurance).
● John’s decides to extend the concepts of linear regression to fulfil his
requirement. One approach is to take the output of linear regression
and map it between 0 and 1, if the resultant output is below a certain
threshold (say 0.5), classify it as No (didn’t buy the insurance) whereas if
the resultant output is above a certain threshold, classify it as bought
the insurance (yes)
Linear Regression vs Logistic
● We then plot a simple linear regression line and set the threshold as 0.5
○ Negative class (Insurance = No)– Age on the left side
○ Positive class (Insurance = Yes) – Age on the right side
Imagine there is an outlier to towards right
Additional outlier that
distorted the
regression line

● As we can see outlier in the data and will distort the whole linear regression
line.
● Clearly the line is unable to differentiate the classes with the linear line fit
● The line should have been at the vertical yellow line which is able to divide
the positive and negative classes i.e yes or no for insurance
Happy John! (Data Scientist)
● Well, life would be much simpler if we had a algorithm that
would fit the points like below right? It is a much better fit compared
to regression line!
Unit 4

Classification and Performance


Metrics
Supervised ML Algorithms -
Classification
Evaluating the performance of a
Classification Model
Learning Objectives

Evaluating the
Performance of
Why not Accuracy?
Logistic Regression
model

Which metrics to
use when?
Evaluating the Performance of Logistic Regression model

● Model Evaluation is a very important part in any analysis to answer


the following questions:

How well does the model fit the data?, Which predictors are most
important?, Are the predictions accurate?

● Guess what, evaluating a Classification model is not as simple as


Linear Regression.

● But why?

● You must be wondering ‘Can’t we just use accuracy of the model as


the holy grail metric?’
Accuracy

Classification Accuracy is what we usually mean, when we use the


term accuracy. It is the ratio of number of correct predictions to the
total number of input samples.

Actual Labels: Task at hand - Separate Yellow and


pink balls
Predicted Labels: Colors predicted by our model

Correctly predicted = 4
Total Predictions = 6
Accuracy = 4/6 = 66.7%
Why not Accuracy?
● Accuracy is very important, but it might not be the best metric all the time. Let’s
look at why with an example -:

● Let’s say we are building a model which predicts if a transaction is fraudulent


or not

● Let’s imagine, we build a basic model which always predicts that a transaction
is not fraudulent. Guess what would be the accuracy of this model?
~99% !! (You may ask why? Well, less than 1% transactions are usually
fraudulent and there is a huge class imbalance. So even if you fit a wrong
model that always predicts a transaction to be not fraudulent, the accuracy will
remain 99% owing to class imbalance)

● Impressive, right? Well, the probability of a bank buying this model is absolute
zero.

● In a problem where there is a large class imbalance, a model can predict the
value of the majority class for all predictions and achieve a high
classification accuracy.

● While our model has a stunning accuracy, this is an apt example where
accuracy is definitely not the right metric.
Why not Accuracy?
Watch till 1 min 14 secs to understand why accuracy is bad metric for model
performance
Evaluating the Performance of Logistic Regression model

Logistic Regression employs different sets of metrics than Linear


Regression. Here, we deal with probabilities and categorical values.

In the following slides, we describe a few of the evaluation metrics


used for Logistic Regression:
Is confusion matrix confusing or it resolves the confusion?
You decide!

443
Confusion Matrix
A confusion matrix is a table that is often used to describe the performance of
a classification model (or "classifier") on a set of test data for which the true
values are known. The confusion matrix itself is relatively simple to
understand, but the related terminology can be confusing.

Let's start with an example confusion matrix for a binary classifier for disease
prediction (though it can easily be extended to the case of more than two
classes):
Confusion Matrix
Let's now define the most basic terms, which are whole numbers (not
rates):

● true positives (TP): These are cases in which we predicted yes (they
have the disease), and they do have the disease.

● true negatives (TN): We predicted no, and they don't have the
disease.

● false positives (FP): We predicted yes, but they don't actually have the
disease. (Also known as a "Type I error.")

● false negatives (FN): We predicted no, but they actually do have the
disease. (Also known as a "Type II error.")

I know these seem hard to memorise. One thing that has helped me
remember these are by putting it in a better way:

false positives = falsely classified as being positive.


Confusion Matrix

This is a list of rates that are often computed from a confusion matrix for
a binary classifier:

● Precision: Correctly predicted as positives compared to total


predicted as positives
Precision = TP/(TP+FP) = 100/110 = 0.91

● Sensitivity/Recall: Correctly predicted as positives compared to


total number of positives
= TP/(TP + FN) = 100/(100+5) = 0.95

Note: Mostly we have to pick one over other, it’s almost impossible to
have both high Precision and Recall.

● Specificity: Correctly predicted as negatives compared to total


number of negatives = TN/(TN + FP) = 50/(50+10) = 0.83
Understanding Precision and Recall
Think about the search box on Amazon home page.

The precision is the proportion of relevant results( correctly predicted yes) in


the list of all returned search results(total predicted yes).

The recall is the ratio of the relevant results( correctly predicted yes)
returned by the search engine to the total number of the relevant results
that could have been returned (total actual yes).
Choosing between Sensitivity and Specificity
Often, the sensitivity and specificity of a test are inversely related. Selecting
the optimal balance of sensitivity and specificity depends on the objective of
the problem that needs to be solved.

If correctly identifying positive class is important for us, then we should


choose a model with higher Sensitivity. However, if correctly identifying
negative class is more important, then we should choose specificity as the
measurement metric.
Sensitivity or Specificity - an example
Let’s say we are predicting if a patient has cancer or not. The default
probability threshold is kept at 0.5 i.e

Class 0 (No cancer) – Below 0.5


Class 1 (Cancer) – Above 0.5
Case 1: Higher Specificity
Suppose we want to predict Class 1 (Ci.e patient has cancer) only if we are VERY confident.
(To avoid giving the patient a shock and to avoid unnecessary treatment)

We can instead change this threshold to 0.7. Thus, we’ll tell someone they have cancer only
if we think they have greater than or equal to 70% chance of having a cancer.

Look at the graph below. SInce the threshold has shifted to the right, so the number of
people correctly guessed as having cancer have increased. Thus, the specificity has
increased. ( We are being very specific with declaring patients with cancer).
Case 2: Higher Sensitivity
Suppose we want to avoid missing too many cases of cancer ( avoid false negatives). If a
person with cancer is told that he’s well, it can cause a delay in treatment and affect the
health badly).

In this case we can set a lower threshold, say 0.25. Even if a patient has 25% chance of
having cancer, we’ll inform him/her.

Looking at the graph you can see that the threshold has shifted to the left. Most of the
people with cancer will be detected in advance in this case. We have completely (or almost)
eliminated False Negatives. It will thus result in higher Sensitivity/ Recall. (We are being
sensitive in detecting a disease i.e a really sensitive test).
You can watch this video from 00:58 to 5:32 explaining the Sensitivity and Specificity trade off
Confusion Matrix
Talking about accuracy, our favourite metric!

Accuracy is defined as the ratio of correctly predicted examples by the total


examples.

● Accuracy: Overall, how often is the classifier correct?


= (TP+TN)/total = (100+50)/165 = 0.91

● Remember, accuracy is a very useful metric when all the classes are
equally important.

● But this might not be the case if we are predicting if a patient has cancer.
In this example, we can probably tolerate FPs but not FNs.

● If a cancerous patient is wrongly reported as being fine, it can result in


delaying of treatment. Which is not good!
Confusion Matrix
So you’ve already learnt how to calculate Precision and Recall and how
changing the threshold can affect their values. (SImilar to Sensitivity,
Specificity threshold)

But do we necessarily need to spend time on varying the threshold to get the
perfect Precision and Recall? Or is there a way to choose this threshold
automatically?

Let’s take 3 algorithms and try to find a metric for combining Precision and
Recall.

How about taking an average of Precision and Recall? (P+R)/2

Precision (P) Recall (R) Average

Algorithm 1 0.5 0.4 0.45

Algorithm 2 0.7 0.1 0.4

Algorithm 3 0.02 1.0 0.51


F1 Score
Average tells us that Algorithm 3 is the best (highest value). Whereas
Algorithm 3 is a dumb model that predicts y=1 each time and thus gives a
recall of 1 (FN =0, TP=1).

That means average isn’t a good metric.

Researchers found a metric that solves our purpose: The F1 Score!


F1 Score
Let’s apply F1 Score to our problem:

Precision (P) Recall (R) Average F1 Score

Algorithm 1 0.5 0.4 0.45 0.444

Algorithm 2 0.7 0.1 0.4 0.175

Algorithm 3 0.02 1.0 0.51 0.0392

The F1 score tells us that Algorithm 1 is the best (highest F1 Score).

● For F1 Score to be large, both P and R need to be large.

● It’ll be highest(1) when both P and R are 1

● Accuracy can be used when the class distribution is similar while F1-
score is a better metric when there are imbalanced classes.
ROC (Receiver Operator Characteristic) Curve
● An ROC curve is a commonly used way to visualize the performance of a
binary classifier, meaning a classifier with two possible output classes.

● It shows the performance of a classification model at all threshold values.

● It plots 2 parameters:

1. True positive rate /Recall (TPR)

1. False Positive rate (FPR)


AUC Curve
● AUC stands for "Area under the ROC Curve." That is, AUC measures the
entire two-dimensional area underneath the entire ROC curve

● AUC provides an aggregate measure of performance across all possible


classification thresholds.
ROC and AUC Explained
Reading Material

MUST READ
An excellent article explaining Threshold, ROC and AUC in a simple
manner: [Link]
auc-curves-a05b68550b69
Which metrics to use when?

463
Which metrics to use when?
This is an important question and we get used to learning these
measures over time. Sharing some resources with you all so that
it helps you understand what metrics to be used in the context of
solving a regression problem.

● 5 Classification Metrics every data scientist must know


[Link]
evaluation-metrics-you-must-know-aa97784ff226

● [Link]
for-evaluating-machine-learning-models-part-2-
86d5649a5428
SUPERVISED LEARNING - BUILDING YOUR FIRST
CLASSIFICATION AND REGRESSION MODELS

[Link]
upervised-ml-model-building-walkthrough
Feature Selection
In real life data science problems, often the data consist of a
large number of attributes or features.

This means selecting the most relevant features as the input to


machine learning model is absolutely vital.
Curse of Dimensionality
Refers to various phenomena that arise when analyzing and
organizing data in high-dimensional spaces (often with
hundreds or thousands of dimensions) that do not occur in
low-dimensional settings such as the three-dimensional
physical space of everyday experience.

As the dimensionality increases, the volume of the space


increases so fast that the available data become sparse.

This sparsity is problematic for any method that requires


statistical significance.

Therefore it is essential to quality control the data that is being


fed to the model.
Feature Selection Methods:
Correlation coefficient

Variance threshold

Recursive Feature Elimination

LASSO regularization

Random Forest Importance


Preserve variance
Linear Discriminant analysis
Principal Component Analysis
Singular value decomposition
Correlation Coefficient
● Measure of the linear relationship
of 2 or more variables.
● Can predict one variable from the
other.
● Good variables are highly
correlated with the target.
● Variables should be correlated
with the target but should be
uncorrelated among themselves.
● If two variables are correlated, we
can predict one from the other.
Therefore, if two features are
correlated, the model only really
needs one of them
● Pearson Correlation
Variance Threshold - Remove Low Var Features
● Removes features whose variance below a pre-defined
cutoff value.
● It is based on the notion that features that do not vary
much within themselves have low predictive power.
● weakness -does not consider the relationship of input
features with the output feature.
● Before performing variance thresholding, all features
should be standardized so they will have the same scale.
● Scikit-learn provides VarianceThreshold function to
perform variance threshold method.
Variance Threshold - Remove Low Var Features

● Variance Threshold Documentation: [Link]


[Link]/stable/modules/generated/sklearn.feature_sele
[Link]
Recursive feature elimination (RFE)
● Given an external estimator that assigns weights to
features (e.g., the coefficients of a linear model), the goal
RFE is to select features by recursively considering
smaller and smaller sets of features.
● First, the estimator is trained on the initial set of features
and the importance of each feature is obtained either
through a coef_ attribute or through a
feature_importances_ attribute.
● The least important features are removed from the
current set of features.
● That procedure is recursively repeated on the
pruned/removed set until the desired number of
features to select is eventually reached.
LASSO Regularization L1
● Regularization consists of adding a penalty to the
different parameters of the machine learning model to
reduce the freedom of the model, i.e. to avoid over-
fitting.
● In linear model regularization, the penalty is applied over
the coefficients that multiply each of the predictors.
● From the different types of regularization, Lasso or L1
has the property that is able to shrink some of the
coefficients to zero.
● Therefore, that feature can be removed from the model.
Random Forests Importance
● Random Forests is a kind of a Bagging Algorithm that
aggregates a specified number of decision trees.
● The tree-based strategies used by random forests
naturally rank by how well they improve the purity of the
node, or in other words a decrease in the impurity (Gini
impurity) over all trees.
● Nodes with the greatest decrease in impurity happen at
the start of the trees, while notes with the least decrease
in impurity occur at the end of trees.
● Thus, by pruning/removing trees below a particular node,
we can create a subset of the most important features.
Supervised ML Algorithms -
Classification
Decision Tree/ Classification Tree
Decision Trees for Classification -Classification Trees
Decision Trees
Classification tree methods (i.e., decision tree
methods) are recommended when the task contains
classifications or predictions of outcomes, and the goal
is to generate rules that can be easily explained and
translated into SQL or a natural query language.

A Classification tree labels, records, and assigns


variables to discrete classes.

A Classification tree can also provide a measure of


confidence that the classification is correct.
CART
You might encounter the term ‘CART’ while building ML models. It’s actually
nothing new but the same old Decision Tree.

CART = Classification and Regression Trees, an umbrella term for:

Classification Trees: where the target variable is categorical and the tree is used
to identify the "class" within which a target variable would likely fall into.

Regression Trees: where the target variable is continuous and tree is used to
predict it's value.
Resources on Decision Tree Classification

● Implementation of Decision Tree Classification:


[Link]
tree-classification-python

● Notebooks:
[Link]
notebooks-intro-to-model-building
[Link]
_improvised_model
Supervised ML Algorithms -
Classification
Support Vector Machine (SVM)
SVM’s Objective
The objective of SVMs is to categorize data into two classes. It does so by finding a
separating hyperplane(decision boundary), where the distance between itself and the
closest data points for both categories is maximized.

Look at how the hyperplane lies exactly between the nearby blue and red points
(maximising the margin)
Simple Visual Explanation of SVM

[Link]
vector-machine-svm-a-visual-simple-explanation-part-1-
a7efa96444f2
Pros and Cons — SVM
Pros:
● It is useful for both linearly Separable (hard margin) and
Non-linearly Separable (soft margin) data.
● It is effective in high dimensional spaces.
● It is effective in cases where a number of dimensions are
greater than the number of samples.
● It uses a subset of training points in the decision function
(called support vectors), so it is also memory efficient.

Cons:
● Picking the right kernel and parameters can be
computationally intensive.
● It also doesn’t perform very well, when the data set has
more noise i.e. target classes are overlapping
● SVM doesn’t directly provide probability estimates, these
are calculated using an expensive five-fold cross-validation.
Support Vector Machine (SVM)
SVM Implementation
SVM Notebook

[Link]
[Link]
Kernels
● SVM algorithms use a set of mathematical functions that
are defined as the kernel.

● The function of kernel is to take data as input and


transform it into the required form.

● Different SVM algorithms use different types of kernel


functions. These functions can be different types.

● For example linear, nonlinear, polynomial, radial basis


function (RBF), and sigmoid.

● The most used type of kernel function is RBF. It is also the


default kernel
Supervised ML Algorithms -
Classification
Naive Bayes Classifier
Naive Bayes

● Naive Bayes is the most straightforward and fast classification


algorithm, which is suitable for a large chunk of data.

● Naive Bayes classifier is successfully used in various


applications such as spam filtering, text classification,
sentiment analysis, and recommender systems.

● It uses Bayes theorem of probability for prediction of


unknown class.
Advantages and Disadvantages

● Advantages:
This algorithm requires a small amount of training data
to estimate the necessary parameters. Naive Bayes
classifiers are extremely fast compared to more
sophisticated methods.

● Disadvantages: Naive Bayes is known to be a bad


estimator.
In-depth explanation & Implementation
Titanic Survival

[Link]
ayes/14_naive_bayes_1_titanic_survival_prediction.ipynb
Multinomial Gaussian NB for spam detection

Note:
In the next video, the instructor has used the spam detection
example that involves some operations around text (NLP).

You don’t need to worry about those parts and instead just
focus on the parts revolving around Naive Bayes.
Multinomial Gaussian NB for spam detection
Count Vectorization
Email Spam filter

[Link]
.ipynb
Naive Bayes Classification using Scikit-learn

[Link]
scikit-learn
Supervised ML Algorithms -
Classification
Random Forest
Random Forest
Random forest is a flexible, easy to use machine learning algorithm
that produces, a great result most of the times even without hyper-
parameter tuning.

It is also one of the most used algorithms, because of its simplicity and
diversity (it can be used for both classification and regression
tasks).

Random forest builds multiple decision trees and merges them


together to get a more accurate and stable prediction.
Random Forest: Ensemble tree based algorithm
It is an ensemble tree-based learning algorithm. The Random Forest
Classifier is a set of decision trees from randomly selected subset of
training set. It aggregates the votes from different decision trees to
decide the final class of the test object.
Random Forest: Ensemble tree based algorithm
Random Forest - Identifying the numbers - 0-9

[Link]
_forest/11_random_forest.ipynb
Applications
● The random forest algorithm is used in a lot of different fields, like
banking, the stock market, medicine and e-commerce.

● In finance, for example, it is used to detect customers more likely


to repay their debt on time, or use a bank's services more
frequently. In this domain it is also used to detect fraudsters out to
scam the bank. In trading, the algorithm can be used to determine
a stock's future behavior.

● In the healthcare domain it is used to identify the correct


combination of components in medicine and to analyze a patient’s
medical history to identify diseases.

● Random forest is used in e-commerce to determine whether a


customer will actually like the product or not.
Resources on Random Forest Classification

● Implementation of Random Forest


Classification:
[Link]
/random-forests-classifier-python

● Notebooks:
[Link]
andom-forest
Learning Objectives

Feature Importance &


Selection

Benefits of Feature Selection Feature Selection Techniques


Problem & Solution
Problem:

Gone are the days when you had 5 variables to fit your linear
regression: Modern datasets contain more variables/features to
choose from. A dataset with 50 or more features -> more than 1
million observations.

Solution: Feature Importance and Selection

We will come to why/benefits very soon!


Feature Importance and Feature Selection

What, why and how?


Feature Importance and Feature Selection
Feature Importance refers to techniques that assign a score to
input features based on how useful they are at predicting a
target variable.

Feature Selection is the process where you automatically or


manually select features which contribute most to your target
variable.

In short, Feature Importance Scores are used for performing


Feature Selection
Iris Dataset
Feature Importance
Feature Selection

Manual or auto
selection
How?
Suppose we’re working on the Iris Classification. We’ll first create a baseline model
using Logistic Regression. Now, we want to try out Feature Selection and try to
improve our model’s performance. On plotting feature importance scores, we
obtain the below graph:

● Feature Importance Scores tell us that Petal width and height are the the top 2
features. The rest have a much lower importance score.
● We’ll select these 2 features.
● We’ll transform our existing dataset to contain only these 2 features.
● We’ll train our model on this transformed dataset.
● Finally, we’ll compare the evaluation metrics of our initial Logistic Regression
model with this new model.
Why Feature Selection?
You already know a number of optimization methods by now and might
think what’s the need of reducing our data by feature selection if we can
just optimize?

There’s something known as “The curse of dimensionality”.


In machine learning,
“dimensionality” = number of features (i.e. input variables) in your dataset.

When the number of features is very large relative to


the number of observations(rows) in your dataset,
certain algorithms struggle to train effective models.
This is called the Curse of Dimensionality.
Why Feature Selection?

Image source: TowardsDataScience


Curse of dimensionality analogy
Let's say you have a straight line 100 yards long and you dropped a penny
somewhere on it. It wouldn't be too hard to find. You walk along the line
and it takes two minutes.

Now let's say you have a square 100 yards on each side and you dropped a
penny somewhere on it. It would be pretty hard, like searching across two
football fields stuck together. It could take days.

Now a cube 100 yards across. That's like searching a 30-story building the
size of a football stadium. Ugh.

The difficulty of searching through the space gets a lot harder as you have
more dimensions.
Benefits of performing Feature Selection
You might’ve gotten an idea of why feature selection is required by now.

Feature Selection helps us with the following:

● Reduces Overfitting: Less redundant data means less opportunity


to make decisions based on noise(irrelevant data).

● Improves Model Performance: Less misleading data means our


model’s performance improves.

● Reduces Training Time: Less data means that algorithms train


faster.

You might also like