0% found this document useful (0 votes)

8 views241 pages

Handling Imbalanced Datasets in ML

The document discusses the challenges of imbalanced datasets in machine learning, particularly in predicting minority classes like fraudulent transactions. It highlights the issues that arise when models are trained on unbalanced data, leading to bias towards the majority class and poor generalization. The document also introduces techniques for handling class imbalance, such as oversampling, undersampling, and SMOTE, along with data preprocessing methods like imputation and scaling.

Uploaded by

Aditi Raghu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views241 pages

Handling Imbalanced Datasets in ML

Uploaded by

Aditi Raghu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Imbalanced Classes

Imbalanced Dataset
Imbalanced Dataset
● For example, suppose you have a credit card transaction data and you are
supposed to predict fraudulent transactions. You'll likely have 10,000
authentic transactions for every 1 fraudulent transaction, that's quite an
imbalance!
● In machine learning terms: Often you'll have a large amount of
data/observations for one class (referred to as the majority class), and
much fewer observations for one or more other classes (referred to as the
minority classes).
Imbalanced Dataset
● The problem is that machine learning models trained on
unbalanced datasets often have poor results when they have to
generalize (predict a class or classify unseen observations). Despite
the algorithm you choose, some models will be more susceptible to
unbalanced data than others. Ultimately, this means you will not
end up with a good model, and the reasons include:

○ The algorithm receives significantly more examples from one

class, prompting it to be biased towards that particular class.
○ It does not learn what makes the other class “different” and fails
to understand the underlying patterns that allow us to
distinguish classes.
Imbalanced Dataset

Can you think of any other example where there is

class imbalance?
Resources for the Tutorial
Download Dataset
● Download IEEE Fraud Dataset:
[Link]
official/Datasets/blob/master/fraud_data.csv

!!! - it is a huge dataset so opening it in excel might be

difficult, so please open it via python environment -
(Colab or Jupyter notebook)

● Read about this dataset here:

[Link]
About the Dataset
IEEE Fraud Dataset
● The data is broken into two files identity and transaction, which are joined by
TransactionID
Transaction Table *
● TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
● TransactionAMT: transaction payment amount in USD
● ProductCD: product code, the product for each transaction
● card1 - card6: payment card information, such as card type, card category, issue bank,
country, etc.
● addr: address of the customer
● dist: distance
● P_ and (R__) emaildomain: purchaser and recipient email domain
About the Dataset
● M1-M9: match, such as names on card and address, etc.

● Categorical Features:
○ ProductCD
○ card1 - card6
○ addr1, addr2
○ Pemaildomain Remaildomain
○ M1 - M9 (bank sensitive data)
Note: Some of the feature/variable description is not given as
About the Dataset
Identity Table *
● Variables in this table are identity information – network connection information (IP,
ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with
transactions.
● They're collected by Vesta’s fraud protection system and digital security partners.
● (The field names are masked and pairwise dictionary will not be provided for privacy
protection and contract agreement)

● Categorical Features:
○ DeviceType
○ DeviceInfo
○ id12 - id38
Notebook Link

[Link]
notebooks-handling-imbalanced-classes

!! The dataset takes a lot of time to load on colab.

Additional Resources

A nice article on data preprocessing:

[Link]
concepts-fa946d11c825

P.S: Try opening in incognito if it is asking for a premium

subscription upgrade
DATA PRE-
PROCESSING
Topics • Handling Missing Values
• Handling Imbalanced datasets,

today –
Oversampling - SMOTE
• Standardization/Normalization and
transformation for data

Data Pre- • Explained with an example on Detecting

Fraudulent Transactions in a given

processing
dataset!
Pre-processing – what, why and how?
What?
Pre-processing is the process of preparing the data for training.
Why?
• Data is not ready-made for us:
✔ Missing values
✔ Wrongful Data entries
✔ Class Imbalance
✔ Different scales of data..
How?
• Handling Missing Values
• Handling Imbalanced datasets, Oversampling - SMOTE
• Standardization/Normalization and transformation for data
Most datasets are not perfect, they have
missing values.

Missing Missing values can be due to:

values in • Missed entries by participants (in case of surveys)

data • Missed information by database managers

• That variable is not relevant for that data point

(e.g. in a customer dataset for banks, ‘No of

children’ not relevant for unmarried
customers. Hence, results in an NA value.)
Example dataset: IEEE Fraud
Dataset- a 10% sample used for
simplicity
IEEE Fraud Dataset was provided at Kaggle a year ago:
⮚ Categorical Features – Transaction
▪ ProductCD – Product code
▪ card1 - card6 : payment card information, such as card type, card
category, issue bank, country, etc.
▪ addr1, addr2
▪ P_emaildomain – Purchaser
▪ R_emaildomain- Recipient
% of • M1 - M9 – Match between names on card and address etc.
frau
True No of d ⮚ Categorical Features - Identity
Datas Observati positiv variabl case • DeviceType
et ons es es s
• DeviceInfo

IEEE 57049 2005 180 3.5 • id_12 - id_38

• The TransactionDT feature is a timedelta from a given reference
datetime (not an actual timestamp).
⮚ Outcome/Target Variable – isFraud –
• whether transaction is fraud or not
More about this dataset here: [Link]
detection/data
Ideas for Missingness Mechanisms

Missingness in dataset can affect a machine learning problem in different ways :

1. Missing completely at random
(eg: forgot to fill in survey, forgot to enter by data entry etc)
2. Missingness is related to other variables(predictors) used for prediction.
(E.g: Fraud dataset, ‘VAT number’ doesn’t exist because customer is of type ‘individual
and not business. Here ‘customer type’ is another predictor variable for fraud.
Ideas for Missingness Mechanisms

3. Missingness is related to outcome variable predicted itself and is hence

not random.
(E.g: Fraud dataset, ‘Transaction message’ doesn’t exist because customer is
doing perhaps something suspicious. Here, imputation might need advanced
techniques. Beyond the scope of this course..
Why fill missing
values?
• How should a model read a missing
value?
• Doesn’t it change what machine
learns if it reads it as 0?
• Mathematical models cannot
understand what a missing value
means.

Source: Tim Bock on DisplayR

Techniques for filling missing values

• Single Imputation
• Regression Imputation
• Multiple Imputation
Which variables to impute-
filling missing values vs using unreal information

• Imputation or filling in unreal values doesn’t come without consequences.

• Is it worth using variables which have 80% missing values which have
been imputed?
• Typical recommendation for variables to be used are variables with < 20%
missingness
• Heuristic for % of missingness ranges from 10% to 20% and hasn’t
reached consensus in academic research. (Schlomer, Bauman, and Card
2010)
• Conclusion: In our example, we will just impute variables with less than
20% missingness and leave out variables with missingness higher than
that
Simple Imputation: replace the
missing value in a variable by the
mean/mode of the variable:

Simple If variable is numeric: replace

with mean of the variable
Imputation
and how it
helps
If variable is categorical: replace
with mode of the variable

Imputing by mean
How it helps? or median balances
the data distribution
Other Imputation Techniques

• Multiple Imputation: Multiple Imputation fills in estimates for the missing data.
But to capture the uncertainty in those estimates, MI estimates the values
multiple times.

• Example In Python: Iterative Imputer

A strategy for imputing missing values by modeling each feature with missing
values as a function of other features in a round-robin fashion.
Example of Multivariate feature imputation:
Iterative Imputer
Recommended Imputations for various
Missingness Mechanisms
1. Missing fully at random - reasonably good to use single imputation techniques
(by mean for numeric variables and mode for categorical variables).
2. Missingness is related to other variables(predictors) used for prediction-
Maybe use multiple imputation here

1. Missingness is related to outcome variable predicted itself - Here, imputation

might need advanced techniques. Beyond the scope of this course.
References

• Common Methods to Imputing Missing Data

[Link]
common-methods-to-imputing-missing-data/
• Multiple Imputation:
• [Link]
recommended-solutions/
• Imputation in Python using scikit-learn:
• [Link]
• Schlomer, Gabriel L., Sheri Bauman, and Noel A. Card. 2010. “Best
Practices for Missing Data Management in Counseling Psychology.”
Journal of Counseling Psychology.
One Hot encoding
in Python
• In Python, typically machine
learning models expect inputs to be
numbers.
• Hence, it is easier to convert all
Binary is
categorical variables to numeric.
nothing but • One such techniques popularly
assign 0 or 1 used is one-hot encoding.
• One-hot encoding is checking
• Example :
DEMONSTRATE
THE FIRST
DECISION TREE
A simple decision tree model
CLASS
IMBALANCE
PROBLEM
Machine Learning
: Classification Vs
Regression
Classification vs Regression:

• Classification

✔ Classification is the problem of classifying the outcome into

two or more outcomes

• Classification examples:

✔ Predict colour: Red/Blue/Green etc

✔ Predict if name matches/not

• Regression

✔ Regression is the problem of predicting a continuous outcome

(a numeric outcome)

• Regression examples:

✔ Predict price of house

✔ Predict age of a dog based on data of dog images!

311
Will this lead to bias
against women?

The problem of Class

Imbalance
⮚ Classification Problem:
• Class Imbalance: Minority class constitutes a very minute fraction
compared to majority class.
⮚ Why?
• Class imbalance in classification problem is too less data points of one
class compared to another class we are trying to predict.
⮚ How it affects?
• It leads to machine to learn too much of the dominant class and too less
about the minority class!
⮚ Examples:
• Millions of black people affected by racial bias in health-care algorithms
(Source: Nature)
• Credit card approval algorithms may be biased against women. (Source:
[Link])
Will this lead to bias
against women?

Example of Class
Imbalance
• Given a database of machine learning publications,
if the problem is to predict whether a researcher is
male or female, will the default prediction be biased
by machine learning?
When is class imbalance a problem?

• Class imbalance is a problem when there are too less minority class (fraud)
observations for model to learn from.
• One needs to decide when to create new minority class (Fraud) observations or
remove existing majority (normal transactions) class observations.
Class Imbalance in Machine learning : in our example: Balanced Scale Data

• Majority class:
• Minority class:
How to Handle Imbalanced datasets

• Oversampling: Artificially increase the minority class (eg: by duplicating minority

class observations, artificially creating them… )
• Under sampling: Sample out observations from majority class to have a certain
ratio between majority and minority class
• SMOTE: Combination of both.
Delving deeper into
one technique-
SMOTE

• SMOTE:
Synthetically (S) creating minority (M)
class observations leading to
oversampling (O) using this technique
(TE) and under sampling majority to
get a certain ratio between the
classes.
• Proposed by Chawla et al 2002. (ref)
PERFORMANCE WITH AND
WITHOUT HANDLING
CLASS IMBALANCE
Performance with and without handling class imbalance

• AUC (a performance score for decision tree classifier) is slightly better using the
“SMOTE’d” data based model.
• We can play around with parameters in SMOTE and further improve the model.
• We can also use advanced machine learning models to improve further!
References

What?
• Standardization/Scaling is bringing all variables used for building model to the same scale
Why?
• It balances the overeffect of variables with higher range (let us example in next slide)
• Sometimes, it also helps in speeding up the calculations in an algorithm.
• It is important for techniques which use distance metrics.

How?
• Scale– It means to change the range of values but without changing the shape of distribution.
Range is often set to 0 to 1.
• Standardize means changing values so that distribution standard deviation from mean equals to
one,output will be very close to normal distribution.
• NORMALIZE-It can be used either of above things
Why scaling- in our example

• Let's say you have two input vectors: X1 and X2. and let's say X1 has range(0.1 to
0.8) and X2 has range(3000 to 50000). Now your SVM classifier will be a linear
boundary lying in X1-X2 plane. My claim is that the slope of linear decision
boundary should not depend on the range of X1 and X2, but instead upon the
distribution of points.
Various scaling methods

• Min-Max Scaler
• Robust Scaler
• Standard Scaler
• Normalizer
When to scale data?

• If you build models using scaled data, it may require scaling back to original
variables to interpret variables’ effect on outcome predicted.
Put all the pre-processing techniques together

• Handle missing values via:

⮚Imputation:
1. Single Imputation
2. Multiple Imputation
• Handle class imbalance:
1. Other techniques – Oversampling, undersampling
2. SMOTE
• Additional: Standardization/scaling/Normalization of data
References

• Why scaling is important and techniques:

[Link]
Class Notebook

[Link]
official/Machine_Learning_Bootcamp/blob/master/Data_Prepa
ration_101/Data_Preparation_101.ipynb
Learning Objectives

What is Machine Machine Learning

Learning? Categorization

Classification and
Regression
What is Machine Learning?
Machine Learning Categorization
Supervised Learning Algorithms
Let’s talk about the datasets that have both input variables and target variables
(labels for the data). Ranging from predicting the survival rate of a person in
Titanic Dataset where Survival Rate is already given to predicting the House
Price according to house characteristics where the house prices are provided.

The algorithms that work on such datasets are known as Supervised Learning
Algorithms.

It is called supervised learning because the process of an algorithm learning

from the training dataset can be thought of as a teacher supervising the
learning process. We know the correct answers, the algorithm iteratively makes
predictions on the training data and is corrected by the teacher. Learning stops
when the algorithm achieves an acceptable level of performance.
Unsupervised Learning Algorithms
Unsupervised learning is where you have unlabeled data (or no target
variable) in the dataset.

The goal of Unsupervised Learning Algorithms is to find some

structure in the dataset.

These are called unsupervised learning because unlike supervised

learning, there are no correct answers and there is no teacher.
Algorithms are left to their own to discover and present the interesting
structure in the data.
Reinforcement Learning Algorithms
A robot takes a big step forward, then falls. The next time, it takes a
smaller step and is able to hold its balance. The robot tries variations
like this many times; eventually, it learns the right size of steps to take
and walks steadily. It has succeeded.

What we see here is called reinforcement learning. The robot learns

how to walk based on reward (staying on balance) and punishment
(falling). This feedback is considered “reinforcement” for doing or not
doing an action.

Reinforcement learning, in a simplistic definition, is learning best

actions based on reward or punishment.
Types of Supervised Learning Algorithms

Supervised Learning can be further divided into 2 types:

1. Classification
2. Regression
Classification vs Regression

In order to decide whether to use a regression or classification model, the

first questions you should ask yourself is:

Does your target variable have a continuous value or is it discrete (binary

or multi-class)?
Regression

If your answer is continuous values, you’re dealing with

Regression.

This means that if you’re trying to predict quantities like

height, income, price, or scores, you should be using a model
that will output a continuous number.

So if your objective is to determine tomorrow’s temperature,

you should use a regression model.
Classification
Let’s come to the second case where if you can clearly see that the
target variable is divided into classes. You’ll be using Classification.

● When the number of classes is 2, it is known as Binary

Classification. Eg. Will it be hot or cold tomorrow is a binary
classification problem with 2 classes: Hot and Cold.

● When it is more than 2, it is known as Multi Class Classification.

Eg. Classifying movies in Good, Average or Bad according to
reviews.
What is Classification?
Let’s learn with some examples:

● In Classification we classify the outcome

● Examples:
○ Predict whether a transaction is fraud or not fraud
○ Predict whether to give loan or not
○ Predict whether to give college admission or not
○ Predict the grade (Grade A, B, C, D)
○ Note: Classification can be more than two
What is Multi-Classification?
It is as simple as dividing waste into 4 categories - plastic, glass,
metal, paper
Classification vs Regression

In the above image, you can see that the classification line is dividing the
data into 2 parts or 2 classes - red and blue. On the other hand, the
regression line is going along the direction of data and not segregating it.
It’s important to understand the characteristics of your target variable
before you begin running models and forming predictions.
Supervised ML Algorithms -
Regression
Linear Regression: Introduction
Learning Objectives

Dependent and
Equation of a
Independent
Straight Line
Variables

Linear Regression
Dependent and Independent Variables
● So far you’ve been studying input and output/target variables.
Commonly, the input variable is known as independent variable and
target variable is known as dependent variable.

● In nutshell, our target variable is nothing but a dependent variable.

Why dependent? Because the values of this variable are dependent on
other variables (i.e. input variables)

● And, our input variables are known as independent variables. Here the
values of these variables are not dependent on any other variables.

Let’s look at some examples to learn more about them!

Dependent and Independent Variables
● Look at the Standard Metropolitan Areas Data below. In the dataset we
might be curious to predict “crime_rate” in future, so that becomes
our target variable (dependent variable) and rest of the variables
become input variables (independent variables) for building a
machine learning model.

Independent Dependent
variables variable
Another example

● A scientist wants to see if the brightness of light has any effect

on a moth being attracted to the light.

● The brightness of the light is controlled by the scientist. This

would be the independent variable.

● How the moth reacts to the different light levels (distance to

light source) would be the dependent variable.
Equation of a Straight Line
● In algebra, a linear equation (equation of a straight line) typically takes the form y = mx
+ b, where m and b are constants, x is the independent variable, y is the dependent
variable.

● Basically, the value of y is being calculated using x whereas x has no dependence on

value of y.

➔ y = how far up
➔ x = how far along
➔ m = Slope or Gradient (how steep the line is)
➔ b = value of y when x=0

● How do you find "m" and "b"?

○ b is easy: just see where the line crosses

the Y axis.
○ m (the Slope) needs some calculation:
Synonyms Recap

Too many synonyms to memorise? Let me put them all down at one
place for better understanding:

Variables = Features

Input Variables = Attributes = Predictor = Independent Variables

Target Variables = Labels = Outcomes = Dependent Variables

What is linear regression? - an example
Suppose you are thinking of selling your home. And, various houses around
you with different sizes (area in [Link]) around you have sold for different
prices as listed below:

And considering, your home is 3000 square feet. How much should you
sell it for?
Well! You have to look at the existing price patterns (data) and predict a
price for your home. This is called linear regression.
What is linear regression? - an example
Here's an easy way to do it. Plotting the 3 data points we have so far:

Each point represents one home.

What is linear regression? - an example
Now you can eyeball it and roughly draw a line that gets pretty close to
all of these points. Then look at the price shown by the line, where the
square footage is 3000:

Boom! Your home should sell for $260,000.

What is linear regression? - an example
That's all! You plot your data, make a rough line, and use the line to
make predictions. You need to make sure your line fits the data well:

But of course we don't want to roughly make a line, we want to

compute the exact line that best "fits" our data. That’s where
machine learning comes into play!
What is linear regression?
● Linear regression is a linear model i.e. a model that assumes a linear
relationship (straight-line relationship) between the input variables (x) and the
single output variable (y).

● When there is a single input variable (x), the method is referred to as simple
linear regression or just linear regression. Eg: Salary dataset given here. There
is only one target variable and one input variable where we are predicting the
salary of individual using their years of experience.

● When there are multiple input variables, it is often referred to as multiple

linear regression. Eg: Smart Metropolitan areas data set, we have multiple
input variables
References

● [Link]
Learning Objectives

Simple vs. Multiple Linear Regression

Linear Regression with Single Variable

Notebook for
practice
Simple vs. Multiple Linear Regression
Linear Regression with Single Variable
Notebook for practice

[Link]
official/Data_Science_Bootcamp/blob/master/Week3/Linear
_Regression/Introduction_to_Linear_Regression.ipynb
Linear Regression with Multiple Variable
Notebook for practice

[Link]
official/Data_Science_Bootcamp/blob/master/Week3/Linear
_Regression/Multiple_Linear_Regression.ipynb
Unit 3

Regression
Supervised ML Algorithms -
Regression
Evaluating a Regression Model
Learning Objectives

Cost Cost Function

Gradient Descent
Which line is good?
Now coming back to our first example. How do you decide what line is
good? Here's a bad line:

This above drawn line is way off. For example, according to the line, a
1000 sq foot house should sell for $310,000, whereas we know it
actually sold for $200,000.
Which line is good?
Here's a better line:

This line is an average of $8,333 dollars off (adding all the distances and
dividing by 3).

This $8,333 is called the cost of using this line.

Short-term Objective
What were we doing in the previous 2 examples? We plotted 2 straight lines using the
equation: y = mx+b.

If we already have the data points (x1, y1), ..., (xn, yn), it means that our values of x and y
remain the same throughout all the lines we plot.

So what remains? What exactly are we changing to plot different lines?

Yes, m and b.

Our objective is to find the values of m and b that will best fit this data.

These 2 variables are actually called hyperparameters. In machine learning, a

hyperparameter is a parameter whose value is used to control the learning process.
And we must always try to find some optimal parameters while building a machine
learning model.
Cost
The cost is how far off the line is from the real data. The best line is the one that is the least off from
the real data.

To find out what line is the best line (to find the values of m and b), we need to use a cost function.

In ML, cost functions are used to estimate how badly models are performing.

Put simply, a cost function is a measure of how wrong the model is in terms of its ability to estimate
the relationship between X and y.
Cost Function
What?

Now that we built a model, we need to measure its performance right? and understand if
it works well or not. Cost function measures the performance of a Machine Learning
model for given data. It quantifies the error between predicted values and expected
values and presents it in the form of a single real number.

Depending on the problem Cost Function can be formed in many different ways. The
purpose of this function is to be either:

● Minimized - then returned value is usually called cost, loss or error. The goal is to
find the values of model parameters for which Cost Function return as small number
as possible.
● Maximized - then the value it yields is named a reward. The goal is to find values of
model parameters for which returned number is as large as possible.
What is predicted and expected value?
● Predicted value: As the name says is the predicted value of your machine learning model.

● Expected value: Is the true value(or the label present in your data)

Often machine learning models are not 100% accurate or perfect, they tend to deviate from the
true value or expected value.

Explaining with an example: If we are predicting the age of a person based on few input
variables or features.

● Our machine learning model predicted the age as 28 years

● However, the actual age of the person is 29 years.
● Here 28 years is predicted value and 29 years is expected value or true value. As data
scientists, we try to minimize the error while building models.
Cost Function

The difference between the true value and the model’s predicted value is
called residual.
Cost Function Types/ Evaluation Metrics

There are three primary metrics used to evaluate linear models (to find
how well a model is performing):

1. Mean Squared Error:

2. Root Mean Squared Error
3. Mean Absolute Error
Mean Squared Error (MSE)
● MSE is simply the average of the squared difference between the true
target value and the value predicted by the regression model.

● As it squares the differences, it penalizes (gives some penalty or weight

for deviating from the objective) even a small error which leads to
over-estimation of how bad the model is.
Root Mean Squared Error (RMSE)
● It is just the square root of the mean square error.

● It is preferred more in some cases because the errors are first

squared before averaging which poses a high penalty on large
errors. This implies that RMSE is useful when large errors are
undesired.
Mean Absolute Error(MAE)
● MAE is the absolute difference between the target value and the value
predicted by the model.

● MAE does not penalize the errors as effectively as mse making it not
suitable for use-cases where you want to pay more attention to the
outliers.
R Squared ( Coefficient of determination)
R Squared ( Coefficient of determination)
● R-squared is a goodness-of-fit measure for linear regression models.
● It represents the coefficient of how well the values fit compared to
the original values. The values from 0 to 1 are interpreted as
percentages.
● The higher the value is, the better the model is.

● Going by the name, you might think R2 cannot be negative. However, it

can. A Negative R2 means you are doing worse than the mean value.
Which metrics to use when?

376
Which metrics to use when?
This is an important question and we get used to learning these
measures over time. Sharing some resources with you all so that
it helps you understand what metrics to be used in the context of
solving a regression problem.

● [Link]
machine-learning-models-part-1-a99d7d7414e4
(you may ignore “Bonus” section in the article for time being)
Note: Gradient Descent is a slightly advanced topic.
Gradient
Gradient is another word for "slope". The higher the gradient of a graph at a point, the steeper
the line is at that point. A negative gradient means that the line slopes downwards.

Finding the gradient of a straight-line graph

It is often useful or necessary to find out what the gradient of a graph is. For a straight-line
graph, pick two points on the graph. The gradient of the line = (change in y-coordinate)/(change
in x-coordinate) .

In this graph, the gradient = (change in y-

coordinate)/(change in x-coordinate) = (8-6)/(10-6)
= 2/4 = 1/2

We can, of course, use this to find the equation of

the line. Since the line crosses the y-axis when y =
3, the equation of this graph is y = ½x + 3 .
Gradient
Finding the gradient of a curve

To find the gradient of a curve, you must draw an accurate sketch of the curve. At
the point where you need to know the gradient, draw a tangent to the curve. A
tangent is a straight line which touches the curve at one point only. You then find the
gradient of this tangent.

Example
Find the gradient of the curve y = x² at the point (3, 9).

Gradient of tangent =
(change in y)/(change in x)
= (9 - 5)/ (3 - 2.3)
= 5.71
Gradient Descent
The cost function will tell you how good those values are (i.e. it will tell
you how far off your predictions were from the actual data). But what
do we do based on that information? How do we find the values of m
and b that will draw the best line? By using gradient descent.

In a nutshell, to update m and b values in order to reduce Cost function

(minimizing RMSE value) and achieving the best fit line the model uses
Gradient Descent. The idea is to start with random m and b values and
then iteratively updating the values, reaching minimum cost.

Let’s start with a simpler version of gradient descent, and then move on
to the real version.
Gradient Descent
Suppose we decide to leave b at zero. So we experiment with what value m
should be, always keeping b at 0. Now you can try various values for m, and you
will end up with different costs. You can plot all of these costs on a graph:
Gradient Descent
Here are the corresponding lines (remember, b is zero in these lines):

m = 75
m = 160

We can see that the line on the left seems to fit the data better than the
line on the right, so it makes sense that the cost of that line is lower. And
from this graph it looks like m = 75 gives us the lowest cost overall.
Gradient Descent

Since it is the lowest point in this graph. So with all the costs graphed out like
this, we just need to find the lowest point on the graph, and that will give us the
optimal value of m!
Gradient descent helps us find the lowest point on this graph. You start with a
value for m, and update it iteratively till you arrive at the best value. So you can
start at m = 0. Then you have to ask, should I go left or right?
Gradient Descent
Well, we want to go down, so lets go right a small step:

This is the new value for m. Again we ask, should we go left or right? At each
step, you need to head downward, till you get to a point where you're as low as
you can go:
Gradient Descent
This is gradient descent: going down bit by bit till you hit the bottom.

How do you figure out which way is down? The answer will be obvious
to calculus experts but not so obvious for the rest of us: you take the
derivative at that point.

But the important bit to know is, if you take the current value of m and
add the derivative at that point, you will go down. You just do that a
bunch of times (say 1000 times) and you will hit bottom!
Gradient Descent

The video in the next slide explains the process of gradient descent.
You only need to watch till 16:07 to gain some understanding and not
go into the python implementation explained later on.

Also, as the instructor said, there’s no need to dive into the

mathematics or worry about not understanding some math right now.
Gradient Descent

The video in the next slide explains the process of gradient descent.
You only need to watch till 16:07 to gain some understanding and not
go into the python implementation explained later on.

Also, as the instructor said, there’s no need to dive into the

mathematics or worry about not understanding some math right now.
Gradient Descent and Cost Function
Recap

➔ Linear regression is used to predict a value (like the sale price

of a house).
➔ Given a set of data, first try to fit a line to it.
➔ The cost function tells you how good your line is.
➔ You can use gradient descent to find the best line.
Bias and Variance
Bias and Variance in real world
In dictionary terms :

Bias : Prejudice in favor of or against one thing, person, or group

compared with another, usually in a way considered to be unfair.

Variance: The state or fact of disagreeing or quarreling.

In short, Bias represents how unfair is something towards others, and

Variance represents how likely something changes with respect to
others.

Confusing ? Worry not. The next example will clarify all your doubts.
Example
Let’s assume you have called two weather examiners, Mr. Bishop and
Mr. Varian to test if it will rain or not.

Mr. Bishop loves rain a lot. And Mr. Varian is a bookworm.

Let us talk about the conditions for rain.

- It rains only if it’s little humid.

- It does not rain if it's windy, hot or freezing.

Mr. Bishop representing Bias
You ask Mr. Bishop (Despite of his training, he is too biased towards rain) :

Me :Sir, its extremely hot out here, will it rain ?

Mr. Bishop : Yup.

Me :Sir, its little windy, will it rain ?

Mr. Bishop : May be not.

Me :Sir, its freezing will it rain ?

Mr. Bishop : Yes of course.

Me :Sir, its little humid, will it rain ?

Mr. Bishop : Damn sure.

Did you notice, Mr. Bishop is highly Biased towards chances of having rain. During the test, he
is unable to predict most of them correctly.

This condition is called under fitting.

Mr. Varian representing Variance
Now let us see your conversation with Mr. Varian (a bookworm who
completely remembers the training he had ):

Me :Sir, its extremely hot out here, will it rain ?

Mr. Varian: Nope.

Me :Sir, its little windy, will it rain ?

Mr. Varian: No way.

Me :Sir, its freezing, will it rain ?

Mr. Varian: No way.

Me :Sir, its little humid, will it rain ?

Mr. Varian: Yes it will.
Mr. Varian representing Variance
Mr. Varian successfully predicted whether it will rain or not. But being a bookworm,
Mr. Varian is unknown to the conditions not described in the book during training.

Now, we ask Mr. Varian :

Me :Sir, there is a giant sitting on the cloud who lost his candy. Will it rain ?
Mr. Varian: Not sure, since the answer is “No” to most of the conditions, there is a
high possibility that it will not rain .

Now, although the decision of Mr. Varian varies perfectly with the input conditions,
he is not able to predict for the new and unseen condition (other general conditions
apart from the given specific conditions while training).

This condition is called over fitting. And it offers poor generalizability.

High Bias or High Variance?
Then what is better, high bias (high generalizability) or high variance
(high accuracy on training data) ?

Well, the answer is, “Best of both worlds”. We neither need high bias
nor high variance. We would want our algorithm to perform better on
training set and also offer best result on unseen data (the test set).

In general, having high bias reduces the performance of the

algorithm on training set while having high variance reduces
performance on unseen data.

This is known as Bias Variance Trade off.

Reading Material

MUST READ
Understanding the Bias Variance Tradeoff:
[Link]
tradeoff-165e6942b229

Reference
[Link]
bias-variance-tradeoff-ec540fb13e12
Supervised ML Algorithms -
Regression
Decision Tree/ Regression Tree
Decision Trees
Decision tree is the most powerful and popular tool for classification and prediction.

A Decision tree is a flowchart like tree structure, where:

● each internal node denotes a test on an attribute/feature,
● each branch represents an outcome of the test, and
● each leaf node (terminal node) holds a class label (Yes and No in this case).
Decision Trees for Regression -Regression Trees
Decision Trees

Regression tree analysis is used when the

predicted outcome can be considered a real
number (e.g. the price of a house, or a patient's
length of stay in a hospital).
Resources on Decision Tree Regressor

● Implementation of Decision Tree Regression:

[Link]
a-decision-tree-and-scikit-learn-ac98552b43d7
CART
You might encounter the term ‘CART’ while building ML models. It’s actually nothing new but the
same old Decision Tree since it can be used for both Classification and Regression

CART = Classification and Regression Trees, an umbrella term for:

Classification Trees: where the target variable is categorical and the tree is used to identify the
"class" within which a target variable would likely fall into.

Regression Trees: where the target variable is continuous and tree is used to predict it's value.
Supervised ML Algorithms -
Regression
Support Vector Regressor
Support Vector Regressor

Support Vector Machines (SVMs) are well known in

classification problems. However, the use of SVMs in
regression is not as well documented. These types of models
are known as Support Vector Regression (SVR).

SVR gives us the flexibility to define how much error is

acceptable in our model and will find an appropriate line to
fit the data.
Support Vector Regressor
Kernels
● SVM algorithms use a set of mathematical functions that are
defined as the kernel.

● The function of kernel is to take data as input and transform it

into the required form.

● Different SVM algorithms use different types of kernel functions.

These functions can be different types.

● For example linear, nonlinear, polynomial, radial basis function

(RBF), and sigmoid.

● The most used type of kernel function is RBF. It is also the

default kernel
Support Vector Regression Implementation

[Link]
vector-regression-tutorial-for-machine-learning/

What is SVM and SVR

Supervised ML Algorithms -
Regression
Random Forest/ Regression Forest
Random Forest for regression - Regression Forest
Random forest builds multiple decision trees and merges them together to
get a more accurate and stable prediction.

Regression Forests (or Random forest Regressors) are an ensemble

(combination) of different regression trees (decision trees for regression).
Each leaf contains a distribution for the continuous output variable/s.
Random Forest for regression - Regression Forest
Ensemble Learning

Predicts using Multiple machine learning model or same machine learning

model multiple times
Difference between Bagging & Boosting
S.N
Bagging Boosting
O

The simplest way of combining predictions that A way of combining predictions that
1.
belong to the same type. belong to the different types.

2. Aim to decrease variance, not bias. Aim to decrease bias, not variance.

3. Each model receives equal weight. Models are weighted according to their performance.

New models are influenced

4. Each model is built independently.
by the performance of previously built models.

Different training data subsets are randomly

Every new subset contains the elements that were
5. drawn with replacement from the entire training
misclassified by previous models.
dataset.

6. Bagging tries to solve the over-fitting problem. Boosting tries to reduce bias.

If the classifier is unstable (high variance), then If the classifier is stable and simple (high bias) the apply
7.
apply bagging. boosting.

Example: The Random forest model uses

8. Example: The AdaBoost uses Boosting techniques
Bagging.
Ensemble Learning

Types: Boosting and Bagging

Drawbacks of Decision Tree algorithm

● Computationally Expensive

● May get deviations in predictions if the underlying data

gets changed
Random Forest
● Supervised learning algorithm which performs both classification and
regression
● Baggaged: Runs in parallel
Random Forest

● Advantages
○ Effective method for estimating missing data and maintains
accuracy when a large proportion of data are missing
○ Runs efficiently on large datasets

● Disadvantages
○ May observe random forest overfitting for some datasets with
noisy classification/regression tasks
Regression Forest - Implementation

Random forest regression Scikit learn tutorial

number of
estimators to
50,10,100
= number of trees
Supervised ML Algorithms -
Classification
Logistic Regression
Logistic Regression
• Logistic Regression is one of the basic and popular algorithms to solve a
binary classification problems
• For each input, logistic regression outputs a probability that this input
belongs to one of the 2 classes
⮚ Set a probability threshold boundary and that determines which
class the input belongs to
• Binary classification problems (2 classes):
⮚ Emails (Spam / Not Spam)
⮚ Credit Card Transactions (Fraudulent / Not Fraudulent)
⮚ Loan Default (Yes / No)
Logistic Regression

Now, you may ask why don’t we use Linear Regression? Why do we
need a new algorithm?

Well, you would find all the answers in the video in the next slides.

The video in the next slide is a must watch, the instructor has
brilliantly explained about logistic regression!
Must Watch Understanding Logistic Regression
Logistic Regression
Classification
Binary - yes or no (Spam or no spam)
Multiclass - Which party a person will vote? A,B, C?
Linear Regression vs Logistic
● Linear regression is used to solve regression problems with
continuous values
● Logistic regression is used to solve classification problems with
discrete categories
○ Binary classification (Classes 0 and 1)
○ Examples:
• Emails (Spam / Not Spam)
• Credit Card Transactions (Fraudulent / Not Fraudulent)
• Loan Default (Yes / No)
Linear Regression vs Logistic
● Let’s say a data scientist named John want to predict that whether a
customer will buy insurance or not
● Remember that linear regression is used to predict a continuous value
where the output (y) may vary between +∞ (posi ve inﬁnity) to -∞
(negative infinity) whereas in this case, the target variable (y) takes only
two discrete values, 0 (No insurance) and 1 (Yes, got the insurance).
● John’s decides to extend the concepts of linear regression to fulfil his
requirement. One approach is to take the output of linear regression
and map it between 0 and 1, if the resultant output is below a certain
threshold (say 0.5), classify it as No (didn’t buy the insurance) whereas if
the resultant output is above a certain threshold, classify it as bought
the insurance (yes)
Linear Regression vs Logistic
● We then plot a simple linear regression line and set the threshold as 0.5
○ Negative class (Insurance = No)– Age on the left side
○ Positive class (Insurance = Yes) – Age on the right side
Imagine there is an outlier to towards right
Additional outlier that
distorted the
regression line

● As we can see outlier in the data and will distort the whole linear regression
line.
● Clearly the line is unable to differentiate the classes with the linear line fit
● The line should have been at the vertical yellow line which is able to divide
the positive and negative classes i.e yes or no for insurance
Happy John! (Data Scientist)
● Well, life would be much simpler if we had a algorithm that
would fit the points like below right? It is a much better fit compared
to regression line!
Unit 4

Classification and Performance

Metrics
Supervised ML Algorithms -
Classification
Evaluating the performance of a
Classification Model
Learning Objectives

Evaluating the
Performance of
Why not Accuracy?
Logistic Regression
model

Which metrics to
use when?
Evaluating the Performance of Logistic Regression model

● Model Evaluation is a very important part in any analysis to answer

the following questions:

How well does the model fit the data?, Which predictors are most
important?, Are the predictions accurate?

● Guess what, evaluating a Classification model is not as simple as

Linear Regression.

● But why?

● You must be wondering ‘Can’t we just use accuracy of the model as

the holy grail metric?’
Accuracy

Classification Accuracy is what we usually mean, when we use the

term accuracy. It is the ratio of number of correct predictions to the
total number of input samples.

Actual Labels: Task at hand - Separate Yellow and

pink balls
Predicted Labels: Colors predicted by our model

Correctly predicted = 4
Total Predictions = 6
Accuracy = 4/6 = 66.7%
Why not Accuracy?
● Accuracy is very important, but it might not be the best metric all the time. Let’s
look at why with an example -:

● Let’s say we are building a model which predicts if a transaction is fraudulent

or not

● Let’s imagine, we build a basic model which always predicts that a transaction
is not fraudulent. Guess what would be the accuracy of this model?
~99% !! (You may ask why? Well, less than 1% transactions are usually
fraudulent and there is a huge class imbalance. So even if you fit a wrong
model that always predicts a transaction to be not fraudulent, the accuracy will
remain 99% owing to class imbalance)

● Impressive, right? Well, the probability of a bank buying this model is absolute
zero.

● In a problem where there is a large class imbalance, a model can predict the
value of the majority class for all predictions and achieve a high
classification accuracy.

● While our model has a stunning accuracy, this is an apt example where
accuracy is definitely not the right metric.
Why not Accuracy?
Watch till 1 min 14 secs to understand why accuracy is bad metric for model
performance
Evaluating the Performance of Logistic Regression model

Logistic Regression employs different sets of metrics than Linear

Regression. Here, we deal with probabilities and categorical values.

In the following slides, we describe a few of the evaluation metrics

used for Logistic Regression:
Is confusion matrix confusing or it resolves the confusion?
You decide!

443
Confusion Matrix
A confusion matrix is a table that is often used to describe the performance of
a classification model (or "classifier") on a set of test data for which the true
values are known. The confusion matrix itself is relatively simple to
understand, but the related terminology can be confusing.

Let's start with an example confusion matrix for a binary classifier for disease
prediction (though it can easily be extended to the case of more than two
classes):
Confusion Matrix
Let's now define the most basic terms, which are whole numbers (not
rates):

● true positives (TP): These are cases in which we predicted yes (they
have the disease), and they do have the disease.

● true negatives (TN): We predicted no, and they don't have the
disease.

● false positives (FP): We predicted yes, but they don't actually have the
disease. (Also known as a "Type I error.")

● false negatives (FN): We predicted no, but they actually do have the
disease. (Also known as a "Type II error.")

I know these seem hard to memorise. One thing that has helped me
remember these are by putting it in a better way:

false positives = falsely classified as being positive.

Confusion Matrix

This is a list of rates that are often computed from a confusion matrix for
a binary classifier:

● Precision: Correctly predicted as positives compared to total

predicted as positives
Precision = TP/(TP+FP) = 100/110 = 0.91

● Sensitivity/Recall: Correctly predicted as positives compared to

total number of positives
= TP/(TP + FN) = 100/(100+5) = 0.95

Note: Mostly we have to pick one over other, it’s almost impossible to
have both high Precision and Recall.

● Specificity: Correctly predicted as negatives compared to total

number of negatives = TN/(TN + FP) = 50/(50+10) = 0.83
Understanding Precision and Recall
Think about the search box on Amazon home page.

The precision is the proportion of relevant results( correctly predicted yes) in

the list of all returned search results(total predicted yes).

The recall is the ratio of the relevant results( correctly predicted yes)
returned by the search engine to the total number of the relevant results
that could have been returned (total actual yes).
Choosing between Sensitivity and Specificity
Often, the sensitivity and specificity of a test are inversely related. Selecting
the optimal balance of sensitivity and specificity depends on the objective of
the problem that needs to be solved.

If correctly identifying positive class is important for us, then we should

choose a model with higher Sensitivity. However, if correctly identifying
negative class is more important, then we should choose specificity as the
measurement metric.
Sensitivity or Specificity - an example
Let’s say we are predicting if a patient has cancer or not. The default
probability threshold is kept at 0.5 i.e

Class 0 (No cancer) – Below 0.5

Class 1 (Cancer) – Above 0.5
Case 1: Higher Specificity
Suppose we want to predict Class 1 (Ci.e patient has cancer) only if we are VERY confident.
(To avoid giving the patient a shock and to avoid unnecessary treatment)

We can instead change this threshold to 0.7. Thus, we’ll tell someone they have cancer only
if we think they have greater than or equal to 70% chance of having a cancer.

Look at the graph below. SInce the threshold has shifted to the right, so the number of
people correctly guessed as having cancer have increased. Thus, the specificity has
increased. ( We are being very specific with declaring patients with cancer).
Case 2: Higher Sensitivity
Suppose we want to avoid missing too many cases of cancer ( avoid false negatives). If a
person with cancer is told that he’s well, it can cause a delay in treatment and affect the
health badly).

In this case we can set a lower threshold, say 0.25. Even if a patient has 25% chance of
having cancer, we’ll inform him/her.

Looking at the graph you can see that the threshold has shifted to the left. Most of the
people with cancer will be detected in advance in this case. We have completely (or almost)
eliminated False Negatives. It will thus result in higher Sensitivity/ Recall. (We are being
sensitive in detecting a disease i.e a really sensitive test).
You can watch this video from 00:58 to 5:32 explaining the Sensitivity and Specificity trade off
Confusion Matrix
Talking about accuracy, our favourite metric!

Accuracy is defined as the ratio of correctly predicted examples by the total

examples.

● Accuracy: Overall, how often is the classifier correct?

= (TP+TN)/total = (100+50)/165 = 0.91

● Remember, accuracy is a very useful metric when all the classes are
equally important.

● But this might not be the case if we are predicting if a patient has cancer.
In this example, we can probably tolerate FPs but not FNs.

● If a cancerous patient is wrongly reported as being fine, it can result in

delaying of treatment. Which is not good!
Confusion Matrix
So you’ve already learnt how to calculate Precision and Recall and how
changing the threshold can affect their values. (SImilar to Sensitivity,
Specificity threshold)

But do we necessarily need to spend time on varying the threshold to get the
perfect Precision and Recall? Or is there a way to choose this threshold
automatically?

Let’s take 3 algorithms and try to find a metric for combining Precision and
Recall.

How about taking an average of Precision and Recall? (P+R)/2

Precision (P) Recall (R) Average

Algorithm 1 0.5 0.4 0.45

Algorithm 2 0.7 0.1 0.4

Algorithm 3 0.02 1.0 0.51

F1 Score
Average tells us that Algorithm 3 is the best (highest value). Whereas
Algorithm 3 is a dumb model that predicts y=1 each time and thus gives a
recall of 1 (FN =0, TP=1).

That means average isn’t a good metric.

Researchers found a metric that solves our purpose: The F1 Score!

F1 Score
Let’s apply F1 Score to our problem:

Precision (P) Recall (R) Average F1 Score

Algorithm 1 0.5 0.4 0.45 0.444

Algorithm 2 0.7 0.1 0.4 0.175

Algorithm 3 0.02 1.0 0.51 0.0392

The F1 score tells us that Algorithm 1 is the best (highest F1 Score).

● For F1 Score to be large, both P and R need to be large.

● It’ll be highest(1) when both P and R are 1

● Accuracy can be used when the class distribution is similar while F1-
score is a better metric when there are imbalanced classes.
ROC (Receiver Operator Characteristic) Curve
● An ROC curve is a commonly used way to visualize the performance of a
binary classifier, meaning a classifier with two possible output classes.

● It shows the performance of a classification model at all threshold values.

● It plots 2 parameters:

1. True positive rate /Recall (TPR)

1. False Positive rate (FPR)

AUC Curve
● AUC stands for "Area under the ROC Curve." That is, AUC measures the
entire two-dimensional area underneath the entire ROC curve

● AUC provides an aggregate measure of performance across all possible

classification thresholds.
ROC and AUC Explained
Reading Material

MUST READ
An excellent article explaining Threshold, ROC and AUC in a simple
manner: [Link]
auc-curves-a05b68550b69
Which metrics to use when?

463
Which metrics to use when?
This is an important question and we get used to learning these
measures over time. Sharing some resources with you all so that
it helps you understand what metrics to be used in the context of
solving a regression problem.

● 5 Classification Metrics every data scientist must know

[Link]
evaluation-metrics-you-must-know-aa97784ff226

● [Link]
for-evaluating-machine-learning-models-part-2-
86d5649a5428
SUPERVISED LEARNING - BUILDING YOUR FIRST
CLASSIFICATION AND REGRESSION MODELS

[Link]
upervised-ml-model-building-walkthrough
Feature Selection
In real life data science problems, often the data consist of a
large number of attributes or features.

This means selecting the most relevant features as the input to

machine learning model is absolutely vital.
Curse of Dimensionality
Refers to various phenomena that arise when analyzing and
organizing data in high-dimensional spaces (often with
hundreds or thousands of dimensions) that do not occur in
low-dimensional settings such as the three-dimensional
physical space of everyday experience.

As the dimensionality increases, the volume of the space

increases so fast that the available data become sparse.

This sparsity is problematic for any method that requires

statistical significance.

Therefore it is essential to quality control the data that is being

fed to the model.
Feature Selection Methods:
Correlation coefficient

Variance threshold

Recursive Feature Elimination

LASSO regularization

Random Forest Importance

Preserve variance
Linear Discriminant analysis
Principal Component Analysis
Singular value decomposition
Correlation Coefficient
● Measure of the linear relationship
of 2 or more variables.
● Can predict one variable from the
other.
● Good variables are highly
correlated with the target.
● Variables should be correlated
with the target but should be
uncorrelated among themselves.
● If two variables are correlated, we
can predict one from the other.
Therefore, if two features are
correlated, the model only really
needs one of them
● Pearson Correlation
Variance Threshold - Remove Low Var Features
● Removes features whose variance below a pre-defined
cutoff value.
● It is based on the notion that features that do not vary
much within themselves have low predictive power.
● weakness -does not consider the relationship of input
features with the output feature.
● Before performing variance thresholding, all features
should be standardized so they will have the same scale.
● Scikit-learn provides VarianceThreshold function to
perform variance threshold method.
Variance Threshold - Remove Low Var Features

● Variance Threshold Documentation: [Link]

[Link]/stable/modules/generated/sklearn.feature_sele
[Link]
Recursive feature elimination (RFE)
● Given an external estimator that assigns weights to
features (e.g., the coefficients of a linear model), the goal
RFE is to select features by recursively considering
smaller and smaller sets of features.
● First, the estimator is trained on the initial set of features
and the importance of each feature is obtained either
through a coef_ attribute or through a
feature_importances_ attribute.
● The least important features are removed from the
current set of features.
● That procedure is recursively repeated on the
pruned/removed set until the desired number of
features to select is eventually reached.
LASSO Regularization L1
● Regularization consists of adding a penalty to the
different parameters of the machine learning model to
reduce the freedom of the model, i.e. to avoid over-
fitting.
● In linear model regularization, the penalty is applied over
the coefficients that multiply each of the predictors.
● From the different types of regularization, Lasso or L1
has the property that is able to shrink some of the
coefficients to zero.
● Therefore, that feature can be removed from the model.
Random Forests Importance
● Random Forests is a kind of a Bagging Algorithm that
aggregates a specified number of decision trees.
● The tree-based strategies used by random forests
naturally rank by how well they improve the purity of the
node, or in other words a decrease in the impurity (Gini
impurity) over all trees.
● Nodes with the greatest decrease in impurity happen at
the start of the trees, while notes with the least decrease
in impurity occur at the end of trees.
● Thus, by pruning/removing trees below a particular node,
we can create a subset of the most important features.
Supervised ML Algorithms -
Classification
Decision Tree/ Classification Tree
Decision Trees for Classification -Classification Trees
Decision Trees
Classification tree methods (i.e., decision tree
methods) are recommended when the task contains
classifications or predictions of outcomes, and the goal
is to generate rules that can be easily explained and
translated into SQL or a natural query language.

A Classification tree labels, records, and assigns

variables to discrete classes.

A Classification tree can also provide a measure of

confidence that the classification is correct.
CART
You might encounter the term ‘CART’ while building ML models. It’s actually
nothing new but the same old Decision Tree.

CART = Classification and Regression Trees, an umbrella term for:

Classification Trees: where the target variable is categorical and the tree is used
to identify the "class" within which a target variable would likely fall into.

Regression Trees: where the target variable is continuous and tree is used to
predict it's value.
Resources on Decision Tree Classification

● Implementation of Decision Tree Classification:

[Link]
tree-classification-python

● Notebooks:
[Link]
notebooks-intro-to-model-building
[Link]
_improvised_model
Supervised ML Algorithms -
Classification
Support Vector Machine (SVM)
SVM’s Objective
The objective of SVMs is to categorize data into two classes. It does so by finding a
separating hyperplane(decision boundary), where the distance between itself and the
closest data points for both categories is maximized.

Look at how the hyperplane lies exactly between the nearby blue and red points
(maximising the margin)
Simple Visual Explanation of SVM

[Link]
vector-machine-svm-a-visual-simple-explanation-part-1-
a7efa96444f2
Pros and Cons — SVM
Pros:
● It is useful for both linearly Separable (hard margin) and
Non-linearly Separable (soft margin) data.
● It is effective in high dimensional spaces.
● It is effective in cases where a number of dimensions are
greater than the number of samples.
● It uses a subset of training points in the decision function
(called support vectors), so it is also memory efficient.

Cons:
● Picking the right kernel and parameters can be
computationally intensive.
● It also doesn’t perform very well, when the data set has
more noise i.e. target classes are overlapping
● SVM doesn’t directly provide probability estimates, these
are calculated using an expensive five-fold cross-validation.
Support Vector Machine (SVM)
SVM Implementation
SVM Notebook

[Link]
[Link]
Kernels
● SVM algorithms use a set of mathematical functions that
are defined as the kernel.

● The function of kernel is to take data as input and

transform it into the required form.

● Different SVM algorithms use different types of kernel

functions. These functions can be different types.

● For example linear, nonlinear, polynomial, radial basis

function (RBF), and sigmoid.

● The most used type of kernel function is RBF. It is also the

default kernel
Supervised ML Algorithms -
Classification
Naive Bayes Classifier
Naive Bayes

● Naive Bayes is the most straightforward and fast classification

algorithm, which is suitable for a large chunk of data.

● Naive Bayes classifier is successfully used in various

applications such as spam filtering, text classification,
sentiment analysis, and recommender systems.

● It uses Bayes theorem of probability for prediction of

unknown class.
Advantages and Disadvantages

● Advantages:
This algorithm requires a small amount of training data
to estimate the necessary parameters. Naive Bayes
classifiers are extremely fast compared to more
sophisticated methods.

● Disadvantages: Naive Bayes is known to be a bad

estimator.
In-depth explanation & Implementation
Titanic Survival

[Link]
ayes/14_naive_bayes_1_titanic_survival_prediction.ipynb
Multinomial Gaussian NB for spam detection

Note:
In the next video, the instructor has used the spam detection
example that involves some operations around text (NLP).

You don’t need to worry about those parts and instead just
focus on the parts revolving around Naive Bayes.
Multinomial Gaussian NB for spam detection
Count Vectorization
Email Spam filter

[Link]
.ipynb
Naive Bayes Classification using Scikit-learn

[Link]
scikit-learn
Supervised ML Algorithms -
Classification
Random Forest
Random Forest
Random forest is a flexible, easy to use machine learning algorithm
that produces, a great result most of the times even without hyper-
parameter tuning.

It is also one of the most used algorithms, because of its simplicity and
diversity (it can be used for both classification and regression
tasks).

Random forest builds multiple decision trees and merges them

together to get a more accurate and stable prediction.
Random Forest: Ensemble tree based algorithm
It is an ensemble tree-based learning algorithm. The Random Forest
Classifier is a set of decision trees from randomly selected subset of
training set. It aggregates the votes from different decision trees to
decide the final class of the test object.
Random Forest: Ensemble tree based algorithm
Random Forest - Identifying the numbers - 0-9

[Link]
_forest/11_random_forest.ipynb
Applications
● The random forest algorithm is used in a lot of different fields, like
banking, the stock market, medicine and e-commerce.

● In finance, for example, it is used to detect customers more likely

to repay their debt on time, or use a bank's services more
frequently. In this domain it is also used to detect fraudsters out to
scam the bank. In trading, the algorithm can be used to determine
a stock's future behavior.

● In the healthcare domain it is used to identify the correct

combination of components in medicine and to analyze a patient’s
medical history to identify diseases.

● Random forest is used in e-commerce to determine whether a

customer will actually like the product or not.
Resources on Random Forest Classification

● Implementation of Random Forest

Classification:
[Link]
/random-forests-classifier-python

● Notebooks:
[Link]
andom-forest
Learning Objectives

Feature Importance &

Selection

Benefits of Feature Selection Feature Selection Techniques

Problem & Solution
Problem:

Gone are the days when you had 5 variables to fit your linear
regression: Modern datasets contain more variables/features to
choose from. A dataset with 50 or more features -> more than 1
million observations.

Solution: Feature Importance and Selection

We will come to why/benefits very soon!

Feature Importance and Feature Selection

What, why and how?

Feature Importance and Feature Selection
Feature Importance refers to techniques that assign a score to
input features based on how useful they are at predicting a
target variable.

Feature Selection is the process where you automatically or

manually select features which contribute most to your target
variable.

In short, Feature Importance Scores are used for performing

Feature Selection
Iris Dataset
Feature Importance
Feature Selection

Manual or auto
selection
How?
Suppose we’re working on the Iris Classification. We’ll first create a baseline model
using Logistic Regression. Now, we want to try out Feature Selection and try to
improve our model’s performance. On plotting feature importance scores, we
obtain the below graph:

● Feature Importance Scores tell us that Petal width and height are the the top 2
features. The rest have a much lower importance score.
● We’ll select these 2 features.
● We’ll transform our existing dataset to contain only these 2 features.
● We’ll train our model on this transformed dataset.
● Finally, we’ll compare the evaluation metrics of our initial Logistic Regression
model with this new model.
Why Feature Selection?
You already know a number of optimization methods by now and might
think what’s the need of reducing our data by feature selection if we can
just optimize?

There’s something known as “The curse of dimensionality”.

In machine learning,
“dimensionality” = number of features (i.e. input variables) in your dataset.

When the number of features is very large relative to

the number of observations(rows) in your dataset,
certain algorithms struggle to train effective models.
This is called the Curse of Dimensionality.
Why Feature Selection?

Image source: TowardsDataScience

Curse of dimensionality analogy
Let's say you have a straight line 100 yards long and you dropped a penny
somewhere on it. It wouldn't be too hard to find. You walk along the line
and it takes two minutes.

Now let's say you have a square 100 yards on each side and you dropped a
penny somewhere on it. It would be pretty hard, like searching across two
football fields stuck together. It could take days.

Now a cube 100 yards across. That's like searching a 30-story building the
size of a football stadium. Ugh.

The difficulty of searching through the space gets a lot harder as you have
more dimensions.
Benefits of performing Feature Selection
You might’ve gotten an idea of why feature selection is required by now.

Feature Selection helps us with the following:

● Reduces Overfitting: Less redundant data means less opportunity

to make decisions based on noise(irrelevant data).

● Improves Model Performance: Less misleading data means our

model’s performance improves.

● Reduces Training Time: Less data means that algorithms train

faster.

Strategies for Handling Missing Data
No ratings yet
Strategies for Handling Missing Data
19 pages
Strategies for Imputing Missing Data
No ratings yet
Strategies for Imputing Missing Data
12 pages
Create Classification Datasets in Python
No ratings yet
Create Classification Datasets in Python
8 pages
Handling Missing Values
No ratings yet
Handling Missing Values
36 pages
Assessing Feature Importance in Python
No ratings yet
Assessing Feature Importance in Python
30 pages
Strategies for Missing Data Handling
No ratings yet
Strategies for Missing Data Handling
13 pages
Data Imputation Techniques Explained
No ratings yet
Data Imputation Techniques Explained
6 pages
Ads3 50
No ratings yet
Ads3 50
13 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
105 pages
Handling Missing Values in Datasets
No ratings yet
Handling Missing Values in Datasets
3 pages
Handling Missing Data in Big Data Analytics
No ratings yet
Handling Missing Data in Big Data Analytics
10 pages
Wa0062.
No ratings yet
Wa0062.
31 pages
AI - ML Engineering Week 2
No ratings yet
AI - ML Engineering Week 2
15 pages
Handling Missing Data in Datasets
No ratings yet
Handling Missing Data in Datasets
5 pages
Handling Missing Values in Datasets
No ratings yet
Handling Missing Values in Datasets
14 pages
Generative Imputation for Missing Data
No ratings yet
Generative Imputation for Missing Data
16 pages
Notes 4
No ratings yet
Notes 4
15 pages
Data Preprocessing: Handling Missing Values
No ratings yet
Data Preprocessing: Handling Missing Values
20 pages
Data Preprocessing Techniques in Mining
No ratings yet
Data Preprocessing Techniques in Mining
23 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
60 pages
ML Unit 1
No ratings yet
ML Unit 1
21 pages
Understanding Missing Data in ML
No ratings yet
Understanding Missing Data in ML
25 pages
EDA and Feature Engineering Aseef
No ratings yet
EDA and Feature Engineering Aseef
9 pages
Handling Missing Data in Machine Learning
No ratings yet
Handling Missing Data in Machine Learning
22 pages
Missing Value Imputation Techniques
No ratings yet
Missing Value Imputation Techniques
9 pages
Lecture 4: Data Cleaning and Preprocessing
No ratings yet
Lecture 4: Data Cleaning and Preprocessing
5 pages
Data Extraction and Cleansing Guide
No ratings yet
Data Extraction and Cleansing Guide
29 pages
SimpleImputer for Data Preprocessing
100% (1)
SimpleImputer for Data Preprocessing
41 pages
Wa0062.
No ratings yet
Wa0062.
30 pages
Data Pre-Processing Techniques in Analytics
No ratings yet
Data Pre-Processing Techniques in Analytics
41 pages
UNIT 3 (MLP) modifIED
No ratings yet
UNIT 3 (MLP) modifIED
20 pages
Missing Data Imputation Techniques in Python
No ratings yet
Missing Data Imputation Techniques in Python
30 pages
Data Preprocessing and Classification Guide
No ratings yet
Data Preprocessing and Classification Guide
12 pages
Machine Learning for Data Imputation
No ratings yet
Machine Learning for Data Imputation
10 pages
Data Preprocessing for Machine Learning
No ratings yet
Data Preprocessing for Machine Learning
21 pages
Data Science: Spam Filtering & Preprocessing
No ratings yet
Data Science: Spam Filtering & Preprocessing
55 pages
Handling Imbalanced Data in ML
No ratings yet
Handling Imbalanced Data in ML
55 pages
Interpolation and Imputation for Missing Data
No ratings yet
Interpolation and Imputation for Missing Data
10 pages
Data Processing Techniques in ML
No ratings yet
Data Processing Techniques in ML
26 pages
Median Imputation in ML Techniques
No ratings yet
Median Imputation in ML Techniques
8 pages
Handling Missing Values in Decision Trees
No ratings yet
Handling Missing Values in Decision Trees
9 pages
Data Preprocessing Guide: Steps 1-5
No ratings yet
Data Preprocessing Guide: Steps 1-5
50 pages
Handling Missing Data in Datasets
No ratings yet
Handling Missing Data in Datasets
12 pages
Data Mining Assignment Report Overview
No ratings yet
Data Mining Assignment Report Overview
12 pages
VVImp Missing Values v14
No ratings yet
VVImp Missing Values v14
35 pages
Handling Missing Data & Encoding Techniques
No ratings yet
Handling Missing Data & Encoding Techniques
13 pages
Imputing Missing Data for Visualization Accuracy
No ratings yet
Imputing Missing Data for Visualization Accuracy
3 pages
Data Cleaning Techniques in Data Science
No ratings yet
Data Cleaning Techniques in Data Science
11 pages
Imputation Methods for Missing Data
No ratings yet
Imputation Methods for Missing Data
5 pages
Data Preparation for Machine Learning
No ratings yet
Data Preparation for Machine Learning
30 pages
Data Preparation For Analysis
No ratings yet
Data Preparation For Analysis
17 pages
Handling Missing Values in Data Analysis
No ratings yet
Handling Missing Values in Data Analysis
27 pages
Handling Missing Values in ML
No ratings yet
Handling Missing Values in ML
9 pages
Data Mining Lab Report: Techniques & Analysis
No ratings yet
Data Mining Lab Report: Techniques & Analysis
33 pages
FE + EDA Complete Notes
No ratings yet
FE + EDA Complete Notes
341 pages
Unit II Data Preprocessing - 2.2 and 2.3
No ratings yet
Unit II Data Preprocessing - 2.2 and 2.3
12 pages
Credit Card Fraud Detection Models
No ratings yet
Credit Card Fraud Detection Models
12 pages
Overview of General Sensations
No ratings yet
Overview of General Sensations
47 pages
Respiratory and Renal Case Studies Analysis
No ratings yet
Respiratory and Renal Case Studies Analysis
14 pages
Anatomy and Function of the Eye
No ratings yet
Anatomy and Function of the Eye
58 pages
Medical Electronics Engineering Syllabus 2021-25
No ratings yet
Medical Electronics Engineering Syllabus 2021-25
112 pages
Sleep Study Report for Mrs. D M Nagamani
No ratings yet
Sleep Study Report for Mrs. D M Nagamani
6 pages
Medical Device Development in India
No ratings yet
Medical Device Development in India
4 pages
MCQs on Management Functions and Objectives
No ratings yet
MCQs on Management Functions and Objectives
57 pages
Direct Variation in Mathematics 9
No ratings yet
Direct Variation in Mathematics 9
10 pages
Pursuing Development Studies in Australia
100% (5)
Pursuing Development Studies in Australia
3 pages
Emotional & Spiritual Intelligence Impact on Employee Performance
No ratings yet
Emotional & Spiritual Intelligence Impact on Employee Performance
16 pages
Series-Parallel Connection of Networks
No ratings yet
Series-Parallel Connection of Networks
6 pages
A Social Archaeology of Roman and Late Antique Egypt Artefacts of Everyday Life Ellen Swift Ebook & Testbank
No ratings yet
A Social Archaeology of Roman and Late Antique Egypt Artefacts of Everyday Life Ellen Swift Ebook & Testbank
279 pages
EMEA Food & Drink Trends 2024 Insights
No ratings yet
EMEA Food & Drink Trends 2024 Insights
39 pages
Introduction to Development Economics
No ratings yet
Introduction to Development Economics
1 page
BPI Card Suspension Case Analysis
83% (6)
BPI Card Suspension Case Analysis
2 pages
Financial Markets Basics Course Outline
No ratings yet
Financial Markets Basics Course Outline
3 pages
Lampiran 6 Prt101 - P - Id Upgrading
No ratings yet
Lampiran 6 Prt101 - P - Id Upgrading
7 pages
Huawei SDN Training Overview
No ratings yet
Huawei SDN Training Overview
390 pages
Functional Occlusion and Malocclusion Guide
No ratings yet
Functional Occlusion and Malocclusion Guide
66 pages
Candidate Attorney Position Available
No ratings yet
Candidate Attorney Position Available
2 pages
Chemistry Course Specifications 109CHM
No ratings yet
Chemistry Course Specifications 109CHM
7 pages
Overview of Numerical Methods for PDEs
No ratings yet
Overview of Numerical Methods for PDEs
17 pages
Operations Management Exam Guide
No ratings yet
Operations Management Exam Guide
7 pages
Data Encoding and Modulation Overview
No ratings yet
Data Encoding and Modulation Overview
47 pages
24-Pulse HVDC Converter Technique
No ratings yet
24-Pulse HVDC Converter Technique
8 pages
Snowpiercer and Conflict Theory Analysis
No ratings yet
Snowpiercer and Conflict Theory Analysis
6 pages
Understanding Arrest Procedures and Requirements
No ratings yet
Understanding Arrest Procedures and Requirements
4 pages
Going Places by Limousine
No ratings yet
Going Places by Limousine
3 pages
Microeconomics: Supply and Demand Analysis
No ratings yet
Microeconomics: Supply and Demand Analysis
2 pages
Bar Code Printer Service Manual
No ratings yet
Bar Code Printer Service Manual
49 pages
Understanding Diversity in Education
No ratings yet
Understanding Diversity in Education
5 pages
TCS Ninja Interview Prep: Resume Questions
No ratings yet
TCS Ninja Interview Prep: Resume Questions
44 pages
Live Animals Acceptance Checklist Guide
No ratings yet
Live Animals Acceptance Checklist Guide
1 page
New Women of Empowerment Honored
No ratings yet
New Women of Empowerment Honored
7 pages
Air Conditioner Price and Specs Guide
No ratings yet
Air Conditioner Price and Specs Guide
5 pages
IoT Applications in Smart Cities
No ratings yet
IoT Applications in Smart Cities
20 pages

Handling Imbalanced Datasets in ML

Uploaded by

Handling Imbalanced Datasets in ML

Uploaded by

Imbalanced Classes

○ The algorithm receives significantly more examples from one

Can you think of any other example where there is

!!! - it is a huge dataset so opening it in excel might be

● Read about this dataset here:

!! The dataset takes a lot of time to load on colab.

A nice article on data preprocessing:

P.S: Try opening in incognito if it is asking for a premium

Data Pre- • Explained with an example on Detecting

Missing Missing values can be due to:

values in • Missed entries by participants (in case of surveys)

data • Missed information by database managers

(e.g. in a customer dataset for banks, ‘No of

IEEE 57049 2005 180 3.5 • id_12 - id_38

Missingness in dataset can affect a machine learning problem in different ways :

3. Missingness is related to outcome variable predicted itself and is hence

Source: Tim Bock on DisplayR

• Imputation or filling in unreal values doesn’t come without consequences.

Simple If variable is numeric: replace

• Example In Python: Iterative Imputer

1. Missingness is related to outcome variable predicted itself - Here, imputation

• Common Methods to Imputing Missing Data

✔ Classification is the problem of classifying the outcome into

✔ Predict colour: Red/Blue/Green etc

✔ Predict if name matches/not

✔ Regression is the problem of predicting a continuous outcome

✔ Predict price of house

✔ Predict age of a dog based on data of dog images!

The problem of Class

• Oversampling: Artificially increase the minority class (eg: by duplicating minority

• SMOTE example in Python: [Link]

• Handle missing values via:

• Why scaling is important and techniques:

What is Machine Machine Learning

It is called supervised learning because the process of an algorithm learning

The goal of Unsupervised Learning Algorithms is to find some

These are called unsupervised learning because unlike supervised

What we see here is called reinforcement learning. The robot learns

Reinforcement learning, in a simplistic definition, is learning best

Supervised Learning can be further divided into 2 types:

In order to decide whether to use a regression or classification model, the

Does your target variable have a continuous value or is it discrete (binary

If your answer is continuous values, you’re dealing with

This means that if you’re trying to predict quantities like

So if your objective is to determine tomorrow’s temperature,

● When the number of classes is 2, it is known as Binary

● When it is more than 2, it is known as Multi Class Classification.

● In Classification we classify the outcome

● In nutshell, our target variable is nothing but a dependent variable.

Let’s look at some examples to learn more about them!

● A scientist wants to see if the brightness of light has any effect

● The brightness of the light is controlled by the scientist. This

● How the moth reacts to the different light levels (distance to

● Basically, the value of y is being calculated using x whereas x has no dependence on

● How do you find "m" and "b"?

○ b is easy: just see where the line crosses

Input Variables = Attributes = Predictor = Independent Variables

Target Variables = Labels = Outcomes = Dependent Variables

Each point represents one home.

Boom! Your home should sell for $260,000.

But of course we don't want to roughly make a line, we want to

● When there are multiple input variables, it is often referred to as multiple

Simple vs. Multiple Linear Regression

Cost Cost Function

This $8,333 is called the cost of using this line.

So what remains? What exactly are we changing to plot different lines?

These 2 variables are actually called hyperparameters. In machine learning, a

● Our machine learning model predicted the age as 28 years

1. Mean Squared Error:

● As it squares the differences, it penalizes (gives some penalty or weight

● It is preferred more in some cases because the errors are first

● Going by the name, you might think R2 cannot be negative. However, it

Finding the gradient of a straight-line graph

In this graph, the gradient = (change in y-

We can, of course, use this to find the equation of

In a nutshell, to update m and b values in order to reduce Cost function