Handling Imbalanced Datasets in ML
Handling Imbalanced Datasets in ML
Imbalanced Dataset
Imbalanced Dataset
● For example, suppose you have a credit card transaction data and you are
supposed to predict fraudulent transactions. You'll likely have 10,000
authentic transactions for every 1 fraudulent transaction, that's quite an
imbalance!
● In machine learning terms: Often you'll have a large amount of
data/observations for one class (referred to as the majority class), and
much fewer observations for one or more other classes (referred to as the
minority classes).
Imbalanced Dataset
● The problem is that machine learning models trained on
unbalanced datasets often have poor results when they have to
generalize (predict a class or classify unseen observations). Despite
the algorithm you choose, some models will be more susceptible to
unbalanced data than others. Ultimately, this means you will not
end up with a good model, and the reasons include:
● Categorical Features:
○ ProductCD
○ card1 - card6
○ addr1, addr2
○ Pemaildomain Remaildomain
○ M1 - M9 (bank sensitive data)
Note: Some of the feature/variable description is not given as
About the Dataset
Identity Table *
● Variables in this table are identity information – network connection information (IP,
ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with
transactions.
● They're collected by Vesta’s fraud protection system and digital security partners.
● (The field names are masked and pairwise dictionary will not be provided for privacy
protection and contract agreement)
● Categorical Features:
○ DeviceType
○ DeviceInfo
○ id12 - id38
Notebook Link
[Link]
notebooks-handling-imbalanced-classes
today –
Oversampling - SMOTE
• Standardization/Normalization and
transformation for data
processing
dataset!
Pre-processing – what, why and how?
What?
Pre-processing is the process of preparing the data for training.
Why?
• Data is not ready-made for us:
✔ Missing values
✔ Wrongful Data entries
✔ Class Imbalance
✔ Different scales of data..
How?
• Handling Missing Values
• Handling Imbalanced datasets, Oversampling - SMOTE
• Standardization/Normalization and transformation for data
Most datasets are not perfect, they have
missing values.
• Single Imputation
• Regression Imputation
• Multiple Imputation
Which variables to impute-
filling missing values vs using unreal information
Imputing by mean
How it helps? or median balances
the data distribution
Other Imputation Techniques
• Multiple Imputation: Multiple Imputation fills in estimates for the missing data.
But to capture the uncertainty in those estimates, MI estimates the values
multiple times.
• Classification
• Classification examples:
• Regression
• Regression examples:
311
Will this lead to bias
against women?
Example of Class
Imbalance
• Given a database of machine learning publications,
if the problem is to predict whether a researcher is
male or female, will the default prediction be biased
by machine learning?
When is class imbalance a problem?
• Class imbalance is a problem when there are too less minority class (fraud)
observations for model to learn from.
• One needs to decide when to create new minority class (Fraud) observations or
remove existing majority (normal transactions) class observations.
Class Imbalance in Machine learning : in our example: Balanced Scale Data
• Majority class:
• Minority class:
How to Handle Imbalanced datasets
• SMOTE:
Synthetically (S) creating minority (M)
class observations leading to
oversampling (O) using this technique
(TE) and under sampling majority to
get a certain ratio between the
classes.
• Proposed by Chawla et al 2002. (ref)
PERFORMANCE WITH AND
WITHOUT HANDLING
CLASS IMBALANCE
Performance with and without handling class imbalance
• AUC (a performance score for decision tree classifier) is slightly better using the
“SMOTE’d” data based model.
• We can play around with parameters in SMOTE and further improve the model.
• We can also use advanced machine learning models to improve further!
References
What?
• Standardization/Scaling is bringing all variables used for building model to the same scale
Why?
• It balances the overeffect of variables with higher range (let us example in next slide)
• Sometimes, it also helps in speeding up the calculations in an algorithm.
• It is important for techniques which use distance metrics.
How?
• Scale– It means to change the range of values but without changing the shape of distribution.
Range is often set to 0 to 1.
• Standardize means changing values so that distribution standard deviation from mean equals to
one,output will be very close to normal distribution.
• NORMALIZE-It can be used either of above things
Why scaling- in our example
• Let's say you have two input vectors: X1 and X2. and let's say X1 has range(0.1 to
0.8) and X2 has range(3000 to 50000). Now your SVM classifier will be a linear
boundary lying in X1-X2 plane. My claim is that the slope of linear decision
boundary should not depend on the range of X1 and X2, but instead upon the
distribution of points.
Various scaling methods
• Min-Max Scaler
• Robust Scaler
• Standard Scaler
• Normalizer
When to scale data?
• If you build models using scaled data, it may require scaling back to original
variables to interpret variables’ effect on outcome predicted.
Put all the pre-processing techniques together
[Link]
official/Machine_Learning_Bootcamp/blob/master/Data_Prepa
ration_101/Data_Preparation_101.ipynb
Learning Objectives
Classification and
Regression
What is Machine Learning?
Machine Learning Categorization
Supervised Learning Algorithms
Let’s talk about the datasets that have both input variables and target variables
(labels for the data). Ranging from predicting the survival rate of a person in
Titanic Dataset where Survival Rate is already given to predicting the House
Price according to house characteristics where the house prices are provided.
The algorithms that work on such datasets are known as Supervised Learning
Algorithms.
In the above image, you can see that the classification line is dividing the
data into 2 parts or 2 classes - red and blue. On the other hand, the
regression line is going along the direction of data and not segregating it.
It’s important to understand the characteristics of your target variable
before you begin running models and forming predictions.
Supervised ML Algorithms -
Regression
Linear Regression: Introduction
Learning Objectives
Dependent and
Equation of a
Independent
Straight Line
Variables
Linear Regression
Dependent and Independent Variables
● So far you’ve been studying input and output/target variables.
Commonly, the input variable is known as independent variable and
target variable is known as dependent variable.
● And, our input variables are known as independent variables. Here the
values of these variables are not dependent on any other variables.
Independent Dependent
variables variable
Another example
➔ y = how far up
➔ x = how far along
➔ m = Slope or Gradient (how steep the line is)
➔ b = value of y when x=0
Too many synonyms to memorise? Let me put them all down at one
place for better understanding:
Variables = Features
And considering, your home is 3000 square feet. How much should you
sell it for?
Well! You have to look at the existing price patterns (data) and predict a
price for your home. This is called linear regression.
What is linear regression? - an example
Here's an easy way to do it. Plotting the 3 data points we have so far:
● When there is a single input variable (x), the method is referred to as simple
linear regression or just linear regression. Eg: Salary dataset given here. There
is only one target variable and one input variable where we are predicting the
salary of individual using their years of experience.
● [Link]
Learning Objectives
Notebook for
practice
Simple vs. Multiple Linear Regression
Linear Regression with Single Variable
Notebook for practice
[Link]
official/Data_Science_Bootcamp/blob/master/Week3/Linear
_Regression/Introduction_to_Linear_Regression.ipynb
Linear Regression with Multiple Variable
Notebook for practice
[Link]
official/Data_Science_Bootcamp/blob/master/Week3/Linear
_Regression/Multiple_Linear_Regression.ipynb
Unit 3
Regression
Supervised ML Algorithms -
Regression
Evaluating a Regression Model
Learning Objectives
Gradient Descent
Which line is good?
Now coming back to our first example. How do you decide what line is
good? Here's a bad line:
This above drawn line is way off. For example, according to the line, a
1000 sq foot house should sell for $310,000, whereas we know it
actually sold for $200,000.
Which line is good?
Here's a better line:
This line is an average of $8,333 dollars off (adding all the distances and
dividing by 3).
If we already have the data points (x1, y1), ..., (xn, yn), it means that our values of x and y
remain the same throughout all the lines we plot.
Our objective is to find the values of m and b that will best fit this data.
To find out what line is the best line (to find the values of m and b), we need to use a cost function.
In ML, cost functions are used to estimate how badly models are performing.
Put simply, a cost function is a measure of how wrong the model is in terms of its ability to estimate
the relationship between X and y.
Cost Function
What?
Now that we built a model, we need to measure its performance right? and understand if
it works well or not. Cost function measures the performance of a Machine Learning
model for given data. It quantifies the error between predicted values and expected
values and presents it in the form of a single real number.
Depending on the problem Cost Function can be formed in many different ways. The
purpose of this function is to be either:
● Minimized - then returned value is usually called cost, loss or error. The goal is to
find the values of model parameters for which Cost Function return as small number
as possible.
● Maximized - then the value it yields is named a reward. The goal is to find values of
model parameters for which returned number is as large as possible.
What is predicted and expected value?
● Predicted value: As the name says is the predicted value of your machine learning model.
● Expected value: Is the true value(or the label present in your data)
Often machine learning models are not 100% accurate or perfect, they tend to deviate from the
true value or expected value.
Explaining with an example: If we are predicting the age of a person based on few input
variables or features.
The difference between the true value and the model’s predicted value is
called residual.
Cost Function Types/ Evaluation Metrics
There are three primary metrics used to evaluate linear models (to find
how well a model is performing):
● MAE does not penalize the errors as effectively as mse making it not
suitable for use-cases where you want to pay more attention to the
outliers.
R Squared ( Coefficient of determination)
R Squared ( Coefficient of determination)
● R-squared is a goodness-of-fit measure for linear regression models.
● It represents the coefficient of how well the values fit compared to
the original values. The values from 0 to 1 are interpreted as
percentages.
● The higher the value is, the better the model is.
376
Which metrics to use when?
This is an important question and we get used to learning these
measures over time. Sharing some resources with you all so that
it helps you understand what metrics to be used in the context of
solving a regression problem.
● [Link]
machine-learning-models-part-1-a99d7d7414e4
(you may ignore “Bonus” section in the article for time being)
Note: Gradient Descent is a slightly advanced topic.
Gradient
Gradient is another word for "slope". The higher the gradient of a graph at a point, the steeper
the line is at that point. A negative gradient means that the line slopes downwards.
It is often useful or necessary to find out what the gradient of a graph is. For a straight-line
graph, pick two points on the graph. The gradient of the line = (change in y-coordinate)/(change
in x-coordinate) .
To find the gradient of a curve, you must draw an accurate sketch of the curve. At
the point where you need to know the gradient, draw a tangent to the curve. A
tangent is a straight line which touches the curve at one point only. You then find the
gradient of this tangent.
Example
Find the gradient of the curve y = x² at the point (3, 9).
Gradient of tangent =
(change in y)/(change in x)
= (9 - 5)/ (3 - 2.3)
= 5.71
Gradient Descent
The cost function will tell you how good those values are (i.e. it will tell
you how far off your predictions were from the actual data). But what
do we do based on that information? How do we find the values of m
and b that will draw the best line? By using gradient descent.
Let’s start with a simpler version of gradient descent, and then move on
to the real version.
Gradient Descent
Suppose we decide to leave b at zero. So we experiment with what value m
should be, always keeping b at 0. Now you can try various values for m, and you
will end up with different costs. You can plot all of these costs on a graph:
Gradient Descent
Here are the corresponding lines (remember, b is zero in these lines):
m = 75
m = 160
We can see that the line on the left seems to fit the data better than the
line on the right, so it makes sense that the cost of that line is lower. And
from this graph it looks like m = 75 gives us the lowest cost overall.
Gradient Descent
Since it is the lowest point in this graph. So with all the costs graphed out like
this, we just need to find the lowest point on the graph, and that will give us the
optimal value of m!
Gradient descent helps us find the lowest point on this graph. You start with a
value for m, and update it iteratively till you arrive at the best value. So you can
start at m = 0. Then you have to ask, should I go left or right?
Gradient Descent
Well, we want to go down, so lets go right a small step:
This is the new value for m. Again we ask, should we go left or right? At each
step, you need to head downward, till you get to a point where you're as low as
you can go:
Gradient Descent
This is gradient descent: going down bit by bit till you hit the bottom.
How do you figure out which way is down? The answer will be obvious
to calculus experts but not so obvious for the rest of us: you take the
derivative at that point.
But the important bit to know is, if you take the current value of m and
add the derivative at that point, you will go down. You just do that a
bunch of times (say 1000 times) and you will hit bottom!
Gradient Descent
The video in the next slide explains the process of gradient descent.
You only need to watch till 16:07 to gain some understanding and not
go into the python implementation explained later on.
The video in the next slide explains the process of gradient descent.
You only need to watch till 16:07 to gain some understanding and not
go into the python implementation explained later on.
Confusing ? Worry not. The next example will clarify all your doubts.
Example
Let’s assume you have called two weather examiners, Mr. Bishop and
Mr. Varian to test if it will rain or not.
Did you notice, Mr. Bishop is highly Biased towards chances of having rain. During the test, he
is unable to predict most of them correctly.
Me :Sir, there is a giant sitting on the cloud who lost his candy. Will it rain ?
Mr. Varian: Not sure, since the answer is “No” to most of the conditions, there is a
high possibility that it will not rain .
Now, although the decision of Mr. Varian varies perfectly with the input conditions,
he is not able to predict for the new and unseen condition (other general conditions
apart from the given specific conditions while training).
Well, the answer is, “Best of both worlds”. We neither need high bias
nor high variance. We would want our algorithm to perform better on
training set and also offer best result on unseen data (the test set).
MUST READ
Understanding the Bias Variance Tradeoff:
[Link]
tradeoff-165e6942b229
Reference
[Link]
bias-variance-tradeoff-ec540fb13e12
Supervised ML Algorithms -
Regression
Decision Tree/ Regression Tree
Decision Trees
Decision tree is the most powerful and popular tool for classification and prediction.
Classification Trees: where the target variable is categorical and the tree is used to identify the
"class" within which a target variable would likely fall into.
Regression Trees: where the target variable is continuous and tree is used to predict it's value.
Supervised ML Algorithms -
Regression
Support Vector Regressor
Support Vector Regressor
[Link]
vector-regression-tutorial-for-machine-learning/
The simplest way of combining predictions that A way of combining predictions that
1.
belong to the same type. belong to the different types.
2. Aim to decrease variance, not bias. Aim to decrease bias, not variance.
3. Each model receives equal weight. Models are weighted according to their performance.
6. Bagging tries to solve the over-fitting problem. Boosting tries to reduce bias.
If the classifier is unstable (high variance), then If the classifier is stable and simple (high bias) the apply
7.
apply bagging. boosting.
● Computationally Expensive
gets changed
Random Forest
● Supervised learning algorithm which performs both classification and
regression
● Baggaged: Runs in parallel
Random Forest
● Advantages
○ Effective method for estimating missing data and maintains
accuracy when a large proportion of data are missing
○ Runs efficiently on large datasets
● Disadvantages
○ May observe random forest overfitting for some datasets with
noisy classification/regression tasks
Regression Forest - Implementation
Now, you may ask why don’t we use Linear Regression? Why do we
need a new algorithm?
Well, you would find all the answers in the video in the next slides.
The video in the next slide is a must watch, the instructor has
brilliantly explained about logistic regression!
Must Watch Understanding Logistic Regression
Logistic Regression
Classification
Binary - yes or no (Spam or no spam)
Multiclass - Which party a person will vote? A,B, C?
Linear Regression vs Logistic
● Linear regression is used to solve regression problems with
continuous values
● Logistic regression is used to solve classification problems with
discrete categories
○ Binary classification (Classes 0 and 1)
○ Examples:
• Emails (Spam / Not Spam)
• Credit Card Transactions (Fraudulent / Not Fraudulent)
• Loan Default (Yes / No)
Linear Regression vs Logistic
● Let’s say a data scientist named John want to predict that whether a
customer will buy insurance or not
● Remember that linear regression is used to predict a continuous value
where the output (y) may vary between +∞ (posi ve infinity) to -∞
(negative infinity) whereas in this case, the target variable (y) takes only
two discrete values, 0 (No insurance) and 1 (Yes, got the insurance).
● John’s decides to extend the concepts of linear regression to fulfil his
requirement. One approach is to take the output of linear regression
and map it between 0 and 1, if the resultant output is below a certain
threshold (say 0.5), classify it as No (didn’t buy the insurance) whereas if
the resultant output is above a certain threshold, classify it as bought
the insurance (yes)
Linear Regression vs Logistic
● We then plot a simple linear regression line and set the threshold as 0.5
○ Negative class (Insurance = No)– Age on the left side
○ Positive class (Insurance = Yes) – Age on the right side
Imagine there is an outlier to towards right
Additional outlier that
distorted the
regression line
● As we can see outlier in the data and will distort the whole linear regression
line.
● Clearly the line is unable to differentiate the classes with the linear line fit
● The line should have been at the vertical yellow line which is able to divide
the positive and negative classes i.e yes or no for insurance
Happy John! (Data Scientist)
● Well, life would be much simpler if we had a algorithm that
would fit the points like below right? It is a much better fit compared
to regression line!
Unit 4
Evaluating the
Performance of
Why not Accuracy?
Logistic Regression
model
Which metrics to
use when?
Evaluating the Performance of Logistic Regression model
How well does the model fit the data?, Which predictors are most
important?, Are the predictions accurate?
● But why?
Correctly predicted = 4
Total Predictions = 6
Accuracy = 4/6 = 66.7%
Why not Accuracy?
● Accuracy is very important, but it might not be the best metric all the time. Let’s
look at why with an example -:
● Let’s imagine, we build a basic model which always predicts that a transaction
is not fraudulent. Guess what would be the accuracy of this model?
~99% !! (You may ask why? Well, less than 1% transactions are usually
fraudulent and there is a huge class imbalance. So even if you fit a wrong
model that always predicts a transaction to be not fraudulent, the accuracy will
remain 99% owing to class imbalance)
● Impressive, right? Well, the probability of a bank buying this model is absolute
zero.
● In a problem where there is a large class imbalance, a model can predict the
value of the majority class for all predictions and achieve a high
classification accuracy.
● While our model has a stunning accuracy, this is an apt example where
accuracy is definitely not the right metric.
Why not Accuracy?
Watch till 1 min 14 secs to understand why accuracy is bad metric for model
performance
Evaluating the Performance of Logistic Regression model
443
Confusion Matrix
A confusion matrix is a table that is often used to describe the performance of
a classification model (or "classifier") on a set of test data for which the true
values are known. The confusion matrix itself is relatively simple to
understand, but the related terminology can be confusing.
Let's start with an example confusion matrix for a binary classifier for disease
prediction (though it can easily be extended to the case of more than two
classes):
Confusion Matrix
Let's now define the most basic terms, which are whole numbers (not
rates):
● true positives (TP): These are cases in which we predicted yes (they
have the disease), and they do have the disease.
● true negatives (TN): We predicted no, and they don't have the
disease.
● false positives (FP): We predicted yes, but they don't actually have the
disease. (Also known as a "Type I error.")
● false negatives (FN): We predicted no, but they actually do have the
disease. (Also known as a "Type II error.")
I know these seem hard to memorise. One thing that has helped me
remember these are by putting it in a better way:
This is a list of rates that are often computed from a confusion matrix for
a binary classifier:
Note: Mostly we have to pick one over other, it’s almost impossible to
have both high Precision and Recall.
The recall is the ratio of the relevant results( correctly predicted yes)
returned by the search engine to the total number of the relevant results
that could have been returned (total actual yes).
Choosing between Sensitivity and Specificity
Often, the sensitivity and specificity of a test are inversely related. Selecting
the optimal balance of sensitivity and specificity depends on the objective of
the problem that needs to be solved.
We can instead change this threshold to 0.7. Thus, we’ll tell someone they have cancer only
if we think they have greater than or equal to 70% chance of having a cancer.
Look at the graph below. SInce the threshold has shifted to the right, so the number of
people correctly guessed as having cancer have increased. Thus, the specificity has
increased. ( We are being very specific with declaring patients with cancer).
Case 2: Higher Sensitivity
Suppose we want to avoid missing too many cases of cancer ( avoid false negatives). If a
person with cancer is told that he’s well, it can cause a delay in treatment and affect the
health badly).
In this case we can set a lower threshold, say 0.25. Even if a patient has 25% chance of
having cancer, we’ll inform him/her.
Looking at the graph you can see that the threshold has shifted to the left. Most of the
people with cancer will be detected in advance in this case. We have completely (or almost)
eliminated False Negatives. It will thus result in higher Sensitivity/ Recall. (We are being
sensitive in detecting a disease i.e a really sensitive test).
You can watch this video from 00:58 to 5:32 explaining the Sensitivity and Specificity trade off
Confusion Matrix
Talking about accuracy, our favourite metric!
● Remember, accuracy is a very useful metric when all the classes are
equally important.
● But this might not be the case if we are predicting if a patient has cancer.
In this example, we can probably tolerate FPs but not FNs.
But do we necessarily need to spend time on varying the threshold to get the
perfect Precision and Recall? Or is there a way to choose this threshold
automatically?
Let’s take 3 algorithms and try to find a metric for combining Precision and
Recall.
● Accuracy can be used when the class distribution is similar while F1-
score is a better metric when there are imbalanced classes.
ROC (Receiver Operator Characteristic) Curve
● An ROC curve is a commonly used way to visualize the performance of a
binary classifier, meaning a classifier with two possible output classes.
● It plots 2 parameters:
MUST READ
An excellent article explaining Threshold, ROC and AUC in a simple
manner: [Link]
auc-curves-a05b68550b69
Which metrics to use when?
463
Which metrics to use when?
This is an important question and we get used to learning these
measures over time. Sharing some resources with you all so that
it helps you understand what metrics to be used in the context of
solving a regression problem.
● [Link]
for-evaluating-machine-learning-models-part-2-
86d5649a5428
SUPERVISED LEARNING - BUILDING YOUR FIRST
CLASSIFICATION AND REGRESSION MODELS
[Link]
upervised-ml-model-building-walkthrough
Feature Selection
In real life data science problems, often the data consist of a
large number of attributes or features.
Variance threshold
LASSO regularization
Classification Trees: where the target variable is categorical and the tree is used
to identify the "class" within which a target variable would likely fall into.
Regression Trees: where the target variable is continuous and tree is used to
predict it's value.
Resources on Decision Tree Classification
● Notebooks:
[Link]
notebooks-intro-to-model-building
[Link]
_improvised_model
Supervised ML Algorithms -
Classification
Support Vector Machine (SVM)
SVM’s Objective
The objective of SVMs is to categorize data into two classes. It does so by finding a
separating hyperplane(decision boundary), where the distance between itself and the
closest data points for both categories is maximized.
Look at how the hyperplane lies exactly between the nearby blue and red points
(maximising the margin)
Simple Visual Explanation of SVM
[Link]
vector-machine-svm-a-visual-simple-explanation-part-1-
a7efa96444f2
Pros and Cons — SVM
Pros:
● It is useful for both linearly Separable (hard margin) and
Non-linearly Separable (soft margin) data.
● It is effective in high dimensional spaces.
● It is effective in cases where a number of dimensions are
greater than the number of samples.
● It uses a subset of training points in the decision function
(called support vectors), so it is also memory efficient.
Cons:
● Picking the right kernel and parameters can be
computationally intensive.
● It also doesn’t perform very well, when the data set has
more noise i.e. target classes are overlapping
● SVM doesn’t directly provide probability estimates, these
are calculated using an expensive five-fold cross-validation.
Support Vector Machine (SVM)
SVM Implementation
SVM Notebook
[Link]
[Link]
Kernels
● SVM algorithms use a set of mathematical functions that
are defined as the kernel.
● Advantages:
This algorithm requires a small amount of training data
to estimate the necessary parameters. Naive Bayes
classifiers are extremely fast compared to more
sophisticated methods.
[Link]
ayes/14_naive_bayes_1_titanic_survival_prediction.ipynb
Multinomial Gaussian NB for spam detection
Note:
In the next video, the instructor has used the spam detection
example that involves some operations around text (NLP).
You don’t need to worry about those parts and instead just
focus on the parts revolving around Naive Bayes.
Multinomial Gaussian NB for spam detection
Count Vectorization
Email Spam filter
[Link]
.ipynb
Naive Bayes Classification using Scikit-learn
[Link]
scikit-learn
Supervised ML Algorithms -
Classification
Random Forest
Random Forest
Random forest is a flexible, easy to use machine learning algorithm
that produces, a great result most of the times even without hyper-
parameter tuning.
It is also one of the most used algorithms, because of its simplicity and
diversity (it can be used for both classification and regression
tasks).
[Link]
_forest/11_random_forest.ipynb
Applications
● The random forest algorithm is used in a lot of different fields, like
banking, the stock market, medicine and e-commerce.
● Notebooks:
[Link]
andom-forest
Learning Objectives
Gone are the days when you had 5 variables to fit your linear
regression: Modern datasets contain more variables/features to
choose from. A dataset with 50 or more features -> more than 1
million observations.
Manual or auto
selection
How?
Suppose we’re working on the Iris Classification. We’ll first create a baseline model
using Logistic Regression. Now, we want to try out Feature Selection and try to
improve our model’s performance. On plotting feature importance scores, we
obtain the below graph:
● Feature Importance Scores tell us that Petal width and height are the the top 2
features. The rest have a much lower importance score.
● We’ll select these 2 features.
● We’ll transform our existing dataset to contain only these 2 features.
● We’ll train our model on this transformed dataset.
● Finally, we’ll compare the evaluation metrics of our initial Logistic Regression
model with this new model.
Why Feature Selection?
You already know a number of optimization methods by now and might
think what’s the need of reducing our data by feature selection if we can
just optimize?
Now let's say you have a square 100 yards on each side and you dropped a
penny somewhere on it. It would be pretty hard, like searching across two
football fields stuck together. It could take days.
Now a cube 100 yards across. That's like searching a 30-story building the
size of a football stadium. Ugh.
The difficulty of searching through the space gets a lot harder as you have
more dimensions.
Benefits of performing Feature Selection
You might’ve gotten an idea of why feature selection is required by now.