Handling Imbalanced Data in ML
Handling Imbalanced Data in ML
Handling Imbalanced
Lesson
Data InPlan
ML
Polymorphism and
Encapsulation
Java + DSA
Topic to covered:
Understanding Imbalanced Dat
Techniques for Handling Imbalanced Dat
Evaluation Metrics for Imbalanced Dat
Advanced Technique
Real-world Applications and Case Studie
Best Practices and Consideration
Challenges and Limitation
Tools and Libraries
Code
Output::
Java + DSA
Techniques for Handling Imbalanced Data
Resampling Methods
Oversampling: Increasing the number of instances in the minority class
Undersampling: Reducing the number of instances in the majority class.
Code
Output::
Code
Output::
Java + DSA
Evaluation Metrics for Imbalanced Data
In imbalanced datasets, accuracy can be misleading due to the disproportionate class distribution.
Instead, evaluation metrics like precision, recall, F1-score, ROC-AUC, and PR curve provide a more
comprehensive understanding of model performance.
Code
Output::
Java + DSA
Advanced Techniques:
Ensemble methods like XGBoost, AdaBoost, or Random Forests can handle imbalanced data effectively
due to their inherent ability to weigh different samples or classes.
Code
Output::
In finance, imbalanced data is common in fraud detection tasks, where fraudulent transactions are
relatively rare compared to legitimate ones.
Techniques like anomaly detection, oversampling the minority class, or using cost-sensitive learning
methods can be applied.
Java + DSA
Code
Output::
Java + DSA
Medical Diagnosis and Healthcare
In medical diagnosis, imbalanced data can occur when certain diseases or conditions are rare.
Handling imbalanced data here involves careful model evaluation and validation to ensure high
sensitivity (recall) while maintaining specificity.
Techniques like resampling or using specialized algorithms are employed.
Code
Output:
Java + DSA
Code
Output:
Oversampling techniques might lead to overfitting on the minority class. Generating synthetic samples
that are too close to existing ones may hinder the model's ability to generalize.
Code
Java + DSA
Output:
Code
Output:
Java + DSA
Output:
Java + DSA
Lesson Plan
Handling Outliers
Lesson Plan
Polymorphism and
Encapsulation
Java + DSA
Topics Covered:
Understanding Outlier
Identifying Outlier
Robust Algorithm
Outliers vs Anomalie
Topics
Covered
Handling
Outliers in ml
Impact of
Understanding Identifying Robust Outliers vs Real-Time
Outliers on
Outliers Outliers Algorithms Anomalies Analogy on
Machine
Handling
Learning
Outliers
Models
Understanding Outliers
Outliers are data points that significantly differ from other observations in a dataset
They can arise due to measurement errors, variability in the data, or genuine extreme values
Outliers can skew statistical analyses, affect model performance, and distort visualizations.
Python Code
Output:
Identifying Outliers
Z-score: Identifies outliers by calculating the deviation of a data point from the mean in terms of standard
deviations
One of the most commonly used tools in determining outliers is the Z-score. The Z-score is just the number
of standard deviations away from the mean that a certain data point is
The statistical formula for a value's z-score is calculated using the following formula
z=(x-μ)/
Where
z = Z-scor
x = the value being evaluate
μ = the mea
σ = the standard deviation
Python code:
Output:
Output:
Interquartile Range (IQR): Determines outliers based on the range between the 25th and 75th
percentiles
The Interquartile Range (IQR) is a robust statistical measure used to identify outliers in a dataset.
It is the range between the 75th percentile (Q3) and the 25th percentile (Q1) of the data
The formula for finding the interquartile range takes the third quartile value and subtracts the first quartile
value
IQR = Q3 – Q
To find outliers, you’ll need to know your data’s IQR, Q1, and Q3 values. Take these values and input them into
the equations below. Statisticians call the result for each equation an outlier gate
Python Code
Output:
[-2.6197451040897444]
Output:
Robust Algorithms
Robust algorithms are less sensitive to outliers and handle them better than traditional algorithms
Robust regression techniques like RANSAC or Huber Regression are resilient to outliers.
Python Code:
Output:
Java + DSA
Output:
Outliers vs Anomalies
Outliers are data points that deviate significantly from other observations.
while anomalies are points that are rare or unexpected. Anomalies might not always be outliers and vice
versa.
Python code:
Output:
Java + DSA
Real-Time Analogy
Different domains might have specific considerations when dealing with outliers. For instance, in sensor
data, outliers might occur due to measurement errors or equipment malfunction.
Code:
Output:
THANK
YOU !
Lesson Plan
Data Interpolation
Lesson Plan
Polymorphism
Data and
Interpolation
EncapsulationIn
ML
Java + DSA
Topic to covered:
Topics
Covered
Data Interpolation
in ml
Introduction
Common
Advance d
Considerations
Applications
Types of
to Data
Techniques
Methods
and Best
an
Missing Data
Interpolation for Data
for Data
Practices Use Cases
Interpolation Interpolation
Data Interpolation refers to the technique of estimating values for missing or incomplete data points within
a dataset.
In the realm of Machine Learning, dealing with missing data is a crucial step before building models, as most
Understanding and addressing missing data is essential to ensure the accuracy and reliability of machine
learning models
MCA R occurs when the missingness of data points is entirely random and unrelated to any othe r
'
There s no systematic reason for the absence of data
Ex :
ample A weather station malfunction causing random data loss in recorded temperature.
MA R refers to the missingness pattern that depends on observed variables but not on the missin g
values themselves
Ex :
ample Missing income information in a survey based on the age of participants.
M N R
A '
occurs when the missingness is related to the missing values themselves, and it s not rando m
Ex : k
ample People with high income are less li ely to disclose their income in a survey.
Common Techniques Of Data Interpretation
Code
import numpy as np
import pandas as pd
data = {
df = [Link](data)
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
A B C D
Mean/Median/Mode Imputation
Mean Imputation: This involves replacing missing values with the mean of the available data in that
particular column.
Code:
# Common techniques for data interpolation
# Mean Imputation
df_mean = [Link]([Link]())
print(df_mean)
Output:
After Mean Imputation:
A B C D
Median Imputation: Similar to mean imputation, but uses the median value instead of the
mean.
Code:
median_values = [Link]()
# median Imputation
A B C D
Mode Imputation: Replaces missing values with the mode (most frequently occurring value) of the
respective column.
Code:
# Imputing missing values with mode
mode_values = [Link]().iloc[0]
Output:
DataFrame with Mode Imputation:
A B C D
df_ffill = [Link]()
Output:
Forward Fill:
A B C D
Forward Filling: Carries the next observed non-missing value backward to fill missing values.
Code:
# Backward filling missing values
df_bfill = [Link]()
Backward Fill:
A B C D
Linear Interpolation
Estimates missing values by creating a linear relationship between adjacent observed
values
Applicable to ordered data, time-series, or datasets with continuous variables.
Code:
# Linear Interpolation
df_linear = [Link](method='linear')
print(df_linear)
Output:
After Linear Interpolation:
A B C D
Code:
from [Link] import enable_iterative_imputer
[7, 8, 9]])
print(filled_data)
Code:
Imputed Data using MICE:
[[1. 2. 3.00203274]
[4. 4.99796925 6. ]
[7. 8. 9. ]]
Considerations and Best Practices
When dealing with missing data, there are several considerations and best practices to keep in mind:
Each interpolation method has its strengths and weaknesses; understanding these helps in
Evaluate the impact of imputation on the performance of machine learning models; imputing
Data interpolation finds applications in various domains like healthcare (patient data), finance
It's crucial in scenarios where missing data can't be ignored but needs to be accurately filled for
Real-Time Code
import pand as d
as p
import nu ymp np
as
# Gene ng yn he c he l hc e d
rati a s t ti a t ar atas e w h
t it missi ng v lues
n . nd . eed(42)
p ra om s
nu _m en = 100
pati ts
a s to sim at missi at
bl d_ e
oo pr ss u e = n . nd . nd n (80, 180, nu _
r p ra om ra en ). y e( l )
ch le e
o st ro l = n . nd . nd n (120, 300, nu _
p ra om ra en ). y e( l )
# De ne nd ce
fi in i s to i tro duce ng v lues
missi a
ng_ ge_ nd ce = n
missi a i i s p . nd .ch ce(nu _
ra om oien , ze=5, e l ce=F l e)
m pati ts si r p a a s
ng_bl d_ e u e_
missi oo pr ss r i nd ce = n . nd .ch ce(nu _
i s p ra en , ze=5, e l ce=F l e)
om oi m pati ts si r p a a s
ng_ch le e l_ nd
missi o st ro i i ce = n . nd .ch ce(nu _
s p ra en , ze=5, e l ce=F l e)
om oi m pati ts si r p a a s
missi a i i s p a
bl d_ e u e[
oo pr ng_bl d_ e u e_ nd ce ] = n .n n
ss r missi oo pr ss r i i s p a
he l h_d
a t ata = d.D F e({
p ata ram
' ge':
a a ge,
'bl d_ oo pr e u e': bl d_ e u e,
ss r oo pr ss r
'ch le o st e l': ch le e l
ro o st ro
})
# I mp u et ng v lue u ng KNN
missi uan
s si imp tatio
impu te = KNN(k=5)
he la th_d _ u ed = d.D F e( u e . _
ata imp t p ata ram imp t r fit tra n sform (he l h_d
a t ata ),
c lu
o mn =he l h_d .c lu n )
s a t ata o m s
pri n ("He l hc e d
t a tbe e ar ata for imp utatio n:")
t a t ata a
pri n ("\nHe l hc e d
t a e KNN
t ar ata aft r imp u tatio n:")
t a t ata imp t a
THANK
YOU !
Lesson Plan
Handling Outliers
Lesson Plan
Polymorphism and
Encapsulation
Java + DSA
Topics Covered:
Understanding Outlier
Identifying Outlier
Robust Algorithm
Outliers vs Anomalie
Topics
Covered
Handling
Outliers in ml
Impact of
Understanding Identifying Robust Outliers vs Real-Time
Outliers on
Outliers Outliers Algorithms Anomalies Analogy on
Machine
Handling
Learning
Outliers
Models
Understanding Outliers
Outliers are data points that significantly differ from other observations in a dataset
They can arise due to measurement errors, variability in the data, or genuine extreme values
Outliers can skew statistical analyses, affect model performance, and distort visualizations.
Python Code
Output:
Identifying Outliers
Z-score: Identifies outliers by calculating the deviation of a data point from the mean in terms of standard
deviations
One of the most commonly used tools in determining outliers is the Z-score. The Z-score is just the number
of standard deviations away from the mean that a certain data point is
The statistical formula for a value's z-score is calculated using the following formula
z=(x-μ)/
Where
z = Z-scor
x = the value being evaluate
μ = the mea
σ = the standard deviation
Python code:
Output:
Output:
Interquartile Range (IQR): Determines outliers based on the range between the 25th and 75th
percentiles
The Interquartile Range (IQR) is a robust statistical measure used to identify outliers in a dataset.
It is the range between the 75th percentile (Q3) and the 25th percentile (Q1) of the data
The formula for finding the interquartile range takes the third quartile value and subtracts the first quartile
value
IQR = Q3 – Q
To find outliers, you’ll need to know your data’s IQR, Q1, and Q3 values. Take these values and input them into
the equations below. Statisticians call the result for each equation an outlier gate
Python Code
Output:
[-2.6197451040897444]
Output:
Robust Algorithms
Robust algorithms are less sensitive to outliers and handle them better than traditional algorithms
Robust regression techniques like RANSAC or Huber Regression are resilient to outliers.
Python Code:
Output:
Java + DSA
Output:
Outliers vs Anomalies
Outliers are data points that deviate significantly from other observations.
while anomalies are points that are rare or unexpected. Anomalies might not always be outliers and vice
versa.
Python code:
Output:
Java + DSA
Real-Time Analogy
Different domains might have specific considerations when dealing with outliers. For instance, in sensor
data, outliers might occur due to measurement errors or equipment malfunction.
Code:
Output:
THANK
YOU !
Lesson Plan
Feature Extraction
Lesson Plan
Polymorphism and
Encapsulation
Java + DSA
Topics Covered
Introduction to Feature Extractio
Types of Feature
Numeric Feature
Categorical Feature
Feature Representatio
Feature Selection vs. Feature Extractio
Practical Implementation of feature extraction
Java + DSA
Role in Improving Model Performance
Enhanced Generalization: Feature extraction contributes to better generalization of machine learning
models. It helps the models perform well not only on the training data but also on new, unseen data
Reduced Overfitting: Feature extraction can mitigate overfitting by eliminating noise and irrelevant
information. This results in models that are more robust and less likely to make predictions based on
spurious correlations.
Types of Features
Numeric features are quantitative and represent measurable quantities.
Examples include
Ag
Temperatur
Incom
Heigh
Categorical features represent discrete and often qualitative data. They can be further divided into nominal
and ordinal categories.
Examples include
Nominal: Colors, Gende
Ordinal: Education level, Socioeconomic status
Feature Representation
Raw Data vs. Features
Raw Data: Raw data refers to the original, unprocessed information collected from various sources. It can
be complex, noisy, and may contain irrelevant information
Features: Features are the transformed, meaningful representations extracted from raw data. They
capture relevant information for a particular task, making it easier for machine learning models to learn
patterns
Tabular Data Representation
Tabular Structure: Features are often organized in a tabular structure, with each row representing an
instance or observation, and each column representing a feature. This structured format facilitates
analysis and model training
Vectorization of Features
Vector Representation: Features are commonly represented as vectors, which are one-dimensional
arrays. This allows for mathematical operations and easy integration into machine learning algorithms
Example: In natural language processing, words can be represented as vectors, enabling operations like
addition and subtraction to capture semantic relationships
Java + DSA
Feature Selection vs. Feature Extraction
Feature Selection:
Involves choosing a subset of the most relevant features from the original set.
The goal is to retain the most informative features while discarding irrelevant or redundant ones
Use when the dataset has a large number of features, and some features are irrelevant or redundant
Reduces computational complexity and may improve model interpretability
Suitable when the importance of specific features is known
Feature Extraction:
This involves transforming the original features into a new set of features.
This is done by combining or creating new features that capture the essential information in the data
Use when there is a high dimensionality of data, and a transformation is needed to create a more
compact and informative feature set
Helpful when the relationships between features are complex and non-linear
Can be applied when the underlying patterns in the data are not well understood.
Practical Implementation
# Import necessary libraries
import numpy as np
iris = datasets.load_iris()
X = [Link]
y = [Link]
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X)
[Link](figsize=(10, 4))
[Link](1, 2, 1)
[Link]('Original Features')
[Link](iris.feature_names[0])
[Link](iris.feature_names[1])
[Link](1, 2, 2)
plt.tight_layout()
[Link]()
Java + DSA
Output:
Java + DSA
THANK
YOU !
Lesson Plan
Data Encoding & Nominal
Encoding
Lesson Plan
Polymorphism and
Encapsulation
Java + DSA
Topics Covered
Introduction to Data Encodin
Types of Data in Machine Learnin
Understanding Nominal Encodin
Common Nominal Encoding Technique
One Hot Encodin
Label Encodin
Binary Encodin
Practical Implementation
Java + DSA
Provide examples of nominal variables, such as country names, colors, or animal types
Illustrate the need for encoding by explaining how these variables cannot be directly used in many machine
learning algorithms
Address the issue of dimensionality increase when using certain encoding methods and its impact on
computational resources.
iris = load_iris()
data = [Link]
target = [Link]
df = [Link](data, columns=columns)
df["target"] = target
print("Original DataFrame:")
print([Link]())
le = LabelEncoder()
df['target_encoded'] = le.fit_transform(df['target'])
print([Link]())
ohe = OneHotEncoder(sparse=False)
ohe_result = ohe.fit_transform(df[['target']])
print([Link]())
df['target_en
coded'], test_size=0.2, random_state=42)
print("\nTraining Set:")
Java + DSA
print(X_train.head())
print("\nTesting Set:")
print(X_test.head())
Output:
DataFrame after One-Hot Encoding:
[Link](
Training Set:
Testing Set:
Java + DSA