0% found this document useful (0 votes)
4 views55 pages

Handling Imbalanced Data in ML

The document outlines lesson plans for handling imbalanced data, outliers, and data interpolation in machine learning. It covers techniques for managing imbalanced datasets, methods for identifying and addressing outliers, and various strategies for imputing missing data. Additionally, it highlights best practices, evaluation metrics, and real-world applications related to these topics.

Uploaded by

SUDHIR NISHAD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views55 pages

Handling Imbalanced Data in ML

The document outlines lesson plans for handling imbalanced data, outliers, and data interpolation in machine learning. It covers techniques for managing imbalanced datasets, methods for identifying and addressing outliers, and various strategies for imputing missing data. Additionally, it highlights best practices, evaluation metrics, and real-world applications related to these topics.

Uploaded by

SUDHIR NISHAD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Lesson Plan

Handling Imbalanced
Lesson
Data InPlan
ML

Polymorphism and

Encapsulation

Java + DSA
Topic to covered:
Understanding Imbalanced Dat
Techniques for Handling Imbalanced Dat
Evaluation Metrics for Imbalanced Dat
Advanced Technique
Real-world Applications and Case Studie
Best Practices and Consideration
Challenges and Limitation
Tools and Libraries

Understanding Imbalanced Data


Imbalanced datasets refer to those where the distribution of classes is not uniform.
For instance, in a binary classification problem, if one class (majority class) heavily outweighs the other
(minority class), it creates an imbalance.
This can lead to biased models as algorithms tend to favor the majority class, affecting the model's ability
to predict the minority class accurately.

Code

Output::

Java + DSA
Techniques for Handling Imbalanced Data
Resampling Methods
Oversampling: Increasing the number of instances in the minority class
Undersampling: Reducing the number of instances in the majority class.

Code

Output::

Synthetic Data Generation


Generating synthetic samples to balance the dataset, such as using the ADASYN algorithm.

Code

Output::

Java + DSA
Evaluation Metrics for Imbalanced Data
In imbalanced datasets, accuracy can be misleading due to the disproportionate class distribution.
Instead, evaluation metrics like precision, recall, F1-score, ROC-AUC, and PR curve provide a more
comprehensive understanding of model performance.

Code

Output::

Java + DSA
Advanced Techniques:
Ensemble methods like XGBoost, AdaBoost, or Random Forests can handle imbalanced data effectively
due to their inherent ability to weigh different samples or classes.

Code

Output::

Real-world Applications and Case Studies


Fraud Detection in Financ

In finance, imbalanced data is common in fraud detection tasks, where fraudulent transactions are
relatively rare compared to legitimate ones.
Techniques like anomaly detection, oversampling the minority class, or using cost-sensitive learning
methods can be applied.

Java + DSA
Code

Output::

Java + DSA
Medical Diagnosis and Healthcare
In medical diagnosis, imbalanced data can occur when certain diseases or conditions are rare.
Handling imbalanced data here involves careful model evaluation and validation to ensure high
sensitivity (recall) while maintaining specificity.
Techniques like resampling or using specialized algorithms are employed.

Code

Output:

Best Practices and Considerations


Before applying techniques to handle imbalanced data, it's crucial to preprocess data, handle missing
values, normalize/standardize features, and perform relevant feature engineering to enhance model
performance.

Java + DSA
Code

Output:

Best PracticChallenges and Limitationses and


Considerations
Overfitting in Oversamplin

Oversampling techniques might lead to overfitting on the minority class. Generating synthetic samples
that are too close to existing ones may hinder the model's ability to generalize.

Code

Java + DSA
Output:

Tools and Libraries


Libraries like “imbalanced-learn” provide various techniques for handling imbalanced data, including
resampling methods, cost-sensitive learning, and ensemble techniques tailored for imbalanced datasets.

Code

Output:

Java + DSA
Output:

Java + DSA
Lesson Plan
Handling Outliers
Lesson Plan

Polymorphism and

Encapsulation

Java + DSA
Topics Covered:
Understanding Outlier

Identifying Outlier

Robust Algorithm

Impact of Outliers on Machine Learning Model

Outliers vs Anomalie

Real-Time Analogy on Handling Outliers

Topics

Covered

Handling

Outliers in ml

Impact of
Understanding Identifying Robust Outliers vs Real-Time
Outliers on
Outliers Outliers Algorithms Anomalies Analogy on
Machine
Handling
Learning
Outliers
Models

Understanding Outliers
Outliers are data points that significantly differ from other observations in a dataset

They can arise due to measurement errors, variability in the data, or genuine extreme values

Outliers can skew statistical analyses, affect model performance, and distort visualizations.

Python Code

Output:
Identifying Outliers
Z-score: Identifies outliers by calculating the deviation of a data point from the mean in terms of standard
deviations
One of the most commonly used tools in determining outliers is the Z-score. The Z-score is just the number
of standard deviations away from the mean that a certain data point is
The statistical formula for a value's z-score is calculated using the following formula
z=(x-μ)/
Where
z = Z-scor
x = the value being evaluate
μ = the mea
σ = the standard deviation

Python code:

Output:

Output:
Interquartile Range (IQR): Determines outliers based on the range between the 25th and 75th

percentiles

The Interquartile Range (IQR) is a robust statistical measure used to identify outliers in a dataset.

It is the range between the 75th percentile (Q3) and the 25th percentile (Q1) of the data

The formula for finding the interquartile range takes the third quartile value and subtracts the first quartile

value

IQR = Q3 – Q

To find outliers, you’ll need to know your data’s IQR, Q1, and Q3 values. Take these values and input them into

the equations below. Statisticians call the result for each equation an outlier gate

Q1 − 1.5 * IQR: Lower outlier gate

Q3 + 1.5 * IQR: Upper outlier gate.

Python Code

Output:

The Outliers are 

[-2.6197451040897444]

Output:

Robust Algorithms
Robust algorithms are less sensitive to outliers and handle them better than traditional algorithms

Robust regression techniques like RANSAC or Huber Regression are resilient to outliers.
Python Code:

Output:

Impact of Outliers on Machine Learning Model


Outliers can significantly influence the performance of machine learning models by skewing the results
For instance, in linear regression, outliers can heavily impact the slope and intercept of the regression
line, affecting the model's predictions.
Python Code:

Java + DSA
Output:

Outliers vs Anomalies
Outliers are data points that deviate significantly from other observations.
while anomalies are points that are rare or unexpected. Anomalies might not always be outliers and vice
versa.
Python code:

Output:

Java + DSA
Real-Time Analogy
Different domains might have specific considerations when dealing with outliers. For instance, in sensor
data, outliers might occur due to measurement errors or equipment malfunction.
Code:

Output:
THANK

YOU !
Lesson Plan

Data Interpolation
Lesson Plan

Polymorphism
Data and

Interpolation
EncapsulationIn
ML

Java + DSA
Topic to covered:

Introduction to Data Interpolatio

Types of Missing Dat

Common Techniques for Data Interpolatio

Advanced Methods for Data Interpolatio

Considerations and Best Practice

Applications and Use Cases

Topics
Covered

Data Interpolation

in ml

Introduction
Common
Advance d
Considerations
Applications

Types of

to Data
Techniques
Methods
and Best
an
Missing Data
Interpolation for Data
for Data
Practices Use Cases

Interpolation Interpolation

Introduction to Data Interpolation

Data Interpolation refers to the technique of estimating values for missing or incomplete data points within

a dataset.

In the realm of Machine Learning, dealing with missing data is a crucial step before building models, as most

algorithms cannot handle missing values.

Understanding and addressing missing data is essential to ensure the accuracy and reliability of machine

learning models

Types of Missing Data

Missing Completely at Random (MCAR):

MCA R occurs when the missingness of data points is entirely random and unrelated to any othe r

variables in the dataset

'
There s no systematic reason for the absence of data

Ex :
ample A weather station malfunction causing random data loss in recorded temperature.

Missing at Random (MAR):

MA R refers to the missingness pattern that depends on observed variables but not on the missin g

values themselves

The missingness is related to other observed variables in the dataset

Ex :
ample Missing income information in a survey based on the age of participants.

Missing Not at Random (MNAR):

M N R
A '
occurs when the missingness is related to the missing values themselves, and it s not rando m

or systematically related to observed or unobserved variables

Ex : k
ample People with high income are less li ely to disclose their income in a survey.
Common Techniques Of Data Interpretation
Code
import numpy as np

import pandas as pd

# Creating a sample DataFrame with missing values

data = {

'A': [1, 2, [Link], 4, 5],

'B': [1, [Link], 3, 4, 5],

'C': [[Link], 2, 3, [Link], 5],

'D': [1, 2, 3, 4, [Link]]

df = [Link](data)

print("Original DataFrame:")

print(df)

Output:
Original DataFrame:

A B C D

0 1.0 1.0 NaN 1.0

1 2.0 NaN 2.0 2.0

2 NaN 3.0 3.0 3.0

3 4.0 4.0 NaN 4.0

4 5.0 5.0 5.0 NaN

Mean/Median/Mode Imputation
Mean Imputation: This involves replacing missing values with the mean of the available data in that
particular column.

Code:
# Common techniques for data interpolation

# Mean Imputation

df_mean = [Link]([Link]())

print("\nAfter Mean Imputation:")

print(df_mean)

Output:
After Mean Imputation:

A B C D

0 1.0 1.00 3.333333 1.0

1 2.0 3.25 2.000000 2.0

2 3.0 3.00 3.000000 3.0

3 4.0 4.00 3.333333 4.0

4 5.0 5.00 5.000000 2.5

Median Imputation: Similar to mean imputation, but uses the median value instead of the
mean.

Code:
median_values = [Link]()  

# median Imputation

df_median = [Link](median_values) # Fill missing values with respective column medians

print("DataFrame with Median Imputation:\n", df_median)


Output:
DataFrame with Median Imputation:

A B C D

0 1.0 1.0 3.0 1.0

1 2.0 3.5 2.0 2.0

2 3.0 3.0 3.0 3.0

3 4.0 4.0 3.0 4.0

4 5.0 5.0 5.0 2.5

Mode Imputation: Replaces missing values with the mode (most frequently occurring value) of the
respective column.

Code:
# Imputing missing values with mode

mode_values = [Link]().iloc[0]

df_mode = [Link](mode_values) # Fill missing values with respective column modes

print("DataFrame with Mode Imputation:\n", df_mode)

Output:
DataFrame with Mode Imputation:

A B C D

0 1.0 1.0 2.0 1.0

1 2.0 1.0 2.0 2.0

2 1.0 3.0 3.0 3.0

3 4.0 4.0 2.0 4.0

4 5.0 5.0 5.0 1.0

Forward Filling and Backward Filling


Forward Filling: Propagates the last observed non-missing value forward to fill missing values.
Code:
# Forward filling missing values

df_ffill = [Link]()

print("Forward Fill:\n", df_ffill)

Output:
Forward Fill:

A B C D

0 1.0 1.0 NaN 1.0

1 2.0 1.0 2.0 2.0

2 2.0 3.0 3.0 3.0

3 4.0 4.0 3.0 4.0

4 5.0 5.0 5.0 4.0

Forward Filling: Carries the next observed non-missing value backward to fill missing values.
Code:
# Backward filling missing values

df_bfill = [Link]()

print("\nBackward Fill:\n", df_bfill)

Backward Fill:

A B C D

0 1.0 1.0 2.0 1.0

1 2.0 3.0 2.0 2.0

2 4.0 3.0 3.0 3.0

3 4.0 4.0 5.0 4.0

4 5.0 5.0 5.0 NaN

Linear Interpolation
Estimates missing values by creating a linear relationship between adjacent observed
values
Applicable to ordered data, time-series, or datasets with continuous variables.

Code:
# Linear Interpolation

df_linear = [Link](method='linear')

print("\nAfter Linear Interpolation:")

print(df_linear)

Output:
After Linear Interpolation:

A B C D

0 1.0 1.0 NaN 1.0

1 2.0 2.0 2.0 2.0

2 3.0 3.0 3.0 3.0

3 4.0 4.0 4.0 4.0

4 5.0 5.0 5.0 4.0

Multiple Imputation by Chained Equations (MICE)


MICE is an iterative imputation technique that generates multiple sets of imputed values by cycling through
each variable, filling in missing values based on observed values in other variables
MICE operates by predicting missing values in a variable using observed values from other variables
iteratively, updating missing values for each variable multiple times based on the imputed values of the
other variables.

Code:
from [Link] import enable_iterative_imputer

from [Link] import IterativeImputer

# Sample data with missing values

data = [Link]([[1, 2, [Link]],

[4, [Link], 6],

[7, 8, 9]])

imputer = IterativeImputer(max_iter=10, random_state=0) # Initialize MICE imputer

filled_data = imputer.fit_transform(data) # Impute missing values

print("Imputed Data using MICE:")

print(filled_data)

Code:
Imputed Data using MICE:

[[1. 2. 3.00203274]

[4. 4.99796925 6. ]

[7. 8. 9. ]]
Considerations and Best Practices

When dealing with missing data, there are several considerations and best practices to keep in mind:

Understanding the Nature of Missing Data:

Analyze the patterns of missing data (MCAR, MAR, MNAR) to choose

appropriate imputation methods.

Pros and Cons of Interpolation Techniques:

Each interpolation method has its strengths and weaknesses; understanding these helps in

choosing the most suitable technique for the dataset.

Impact on Model Performance:

Evaluate the impact of imputation on the performance of machine learning models; imputing

data may introduce biases or affect model accuracy.

Applications and Use Cases

Data interpolation finds applications in various domains like healthcare (patient data), finance

(stock market data), and IoT (sensor data).

It's crucial in scenarios where missing data can't be ignored but needs to be accurately filled for

analysis or modeling purposes.

Real-Time Code

import pand as d
as p

import nu ymp np
as

from fa ncy u eimp t import KNN

# Gene ng yn he c he l hc e d
rati a s t ti a t ar atas e w h
t it missi ng v lues

n . nd . eed(42)

p ra om s

nu _m en = 100

pati ts

# Gene rati ng y w h N N v lue


arra s it ul ea ng d a

a s to sim at missi at

age = n . p ra nd . nd n (20, 70, nu _


om ra i t en ). y e( l )
m pati ts ast p f oat

bl d_ e
oo pr ss u e = n . nd . nd n (80, 180, nu _
r p ra om ra en ). y e( l )

i t m pati ts ast p f oat

ch le e
o st ro l = n . nd . nd n (120, 300, nu _
p ra om ra en ). y e( l )

i t m pati ts ast p f oat

# De ne nd ce
fi in i s to i tro duce ng v lues

missi a

ng_ ge_ nd ce = n
missi a i i s p . nd .ch ce(nu _
ra om oien , ze=5, e l ce=F l e)

m pati ts si r p a a s

ng_bl d_ e u e_
missi oo pr ss r i nd ce = n . nd .ch ce(nu _
i s p ra en , ze=5, e l ce=F l e)

om oi m pati ts si r p a a s

ng_ch le e l_ nd
missi o st ro i i ce = n . nd .ch ce(nu _
s p ra en , ze=5, e l ce=F l e)

om oi m pati ts si r p a a s

# In ducetro ng v lue by e ng elec ed nd ce


missi a s s tti s t i i s to N N

age[ ng_ ge_ nd ce ] = n .n n

missi a i i s p a

bl d_ e u e[
oo pr ng_bl d_ e u e_ nd ce ] = n .n n

ss r missi oo pr ss r i i s p a

# C er ati ng a D ata F ram e for he l hc e d


a t ar ata w h
it missi ng v lues

he l h_d
a t ata = d.D F e({

p ata ram

' ge':
a a ge,

'bl d_ oo pr e u e': bl d_ e u e,

ss r oo pr ss r

'ch le o st e l': ch le e l

ro o st ro

})

# I mp u et ng v lue u ng KNN
missi uan
s si imp tatio

impu te = KNN(k=5)

he la th_d _ u ed = d.D F e( u e . _
ata imp t p ata ram imp t r fit tra n sform (he l h_d
a t ata ),
c lu
o mn =he l h_d .c lu n )

s a t ata o m s

pri n ("He l hc e d
t a tbe e ar ata for imp utatio n:")

pri n (he l h_d .he d(10))

t a t ata a

pri n ("\nHe l hc e d
t a e KNN
t ar ata aft r imp u tatio n:")

pri n (he l h_d _ u [Link] d(10))

t a t ata imp t a

Imputing row 1/100 with 0 missing, elapsed time: 0.005

Healthcare data before imputation:

age blood_pressure cholesterol

0 58.0 88.0 215.0

1 48.0 167.0 299.0

2 34.0 80.0 232.0

3 62.0 87.0 181.0

4 NaN 167.0 171.0

5 40.0 142.0 NaN

6 58.0 90.0 158.0

7 38.0 160.0 249.0

8 42.0 87.0 250.0

9 30.0 114.0 232.0

Healthcare data after KNN imputation:

age blood_pressure cholesterol

0 58.000000 88.0 215.000000

1 48.000000 167.0 299.000000

2 34.000000 80.0 232.000000

3 62.000000 87.0 181.000000

4 39.867528 167.0 171.000000

5 40.000000 142.0 290.999936

6 58.000000 90.0 158.000000

7 38.000000 160.0 249.000000

8 42.000000 87.0 250.000000

9 30.000000 114.0 232.000000

THANK

YOU !
Lesson Plan
Handling Outliers
Lesson Plan

Polymorphism and

Encapsulation

Java + DSA
Topics Covered:
Understanding Outlier

Identifying Outlier

Robust Algorithm

Impact of Outliers on Machine Learning Model

Outliers vs Anomalie

Real-Time Analogy on Handling Outliers

Topics

Covered

Handling

Outliers in ml

Impact of
Understanding Identifying Robust Outliers vs Real-Time
Outliers on
Outliers Outliers Algorithms Anomalies Analogy on
Machine
Handling
Learning
Outliers
Models

Understanding Outliers
Outliers are data points that significantly differ from other observations in a dataset

They can arise due to measurement errors, variability in the data, or genuine extreme values

Outliers can skew statistical analyses, affect model performance, and distort visualizations.

Python Code

Output:
Identifying Outliers
Z-score: Identifies outliers by calculating the deviation of a data point from the mean in terms of standard
deviations
One of the most commonly used tools in determining outliers is the Z-score. The Z-score is just the number
of standard deviations away from the mean that a certain data point is
The statistical formula for a value's z-score is calculated using the following formula
z=(x-μ)/
Where
z = Z-scor
x = the value being evaluate
μ = the mea
σ = the standard deviation

Python code:

Output:

Output:
Interquartile Range (IQR): Determines outliers based on the range between the 25th and 75th

percentiles

The Interquartile Range (IQR) is a robust statistical measure used to identify outliers in a dataset.

It is the range between the 75th percentile (Q3) and the 25th percentile (Q1) of the data

The formula for finding the interquartile range takes the third quartile value and subtracts the first quartile

value

IQR = Q3 – Q

To find outliers, you’ll need to know your data’s IQR, Q1, and Q3 values. Take these values and input them into

the equations below. Statisticians call the result for each equation an outlier gate

Q1 − 1.5 * IQR: Lower outlier gate

Q3 + 1.5 * IQR: Upper outlier gate.

Python Code

Output:

The Outliers are 

[-2.6197451040897444]

Output:

Robust Algorithms
Robust algorithms are less sensitive to outliers and handle them better than traditional algorithms

Robust regression techniques like RANSAC or Huber Regression are resilient to outliers.
Python Code:

Output:

Impact of Outliers on Machine Learning Model


Outliers can significantly influence the performance of machine learning models by skewing the results
For instance, in linear regression, outliers can heavily impact the slope and intercept of the regression
line, affecting the model's predictions.
Python Code:

Java + DSA
Output:

Outliers vs Anomalies
Outliers are data points that deviate significantly from other observations.
while anomalies are points that are rare or unexpected. Anomalies might not always be outliers and vice
versa.
Python code:

Output:

Java + DSA
Real-Time Analogy
Different domains might have specific considerations when dealing with outliers. For instance, in sensor
data, outliers might occur due to measurement errors or equipment malfunction.
Code:

Output:
THANK

YOU !
Lesson Plan
Feature Extraction
Lesson Plan

Polymorphism and

Encapsulation

Java + DSA
Topics Covered
Introduction to Feature Extractio
Types of Feature
Numeric Feature
Categorical Feature
Feature Representatio
Feature Selection vs. Feature Extractio
Practical Implementation of feature extraction

Introduction to Feature Extraction


Feature extraction is a process within machine learning where relevant information or patterns are extracted
from raw data to create a more manageable and informative set of features.
The goal is to transform the data into a format that is more suitable for analysis and modeling
Why is Feature Extraction Necessary
Dimensionality Reduction: Raw data often contain a large number of features, and some of these features
may be irrelevant or redundant. Feature extraction helps reduce the dimensionality of the data, making it
more manageable and preventing the "curse of dimensionality.
Improved Model Performance: By focusing on the most relevant features, feature extraction can enhance
model performance. It helps in capturing the essential information from the data, leading to more
efficient and accurate machine learning models.

Java + DSA
Role in Improving Model Performance
Enhanced Generalization: Feature extraction contributes to better generalization of machine learning
models. It helps the models perform well not only on the training data but also on new, unseen data
Reduced Overfitting: Feature extraction can mitigate overfitting by eliminating noise and irrelevant
information. This results in models that are more robust and less likely to make predictions based on
spurious correlations.

Types of Features
Numeric features are quantitative and represent measurable quantities.
Examples include
Ag
Temperatur
Incom
Heigh
Categorical features represent discrete and often qualitative data. They can be further divided into nominal
and ordinal categories.
Examples include
Nominal: Colors, Gende
Ordinal: Education level, Socioeconomic status

Feature Representation
Raw Data vs. Features
Raw Data: Raw data refers to the original, unprocessed information collected from various sources. It can
be complex, noisy, and may contain irrelevant information
Features: Features are the transformed, meaningful representations extracted from raw data. They
capture relevant information for a particular task, making it easier for machine learning models to learn
patterns
Tabular Data Representation
Tabular Structure: Features are often organized in a tabular structure, with each row representing an
instance or observation, and each column representing a feature. This structured format facilitates
analysis and model training
Vectorization of Features
Vector Representation: Features are commonly represented as vectors, which are one-dimensional
arrays. This allows for mathematical operations and easy integration into machine learning algorithms
Example: In natural language processing, words can be represented as vectors, enabling operations like
addition and subtraction to capture semantic relationships

Importance of a Good Feature Representation


Enhanced Model Performance: A good feature representation is critical for the model to accurately
capture patterns and relationships within the data
Facilitates Learning: Well-represented features simplify the learning process for machine learning
algorithms, enabling them to generalize better to new, unseen data.

Java + DSA
Feature Selection vs. Feature Extraction
Feature Selection:
Involves choosing a subset of the most relevant features from the original set.
The goal is to retain the most informative features while discarding irrelevant or redundant ones
Use when the dataset has a large number of features, and some features are irrelevant or redundant
Reduces computational complexity and may improve model interpretability
Suitable when the importance of specific features is known

Feature Extraction:
This involves transforming the original features into a new set of features.
This is done by combining or creating new features that capture the essential information in the data
Use when there is a high dimensionality of data, and a transformation is needed to create a more
compact and informative feature set
Helpful when the relationships between features are complex and non-linear
Can be applied when the underlying patterns in the data are not well understood.

Practical Implementation
# Import necessary libraries

import numpy as np

import [Link] as plt

from sklearn import datasets

from [Link] import PCA

# Load Iris dataset

iris = datasets.load_iris()

X = [Link]

y = [Link]

# Feature extraction using PCA

n_components = 2 # Number of components to keep after extraction

pca = PCA(n_components=n_components)

X_pca = pca.fit_transform(X)

# Visualize the original and extracted features

[Link](figsize=(10, 4))

# Plot original features

[Link](1, 2, 1)

[Link](X[:, 0], X[:, 1], c=y, edgecolor='k', cmap=[Link])

[Link]('Original Features')

[Link](iris.feature_names[0])

[Link](iris.feature_names[1])

# Plot extracted features after PCA

[Link](1, 2, 2)

[Link](X_pca[:, 0], X_pca[:, 1], c=y, edgecolor='k', cmap=[Link])

[Link]('Extracted Features (PCA)')

[Link]('Principal Component 1')

[Link]('Principal Component 2')

plt.tight_layout()

[Link]()

Java + DSA
Output:

Java + DSA
THANK

YOU !
Lesson Plan
Data Encoding & Nominal
Encoding
Lesson Plan

Polymorphism and

Encapsulation

Java + DSA
Topics Covered
Introduction to Data Encodin
Types of Data in Machine Learnin
Understanding Nominal Encodin
Common Nominal Encoding Technique
One Hot Encodin
Label Encodin
Binary Encodin
Practical Implementation

Introduction to Data Encoding


Data encoding refers to the process of converting categorical or qualitative data into a numerical format
that can be easily processed by machine learning algorithms
It plays a crucial role in preparing and structuring data for analysis and model training
Many machine learning algorithms, especially those based on mathematical equations, require numerical
input
Data encoding enables algorithms to process and make sense of categorical features, which are prevalent
in real-world datasets
Proper encoding improves the performance and accuracy of machine learning models
Introduce the main types of data encoding: Nominal Encoding, Ordinal Encoding, and Binary Encoding
Mention that the choice of encoding method depends on the nature of the data and the requirements of the
machine learning algorithm
Highlight that this discussion will focus on Nominal Encoding and its significance in handling categorical
variables.

Types of Data in Machine Learning


Categorical Data vs. Numerical Data
Define categorical data as variables that can take on a limited and fixed number of values or categories
Define numerical data as variables that represent measurable quantities and can take on a range of
values
Explain that many real-world datasets contain categorical variables, such as gender, color, or product
type
Emphasize that without proper handling, machine learning models may misinterpret categorical data,
leading to inaccurate predictions or classifications.

Understanding Nominal Encoding


Define nominal encoding as a technique used to represent categorical data where no order or ranking is
implied among the categories
Emphasize that nominal encoding transforms categories into unique numerical representations, allowing
machine learning algorithms to understand and process them effectively.

Java + DSA
Provide examples of nominal variables, such as country names, colors, or animal types
Illustrate the need for encoding by explaining how these variables cannot be directly used in many machine
learning algorithms
Address the issue of dimensionality increase when using certain encoding methods and its impact on
computational resources.

Common Nominal Encoding Techniques


One-Hot Encoding
One-hot encoding is a technique used to convert categorical variables into a binary matrix format. Each
category is represented by a binary vector where only one bit is 1 (hot) and the rest are 0 (cold).
For example, if you have a categorical variable "Color" with values ["Red", "Blue", "Green"], after one-hot
encoding, each color becomes a binary vector such as [1, 0, 0], [0, 1, 0], and [0, 0, 1] respectively
Pros
Preserves all information about the categorical variable
Works well with algorithms that do not assume any ordinal relationship between categories
Easy to implement in most machine learning frameworks
Cons
This can lead to a high-dimensional sparse matrix, especially if the categorical variable has many
unique categories
May introduce multicollinearity in linear models if not handled properly
Increases the computational complexity and memory requirement
Examples
Sentiment analysis where categorical variables may include emotions like "happy", "sad", or "neutral"
Customer segmentation based on demographics such as age groups, gender, and location
Image classification tasks where labels represent different objects or classes
Label Encoding
Label encoding is a technique used to convert categorical variables into numerical format by assigning a
unique integer to each category.
Each category is replaced by its corresponding integer label. For example, ["Red", "Blue", "Green"] may be
encoded as [0, 1, 2]
Label encoding is suitable when there is an ordinal relationship among the categories. It works well for
categorical variables where the categories have a natural order or hierarchy.
For instance, "low", "medium", and "high" can be encoded as [0, 1, 2] respectively
Binary Encoding
Binary encoding is a technique that represents categorical variables as binary numbers.
Each category is first encoded as an integer, and then the integer is converted into binary digits
These binary digits are used as features. For example, if a categorical variable "Color" with values ["Red",
"Blue", "Green"] is binary encoded, it may result in binary representations such as [00, 01, 10]
Reduces dimensionality compared to one-hot encoding while still preserving information
Can be more memory-efficient, especially for high cardinality categorical variables
Works well with algorithms that cannot handle high-dimensional sparse matrices efficiently
Helps in capturing relationships between categories in a more compact representation
Binary encoding is often compared with one-hot encoding and label encoding. Unlike one-hot encoding,
binary encoding results in fewer dimensions,Java
making + DSA
it computationally more efficient.
Compared to label encoding, binary encoding does not impose any ordinality assumptions on the
categorical variables, making it more suitable for non-ordinal categorical data.
Practical Implementation
import pandas as pd

from [Link] import load_iris

from sklearn.model_selection import train_test_split

from [Link] import LabelEncoder, OneHotEncoder

# Load the Iris dataset

iris = load_iris()

data = [Link]

target = [Link]

# Create a DataFrame from the dataset

columns = [f"feature_{i}" for i in range([Link][1])]

df = [Link](data, columns=columns)

df["target"] = target

# Display the original DataFrame

print("Original DataFrame:")

print([Link]())

# Label Encoding for the 'target' column

le = LabelEncoder()

df['target_encoded'] = le.fit_transform(df['target'])

# Display the DataFrame after Label Encoding

print("\nDataFrame after Label Encoding:")

print([Link]())

# One-Hot Encoding for the 'target' column

ohe = OneHotEncoder(sparse=False)

ohe_result = ohe.fit_transform(df[['target']])

# Create a new DataFrame for One-Hot Encoding results

ohe_df = [Link](ohe_result, columns=[f"target_{i}" for i in


range(ohe_result.shape[1])])

# Concatenate the One-Hot Encoding DataFrame with the original


DataFrame

df = [Link]([df, ohe_df], axis=1)

# Display the DataFrame after One-Hot Encoding

print("\nDataFrame after One-Hot Encoding:")

print([Link]())

# Split the data into training and testing sets

X_train, X_test, y_train, y_test =


train_test_split([Link](['target', 'target_encoded'] +
ohe_df.[Link](), axis=1),

df['target_en
coded'], test_size=0.2, random_state=42)

# Display the training and testing sets

print("\nTraining Set:")
Java + DSA
print(X_train.head())

print("\nTesting Set:")

print(X_test.head())

Output:
DataFrame after One-Hot Encoding:

feature_0 feature_1 feature_2 feature_3 target target_encoded \

0 5.1 3.5 1.4 0.2 0 0  

1 4.9 3.0 1.4 0.2 0 0  

2 4.7 3.2 1.3 0.2 0 0  

3 4.6 3.1 1.5 0.2 0 0  

4 5.0 3.6 1.4 0.2 0 0  

target_0 target_1 target_2  

0 1.0 0.0 0.0  

1 1.0 0.0 0.0  

2 1.0 0.0 0.0  

3 1.0 0.0 0.0  

4 1.0 0.0 0.0  

/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse`


was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you
leave `sparse` to its default value.

[Link](

Training Set:

feature_0 feature_1 feature_2 feature_3

22 4.6 3.6 1.0 0.2

15 5.7 4.4 1.5 0.4

65 6.7 3.1 4.4 1.4

11 4.8 3.4 1.6 0.2

42 4.4 3.2 1.3 0.2

Testing Set:

feature_0 feature_1 feature_2 feature_3

73 6.1 2.8 4.7 1.2

18 5.7 3.8 1.7 0.3

118 7.7 2.6 6.9 2.3

78 6.0 2.9 4.5 1.5

76 6.8 2.8 4.8 1.4

Java + DSA

You might also like