0% found this document useful (0 votes)
12 views41 pages

Data Preprocessing Techniques in ML

The document covers essential data preprocessing steps for machine learning, including handling missing values, outliers, noise, duplicates, and encoding categorical data. It discusses techniques for addressing class imbalance, optimizing hyperparameters, and design considerations in machine learning, such as data quality and model robustness. Additionally, it highlights limitations of machine learning and provides case studies of successful applications in various industries.

Uploaded by

Akash s
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views41 pages

Data Preprocessing Techniques in ML

The document covers essential data preprocessing steps for machine learning, including handling missing values, outliers, noise, duplicates, and encoding categorical data. It discusses techniques for addressing class imbalance, optimizing hyperparameters, and design considerations in machine learning, such as data quality and model robustness. Additionally, it highlights limitations of machine learning and provides case studies of successful applications in various industries.

Uploaded by

Akash s
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

Module 6 and 7

Contents
Class Imbalance
SMOTE
One class SVM
Optimization of hyperparameters
Data Preprocessing Steps:

[Link] the necessary libraries


[Link] the dataset
[Link] missing data
[Link] categorical data
Encoding Categorical Features

5. Normalization and Standardization


6. Handling imbalanced data
7. Data Splitting
Data Cleaning
This involves identifying and correcting errors or inconsistencies
in the data, such as missing values, noise, outliers and
duplicates. Various techniques can be used for data cleaning,
such as imputation, removal or transformation.

Handling Missing Values:

Handling Outliers:

Handling Noisy data:


Binning
Handling Missing Values:

[Link] missing value manually


[Link] missing value column/row
[Link] central tendency (mean, median, mode) to
replace missing value through imputation

[Link]

 dropna() in pandas
 fillna() method in pandas by imputing mean, median, or mode
Handling Outliers

Outliers are data points that significantly differ from other observations in the data
set and can skew statistical analysis or machine learning models.

Use Box plot and Scatter plot to identify outliers


Interquartile range(IQR), Outliers are often defined as data points that fall below
Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR.
Consider using robust machine learning models that are less sensitive to outliers,
such as support vector machines (SVM), Random Forests, or ensemble methods.
Use Kmedoid clustering instead of kmeans clustering if there exists outliers in the
data.
One class SVM is a outlier detection method.
Handling Noise

Irrelevant or erroneous information can obscure meaningful patterns.

Noise removal techniques, such as filtering outliers or smoothing data, help


retain essential information

Data binning or bucketing is a data preprocessing method used to minimize


the effects of small observation errors. This has a smoothing effect on the
input data and may also reduce the chances of overfitting in case of small
datasets. Equal-Width Binning and Equal-frequency Binning are certain
techniques used.
Handling Duplicates

Duplicate records can skew analysis and model training by inflating


certain patterns
To detect and handle duplicates:
•Use functions like duplicated() in pandas to identify duplicate rows based on
specific columns or the entire row.
•If duplicate records are redundant and provide no additional information, you
can remove them using the drop_duplicates() function in pandas or similar
methods in other tools.
Encoding Categorical data

Why Encode Categorical Data?

Encoding is a crucial step in the data preprocessing pipeline, ensuring that the data is in
a suitable format for training and evaluation.
Machine Learning Algorithms: Most machine learning algorithms, such as linear
regression, support vector machines, and neural networks, require numerical input.
Categorical data needs to be converted into a numerical format to be used effectively.

Impact of encoding in model performance: Proper encoding can significantly impact


the performance of a machine learning model. Incorrect or suboptimal encoding can
lead to poor model performance and inaccurate predictions.
Qualitative and Quantitative Data
Encoding Categorical data

Types of Categorical Data


Categorical data can be broadly classified into two types:
[Link] Data: This type of data represents categories without any inherent order.
Examples include gender (male, female), color (red, blue, green), and country (USA,
India, UK).

[Link] Data: This type of data represents categories with a meaningful order or
ranking. Examples include education level (high school, bachelor's, master's, PhD)
and customer satisfaction (low, medium, high)
Encoding Categorical data

•Encoding Techniques in Sklearn

• Label Encoding
• One-hot Encoding
• Ordinal Encoding
• Target Encoding
Encoding Categorical data

Label encoding

Label encoding is a technique for encoding categorical variables as


numeric values, with each category assigned a unique integer.
Encoding Categorical data

One-hot encoding
One hot encoding is a process of representing categorical data as
a set of binary values, where each category is mapped to a
unique binary value.
Data Scaling
Data Scaling is a data preprocessing step for numerical features. Scaling is the process of
transforming numerical features to a common range or scale.

Two Techniques:
Standardization - StandardScaler from scikit-learn
Normalization - MinMaxScaler from scikit-learn
Standardization

Z-score normalization (standardization)


Z-score normalization (standardization) assumes a Gaussian (bell curve) distribution
of the data and transforms features to have a mean (μ) of 0 and a standard deviation
(σ) of 1. The transformation performed by StandardScaler can be expressed
mathematically as:
Z = (x-mean)/standard deviation

StandardScaler is relatively robust to the presence of outliers compared to min-max


scaling.

StandardScaler() from scikit-learn


Normalization

Min-max scaling is very often simply called ‘normalization.’ It transforms


features to a specified range, typically between 0 and 1. Highly sensitive to outliers
since min and max are affected by extreme values.
Xnormalized = X – Xmin / Xmax – Xmin
MinMaxScaler() from scikit-learn

Robust Scaling
Robust Scaling uses the median and the interquartile range to scale the data, making it
robust to outliers.

RobustScaler() from scikit-learn


Class Imbalance

Class imbalance occurs when one class in a classification problem significantly


outweighs the other class. It’s common in many machine learning problems.

A model trained on an imbalanced dataset perform poorly on the minority class.

Class imbalance is addressed by following techniques:


[Link] oversampling
[Link] undersampling
[Link] with SMOTE
[Link] with Tomek Link
[Link] with ADASYN
[Link] with SMOTE, then undersample with TOMEK Link (SMOTE-
Tomek)
Random Oversampling

Oversampling can be defined as adding more copies to the


minority class.

Random Undersampling
Conversely, random undersampling removes existing samples from
the majority class.

imbalanced-learn is a python package offering a number of re-


sampling techniques commonly used in datasets showing strong
between-class imbalance.
Oversampling with SMOTE (Synthetic Minority Oversampling
Technique)
This technique generates synthetic data for the minority class.
SMOTE (Synthetic Minority Oversampling Technique in machine
learning) works by randomly picking a point from the minority
class and computing the k-nearest neighbors for this point.
The synthetic points are added between the chosen point and
its neighbors.
Oversampling with SMOTE (Synthetic Minority Oversampling
Technique)

SMOTE algorithm works in 4 simple steps:


[Link] a minority class as the input vector.

[Link]
its k nearest neighbors (k_neighbors is specified as an
argument in the SMOTE() function).
[Link] of these neighbors and place a synthetic point
anywhere on the line joining the point under consideration and its
chosen neighbor.
[Link] the steps until the data is balanced.
Oversampling with SMOTE (Synthetic Minority Oversampling
Technique)

Though this algorithm is quite useful, it has few drawbacks associated with it.

If the minority class has noisy or mislabelled instances, SMOTE can amplify the noise by generating synthetic
examples based on these erroneous data points. SMOTE tends to create a large no. of noisy data points in feature
space.

Since SMOTE generates synthetic samples that are close to existing minority class samples, it can lead to overfitting,
especially if the model becomes too tailored to these synthetic examples rather than capturing the underlying patterns
Under-Sampling with Tomek
Links
Tomek links are pairs of very close instances but of opposite
classes. Removing the instances of the majority class of each pair
increases the space between the two classes, facilitating the
classification process.
Tomek’s link exists if the two samples are the nearest
neighbors of each other.
One Class SVM

One Class Support Vector Machines (SVMs) are a type of


unsupervised outlier detection method.
Optimization of hyperparameters

It's the process of tuning the settings (aka hyperparameters) of a machine


learning model to get the best performance.

Unlike model parameters (like weights in a neural network),


hyperparameters are set before training — examples include:
•Learning rate
•Number of layers/neurons
•Batch size
•Number of trees in a random forest
•Regularization strength
Common Optimization Techniques

1. Grid Search
2. Random Search
3. Bayesian Optimization
4. Evolutionary Algorithms / Genetic Algorithms

Libraries like:

1. Scikit-learn’s GridSearchCV / RandomizedSearchCV


2. Keras Tuner
Design constraints and considerations in machine learning

1. Data Quality and Quantity


Challenge: Machine learning algorithms are voracious consumers
of data, but they demand high-quality data. Garbage in, garbage
out — the adage rings true in the world of ML. The data used for
training machine learning models should be clean, accurate, and
representative of the problem. Data preprocessing, cleaning, and
augmentation are often required to ensure data quality.
Additionally, having a sufficient quantity of data is crucial, as
models need diverse examples to learn effectively.
Real-life Example: Healthcare providers rely on patient records
to train diagnostic models. Incomplete or inaccurate data can lead
to erroneous predictions, risking patient health.
Design constraints and considerations in machine learning

2. Overfitting and Underfitting


Challenge: Machine learning models can overfit (become overly
complex) or underfit (too simplistic). Striking the right balance is
critical for model performance. Overfitting occurs when a model
fits the training data too closely, capturing noise instead of useful
patterns. Underfitting, on the other hand, results from overly
simplistic models that can’t capture complex relationships in the
data. Addressing these issues often involves hyperparameter
tuning and cross-validation.
Real-life Example: In stock market prediction, an overfit model
may perform exceptionally well on historical data but fail to
generalize to new market conditions, leading to poor investment
decisions.
Design constraints and considerations in machine learning

[Link] and Explainability


Challenge: Many machine learning models operate as “black
boxes,” making it challenging to comprehend the reasoning
behind their decisions. Interpretability and explainability are
crucial in applications where human lives or significant financial
decisions are at stake. Understanding why a model made a
specific decision can be essential for transparency,
accountability, and trust.

Real-life Example: In the context of autonomous vehicles,


understanding why a self-driving car made a specific decision is
vital for safety and accountability.
Design constraints and considerations in machine learning

4. Generalization
Challenge: A successful machine learning model should perform
well on new, unseen data. Achieving this generalization is often
tricky. Ensuring that a model generalizes effectively is a core
challenge. Overfit models may perform well on training data but fail
to make accurate predictions on new, unseen data. Techniques such
as cross-validation and regularization are employed to improve
generalization.

Real-life Example: A spam email classifier may excel in identifying


common spam, but it could falter when new, sophisticated spam
techniques emerge.
Design constraints and considerations in machine learning

6. Computational Resources
Challenge: Training deep learning models, especially large neural
networks, requires substantial computational resources, including
powerful GPUs and TPUs. Deep learning models with millions of
parameters demand significant computing power. This not only
drives up costs but also raises concerns about energy
consumption and environmental impact.

Real-life Example: Training state-of-the-art language models like


GPT-3 demands massive computing power and energy
consumption, raising environmental concerns.
Design constraints and considerations in machine learning

7. Model Selection
Challenge: Selecting the appropriate machine learning
algorithm or model architecture for a specific problem can be
perplexing. Choosing the wrong one may result in suboptimal
performance. The choice of model depends on the data type,
problem type (classification, regression, clustering), and the
desired output. It also involves deciding between traditional
machine learning algorithms and deep learning methods.

Real-life Example: Image recognition tasks benefit from


convolutional neural networks (CNNs), while natural language
processing tasks require recurrent neural networks (RNNs).
Limitations of Machine Learning

Data Dependency
Limitation: Machine learning models are highly dependent on
the data they are trained on. They can’t provide meaningful
insights beyond their training data. Machine learning models
make predictions based on the patterns they’ve learned from
historical data. They can’t offer insights or predictions about
events or phenomena that fall outside the scope of their training
data.

Real-life Example: A sentiment analysis model trained on


restaurant reviews can’t offer insights into political sentiment.
Limitations of Machine Learning

Data Privacy and Security


Limitation: Machine learning often involves processing
sensitive data, raising concerns about privacy and security
breaches. Machine learning applications, especially in
healthcare, finance, and personalization, often require access
to sensitive data. Safeguarding this data from breaches and
ensuring compliance with data protection regulations are
significant challenges.

Real-life Example: Healthcare providers must ensure that


patient data is securely processed and stored when using
machine learning for diagnostics.
Limitations of Machine Learning

Model Robustness
Limitation: Machine learning models can be fragile. Small changes
in input data can result in incorrect predictions or cause the model
to fail. Models can be sensitive to minor changes in input data,
which is problematic when deployed in real-world environments
where data may be noisy or incomplete. Ensuring model robustness
is a continuous challenge.

Real-life Example: In an image recognition system, minor


alterations to an image may render it unrecognizable to the model.
Limitations of Machine Learning

Scalability
Limitation: Scaling machine learning solutions to handle
increased data volumes and user traffic can be challenging,
requiring substantial infrastructure investments. As
organizations grow and the volume of data and users
increases, machine learning systems may need to be re-
architected to maintain performance. This involves investing
in infrastructure, distributed computing.

Real-life Example: E-commerce websites must invest in


powerful servers and databases to scale up recommendation
systems as their customer base grows.
Case Studies
[Link]

1. Google’s DeepMind: Diabetic Retinopathy Detection

Company: Google DeepMind


Problem Definition: Diabetic retinopathy is a diabetes
complication affecting the eyes and the leading cause of
blindness among adults. Early detection is crucial for effective
treatment, but many patients lack access to screening
services.

Solution: DeepMind developed a machine learning model that


analyzes eye images to detect signs of diabetic retinopathy
automatically. The model uses deep learning techniques to
interpret optical coherence tomography (OCT) and fundus
photography, identifying disease markers that are often subtle
and challenging for human examiners.
Case Studies

2. PayPal: Fraud Detection

Company: PayPal
Problem Definition: As a major online payment platform,
PayPal faces constant threats from various forms of financial
fraud, including unauthorized transactions and identity theft,
which can result in significant financial losses.

Solution: PayPal implemented a machine learning system to


enhance its fraud detection capabilities. The system analyzes
millions of transactions in real-time, utilizing algorithms to
identify patterns and anomalies that suggest fraudulent activity.
Case Studies

3. Amazon: Personalized Recommendations

Company: Amazon
Problem Definition: With millions of products and a diverse
customer base, Amazon sought to improve shopping experiences
by making personalized product recommendations to its users,
aiming to increase sales and customer satisfaction.

Solution: Amazon developed a sophisticated machine learning


recommendation system that analyzes individual customer data,
including past purchases, search histories, and browsing
behaviors. This system uses collaborative filtering and deep
learning to predict and suggest products that a user is likely to
be interested in.
Case Studies

4. General Electric: Predictive Maintenance

Company: General Electric (GE)


Problem Definition: In industries like aviation and energy,
equipment failure can lead to massive operational disruptions
and high maintenance costs. GE needed a way to predict
equipment failures before they occurred to optimize
maintenance schedules and reduce downtime.

Solution: GE developed predictive maintenance software


using machine learning algorithms that analyze data from
machinery-embedded sensors. These algorithms predict when
equipment will likely fail by detecting anomalies and patterns
that deviate from normal operation.
Case Studies

5. Tesla: Autonomous Driving

Company: Tesla, Inc.


Problem Definition: Enhancing vehicle safety and efficiency
through automation is a major goal in the automotive industry.
Tesla aimed to develop a reliable self-driving system to handle
complex driving tasks and adapt to diverse driving conditions.

Solution: Tesla’s Autopilot system uses machine learning to


process data from cameras, radar, and sensors to enable
autonomous driving capabilities. The system includes auto-
steering, traffic-aware cruise control, and automatic lane changes.

You might also like