0% found this document useful (0 votes)

12 views41 pages

Data Preprocessing Techniques in ML

The document covers essential data preprocessing steps for machine learning, including handling missing values, outliers, noise, duplicates, and encoding categorical data. It discusses techniques for addressing class imbalance, optimizing hyperparameters, and design considerations in machine learning, such as data quality and model robustness. Additionally, it highlights limitations of machine learning and provides case studies of successful applications in various industries.

Uploaded by

Akash s

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views41 pages

Data Preprocessing Techniques in ML

Uploaded by

Akash s

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Module 6 and 7

Contents
Class Imbalance
SMOTE
One class SVM
Optimization of hyperparameters
Data Preprocessing Steps:

[Link] the necessary libraries

[Link] the dataset
[Link] missing data
[Link] categorical data
Encoding Categorical Features

5. Normalization and Standardization

6. Handling imbalanced data
7. Data Splitting
Data Cleaning
This involves identifying and correcting errors or inconsistencies
in the data, such as missing values, noise, outliers and
duplicates. Various techniques can be used for data cleaning,
such as imputation, removal or transformation.

Handling Missing Values:

Handling Outliers:

Handling Noisy data:

Binning
Handling Missing Values:

[Link] missing value manually

[Link] missing value column/row
[Link] central tendency (mean, median, mode) to
replace missing value through imputation

[Link]

 dropna() in pandas
 fillna() method in pandas by imputing mean, median, or mode
Handling Outliers

Outliers are data points that significantly differ from other observations in the data
set and can skew statistical analysis or machine learning models.

Use Box plot and Scatter plot to identify outliers

Interquartile range(IQR), Outliers are often defined as data points that fall below
Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR.
Consider using robust machine learning models that are less sensitive to outliers,
such as support vector machines (SVM), Random Forests, or ensemble methods.
Use Kmedoid clustering instead of kmeans clustering if there exists outliers in the
data.
One class SVM is a outlier detection method.
Handling Noise

Irrelevant or erroneous information can obscure meaningful patterns.

Noise removal techniques, such as filtering outliers or smoothing data, help

retain essential information

Data binning or bucketing is a data preprocessing method used to minimize

the effects of small observation errors. This has a smoothing effect on the
input data and may also reduce the chances of overfitting in case of small
datasets. Equal-Width Binning and Equal-frequency Binning are certain
techniques used.
Handling Duplicates

Duplicate records can skew analysis and model training by inflating

certain patterns
To detect and handle duplicates:
•Use functions like duplicated() in pandas to identify duplicate rows based on
specific columns or the entire row.
•If duplicate records are redundant and provide no additional information, you
can remove them using the drop_duplicates() function in pandas or similar
methods in other tools.
Encoding Categorical data

Why Encode Categorical Data?

Encoding is a crucial step in the data preprocessing pipeline, ensuring that the data is in
a suitable format for training and evaluation.
Machine Learning Algorithms: Most machine learning algorithms, such as linear
regression, support vector machines, and neural networks, require numerical input.
Categorical data needs to be converted into a numerical format to be used effectively.

Impact of encoding in model performance: Proper encoding can significantly impact

the performance of a machine learning model. Incorrect or suboptimal encoding can
lead to poor model performance and inaccurate predictions.
Qualitative and Quantitative Data
Encoding Categorical data

Types of Categorical Data

Categorical data can be broadly classified into two types:
[Link] Data: This type of data represents categories without any inherent order.
Examples include gender (male, female), color (red, blue, green), and country (USA,
India, UK).

[Link] Data: This type of data represents categories with a meaningful order or
ranking. Examples include education level (high school, bachelor's, master's, PhD)
and customer satisfaction (low, medium, high)
Encoding Categorical data

•Encoding Techniques in Sklearn

• Label Encoding
• One-hot Encoding
• Ordinal Encoding
• Target Encoding
Encoding Categorical data

Label encoding

Label encoding is a technique for encoding categorical variables as

numeric values, with each category assigned a unique integer.
Encoding Categorical data

One-hot encoding
One hot encoding is a process of representing categorical data as
a set of binary values, where each category is mapped to a
unique binary value.
Data Scaling
Data Scaling is a data preprocessing step for numerical features. Scaling is the process of
transforming numerical features to a common range or scale.

Two Techniques:
Standardization - StandardScaler from scikit-learn
Normalization - MinMaxScaler from scikit-learn
Standardization

Z-score normalization (standardization)

Z-score normalization (standardization) assumes a Gaussian (bell curve) distribution
of the data and transforms features to have a mean (μ) of 0 and a standard deviation
(σ) of 1. The transformation performed by StandardScaler can be expressed
mathematically as:
Z = (x-mean)/standard deviation

StandardScaler is relatively robust to the presence of outliers compared to min-max

scaling.

StandardScaler() from scikit-learn

Normalization

Min-max scaling is very often simply called ‘normalization.’ It transforms

features to a specified range, typically between 0 and 1. Highly sensitive to outliers
since min and max are affected by extreme values.
Xnormalized = X – Xmin / Xmax – Xmin
MinMaxScaler() from scikit-learn

Robust Scaling
Robust Scaling uses the median and the interquartile range to scale the data, making it
robust to outliers.

RobustScaler() from scikit-learn

Class Imbalance

Class imbalance occurs when one class in a classification problem significantly

outweighs the other class. It’s common in many machine learning problems.

A model trained on an imbalanced dataset perform poorly on the minority class.

Class imbalance is addressed by following techniques:

[Link] oversampling
[Link] undersampling
[Link] with SMOTE
[Link] with Tomek Link
[Link] with ADASYN
[Link] with SMOTE, then undersample with TOMEK Link (SMOTE-
Tomek)
Random Oversampling

Oversampling can be defined as adding more copies to the

minority class.

Random Undersampling
Conversely, random undersampling removes existing samples from
the majority class.

imbalanced-learn is a python package offering a number of re-

sampling techniques commonly used in datasets showing strong
between-class imbalance.
Oversampling with SMOTE (Synthetic Minority Oversampling
Technique)
This technique generates synthetic data for the minority class.
SMOTE (Synthetic Minority Oversampling Technique in machine
learning) works by randomly picking a point from the minority
class and computing the k-nearest neighbors for this point.
The synthetic points are added between the chosen point and
its neighbors.
Oversampling with SMOTE (Synthetic Minority Oversampling
Technique)

SMOTE algorithm works in 4 simple steps:

[Link] a minority class as the input vector.

[Link]
its k nearest neighbors (k_neighbors is specified as an
argument in the SMOTE() function).
[Link] of these neighbors and place a synthetic point
anywhere on the line joining the point under consideration and its
chosen neighbor.
[Link] the steps until the data is balanced.
Oversampling with SMOTE (Synthetic Minority Oversampling
Technique)

Though this algorithm is quite useful, it has few drawbacks associated with it.

If the minority class has noisy or mislabelled instances, SMOTE can amplify the noise by generating synthetic
examples based on these erroneous data points. SMOTE tends to create a large no. of noisy data points in feature
space.

Since SMOTE generates synthetic samples that are close to existing minority class samples, it can lead to overfitting,
especially if the model becomes too tailored to these synthetic examples rather than capturing the underlying patterns
Under-Sampling with Tomek
Links
Tomek links are pairs of very close instances but of opposite
classes. Removing the instances of the majority class of each pair
increases the space between the two classes, facilitating the
classification process.
Tomek’s link exists if the two samples are the nearest
neighbors of each other.
One Class SVM

One Class Support Vector Machines (SVMs) are a type of

unsupervised outlier detection method.
Optimization of hyperparameters

It's the process of tuning the settings (aka hyperparameters) of a machine

learning model to get the best performance.

Unlike model parameters (like weights in a neural network),

hyperparameters are set before training — examples include:
•Learning rate
•Number of layers/neurons
•Batch size
•Number of trees in a random forest
•Regularization strength
Common Optimization Techniques

1. Grid Search
2. Random Search
3. Bayesian Optimization
4. Evolutionary Algorithms / Genetic Algorithms

Libraries like:

1. Scikit-learn’s GridSearchCV / RandomizedSearchCV

2. Keras Tuner
Design constraints and considerations in machine learning

1. Data Quality and Quantity

Challenge: Machine learning algorithms are voracious consumers
of data, but they demand high-quality data. Garbage in, garbage
out — the adage rings true in the world of ML. The data used for
training machine learning models should be clean, accurate, and
representative of the problem. Data preprocessing, cleaning, and
augmentation are often required to ensure data quality.
Additionally, having a sufficient quantity of data is crucial, as
models need diverse examples to learn effectively.
Real-life Example: Healthcare providers rely on patient records
to train diagnostic models. Incomplete or inaccurate data can lead
to erroneous predictions, risking patient health.
Design constraints and considerations in machine learning

2. Overfitting and Underfitting

Challenge: Machine learning models can overfit (become overly
complex) or underfit (too simplistic). Striking the right balance is
critical for model performance. Overfitting occurs when a model
fits the training data too closely, capturing noise instead of useful
patterns. Underfitting, on the other hand, results from overly
simplistic models that can’t capture complex relationships in the
data. Addressing these issues often involves hyperparameter
tuning and cross-validation.
Real-life Example: In stock market prediction, an overfit model
may perform exceptionally well on historical data but fail to
generalize to new market conditions, leading to poor investment
decisions.
Design constraints and considerations in machine learning

[Link] and Explainability

Challenge: Many machine learning models operate as “black
boxes,” making it challenging to comprehend the reasoning
behind their decisions. Interpretability and explainability are
crucial in applications where human lives or significant financial
decisions are at stake. Understanding why a model made a
specific decision can be essential for transparency,
accountability, and trust.

Real-life Example: In the context of autonomous vehicles,

understanding why a self-driving car made a specific decision is
vital for safety and accountability.
Design constraints and considerations in machine learning

4. Generalization
Challenge: A successful machine learning model should perform
well on new, unseen data. Achieving this generalization is often
tricky. Ensuring that a model generalizes effectively is a core
challenge. Overfit models may perform well on training data but fail
to make accurate predictions on new, unseen data. Techniques such
as cross-validation and regularization are employed to improve
generalization.

Real-life Example: A spam email classifier may excel in identifying

common spam, but it could falter when new, sophisticated spam
techniques emerge.
Design constraints and considerations in machine learning

6. Computational Resources
Challenge: Training deep learning models, especially large neural
networks, requires substantial computational resources, including
powerful GPUs and TPUs. Deep learning models with millions of
parameters demand significant computing power. This not only
drives up costs but also raises concerns about energy
consumption and environmental impact.

Real-life Example: Training state-of-the-art language models like

GPT-3 demands massive computing power and energy
consumption, raising environmental concerns.
Design constraints and considerations in machine learning

7. Model Selection
Challenge: Selecting the appropriate machine learning
algorithm or model architecture for a specific problem can be
perplexing. Choosing the wrong one may result in suboptimal
performance. The choice of model depends on the data type,
problem type (classification, regression, clustering), and the
desired output. It also involves deciding between traditional
machine learning algorithms and deep learning methods.

Real-life Example: Image recognition tasks benefit from

convolutional neural networks (CNNs), while natural language
processing tasks require recurrent neural networks (RNNs).
Limitations of Machine Learning

Data Dependency
Limitation: Machine learning models are highly dependent on
the data they are trained on. They can’t provide meaningful
insights beyond their training data. Machine learning models
make predictions based on the patterns they’ve learned from
historical data. They can’t offer insights or predictions about
events or phenomena that fall outside the scope of their training
data.

Real-life Example: A sentiment analysis model trained on

restaurant reviews can’t offer insights into political sentiment.
Limitations of Machine Learning

Data Privacy and Security

Limitation: Machine learning often involves processing
sensitive data, raising concerns about privacy and security
breaches. Machine learning applications, especially in
healthcare, finance, and personalization, often require access
to sensitive data. Safeguarding this data from breaches and
ensuring compliance with data protection regulations are
significant challenges.

Real-life Example: Healthcare providers must ensure that

patient data is securely processed and stored when using
machine learning for diagnostics.
Limitations of Machine Learning

Model Robustness
Limitation: Machine learning models can be fragile. Small changes
in input data can result in incorrect predictions or cause the model
to fail. Models can be sensitive to minor changes in input data,
which is problematic when deployed in real-world environments
where data may be noisy or incomplete. Ensuring model robustness
is a continuous challenge.

Real-life Example: In an image recognition system, minor

alterations to an image may render it unrecognizable to the model.
Limitations of Machine Learning

Scalability
Limitation: Scaling machine learning solutions to handle
increased data volumes and user traffic can be challenging,
requiring substantial infrastructure investments. As
organizations grow and the volume of data and users
increases, machine learning systems may need to be re-
architected to maintain performance. This involves investing
in infrastructure, distributed computing.

Real-life Example: E-commerce websites must invest in

powerful servers and databases to scale up recommendation
systems as their customer base grows.
Case Studies
[Link]

1. Google’s DeepMind: Diabetic Retinopathy Detection

Company: Google DeepMind

Problem Definition: Diabetic retinopathy is a diabetes
complication affecting the eyes and the leading cause of
blindness among adults. Early detection is crucial for effective
treatment, but many patients lack access to screening
services.

Solution: DeepMind developed a machine learning model that

analyzes eye images to detect signs of diabetic retinopathy
automatically. The model uses deep learning techniques to
interpret optical coherence tomography (OCT) and fundus
photography, identifying disease markers that are often subtle
and challenging for human examiners.
Case Studies

2. PayPal: Fraud Detection

Company: PayPal
Problem Definition: As a major online payment platform,
PayPal faces constant threats from various forms of financial
fraud, including unauthorized transactions and identity theft,
which can result in significant financial losses.

Solution: PayPal implemented a machine learning system to

enhance its fraud detection capabilities. The system analyzes
millions of transactions in real-time, utilizing algorithms to
identify patterns and anomalies that suggest fraudulent activity.
Case Studies

3. Amazon: Personalized Recommendations

Company: Amazon
Problem Definition: With millions of products and a diverse
customer base, Amazon sought to improve shopping experiences
by making personalized product recommendations to its users,
aiming to increase sales and customer satisfaction.

Solution: Amazon developed a sophisticated machine learning

recommendation system that analyzes individual customer data,
including past purchases, search histories, and browsing
behaviors. This system uses collaborative filtering and deep
learning to predict and suggest products that a user is likely to
be interested in.
Case Studies

4. General Electric: Predictive Maintenance

Company: General Electric (GE)

Problem Definition: In industries like aviation and energy,
equipment failure can lead to massive operational disruptions
and high maintenance costs. GE needed a way to predict
equipment failures before they occurred to optimize
maintenance schedules and reduce downtime.

Solution: GE developed predictive maintenance software

using machine learning algorithms that analyze data from
machinery-embedded sensors. These algorithms predict when
equipment will likely fail by detecting anomalies and patterns
that deviate from normal operation.
Case Studies

5. Tesla: Autonomous Driving

Company: Tesla, Inc.

Problem Definition: Enhancing vehicle safety and efficiency
through automation is a major goal in the automotive industry.
Tesla aimed to develop a reliable self-driving system to handle
complex driving tasks and adapt to diverse driving conditions.

Solution: Tesla’s Autopilot system uses machine learning to

process data from cameras, radar, and sensors to enable
autonomous driving capabilities. The system includes auto-
steering, traffic-aware cruise control, and automatic lane changes.

Scikit-Learn Machine Learning Guide
No ratings yet
Scikit-Learn Machine Learning Guide
54 pages
Supervised Machine Learning Overview
No ratings yet
Supervised Machine Learning Overview
38 pages
U1 Int395
No ratings yet
U1 Int395
38 pages
Data Preparation for Machine Learning
No ratings yet
Data Preparation for Machine Learning
47 pages
Machine Learning Practice - Viva Prep
No ratings yet
Machine Learning Practice - Viva Prep
42 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
20 pages
Data Preprocessing Techniques for ML
No ratings yet
Data Preprocessing Techniques for ML
19 pages
ML Unit 1
No ratings yet
ML Unit 1
21 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
55 pages
Machine Learning - 1 - 1725211927799
No ratings yet
Machine Learning - 1 - 1725211927799
10 pages
Data Preprocessing for Machine Learning
No ratings yet
Data Preprocessing for Machine Learning
111 pages
Feature Selection Techniques in ML
No ratings yet
Feature Selection Techniques in ML
33 pages
ML Workflow & Data Preprocessing Guide
No ratings yet
ML Workflow & Data Preprocessing Guide
16 pages
Data Preprocessing for Machine Learning
No ratings yet
Data Preprocessing for Machine Learning
46 pages
Data Preprocessing for AI Performance
No ratings yet
Data Preprocessing for AI Performance
35 pages
DMML Lab 22csl61 - Part-A
No ratings yet
DMML Lab 22csl61 - Part-A
21 pages
Lec 4,5,6 ML Basics
No ratings yet
Lec 4,5,6 ML Basics
48 pages
Essential Steps in Data Preprocessing
No ratings yet
Essential Steps in Data Preprocessing
87 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
24 pages
Data Preprocessing for ML Techniques
No ratings yet
Data Preprocessing for ML Techniques
38 pages
Data Cleaning and Feature Scaling Guide
No ratings yet
Data Cleaning and Feature Scaling Guide
18 pages
Essential Techniques for Data Preprocessing
No ratings yet
Essential Techniques for Data Preprocessing
9 pages
Supervised Learning: Regression Techniques
No ratings yet
Supervised Learning: Regression Techniques
37 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
25 pages
Machine Learning Data Preprocessing Guide
No ratings yet
Machine Learning Data Preprocessing Guide
16 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
23 pages
Feature Selection & Data Preprocessing Guide
No ratings yet
Feature Selection & Data Preprocessing Guide
18 pages
Understanding EDA and Model Evaluation
No ratings yet
Understanding EDA and Model Evaluation
22 pages
Essential Data Preprocessing Steps
No ratings yet
Essential Data Preprocessing Steps
5 pages
Data Preprocessing Techniques for ML
No ratings yet
Data Preprocessing Techniques for ML
35 pages
Data Preprocessing for Machine Learning
No ratings yet
Data Preprocessing for Machine Learning
8 pages
Supervised Machine Learning Guide
No ratings yet
Supervised Machine Learning Guide
30 pages
Feature Engineering with Scikit-Learn
No ratings yet
Feature Engineering with Scikit-Learn
58 pages
Preprocessing
No ratings yet
Preprocessing
10 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
11 pages
Data Preprocessing Interview Q&A Guide
No ratings yet
Data Preprocessing Interview Q&A Guide
17 pages
Salary Prediction Using Regression Techniques
No ratings yet
Salary Prediction Using Regression Techniques
10 pages
Student Performance Data Analysis
No ratings yet
Student Performance Data Analysis
11 pages
Data Preprocessing Techniques in Python
No ratings yet
Data Preprocessing Techniques in Python
12 pages
Hands-On Data Preprocessing in Python
No ratings yet
Hands-On Data Preprocessing in Python
12 pages
Lecture03 Handling Machine+Ensemble
No ratings yet
Lecture03 Handling Machine+Ensemble
62 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
4 pages
Data Reduction and Sampling Techniques
No ratings yet
Data Reduction and Sampling Techniques
49 pages
HCA Machine Learning with Scikit-Learn
No ratings yet
HCA Machine Learning with Scikit-Learn
8 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
Data Preprocessing Techniques for ML
No ratings yet
Data Preprocessing Techniques for ML
40 pages
Data Preprocessing & Supervised Learning
No ratings yet
Data Preprocessing & Supervised Learning
19 pages
AI Internal
No ratings yet
AI Internal
10 pages
Data Preprocessing & Visualization Masterclass - Google Slides
No ratings yet
Data Preprocessing & Visualization Masterclass - Google Slides
30 pages
Data Metrics
No ratings yet
Data Metrics
63 pages
Data Preprocessing Techniques in Python
No ratings yet
Data Preprocessing Techniques in Python
14 pages
Data Preparation Checklist for ML
No ratings yet
Data Preparation Checklist for ML
22 pages
Feature Engineering
No ratings yet
Feature Engineering
75 pages
Building Effective Machine Learning Datasets
No ratings yet
Building Effective Machine Learning Datasets
17 pages
Machine Learning SIT DGP-pages-5
No ratings yet
Machine Learning SIT DGP-pages-5
10 pages
Lareb Razzaq 22635
No ratings yet
Lareb Razzaq 22635
11 pages
Data Cleaning & Preparation
No ratings yet
Data Cleaning & Preparation
22 pages
NEET & JEE Coaching in Namakkal
No ratings yet
NEET & JEE Coaching in Namakkal
40 pages
Grade X Abhyas Test Schedule 2025
No ratings yet
Grade X Abhyas Test Schedule 2025
2 pages
CubeSat Debris Capture Strategies
No ratings yet
CubeSat Debris Capture Strategies
8 pages
Fluid Dynamics and Motion Problems
No ratings yet
Fluid Dynamics and Motion Problems
1 page
HTML Tags and Attributes Guide
No ratings yet
HTML Tags and Attributes Guide
6 pages
Grade X Exam Schedule 2025-26
No ratings yet
Grade X Exam Schedule 2025-26
4 pages
Rotational Mechanics: Key Concepts & Solutions
No ratings yet
Rotational Mechanics: Key Concepts & Solutions
2 pages
Nadi Dasha Systems Overview
No ratings yet
Nadi Dasha Systems Overview
9 pages
Digital Workflow in Implant Dentistry
No ratings yet
Digital Workflow in Implant Dentistry
4 pages
English for Science: A Learner-Centered Approach
100% (2)
English for Science: A Learner-Centered Approach
10 pages
ሸገረ የተገኙ የተመዘገቡ ሰዎች
No ratings yet
ሸገረ የተገኙ የተመዘገቡ ሰዎች
8 pages
SIWES Experience at NIMASA
No ratings yet
SIWES Experience at NIMASA
22 pages
Parts Tooling Assets
100% (1)
Parts Tooling Assets
23 pages
Learning Needs Assessment Plan for Math Teachers
100% (4)
Learning Needs Assessment Plan for Math Teachers
3 pages
Manual de Partes Motor z225
No ratings yet
Manual de Partes Motor z225
16 pages
Antimicrobial Activity Surfactants
No ratings yet
Antimicrobial Activity Surfactants
8 pages
Biology Exam: Cell Structure & Function
No ratings yet
Biology Exam: Cell Structure & Function
21 pages
Understanding Physical Education Principles
No ratings yet
Understanding Physical Education Principles
16 pages
Data Processing and Analysis Techniques
No ratings yet
Data Processing and Analysis Techniques
4 pages
Manufacturing Management Principles
No ratings yet
Manufacturing Management Principles
535 pages
The Hammer of Thor - Chapter 1
100% (2)
The Hammer of Thor - Chapter 1
11 pages
Levi Ackerman: A Comprehensive Guide
No ratings yet
Levi Ackerman: A Comprehensive Guide
1 page
Prompts
No ratings yet
Prompts
11 pages
Galvanic Cell Exam Notes
No ratings yet
Galvanic Cell Exam Notes
5 pages
Universal Human Values MCQ Exam Paper
No ratings yet
Universal Human Values MCQ Exam Paper
4 pages
Matrix Determinants and Inverses Solutions
No ratings yet
Matrix Determinants and Inverses Solutions
11 pages
Surgical Pathology Report for Ms. Geeta
No ratings yet
Surgical Pathology Report for Ms. Geeta
2 pages
13th Age 2E Gamemaster's Guide
100% (1)
13th Age 2E Gamemaster's Guide
248 pages
Edição de Vídeo com Davinci Resolve
No ratings yet
Edição de Vídeo com Davinci Resolve
14 pages
Finishing Faults in Garment Production
100% (1)
Finishing Faults in Garment Production
58 pages
Avinash Kumar's 2024 Marks Report
No ratings yet
Avinash Kumar's 2024 Marks Report
1 page
OSH Hazard Control Measures Guide
No ratings yet
OSH Hazard Control Measures Guide
12 pages
Flood Impact on Ecosystem: A Role Play
No ratings yet
Flood Impact on Ecosystem: A Role Play
2 pages
E-Voting Systems Overview and Types
No ratings yet
E-Voting Systems Overview and Types
9 pages
SINUMERIK 828D CNC Commissioning Guide
No ratings yet
SINUMERIK 828D CNC Commissioning Guide
368 pages
A-Level Enzyme Multiple-Choice Questions
No ratings yet
A-Level Enzyme Multiple-Choice Questions
20 pages
Understanding Molecular Symmetry Elements
No ratings yet
Understanding Molecular Symmetry Elements
20 pages

Data Preprocessing Techniques in ML

Uploaded by

Data Preprocessing Techniques in ML

Uploaded by

Module 6 and 7

[Link] the necessary libraries

5. Normalization and Standardization

Handling Missing Values:

Handling Noisy data:

[Link] missing value manually

Use Box plot and Scatter plot to identify outliers

Irrelevant or erroneous information can obscure meaningful patterns.

Noise removal techniques, such as filtering outliers or smoothing data, help

Data binning or bucketing is a data preprocessing method used to minimize

Duplicate records can skew analysis and model training by inflating

Why Encode Categorical Data?

Impact of encoding in model performance: Proper encoding can significantly impact

Types of Categorical Data

•Encoding Techniques in Sklearn

Label encoding is a technique for encoding categorical variables as

Z-score normalization (standardization)

StandardScaler is relatively robust to the presence of outliers compared to min-max

StandardScaler() from scikit-learn

Min-max scaling is very often simply called ‘normalization.’ It transforms

RobustScaler() from scikit-learn

Class imbalance occurs when one class in a classification problem significantly

A model trained on an imbalanced dataset perform poorly on the minority class.

Class imbalance is addressed by following techniques:

Oversampling can be defined as adding more copies to the

imbalanced-learn is a python package offering a number of re-

SMOTE algorithm works in 4 simple steps:

One Class Support Vector Machines (SVMs) are a type of

It's the process of tuning the settings (aka hyperparameters) of a machine

Unlike model parameters (like weights in a neural network),

1. Scikit-learn’s GridSearchCV / RandomizedSearchCV

1. Data Quality and Quantity

2. Overfitting and Underfitting

[Link] and Explainability

Real-life Example: In the context of autonomous vehicles,

Real-life Example: A spam email classifier may excel in identifying

Real-life Example: Training state-of-the-art language models like

Real-life Example: Image recognition tasks benefit from

Real-life Example: A sentiment analysis model trained on

Data Privacy and Security

Real-life Example: Healthcare providers must ensure that

Real-life Example: In an image recognition system, minor

Real-life Example: E-commerce websites must invest in

1. Google’s DeepMind: Diabetic Retinopathy Detection

Company: Google DeepMind

Solution: DeepMind developed a machine learning model that

2. PayPal: Fraud Detection

Solution: PayPal implemented a machine learning system to

3. Amazon: Personalized Recommendations

Solution: Amazon developed a sophisticated machine learning

4. General Electric: Predictive Maintenance

Company: General Electric (GE)

Solution: GE developed predictive maintenance software

5. Tesla: Autonomous Driving

Company: Tesla, Inc.

Solution: Tesla’s Autopilot system uses machine learning to

You might also like