0% found this document useful (0 votes)

90 views12 pages

Hands-On Data Preprocessing in Python

The document discusses various techniques for data preprocessing which is an essential step in machine learning projects. These techniques include data cleansing to handle missing values and outliers, feature selection to reduce complexity and improve performance, feature scaling to prepare data for algorithms, and feature engineering to better represent the problem for models. Python libraries like Scikit-learn can be used to implement these techniques such as imputation to fill missing values, recursive feature elimination for selection, and MinMaxScaler for normalization.

Uploaded by

Bongkar Taktik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

90 views12 pages

Hands-On Data Preprocessing in Python

Uploaded by

Bongkar Taktik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

When dealing with machine learning project, real world data typically

is not ready to be used. There might be missing values or incorrect

types in the dataset that we get. These rawness of the data needs to be
dealt first so that ML algorithm can be applied on it. This is a common
problem that all data-related professionals have to face.

The process of dealing with unclean data and transform it into more
appropriate form for modeling is called data pre-processing. This step
can be considered as a mandatory in machine learning process due to
some reason, such as:

• data errors: Statistical noise or missing data need to be

corrected.

• data types: Most machine learning algorithm require input

data in form of numbers.

• data complexity: Some data might be so complex that

algorithm can not perform well on it. Complexity can be a
reason for overfitting in a model.

While data pre-processing can be different for every cases, there are
some common tasks that ca be used:

• data cleansing

• feature selection

Internal
• data scaling

• feature engineering

• dimensionality reduction

We will explore these steps and implement it on sample dataset using

python libraries.

Data Cleansing: Handling missing values

One of the most common process of data cleansing is dealing with

missing values. Basically, there are two ways to handle missing values:

1. Remove rows with missing values

2. Impute missing values

Removing rows is the simplest strategy and easy to execute. On the

contrary, impute missing values is more complicated. We can impute
values using some rules, such as:

• Constant value that has meaning within the domain and

different from other data, like 0 or -1.

• Central tendency of data, which are mean, median, or

mode.

• Predictive values estimated from other data.

Internal
Even though most ML algorithm require complete dataset, not all of
them fail when there is missing data. There are algorithm that robust
to missing values, like KNN and Naive Bayes while other algorithm can
use missing values as a unique value, like Decision Trees. Nevertheless,
scikit-learn library implementations for those algorithms are not
robust to missing values.

We are going to use SimpleImputer class to transform all missing

values marked with a NaN value with the mean value for the column.
You can download the dataset here: Melbourne Housing
Snapshot.

Four features have missing values. We will work on feature ‘Age’, ‘BuildingArea’, and
‘YearBuilt’.

Internal
Feature Selection

In a nutshell, feature selection means removing irrelevant features.

The reasons we need to do this are to:

• reduce complexity

• produce easy to understand model

• reduce computational cost

• prevent overfitting

• improve model performance

These are feature selection techniques based on its basic algorithm:

credit: [Link]

Internal
In using stats based feature selection, it is important to choose what
method to use based on the data types of input and output variable.
This is a decision tree to decide which stats based method is
suitable for our data:

credit: [Link]

We are going to use RFE method to select the most important features
from our dataset. Recursive Feature Elimination (RFE) is popular due
to its flexibility and ease of use. It reduces model complexity
by removing features one by one until the selected number of
features is left.

The scikit-learn Python machine learning library provides an

implementation of RFE for machine learning. To use it, first, the class
is configured with the chosen algorithm specified via the “estimator”
argument and the number of features to select via the
“n_features_to_select” argument.

Internal
Six most relevant features based on RFE are indicated by “Selected=True”

Feature Scaling

Many machine learning algorithms perform better when numerical

input variables are scaled. This case includes algorithms that use a
weighted sum of the input, like linear regression, and algorithms that
use distance measures, like k-nearest neighbors, or gradient descent-
based algorithms.

There are two common methods for scaling:

For the Melbourne Housing dataset, we are going to implement

normalization using scikit-learn object called MinMaxScaler.

Internal
All maximum values have been scaled to 1

Feature Engineering

Feature engineering is the process of transforming data to

represent the underlying problem better to the predictive
models. It is an iterative process that interplays with data selection and
model evaluation, again and again.

General process of feature engineering are commonly divided by

numerical and categorical feature.

Example of feature engineering for numerical features including:

• Feature Generation: feature 1 + feature 2, feature 1 x

feature 2, feature 1 /feature 2, etc.

• Decomposing Categorical Attributes: item_color ->

is_red, is_blue; gender -> is_male, is_female (one-hot
encoding)

Internal
• Decomposing a Date-Time: datetime -> hour_of_day;
hour -> morning, night

• Reframing Numerical Quantities: weight -> above_70,

below_70

• etc.

Tips for doing numerical feature engineering effectively:

1. Ask the expert

2. Discretization

3. Combinations of 2 features or more

4. Using simple statistics descriptive

Next, for handling categorical features, there are several method called
encoding. These are three common encoding techniques with sample.

Label Encoding

• Give every categorical variable a numerical ID.

Internal
• Useful for non-linear and tree-based algorithms.

• Does not increase dimensionality.

• Useful for ordinal data type.

One-Hot Encoding

• Create new feature for every unique value.

• Memory depends on number of unique category.

• Similar to dummy encoding that generates n-1 new columns,

while OHE generates n new columns, with n is the count of
unique value from encoded feature.

Binary Encoding

Internal
• Variables -> numerical label (label encoding) -> binary
number -> split every digit into different columns.

• Useful for feature with large number of unique values.

Increase dataset dimension logarithmically.

• Only need to create log base 2 new columns of unique values

from encoded feature.

The following code shows how to implement one-hot encoding in the

pandas Python library via get_dummies class.

Dimensionality Reduction

More input features often make a predictive modeling task more

challenging to model, more generally referred to as the curse of
dimensionality.

Internal
Dimensionality reduction techniques are often used for data
visualization. Nevertheless, these techniques can be used in applied
machine learning to simplify a classification or regression
dataset in order to better fit a predictive model.

One of the most popular technique for dimensionality reduction in

machine learning is Principal Component Analysis (PCA).

Handling Outliers

Many data have outliers that can heavily affect model training result.
In Python, outliers can be easily detected using boxplot visualization.

Both Landsize and BuildingArea feature have outliers

Internal
We can adjust the outliers without any additional library using
winsorization method. Outlier values can be replaced by certain value
that called upper and lower bound.

Those are several common method for data preparation. Every project
is unique and may need different approach for data pre-processing and
cleansing.

References
• [Link]

• [Link]
machine-learning-7-day-mini-course/

• [Link]
real-and-categorical-data/

• [Link]
label-encoding-and-one-hot-encoder-911ef77fb5bd

Internal

Common questions

High dimensionality leads to the 'curse of dimensionality', where models become overfitted, overly complex, and computationally expensive. Dimensionality reduction simplifies models by removing irrelevant or redundant features, improving generalization, reducing training time, and enhancing visualization. Principal Component Analysis (PCA) is a popular technique used to reduce dimensions while retaining most of the dataset's variability .

Data pre-processing is mandatory in machine learning because real-world data is often incomplete, inconsistent, and not readily usable by algorithms. The process addresses data errors, such as statistical noise and missing data, ensures compatibility with algorithms that typically require numerical input, and reduces data complexity, which can lead to model overfitting .

Outliers can skew the model's interpretation of data, leading to biased parameter estimation and degraded performance. They can be handled using methods like winsorization, replacing extreme values with boundaries, which neutralizes their negative impact. Visualization tools like boxplots help detect outliers, allowing for their appropriate treatment .

Missing values can disrupt the training of algorithms like linear regression and SVM, which assume complete datasets. However, algorithms like KNN and Naive Bayes are more robust to missing values, and some, like Decision Trees, can treat missing data as a distinct category. Nonetheless, implementations such as scikit-learn's do not inherently handle missing values without pre-processing .

Feature scaling improves algorithm performance by ensuring that numerical inputs have the same scale, which is crucial for methods that depend on distances, such as k-nearest neighbors, or those that utilize gradient descent, like linear regression. It prevents attributes with large ranges from dominating those with smaller ranges, facilitating convergence and improving model accuracy .

Label encoding assigns a unique integer to each category and is useful for ordinal data and algorithms that can handle numerical labels without adding dimensionality. One-hot encoding creates separate binary columns for each category, increasing dimensionality and benefiting non-ordinal, tree-based models. Binary encoding reduces dimensionality growth by using fewer columns through binary representation, being efficient for features with many unique categories .

Feature engineering is iterative because it involves continuous experimentation and refinement to discover features that best represent the underlying problem for predictive models. It requires interplay with data selection and model evaluation, often involving expert insights, manipulation of feature combinations, and revisions based on model feedback to enhance performance and effectiveness .

Common data pre-processing tasks include data cleansing, feature selection, data scaling, feature engineering, and dimensionality reduction. These tasks facilitate data transformation into a suitable format for model training by correcting errors, reducing complexity, scaling inputs, enhancing relevant features, and minimizing dimensions, all contributing to a model's robustness, accuracy, and efficiency .

SimpleImputer plays a crucial role in handling missing values since scikit-learn's algorithms are not robust to such data. It fills missing entries with statistical metrics like mean, median, or mode, allowing algorithms that require complete datasets to function correctly without interruption .

Recursive Feature Elimination (RFE) is advantageous because it systematically reduces model complexity by eliminating less important features, which helps in improving model interpretability and reducing computational costs. It's flexible and can be easily customized for different algorithms using parameters like 'estimator' and 'n_features_to_select'. RFE enhances model performance by focusing on the most relevant features .

Internal
When dealing with machine learning project, real world data typically
is not ready to be used. There might be

Internal
• data scaling
• feature engineering
• dimensionality reduction
We will explore these steps and implement it

Internal
Even though most ML algorithm require complete dataset, not all of
them fail when there is missing data. There

Internal
Feature Selection
In a nutshell, feature selection means removing irrelevant features.
The reasons we need to

Internal
In using stats based feature selection, it is important to choose what
method to use based on the data types o

Internal

Six most relevant features based on RFE are indicated by “Selected=True”
Feature Scaling
Many machine learn

Internal

All maximum values have been scaled to 1
Feature Engineering
Feature engineering is the process of transfor

Internal
• Decomposing a Date-Time: datetime -> hour_of_day;
hour -> morning, night
• Reframing Numerical Quantities:

Internal
• Useful for non-linear and tree-based algorithms.
• Does not increase dimensionality.
• Useful for ordinal d

Internal

• Variables -> numerical label (label encoding) -> binary
number -> split every digit into different columns

Data Preprocessing Techniques in Python
No ratings yet
Data Preprocessing Techniques in Python
12 pages
Data Preprocessing vs Feature Engineering
100% (1)
Data Preprocessing vs Feature Engineering
32 pages
Feature Engineering for Machine Learning
No ratings yet
Feature Engineering for Machine Learning
81 pages
Supervised Machine Learning Overview
No ratings yet
Supervised Machine Learning Overview
38 pages
U1 Int395
No ratings yet
U1 Int395
38 pages
Feature Engineering for Machine Learning
No ratings yet
Feature Engineering for Machine Learning
18 pages
Feature Engineering Basics in ML
100% (1)
Feature Engineering Basics in ML
33 pages
Feature Engineering for Machine Learning
No ratings yet
Feature Engineering for Machine Learning
41 pages
Data Preprocessing for AI Performance
No ratings yet
Data Preprocessing for AI Performance
35 pages
Understanding EDA and Model Evaluation
No ratings yet
Understanding EDA and Model Evaluation
22 pages
Feature Engineering
No ratings yet
Feature Engineering
22 pages
Data Preparation Checklist for ML
No ratings yet
Data Preparation Checklist for ML
22 pages
Feature Engineering Basics in Python
No ratings yet
Feature Engineering Basics in Python
33 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
26 pages
Data Pre-processing in Machine Learning
No ratings yet
Data Pre-processing in Machine Learning
31 pages
Feature Scaling and Data Encoding Techniques
No ratings yet
Feature Scaling and Data Encoding Techniques
44 pages
Feature Engineering in Data Science
No ratings yet
Feature Engineering in Data Science
50 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
55 pages
Data Preprocessing in EDA Techniques
No ratings yet
Data Preprocessing in EDA Techniques
37 pages
Feature Selection Techniques in ML
No ratings yet
Feature Selection Techniques in ML
33 pages
MSDA 3050 Lecture2 S24
No ratings yet
MSDA 3050 Lecture2 S24
35 pages
Data Cleaning & Preparation
No ratings yet
Data Cleaning & Preparation
22 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
30 pages
Data Metrics
No ratings yet
Data Metrics
63 pages
03 Data Preprocessing
No ratings yet
03 Data Preprocessing
44 pages
Machine Learning Feature Selection Guide
100% (1)
Machine Learning Feature Selection Guide
5 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
23 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
20 pages
Feature Selection & Data Preprocessing Guide
No ratings yet
Feature Selection & Data Preprocessing Guide
18 pages
Feature Engineering and Selection Methods
No ratings yet
Feature Engineering and Selection Methods
68 pages
Feature Engineering and Selection Guide
No ratings yet
Feature Engineering and Selection Guide
32 pages
Data Cleaning and Feature Scaling Guide
No ratings yet
Data Cleaning and Feature Scaling Guide
18 pages
Data Cleaning & Feature Engineering in ML
No ratings yet
Data Cleaning & Feature Engineering in ML
18 pages
Essential Steps in Data Preprocessing
No ratings yet
Essential Steps in Data Preprocessing
87 pages
Ai Module 4
No ratings yet
Ai Module 4
66 pages
Machine Learning - White BG
No ratings yet
Machine Learning - White BG
5 pages
Machine Learning Pipeline & Feature Engineering
No ratings yet
Machine Learning Pipeline & Feature Engineering
35 pages
Preprocessing
No ratings yet
Preprocessing
10 pages
Dimensionality Reduction with PCA Techniques
No ratings yet
Dimensionality Reduction with PCA Techniques
86 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
24 pages
Comprehensive Guide to EDA Techniques
No ratings yet
Comprehensive Guide to EDA Techniques
48 pages
Essential Data Preprocessing Steps
No ratings yet
Essential Data Preprocessing Steps
5 pages
Data Splitting and Transformation Methods
No ratings yet
Data Splitting and Transformation Methods
96 pages
Data Preparation for Machine Learning
No ratings yet
Data Preparation for Machine Learning
47 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
25 pages
Machine Learning Pipeline Overview
No ratings yet
Machine Learning Pipeline Overview
19 pages
Feature Scaling and PCA Overview
No ratings yet
Feature Scaling and PCA Overview
20 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
16 pages
ML Lecture 6 7 Preprocess
No ratings yet
ML Lecture 6 7 Preprocess
43 pages
Feature Engineering Techniques Explained
No ratings yet
Feature Engineering Techniques Explained
34 pages
Sliding Window Outlier Detection in Python
No ratings yet
Sliding Window Outlier Detection in Python
43 pages
ML Chapter 2 - Data Exploration Updated
No ratings yet
ML Chapter 2 - Data Exploration Updated
31 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
66 pages
Essential Techniques for Data Preprocessing
No ratings yet
Essential Techniques for Data Preprocessing
9 pages
Machine Learning Basics and Preprocessing
No ratings yet
Machine Learning Basics and Preprocessing
52 pages
Quantitative Techniques in Business
No ratings yet
Quantitative Techniques in Business
4 pages
PTSD and Psychosocial Risks in First Responders
No ratings yet
PTSD and Psychosocial Risks in First Responders
9 pages
Cronbach's Alpha Reliability Analysis
No ratings yet
Cronbach's Alpha Reliability Analysis
5 pages
AI Advancements in Cardiology Research
No ratings yet
AI Advancements in Cardiology Research
11 pages
Importance of Sample Size in Research
No ratings yet
Importance of Sample Size in Research
23 pages
Play-Based Learning and Cognitive Skills
No ratings yet
Play-Based Learning and Cognitive Skills
35 pages
Statistics Using R An Integrative Approach 2nd Edition Sharon L. Weinberg Ebook Revised 2026 Edition
100% (3)
Statistics Using R An Integrative Approach 2nd Edition Sharon L. Weinberg Ebook Revised 2026 Edition
148 pages
Components of Marketing Information System
No ratings yet
Components of Marketing Information System
6 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
42 pages
Normality in Classical Regression Model
No ratings yet
Normality in Classical Regression Model
10 pages
Central Tendency & Variability Worksheet
No ratings yet
Central Tendency & Variability Worksheet
1 page
Time Series Analysis Techniques Explained
No ratings yet
Time Series Analysis Techniques Explained
9 pages
Understanding Primary and Secondary Data
No ratings yet
Understanding Primary and Secondary Data
16 pages
Montessori Some-Ideological-Considerations
No ratings yet
Montessori Some-Ideological-Considerations
10 pages
Applied Statistics in Business & Economics: David P. Doane and Lori E. Seward
No ratings yet
Applied Statistics in Business & Economics: David P. Doane and Lori E. Seward
57 pages
Croxton and Cowden on Statistics
No ratings yet
Croxton and Cowden on Statistics
147 pages
NPTEL Data Analytics with Python Course
100% (1)
NPTEL Data Analytics with Python Course
4 pages
Understanding Statistics and Its Functions
0% (1)
Understanding Statistics and Its Functions
124 pages
2023 Afghanistan
No ratings yet
2023 Afghanistan
15 pages
EV Adoption in Indonesia: Dealer Impact
No ratings yet
EV Adoption in Indonesia: Dealer Impact
14 pages
Probability Distribution Assignment
No ratings yet
Probability Distribution Assignment
2 pages
Rehabilitation Psychology Dissertation Outline
No ratings yet
Rehabilitation Psychology Dissertation Outline
2 pages
Business Statistics: Mean, SD, Correlation
No ratings yet
Business Statistics: Mean, SD, Correlation
6 pages
Software Quality's Role in Ethiopian E-Commerce
No ratings yet
Software Quality's Role in Ethiopian E-Commerce
10 pages
Test Bank For Introductory Statistics 9th by Mann
100% (2)
Test Bank For Introductory Statistics 9th by Mann
102 pages
Online Analysis of Handwriting For Disease Diagnosis: A Review
No ratings yet
Online Analysis of Handwriting For Disease Diagnosis: A Review
7 pages
Understanding Quantitative Research Designs
No ratings yet
Understanding Quantitative Research Designs
8 pages
HR Practices and Commitment in Telecom
No ratings yet
HR Practices and Commitment in Telecom
5 pages
Areal Differentiation in Geography
No ratings yet
Areal Differentiation in Geography
34 pages
Quantitative Decision-Making Techniques
No ratings yet
Quantitative Decision-Making Techniques
80 pages

Hands-On Data Preprocessing in Python

Uploaded by

Hands-On Data Preprocessing in Python

Uploaded by

When dealing with machine learning project, real world data typically

is not ready to be used. There might be missing values or incorrect

• data errors: Statistical noise or missing data need to be

• data types: Most machine learning algorithm require input

• data complexity: Some data might be so complex that

We will explore these steps and implement it on sample dataset using

Data Cleansing: Handling missing values

One of the most common process of data cleansing is dealing with

1. Remove rows with missing values

2. Impute missing values

Removing rows is the simplest strategy and easy to execute. On the

• Constant value that has meaning within the domain and

• Central tendency of data, which are mean, median, or

• Predictive values estimated from other data.

We are going to use SimpleImputer class to transform all missing

In a nutshell, feature selection means removing irrelevant features.

• produce easy to understand model

• reduce computational cost

• improve model performance

These are feature selection techniques based on its basic algorithm:

The scikit-learn Python machine learning library provides an

Many machine learning algorithms perform better when numerical

There are two common methods for scaling:

For the Melbourne Housing dataset, we are going to implement

Feature engineering is the process of transforming data to

General process of feature engineering are commonly divided by

Example of feature engineering for numerical features including:

• Feature Generation: feature 1 + feature 2, feature 1 x

• Decomposing Categorical Attributes: item_color ->

• Reframing Numerical Quantities: weight -> above_70,

Tips for doing numerical feature engineering effectively:

1. Ask the expert

3. Combinations of 2 features or more

4. Using simple statistics descriptive

• Give every categorical variable a numerical ID.

• Does not increase dimensionality.

• Useful for ordinal data type.

• Create new feature for every unique value.

• Memory depends on number of unique category.

• Similar to dummy encoding that generates n-1 new columns,

• Useful for feature with large number of unique values.

• Only need to create log base 2 new columns of unique values

The following code shows how to implement one-hot encoding in the

More input features often make a predictive modeling task more

One of the most popular technique for dimensionality reduction in

Both Landsize and BuildingArea feature have outliers

Common questions

What challenges does dimensionality bring to predictive modeling, and how does dimensionality reduction address these?

What challenges does dimensionality bring to predictive modeling, and how does dimensionality reduction address these?

Why is data pre-processing considered mandatory in the machine learning process?

Why is data pre-processing considered mandatory in the machine learning process?

How can outliers affect machine learning model performance, and what methods can be employed to handle them effectively?

How can outliers affect machine learning model performance, and what methods can be employed to handle them effectively?

How does the presence of missing values affect the application of different machine learning algorithms?

How does the presence of missing values affect the application of different machine learning algorithms?

In what ways does feature scaling improve the performance of machine learning algorithms?

In what ways does feature scaling improve the performance of machine learning algorithms?

How do different encoding techniques, such as labeling, one-hot, and binary encoding, differ in their application to categorical features in datasets?

How do different encoding techniques, such as labeling, one-hot, and binary encoding, differ in their application to categorical features in datasets?

Why is feature engineering considered an iterative process in the context of machine learning?

Why is feature engineering considered an iterative process in the context of machine learning?

What are the common tasks involved in data pre-processing, and how do they contribute to effective model training?

What are the common tasks involved in data pre-processing, and how do they contribute to effective model training?

What role does SimpleImputer play in handling missing values within the context of scikit-learn's limitations?

What role does SimpleImputer play in handling missing values within the context of scikit-learn's limitations?

What are the advantages of using Recursive Feature Elimination (RFE) for feature selection in machine learning?

What are the advantages of using Recursive Feature Elimination (RFE) for feature selection in machine learning?

You might also like