0% found this document useful (0 votes)
90 views12 pages

Hands-On Data Preprocessing in Python

The document discusses various techniques for data preprocessing which is an essential step in machine learning projects. These techniques include data cleansing to handle missing values and outliers, feature selection to reduce complexity and improve performance, feature scaling to prepare data for algorithms, and feature engineering to better represent the problem for models. Python libraries like Scikit-learn can be used to implement these techniques such as imputation to fill missing values, recursive feature elimination for selection, and MinMaxScaler for normalization.

Uploaded by

Bongkar Taktik
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views12 pages

Hands-On Data Preprocessing in Python

The document discusses various techniques for data preprocessing which is an essential step in machine learning projects. These techniques include data cleansing to handle missing values and outliers, feature selection to reduce complexity and improve performance, feature scaling to prepare data for algorithms, and feature engineering to better represent the problem for models. Python libraries like Scikit-learn can be used to implement these techniques such as imputation to fill missing values, recursive feature elimination for selection, and MinMaxScaler for normalization.

Uploaded by

Bongkar Taktik
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

When dealing with machine learning project, real world data typically

is not ready to be used. There might be missing values or incorrect


types in the dataset that we get. These rawness of the data needs to be
dealt first so that ML algorithm can be applied on it. This is a common
problem that all data-related professionals have to face.

The process of dealing with unclean data and transform it into more
appropriate form for modeling is called data pre-processing. This step
can be considered as a mandatory in machine learning process due to
some reason, such as:

• data errors: Statistical noise or missing data need to be


corrected.

• data types: Most machine learning algorithm require input


data in form of numbers.

• data complexity: Some data might be so complex that


algorithm can not perform well on it. Complexity can be a
reason for overfitting in a model.

While data pre-processing can be different for every cases, there are
some common tasks that ca be used:

• data cleansing

• feature selection

Internal
• data scaling

• feature engineering

• dimensionality reduction

We will explore these steps and implement it on sample dataset using


python libraries.

Data Cleansing: Handling missing values

One of the most common process of data cleansing is dealing with


missing values. Basically, there are two ways to handle missing values:

1. Remove rows with missing values

2. Impute missing values

Removing rows is the simplest strategy and easy to execute. On the


contrary, impute missing values is more complicated. We can impute
values using some rules, such as:

• Constant value that has meaning within the domain and


different from other data, like 0 or -1.

• Central tendency of data, which are mean, median, or


mode.

• Predictive values estimated from other data.

Internal
Even though most ML algorithm require complete dataset, not all of
them fail when there is missing data. There are algorithm that robust
to missing values, like KNN and Naive Bayes while other algorithm can
use missing values as a unique value, like Decision Trees. Nevertheless,
scikit-learn library implementations for those algorithms are not
robust to missing values.

We are going to use SimpleImputer class to transform all missing


values marked with a NaN value with the mean value for the column.
You can download the dataset here: Melbourne Housing
Snapshot.

Four features have missing values. We will work on feature ‘Age’, ‘BuildingArea’, and
‘YearBuilt’.

Internal
Feature Selection

In a nutshell, feature selection means removing irrelevant features.


The reasons we need to do this are to:

• reduce complexity

• produce easy to understand model

• reduce computational cost

• prevent overfitting

• improve model performance

These are feature selection techniques based on its basic algorithm:

credit: [Link]

Internal
In using stats based feature selection, it is important to choose what
method to use based on the data types of input and output variable.
This is a decision tree to decide which stats based method is
suitable for our data:

credit: [Link]

We are going to use RFE method to select the most important features
from our dataset. Recursive Feature Elimination (RFE) is popular due
to its flexibility and ease of use. It reduces model complexity
by removing features one by one until the selected number of
features is left.

The scikit-learn Python machine learning library provides an


implementation of RFE for machine learning. To use it, first, the class
is configured with the chosen algorithm specified via the “estimator”
argument and the number of features to select via the
“n_features_to_select” argument.

Internal
Six most relevant features based on RFE are indicated by “Selected=True”

Feature Scaling

Many machine learning algorithms perform better when numerical


input variables are scaled. This case includes algorithms that use a
weighted sum of the input, like linear regression, and algorithms that
use distance measures, like k-nearest neighbors, or gradient descent-
based algorithms.

There are two common methods for scaling:

For the Melbourne Housing dataset, we are going to implement


normalization using scikit-learn object called MinMaxScaler.

Internal
All maximum values have been scaled to 1

Feature Engineering

Feature engineering is the process of transforming data to


represent the underlying problem better to the predictive
models. It is an iterative process that interplays with data selection and
model evaluation, again and again.

General process of feature engineering are commonly divided by


numerical and categorical feature.

Example of feature engineering for numerical features including:

• Feature Generation: feature 1 + feature 2, feature 1 x


feature 2, feature 1 /feature 2, etc.

• Decomposing Categorical Attributes: item_color ->


is_red, is_blue; gender -> is_male, is_female (one-hot
encoding)

Internal
• Decomposing a Date-Time: datetime -> hour_of_day;
hour -> morning, night

• Reframing Numerical Quantities: weight -> above_70,


below_70

• etc.

Tips for doing numerical feature engineering effectively:

1. Ask the expert

2. Discretization

3. Combinations of 2 features or more

4. Using simple statistics descriptive

Next, for handling categorical features, there are several method called
encoding. These are three common encoding techniques with sample.

Label Encoding

• Give every categorical variable a numerical ID.

Internal
• Useful for non-linear and tree-based algorithms.

• Does not increase dimensionality.

• Useful for ordinal data type.

One-Hot Encoding

• Create new feature for every unique value.

• Memory depends on number of unique category.

• Similar to dummy encoding that generates n-1 new columns,


while OHE generates n new columns, with n is the count of
unique value from encoded feature.

Binary Encoding

Internal
• Variables -> numerical label (label encoding) -> binary
number -> split every digit into different columns.

• Useful for feature with large number of unique values.


Increase dataset dimension logarithmically.

• Only need to create log base 2 new columns of unique values


from encoded feature.

The following code shows how to implement one-hot encoding in the


pandas Python library via get_dummies class.

Dimensionality Reduction

More input features often make a predictive modeling task more


challenging to model, more generally referred to as the curse of
dimensionality.

Internal
Dimensionality reduction techniques are often used for data
visualization. Nevertheless, these techniques can be used in applied
machine learning to simplify a classification or regression
dataset in order to better fit a predictive model.

One of the most popular technique for dimensionality reduction in


machine learning is Principal Component Analysis (PCA).

Handling Outliers

Many data have outliers that can heavily affect model training result.
In Python, outliers can be easily detected using boxplot visualization.

Both Landsize and BuildingArea feature have outliers

Internal
We can adjust the outliers without any additional library using
winsorization method. Outlier values can be replaced by certain value
that called upper and lower bound.

Those are several common method for data preparation. Every project
is unique and may need different approach for data pre-processing and
cleansing.

References
• [Link]

• [Link]
machine-learning-7-day-mini-course/

• [Link]
real-and-categorical-data/

• [Link]
label-encoding-and-one-hot-encoder-911ef77fb5bd

Internal

Common questions

Powered by AI

High dimensionality leads to the 'curse of dimensionality', where models become overfitted, overly complex, and computationally expensive. Dimensionality reduction simplifies models by removing irrelevant or redundant features, improving generalization, reducing training time, and enhancing visualization. Principal Component Analysis (PCA) is a popular technique used to reduce dimensions while retaining most of the dataset's variability .

Data pre-processing is mandatory in machine learning because real-world data is often incomplete, inconsistent, and not readily usable by algorithms. The process addresses data errors, such as statistical noise and missing data, ensures compatibility with algorithms that typically require numerical input, and reduces data complexity, which can lead to model overfitting .

Outliers can skew the model's interpretation of data, leading to biased parameter estimation and degraded performance. They can be handled using methods like winsorization, replacing extreme values with boundaries, which neutralizes their negative impact. Visualization tools like boxplots help detect outliers, allowing for their appropriate treatment .

Missing values can disrupt the training of algorithms like linear regression and SVM, which assume complete datasets. However, algorithms like KNN and Naive Bayes are more robust to missing values, and some, like Decision Trees, can treat missing data as a distinct category. Nonetheless, implementations such as scikit-learn's do not inherently handle missing values without pre-processing .

Feature scaling improves algorithm performance by ensuring that numerical inputs have the same scale, which is crucial for methods that depend on distances, such as k-nearest neighbors, or those that utilize gradient descent, like linear regression. It prevents attributes with large ranges from dominating those with smaller ranges, facilitating convergence and improving model accuracy .

Label encoding assigns a unique integer to each category and is useful for ordinal data and algorithms that can handle numerical labels without adding dimensionality. One-hot encoding creates separate binary columns for each category, increasing dimensionality and benefiting non-ordinal, tree-based models. Binary encoding reduces dimensionality growth by using fewer columns through binary representation, being efficient for features with many unique categories .

Feature engineering is iterative because it involves continuous experimentation and refinement to discover features that best represent the underlying problem for predictive models. It requires interplay with data selection and model evaluation, often involving expert insights, manipulation of feature combinations, and revisions based on model feedback to enhance performance and effectiveness .

Common data pre-processing tasks include data cleansing, feature selection, data scaling, feature engineering, and dimensionality reduction. These tasks facilitate data transformation into a suitable format for model training by correcting errors, reducing complexity, scaling inputs, enhancing relevant features, and minimizing dimensions, all contributing to a model's robustness, accuracy, and efficiency .

SimpleImputer plays a crucial role in handling missing values since scikit-learn's algorithms are not robust to such data. It fills missing entries with statistical metrics like mean, median, or mode, allowing algorithms that require complete datasets to function correctly without interruption .

Recursive Feature Elimination (RFE) is advantageous because it systematically reduces model complexity by eliminating less important features, which helps in improving model interpretability and reducing computational costs. It's flexible and can be easily customized for different algorithms using parameters like 'estimator' and 'n_features_to_select'. RFE enhances model performance by focusing on the most relevant features .

Internal 
When dealing with machine learning project, real world data typically 
is not ready to be used. There might be
Internal 
• data scaling 
• feature engineering 
• dimensionality reduction 
We will explore these steps and implement it
Internal 
Even though most ML algorithm require complete dataset, not all of 
them fail when there is missing data. There
Internal 
Feature Selection 
In a nutshell, feature selection means removing irrelevant features. 
The reasons we need to
Internal 
In using stats based feature selection, it is important to choose what 
method to use based on the data types o
Internal 
 
Six most relevant features based on RFE are indicated by “Selected=True” 
Feature Scaling 
Many machine learn
Internal 
 
All maximum values have been scaled to 1 
Feature Engineering 
Feature engineering is the process of transfor
Internal 
• Decomposing a Date-Time: datetime -> hour_of_day; 
hour -> morning, night 
• Reframing Numerical Quantities:
Internal 
• Useful for non-linear and tree-based algorithms. 
• Does not increase dimensionality. 
• Useful for ordinal d
Internal 
 
• Variables -> numerical label (label encoding) -> binary 
number -> split every digit into different columns

You might also like