100% found this document useful (1 vote)
142 views5 pages

Data Preprocessing Steps in ML

Data preprocessing in machine learning involves 7 key steps: 1) acquiring the dataset, 2) importing relevant libraries, 3) importing the dataset, 4) identifying and handling missing values, 5) encoding categorical data, 6) splitting the dataset into training and test sets, and 7) performing feature scaling to standardize variables. These steps clean, organize, and transform raw data into a readable format suitable for building and training machine learning models.

Uploaded by

Musto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
142 views5 pages

Data Preprocessing Steps in ML

Data preprocessing in machine learning involves 7 key steps: 1) acquiring the dataset, 2) importing relevant libraries, 3) importing the dataset, 4) identifying and handling missing values, 5) encoding categorical data, 6) splitting the dataset into training and test sets, and 7) performing feature scaling to standardize variables. These steps clean, organize, and transform raw data into a readable format suitable for building and training machine learning models.

Uploaded by

Musto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
  • Data Preprocessing in Machine Learning
  • Detailed Steps in Data Preprocessing
  • Further Preprocessing Techniques

Data Preprocessing in Machine Learning: 7 Easy Steps To Follow

Data preprocessing in Machine Learning is a crucial step that helps enhance the
quality of data to promote the extraction of meaningful insights from the data. Data
preprocessing in Machine Learning refers to the technique of preparing (cleaning and
organizing) the raw data to make it suitable for a building and training Machine Learning
models. In simple words, data preprocessing in Machine Learning is a data mining
technique that transforms raw data into an understandable and readable format.

Why Data Preprocessing in Machine Learning?

When it comes to creating a Machine Learning model, data preprocessing is the first
step marking the initiation of the process. Typically, real-world data is incomplete,
inconsistent, inaccurate (contains errors or outliers), and often lacks specific attribute
values/trends. This is where data preprocessing enters the scenario – it helps to clean,
format, and organize the raw data, thereby making it ready-to-go for Machine Learning
models. Let’s explore various steps of data preprocessing in machine learning. 

Steps in Data Preprocessing in Machine Learning


1. Acquire the dataset:

Acquiring the dataset is the first step in data preprocessing in machine learning. To
build and develop Machine Learning models, you must first acquire the relevant dataset.
This dataset will be comprised of data gathered from multiple and disparate sources which
are then combined in a proper format to form a dataset.

2. Import all the crucial libraries:

Since Python is the most extensively used and the most preferred library by Data
Scientists around the world, we’ll show you how to import Python libraries for data
preprocessing in Machine Learning. The predefined Python libraries can perform specific
data preprocessing jobs. Importing all the crucial libraries is the second step in data
preprocessing in machine learning. The three core Python libraries used for this data
preprocessing in Machine Learning are:
NumPy – NumPy is the fundamental package for scientific calculation in Python.
Hence, it is used for inserting any type of mathematical operation in the code. Using NumPy,
you can also add large multidimensional arrays and matrices in your code. 
Pandas – Pandas is an excellent open-source Python library for data manipulation
and analysis. It is extensively used for importing and managing the datasets. It packs in
high-performance, easy-to-use data structures and data analysis tools for Python.
Matplotlib – Matplotlib is a Python 2D plotting library that is used to plot any type of
charts in Python. It can deliver publication-quality figures in numerous hard copy formats
and interactive environments across platforms (IPython shells, Jupyter notebook, web
application servers, etc.).

3. Import the Dataset:

In this step, you need to import the dataset/s that you have gathered for the ML
project at hand. Importing the dataset is one of the important steps in data preprocessing in
machine learning. However, before you can import the dataset/s, you must set the current
directory as the working directory.

4. Identifying and handling the missing values:


In data preprocessing, it is pivotal to identify and correctly handle the missing values,
failing to do this, you might draw inaccurate and faulty conclusions and inferences from the
data. This will hamper your ML project. 

Basically, there are two ways to handle missing data:

Deleting a particular row – In this method, you remove a specific row that has a null
value for a feature or a particular column where more than 75% of the values are missing.
However, this method is not 100% efficient, and it is recommended that you use it only
when the dataset has adequate samples. You must ensure that after deleting the data, there
remains no addition of bias. 

Calculating the mean – This method is useful for features having numeric data like
age, salary, year, etc. Here, you can calculate the mean, median, or mode of a particular
feature or column or row that contains a missing value and replace the result for the
missing value. This method can add variance to the dataset, and any loss of data can be
efficiently negated. Hence, it yields better results compared to the first method (omission of
rows/columns). Another way of approximation is through the deviation of neighboring
values. However, this works best for linear data.

5. Encoding the categorical data:


Categorical data refers to the information that has specific categories within the
dataset. In the dataset cited above, there are two categorical variables – country and
purchased.

Machine Learning models are primarily based on mathematical equations. Thus, you
can intuitively understand that keeping the categorical data in the equation will cause
certain issues since you would only need numbers in the equations.

6. Splitting the dataset:

Splitting the dataset is the next step in data preprocessing in machine learning. Every
dataset for Machine Learning model must be split into two separate sets – training set and
test set.

Training set denotes the subset of a dataset that is used for training the machine
learning model. Here, you are already aware of the output. A test set, on the other hand, is
the subset of the dataset that is used for testing the machine learning model. The ML model
uses the test set to predict outcomes.

Usually, the dataset is split into 80% of the data for training the model while leaving
out the rest 20%.

7. Feature scaling:

Feature scaling marks the end of the data preprocessing in Machine Learning. It is a


method to standardize the independent variables of a dataset within a specific range. In
other words, feature scaling limits the range of variables so that you can compare them on
common grounds.

You can perform feature scaling in Machine Learning in two ways:


Standardization:

Normalization:

Common questions

Powered by AI

Standardization and normalization are two feature scaling techniques used to standardize data ranges. Standardization scales data based on the mean and standard deviation, resulting in a distribution with a mean of 0 and standard deviation of 1, suitable for algorithms assuming Gaussian distribution. Normalization scales data within a specific range, typically 0-1, and is appropriate where data needs to be compared on the same scale without outlier influence. Each method is chosen based on the dataset's distribution characteristics and algorithmic requirements .

Setting the current directory as the working directory ensures that file paths are correctly referenced during data import, facilitating hassle-free loading of data sources. Failing to set it accurately could lead to file not found errors, disrupting preprocessing workflow and potentially resulting in erroneous data manipulation if incorrect data paths are used .

The two primary methods for handling missing values are deleting rows/columns and calculating the mean/median/mode for imputation. Deleting rows is appropriate when there are adequate samples and removing them doesn't introduce bias. Mean, median, or mode imputation is preferred for numeric features when preserving as much data as possible is crucial, as it minimizes variance and fills gaps accurately in situations where linearity exists .

Python libraries such as NumPy, Pandas, and Matplotlib offer significant advantages in data preprocessing. NumPy provides support for large multidimensional arrays and matrices, crucial for mathematical computations. Pandas facilitates data manipulation and analysis with high-performance data structures and tools. Matplotlib allows for creating publication-quality plots and visualization of relationships within the data. Together, they expedite preprocessing by providing efficient, easy-to-use interfaces and structures .

The acquisition of datasets is foundational, influencing subsequent preprocessing steps. Quality, comprehensiveness, and relevance of acquired data dictate the extent of cleaning, transformation, and enrichment required. A well-acquired dataset ensures fewer missing values and inconsistencies, streamlining tasks such as handling missing values, encoding, balancing, and scaling, ultimately impacting model accuracy and effectiveness .

Deleting rows with missing values benefits scenarios with abundant and diverse data where the deletion won't bias the dataset. It is justified when the presence of missing values is unrelated to any target outcome or when skewness caused by imputation could compromise the dataset's representativeness. The decision should be backed by an analysis ensuring sustained dataset integrity .

Data preprocessing is crucial in machine learning because it addresses the issues of real-world data, which is often incomplete, inconsistent, and contains errors or outliers. The primary objectives of data preprocessing are to clean, format, and organize raw data, making it suitable for building and training machine learning models. It enhances the quality of data, facilitating the extraction of meaningful insights .

Omission of rows or columns can lead to biased datasets if key patterns or relationships are inadvertently removed, resulting in inaccurate models. These risks can be mitigated by ensuring that data deletions do not disproportionately affect the dataset's integrity and by considering imputation methods to preserve data distribution and variability as much as possible .

Encoding categorical data is significant because machine learning models require numeric inputs for mathematical computations. Without encoding, categorical variables would lead to incorrect model interpretations. Proper encoding transforms categories into a form that can be used in calculations, ensuring the model's ability to learn patterns effectively, ultimately enhancing model performance .

Splitting a dataset into training and testing sets ensures that the model is exposed to a distinct subset of data during testing, which it hasn't seen before. This approach tests the model's generalization ability, reduces the risk of overfitting, and improves the reliability and robustness of the model's predictions by providing a realistic evaluation of its performance on unseen data .

Data Preprocessing in Machine Learning: 7 Easy Steps To Follow
Data preprocessing in Machine Learning is a crucial step that
1. Acquire the dataset:
Acquiring the dataset is the first step in data preprocessing in machine learning. To 
build and deve
In data preprocessing, it is pivotal to identify and correctly handle the missing values,
failing to do this, you might draw
Categorical data refers to the information that has specific categories within the 
dataset. In the dataset cited above, ther
Standardization:
Normalization:

You might also like