0% found this document useful (0 votes)

20 views39 pages

Data Preprocessing

The document provides an overview of data preprocessing, detailing its importance in ensuring high-quality data for analysis and machine learning. It outlines key techniques such as data cleaning, integration, transformation, and reduction, along with the libraries commonly used for these processes. Additionally, it discusses the advantages and disadvantages of data preprocessing, as well as specific methods for data reduction in data mining.

Uploaded by

rathodnikhil19113

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views39 pages

Data Preprocessing

Uploaded by

rathodnikhil19113

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Unit 1

➢ Introduction:
Raw, real-world data is often incomplete, inconsistent, and dirty. For example, a
dataset might have missing values, contain errors (such as a person's age listed as
200), be noisy (containing random errors or variances), or be in a format incompatible
with many analytical models. Data preprocessing addresses these issues, ensuring
that the data is high-quality and reliable, which in turn improves the accuracy and
efficiency of any subsequent analysis or machine learning model.

➢ Definitions:

• Data Preprocessing: A series of techniques used to prepare raw data for analysis.
• Data Cleaning: The process of detecting and correcting (or removing) corrupt,
inaccurate, or irrelevant records from a dataset. Key tasks include handling missing
values and outlier detection .
• Data Transformation: The process of converting data from one format or structure to
another, often involving normalization, smoothing, and aggregation.
• Data Reduction: Techniques used to obtain a reduced representation of the data set
that is much smaller in volume but still produces the same analytical results .

➢ Need for Data Preprocessing:

The primary need for data preprocessing stems from the "garbage in, garbage out"
principle: the quality of the output is only as good as the quality of the input. Specific
reasons include:
• Improving Data Quality: Raw data is often incomplete, inconsistent, and noisy.
Preprocessing ensures that the data is accurate and reliable.
• Enhancing Model Accuracy: Most machine learning models perform better with
clean, well-structured data.
• Increasing Efficiency: Preprocessed data can be processed faster by algorithms,
reducing computation time.
• Compatibility: Many algorithms have specific data format or scaling requirements
(e.g., numerical inputs only, data scaled to a 0-1 range), which preprocessing fulfills .
➢ Understanding Data Preprocessing:

Data preprocessing typically involves several key steps:

1. Data Cleaning:
1. Handling Missing Values: Replacing missing data with a calculated value (mean,
median, mode) or simply removing the record .
2. Handling Noisy Data: Smoothing data or removing outliers using methods like
binning, regression, or clustering .
2. Data Integration: Merging data from multiple sources (databases, files, etc.) into a
single, consistent data store .
3. Data Transformation:
1. Normalization/Scaling: Adjusting data to a common scale (e.g., Min-Max scaling or
Z-score normalization) so that no single feature dominates the model .
2. Attribute/Feature Engineering: Creating new features from existing ones to help
the model learn more effectively .
4. Data Reduction:
1. Dimensionality Reduction: Reducing the number of variables, using techniques like
Principal Component Analysis (PCA) or feature selection .
2. Numerosity Reduction: Replacing the original data with smaller alternative
representations, such as histograms .

➢ Libraries Used for Data Preprocessing:

In the field of data science, several powerful libraries facilitate the data preprocessing
steps. Common Python libraries include:
• NumPy: Provides support for large, multi-dimensional arrays and matrices, along with
a collection of mathematical functions to operate on these arrays [3]. It is fundamental
for numerical operations and data structuring.

• Pandas: Offers powerful data structures like DataFrames and tools for data
manipulation and analysis, making it the primary tool for data cleaning, handling
missing values, and data integration [3]. You can find tutorials on their official
documentation.
• Scikit-learn (sklearn): A comprehensive machine learning library with a
dedicated preprocessing module. It provides a wide range of tools for scaling,
normalization, encoding categorical variables, and dimensionality reduction
techniques like PCA . Learn more in their preprocessing documentation.
• Matplotlib: While primarily for plotting, visualization tools are often used during the
preprocessing phase for outlier detection and initial data exploration.
• Seaborn: A visualization library built on Matplotlib, useful for visually identifying
patterns, missing data, and outliers .
Unit 2

➢ Steps in Data Preprocessing:

Some key steps in data preprocessing are Data Cleaning, Data Integration, Data
Transformation, and Data Reduction.

1. Data Cleaning: It is the process of identifying and correcting errors or

inconsistencies in the dataset. It involves handling missing values, removing
duplicates, and correcting incorrect or outlier data to ensure the dataset is accurate
and reliable. Clean data is essential for effective analysis, as it improves the quality of
results and enhances the performance of data models.
• Missing Values: This occur when data is absent from a dataset. You can either ignore
the rows with missing data or fill the gaps manually, with the attribute mean, or by
using the most probable value. This ensures the dataset remains accurate and
complete for analysis.
• Noisy Data: It refers to irrelevant or incorrect data that is difficult for machines to
interpret, often caused by errors in data collection or entry. It can be handled in
several ways:
o Binning Method: The data is sorted into equal segments, and each segment is
smoothed by replacing values with the mean or boundary values.
o Regression: Data can be smoothed by fitting it to a regression function, either linear
or multiple, to predict values.
o Clustering: This method groups similar data points together, with outliers either being
undetected or falling outside the clusters. These techniques help remove noise and
improve data quality.
• Removing Duplicates: It involves identifying and eliminating repeated data entries to
ensure accuracy and consistency in the dataset. This process prevents errors and
ensures reliable analysis by keeping only unique records.

2. Data Integration: It involves merging data from various sources into a single, unified
dataset. It can be challenging due to differences in data formats, structures, and
meanings. Techniques like record linkage and data fusion help in combining data
efficiently, ensuring consistency and accuracy.
• Record Linkage is the process of identifying and matching records from different
datasets that refer to the same entity, even if they are represented differently. It helps
in combining data from various sources by finding corresponding records based on
common identifiers or attributes.
• Data Fusion involves combining data from multiple sources to create a more
comprehensive and accurate dataset. It integrates information that may be
inconsistent or incomplete from different sources, ensuring a unified and richer
dataset for analysis.
3. Data Transformation: It involves converting data into a format suitable for analysis.
Common techniques include normalization, which scales data to a common range;
standardization, which adjusts data to have zero mean and unit variance; and
discretization, which converts continuous data into discrete categories. These
techniques help prepare the data for more accurate analysis.
• Data Normalization: The process of scaling data to a common range to ensure
consistency across variables.
• Discretization: Converting continuous data into discrete categories for easier analysis.
• Data Aggregation: Combining multiple data points into a summary form, such as
averages or totals, to simplify analysis.
• Concept Hierarchy Generation: Organizing data into a hierarchy of concepts to
provide a higher-level view for better understanding and analysis.
4. Data Reduction: It reduces the dataset's size while maintaining key information.
This can be done through feature selection, which chooses the most relevant features,
and feature extraction, which transforms the data into a lower-dimensional space
while preserving important details. It uses various reduction techniques such as,
• Dimensionality Reduction (e.g., Principal Component Analysis): A technique that
reduces the number of variables in a dataset while retaining its essential information.
• Numerosity Reduction: Reducing the number of data points by methods like sampling
to simplify the dataset without losing critical patterns.
• Data Compression: Reducing the size of data by encoding it in a more compact form,
making it easier to store and process.

➢ Uses of Data Preprocessing:

Data preprocessing is utilized across various fields to ensure that raw data is
transformed into a usable format for analysis and decision-making. Here are some key
areas where data preprocessing is applied:
1. Data Warehousing: In data warehousing, preprocessing is essential for cleaning,
integrating, and structuring data before it is stored in a centralized repository. This
ensures the data is consistent and reliable for future queries and reporting.
2. Data Mining: Data preprocessing in data mining involves cleaning and transforming
raw data to make it suitable for analysis. This step is crucial for identifying patterns
and extracting insights from large datasets.
3. Machine Learning: In machine learning, preprocessing prepares raw data for model
training. This includes handling missing values, normalizing features, encoding
categorical variables, and splitting datasets into training and testing sets to improve
model performance and accuracy.
4. Data Science: Data preprocessing is a fundamental step in data science projects,
ensuring that the data used for analysis or building predictive models is clean,
structured, and relevant. It enhances the overall quality of insights derived from the
data.
5. Web Mining: In web mining, preprocessing helps analyze web usage logs to extract
meaningful user behavior patterns. This can inform marketing strategies and improve
user experience through personalized recommendations.
6. Business Intelligence (BI): Preprocessing supports BI by organizing and cleaning
data to create dashboards and reports that provide actionable insights for decision-
makers.
7. Deep Learning Purpose: Similar to machine learning, deep learning applications
require preprocessing to normalize or enhance features of the input data, optimizing
model training processes.

➢ Advantages of Data Preprocessing:

• Improved Data Quality: Ensures data is clean, consistent, and reliable for analysis.
• Better Model Performance: Reduces noise and irrelevant data, leading to more
accurate predictions and insights.
• Efficient Data Analysis: Streamlines data for faster and easier processing.
• Enhanced Decision-Making: Provides clear and well-organized data for better
business decisions.

➢ Disadvantages of Data Preprocessing:

• Time-Consuming: Requires significant time and effort to clean, transform, and
organize data.
• Resource-Intensive: Demands computational power and skilled personnel for
complex preprocessing tasks.
• Potential Data Loss: Incorrect handling may result in losing valuable information.
• Complexity: Handling large datasets or diverse formats can be challenging.
Unit 3
INTRODUCTION:
Data reduction is a technique used in data mining to reduce the size of a
dataset while still preserving the most important information. This can be
beneficial in situations where the dataset is too large to be processed
efficiently, or where the dataset contains a large amount of irrelevant or
redundant information.
There are several different data reduction techniques that can be used in data
mining, including:
1. Data Sampling: This technique involves selecting a subset of the data to
work with, rather than using the entire dataset. This can be useful for
reducing the size of a dataset while still preserving the overall trends and
patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number
of features in the dataset, either by removing features that are not
relevant or by combining multiple features into a single feature.
3. Data Compression: This technique involves using techniques such as
lossy or lossless compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data
into discrete data by partitioning the range of possible values into
intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features
from the dataset that are most relevant to the task at hand.
6. It's important to note that data reduction can have a trade-off between
the accuracy and the size of the data. The more data is reduced, the less
accurate the model will be and the less generalizable it will be.
In conclusion, data reduction is an important step in data mining, as it can help
to improve the efficiency and performance of machine learning algorithms by
reducing the size of the dataset. However, it is important to be aware of the
trade-off between the size and accuracy of the data, and carefully assess the
risks and benefits before implementing it.
Methods of data reduction:
These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example,
imagine the information you gathered for your analysis for the years 2012 to
2014, that data includes the revenue of your company every three months.
They involve you in the annual sales, rather than the quarterly average, So we
can summarize the data in such a way that the resulting data summarizes the
total sales per year instead of per quarter. It summarizes the data.

2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use
the attribute required for our analysis. It reduces data size as it eliminates
outdated or redundant features.
• Step-wise Forward Selection -
The selection begins with an empty set of attributes later on we decide
the best of the original attributes on the set based on their relevance to
other attributes. We know it as a p-value in statistics.

Suppose there are the following attributes in the data set in which few
attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }

Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

• Step-wise Backward Selection -

This selection starts with a set of complete attributes in the original data
and at each point, it eliminates the worst remaining attribute in the set.

Suppose there are the following attributes in the data set in which few
attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }

Step-1: {X1, X2, X3, X4, X5}

Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

• Combination of forwarding and Backward Selection -
It allows us to remove the worst and select the best attributes, saving
time and making the process faster.
3. Data Compression:
The data compression technique reduces the size of the files using different
encoding mechanisms (Huffman Encoding & run-length Encoding). We can
divide it into two types based on their compression techniques.
• Lossless Compression -
Encoding techniques (Run Length Encoding) allow a simple and minimal
data size reduction. Lossless data compression uses algorithms to restore
the precise original data from the compressed data.
• Lossy Compression -
Methods such as the Discrete Wavelet transform technique, PCA
(principal component analysis) are examples of this compression. For
e.g., the JPEG image format is a lossy compression, but we can find the
meaning equivalent to the original image. In lossy-data compression, the
decompressed data may differ from the original data but are useful
enough to retrieve information from them.
4. Numerosity Reduction:
In this reduction technique, the actual data is replaced with mathematical
models or smaller representations of the data instead of actual data, it is
important to only store the model parameter. Or non-parametric methods such
as clustering, histogram, and sampling.

5. Discretization & Concept Hierarchy Operation:

Techniques of data discretization are used to divide the attributes of the
continuous nature into data with intervals. We replace many constant values of
the attributes by labels of small intervals. This means that mining results are
shown in a concise, and easily understandable way.
• Top-down discretization -
If you first consider one or a couple of points (so-called breakpoints or
split points) to divide the whole set of attributes and repeat this method
up to the end, then the process is known as top-down discretization also
known as splitting.
• Bottom-up discretization -
If you first consider all the constant values as split points, some are
discarded through a combination of the neighborhood values in the
interval, that process is called bottom-up discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts
(such as 43 for age) with high-level concepts (categorical variables such as
middle age or Senior).

For numeric data following techniques can be followed:

• Binning -
Binning is the process of changing numerical variables into categorical
counterparts. The number of categorical counterparts depends on the
number of bins specified by the user.
• Histogram analysis -
Like the process of binning, the histogram is used to partition the value
for the attribute X, into disjoint ranges called brackets. There are several
partitioning rules:

1. Equal Frequency partitioning: Partitioning the values based on

their number of occurrences in the data set.
2. Equal Width Partitioning: Partitioning the values in a fixed gap
based on the number of bins i.e. a set of values ranging from 0-
20.
3. Clustering: Grouping similar data together.
ADVANTAGED OR DISADVANTAGES OF Data Reduction in Data Mining :
Data reduction in data mining can have a number of advantages and
disadvantages.
Advantages:
1. Improved efficiency: Data reduction can help to improve the efficiency of
machine learning algorithms by reducing the size of the dataset. This can
make it faster and more practical to work with large datasets.
2. Improved performance: Data reduction can help to improve the
performance of machine learning algorithms by removing irrelevant or
redundant information from the dataset. This can help to make the
model more accurate and robust.
3. Reduced storage costs: Data reduction can help to reduce the storage
costs associated with large datasets by reducing the size of the data.
4. Improved interpretability: Data reduction can help to improve the
interpretability of the results by removing irrelevant or redundant
information from the dataset.
Disadvantages:
1. Loss of information: Data reduction can result in a loss of information, if
important data is removed during the reduction process.
2. Impact on accuracy: Data reduction can impact the accuracy of a model,
as reducing the size of the dataset can also remove important
information that is needed for accurate predictions.
3. Impact on interpretability: Data reduction can make it harder to interpret
the results, as removing irrelevant or redundant information can also
remove context that is needed to understand the results.
4. Additional computational costs: Data reduction can add additional
computational costs to the data mining process, as it requires additional
processing time to reduce the data.

➢ Introduction to Dimensionality Reduction:

When working with machine learning models, datasets with too many features
can cause issues like slow computation and overfitting. Dimensionality
reduction helps to reduce the number of features while retaining key
information. It converts high-dimensional data into a lower-dimensional space
while preserving important details.

For example, when you are building a model to predict house prices with
features like bedrooms, square footage and location. If you add too many
features such as room condition or flooring type, the dataset becomes large
and complex.

How Dimensionality Reduction Works?

Lets understand how dimensionality Reduction is used with the help of
example. Imagine a dataset where each data point exists in a 3D space defined
by axes X, Y and Z. If most of the data variance occurs along X and Y then the Z-
dimension may contribute very little to understanding the structure of the
data.
Dimensionality Reduction
• Before Reduction we can see that data exist in 3D (X,Y,Z). It has high
redundancy and Z contributes little meaningful information
• On the right after reducing the dimensionality the data is represented in
lower-dimensional spaces. The top plot (X-Y) maintains the meaningful
structure while the bottom plot (Z-Y) shows that the Z-dimension
contributed little useful information.
This process makes data analysis more efficient hence improving computation
speed and visualization while minimizing redundancy
Dimensionality Reduction Techniques
Dimensionality reduction techniques can be broadly divided into two
categories:
1. Feature Selection
Feature selection chooses the most relevant features from the dataset without
altering them. It helps remove redundant or irrelevant features, improving
model efficiency. Some common methods are:
• Filter methods rank the features based on their relevance to the target
variable.
• Wrapper methods use the model performance as the criteria for
selecting features.
• Embedded methods combine feature selection with the model training
process.

2. Feature Extraction
Feature extraction involves creating new features by combining or transforming
the original features. These new features retain most of the dataset’s
important information in fewer dimensions. Common feature extraction
methods are:
1. Principal Component Analysis (PCA): Converts correlated variables into
uncorrelated principal components hence reducing dimensionality while
maintaining as much variance as possible enabling more efficient
analysis.
2. Missing Value Ratio: Variables with missing data beyond a set threshold
are removed, improving dataset reliability.
3. Backward Feature Elimination: Starts with all features and removes the
least significant ones in each iteration. The process continues until only
the most impactful features remain, optimizing model performance.
4. Forward Feature Selection: It begins with one feature, adds others
incrementally and keeps those improving model performance.
5. Random Forest: Random forest uses decision trees to evaluate feature
importance, automatically selecting the most relevant features without
the need for manual coding, enhancing model accuracy.
6. Factor Analysis: Groups variables by correlation and keeps the most
relevant ones for further analysis.
7. Independent Component Analysis (ICA): Identifies statistically
independent components, ideal for applications like ‘blind source
separation’ where traditional correlation-based methods fall short.

Real World Use Case

Dimensionality reduction plays a important role in many real-world
applications such as text categorization, image retrieval, gene expression
analysis and more. Here are a few examples:
1. Text Categorization: With vast amounts of online data dimensionality
reduction helps classify text documents into predefined categories by
reducing the feature space like word or phrase features while
maintaining accuracy.
2. Image Retrieval: As image data grows indexing based on visual content
like color, texture, shape rather than just text descriptions has become
essential. This allows for better retrieval of images from large databases.
3. Gene Expression Analysis: Dimensionality reduction accelerates gene
expression analysis help to classify samples like leukemia by identifying
key features, improve both speed and accuracy.
4. Intrusion Detection: In cybersecurity dimensionality reduction helps
analyze user activity patterns to detect suspicious behaviors and
intrusions by identifying optimal features for network monitoring.
Advantages
As seen earlier high dimensionality makes models inefficient. Let's now
summarize the key advantages of reducing dimensionality.
• Faster Computation: With fewer features machine learning algorithms
can process data more quickly. This results in faster model training and
testing which is particularly useful when working with large datasets.
• Better Visualization: As we saw in the earlier figure reducing dimensions
makes it easier to visualize data and reveal hidden patterns.
• Prevent Overfitting: With few features models are less likely to
memorize the training data and overfit. This helps the model generalize
better to new, unseen data improve its ability to make accurate
predictions.
Disadvantages
• Data Loss & Reduced Accuracy: Some important information may be lost
during dimensionality reduction and affect model performance.
• Choosing the Right Components: Deciding how many dimensions to
keep is difficult as keeping too few may lose valuable information while
keeping too many can led to overfitting

➢ Introduction to Clustering :

Clustering is the task of dividing a dataset into several groups (clusters) such
that data points in the same cluster are more similar to each other than to
those in other clusters [3]. It is a form of unsupervised learning and can also be
used as a data reduction technique by summarizing the data through cluster
prototypes (e.g., centroids).

• K-Means Clustering: Partitions data into

distinct clusters, with each data point assigned to the cluster with the nearest
mean (centroid).
• Hierarchical Clustering: Builds a hierarchy of clusters, either
agglomeratively (bottom-up) or divisively (top-down).
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
Groups together points that are closely packed together, marking as
outliers points that lie alone in low-density regions .

➢ Introduction to Data Sampling :

Data sampling is a statistical process of selecting a representative subset of

data from a larger dataset to analyze the characteristics of the whole dataset.
This is particularly useful for very large datasets where processing all the data is
computationally prohibitive.
• Simple Random Sampling: Every individual in the population has an
equal chance of being selected.
• Stratified Sampling: The population is divided into subgroups (strata)
based on shared characteristics, and then a random sample is taken from
each subgroup.
• Cluster Sampling: The population is divided into clusters, and a random
selection of these clusters is analyzed completely.

➢ Introduction to Data Discretization :

Data discretization is the process of converting continuous data into a finite set
of intervals or bins. This makes the data easier to manage and analyze,
especially for algorithms that prefer categorical data, such as certain
classification models.
• Binning: Data is divided into several intervals (bins). Methods include
equal-width binning (bins have the same range) and equal-frequency
binning (each bin has the same number of data points).
• Histogram Analysis: Similar to binning, but often used for visualization
and density estimation.
• Feature Mapping: Transforming continuous values into categorical ones
using specific functions or rules.
Unit 4

Categorical data refers to variables that belong to distinct categories such as

labels, names or types. Since most machine learning algorithms require
numerical inputs, encoding categorical data to numerical data becomes
important. Proper encoding ensures that models can interpret categorical
variables effectively, leading to improved predictive accuracy and reduced bias.

Types of Categorical Data

1. Nominal Data: Nominal data consists of categories without any inherent
order or ranking. These are simple labels used to classify data.
• Example: 'Red', 'Blue', 'Green' (Car Color).
• Encoding Options: One-Hot Encoding or Label Encoding, depending on
the model's needs.
2. Ordinal Data: Ordinal data includes categories with a defined order or
ranking, where the relationship between values is important.
• Example: 'Low', 'Medium', 'High' (Car Engine Power).
• Encoding Options: Ordinal Encoding.
Using the right encoding techniques, we can effectively transform categorical
data for machine learning models which improves their performance and
predictive capabilities.
Techniques to perform Categorical Data Encoding
Techniques
1. Label Encoding
Label Encoding assigns each category a unique integer. It is simple and
memory-efficient but may unintentionally imply an order among categories
when none exists.
• Used in tree-based models like Decision Trees or XGBoost.
• Pros: Simple and memory-efficient.
• Cons: Introduces implicit order which may be misinterpreted by non-tree
models when used with nominal data.
Label Encoding
Let's look at the following example:
from [Link] import LabelEncoder

data = ['Red', 'Green', 'Blue', 'Red']

le = LabelEncoder()
encoded_data = le.fit_transform(data)
print(f"Encoded Data: {encoded_data}")
Output:
Encoded Data: [0 1 2 0]
Here, 'Red' becomes 0, 'Green' becomes 1 and 'Blue' becomes 2.
2. One-Hot Encoding
One-Hot Encoding converts categories into binary columns with each column
representing one category. It prevents false ordering but can lead to high
dimensionality if there are many unique values.
• Used in linear models, logistic regression and neural networks.
• Pros: Does not assume order; widely supported.
• Cons: Can cause high dimensionality and sparse data when feature has
many categories.
One-Hot Encoding
Let's look at the following example:
import pandas as pd

data = ['Red', 'Blue', 'Green', 'Red']

df = [Link](data, columns=['Color'])
one_hot_encoded = pd.get_dummies(df['Color'])
print(one_hot_encoded)
Output:

output
Each unique category ('Red', 'Blue', 'Green') is transformed into a separate
binary column, with 1 representing the presence of the category and 0 its
absence.
3. Ordinal Encoding
Ordinal Encoding maps categories to integers while preserving their natural
order. This works well for ordered data like ratings but is not suitable for
nominal variables.
• Used for ordered features like ratings or education levels.
• Pros: Maintains order; reduces dimensionality.
• Cons: Not suitable for nominal categories.
Ordinal Encoding
Let's consider the following example:
from [Link] import OrdinalEncoder
data = [['Low'], ['Medium'], ['High'], ['Medium'], ['Low']]

encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])

encoded_data = encoder.fit_transform(data)

print(f"Encoded Ordinal Data: {encoded_data}")

Output:
output
In this case, 'Low' is encoded as 0, 'Medium' as 1 and 'High' as 2, preserving the
natural order of the categories.
4. Target Encoding
Target Encoding also known as Mean Encoding is a technique where each
category in a feature is replaced by the mean of the target variable for that
category.
• Useful for high-cardinality features like ZIP codes or product IDs.
• Pros: Captures relationship to target variable.
• Cons: Risk of overfitting, also must apply smoothing/statistical
techniques.
Target Encoding
Let's consider the following example:
import pandas as pd
import category_encoders as ce

df = [Link](
{'City': ['London', 'Paris', 'London', 'Berlin'], 'Target': [1, 0, 1, 0]}
)

encoder = [Link](cols=['City'])
df_tgt = encoder.fit_transform(df['City'], df['Target'])

print(f"Encoded Target Data:\n{df_tgt}")

Output:
output
In this case, each color is encoded based on the mean of the target variable.
For instance, 'Red' has a mean target value of approximately 0.485, which
reflects the target values for the rows where 'Red' appears.
5. Binary Encoding
Binary encoding represents categories as binary codes and splits them across
multiple columns. It is efficient for high-cardinality data but slightly more
complex to implement.
• Applied in high-cardinality text/NLP tasks to save memory.
• Pros: Reduces dimensionality, more memory-efficient than one-hot
encoding.
• Cons: Slightly more complex; requires careful handling of missing values.
Binary Encoding
Let's consider the following example:
import category_encoders as ce
data = ['Red', 'Green', 'Blue', 'Red']
encoder = [Link](cols=['Color'])
encoded_data = encoder.fit_transform([Link](data, columns=['Color']))
print(encoded_data)
Output:

output
Here, each category (like 'Red', 'Blue', 'Green') is converted into binary digits.
'Red' gets the binary code '10', 'Blue' becomes '01' and 'Green' becomes '11'.
Each binary digit is placed in a separate column (e.g., Color_0 and Color_1).
6. Frequency Encoding
Frequency Encoding assigns categories values based on how often they occur
in the dataset. It is simple and compact but can introduce data leakage if
applied improperly.
• Effective in retail, e-commerce or clickstream data for popularity trends.
• Pros: Low computational and storage requirements.
• Cons: Can introduce data leakage if not handled properly.
Frequency Encoding
Let's consider the following example:
import pandas as pd
data = ['Red', 'Green', 'Blue', 'Red', 'Red']
series_data = [Link](data)
frequency_encoding = series_data.value_counts()

encoded_data = [frequency_encoding[x] for x in data]

print("Encoded Data:", encoded_data)
Output:
Encoded Data: [np.int64(3), np.int64(1), np.int64(1), np.int64(3), np.int64(3)]
Here, 'Red' appears 3 times, so it is encoded as 3, while 'Green' and 'Blue'
appear once, so they are encoded as 1.

1. Feature Scaling
Feature scaling is a data preprocessing technique used to bring all numerical
features of a dataset onto a similar scale. In real-world datasets, features often
have very different units and ranges (e.g., age in years, income in lakhs,
distance in kilometers). If not scaled, features with larger numerical values may
dominate the learning process.
Why Feature Scaling is Important
• Prevents bias toward large-valued features
• Improves convergence speed of gradient-based algorithms
• Ensures fair distance computation in distance-based algorithms
• Enhances numerical stability
• Improves model performance and accuracy
Algorithms That Require Feature Scaling
• Gradient Descent
• K-Nearest Neighbors (KNN)
• Support Vector Machines (SVM)
• Principal Component Analysis (PCA)
• Neural Networks
• Logistic Regression
• Linear Regression (with regularization)

1.1 Normalization (Min–Max Scaling)

Normalization rescales the feature values into a fixed range, typically [0, 1].
Formula
𝑋 − 𝑋min
𝑋scaled =
𝑋max − 𝑋min

Characteristics
• Preserves the relative relationships between data points
• Compresses all values into a bounded range
• Sensitive to outliers
Advantages
• Useful for neural networks
• Improves model convergence
• Ensures consistent feature magnitude
Limitations
• Outliers can distort the scale
• Not ideal if future data exceeds the original min/max
Use Cases
• Image processing (pixel values)
• Deep learning models
• When features do not follow a Gaussian distribution

1.2 Standardization (Z-score Normalization)

Standardization transforms data so that it has:
• Mean = 0
• Standard Deviation = 1
Formula
𝑋−𝜇
𝑋scaled =
𝜎

Where:
• 𝜇= mean
• 𝜎= standard deviation
Characteristics
• Centers data around zero
• Does not bound values to a fixed range
• Less affected by outliers compared to Min–Max scaling
Advantages
• Works well when data follows a normal (Gaussian) distribution
• Preferred for SVM, PCA, Linear Regression
• Maintains statistical properties of data
Limitations
• Still influenced by extreme outliers
• Harder to interpret scaled values
Other Feature Scaling Techniques
• Robust Scaling (uses median and IQR, good for outliers)
• Log Transformation (reduces skewness)
• Power Transformation (Box-Cox, Yeo-Johnson)

2. Encoding Categorical Features

Machine learning algorithms operate on numerical data, so categorical
variables must be transformed into numeric representations without losing
meaning.
Types of Categorical Data
• Nominal: No order (e.g., color, city)
• Ordinal: Has order (e.g., low, medium, high)

2.1 Nominal Encoding (One-Hot Encoding)

One-hot encoding creates binary columns for each unique category.
Example
Color = {Red, Blue, Green}

Red Blue Green

1 0 0

Advantages
• No ordinal assumption
• Prevents misleading numerical relationships
Limitations
• Increases dimensionality
• Can cause curse of dimensionality
• Memory inefficient for high-cardinality features
Use Cases
• Gender, city, product type
• Tree-based models (Random Forest, XGBoost)

2.2 Ordinal Encoding

Assigns integers based on category order.
Example
Low → 1
Medium → 2
High → 3
Advantages
• Preserves ranking information
• Simple and compact
Limitations
• Assumes equal distance between categories
• Can mislead linear models if order spacing is uneven
Use Cases
• Education level
• Ratings or severity scales

2.3 Target Encoding (Mean Encoding)

Each category is replaced with the mean of the target variable for that
category.
Advantages
• Handles high-cardinality features efficiently
• Often improves performance
Limitations
• High risk of data leakage
• Overfitting if not regularized
• Requires careful cross-validation
Best Practices
• Apply only on training data
• Use smoothing or noise
• Combine with cross-validation

Other Encoding Techniques

• Frequency Encoding
• Binary Encoding
• Hash Encoding
• Label Encoding (use cautiously)

3. Binarization
Binarization converts continuous or discrete numerical features into binary
values (0 or 1) based on a threshold.
Rule
• Value > threshold → 1
• Value ≤ threshold → 0
Why Use Binarization
• Simplifies feature representation
• Focuses on presence/absence
• Reduces noise
• Useful for rule-based or Boolean models
Applications
• Spam detection
• Medical diagnosis (disease present or not)
• Text data (word present or not)
Limitations
• Loses magnitude information
• Threshold selection is critical
4. Data Labeling
Data labeling is the process of annotating raw data with meaningful labels so
that supervised learning models can learn from it.
Importance of Data Labeling
• Determines model accuracy
• Directly impacts learning quality
• Essential for supervised and semi-supervised learning
Types of Data Labeling
• Image labeling: bounding boxes, segmentation masks
• Text labeling: sentiment, topic, intent
• Audio labeling: speech transcription, emotion tagging
• Video labeling: object tracking, action recognition
Challenges
• Time-consuming
• Expensive
• Human bias and inconsistency
• Scalability issues
Solutions
• Automated labeling
• Active learning
• Semi-supervised learning
• Crowdsourcing

5. Feature Selection
Feature selection involves choosing a subset of the most relevant features
from the original dataset to improve model efficiency and performance.
Why Feature Selection Matters
• Reduces overfitting
• Improves generalization
• Decreases training time
• Enhances interpretability
• Removes irrelevant and redundant features

5.1 Filter Methods

Use statistical techniques independent of ML algorithms.
Common Techniques
• Correlation coefficient
• Chi-square test
• ANOVA
• Mutual Information
Advantages
• Fast and scalable
• Model-independent
Limitations
• Ignores feature interactions

5.2 Wrapper Methods

Evaluate feature subsets using a learning algorithm.
Examples
• Forward selection
• Backward elimination
• Recursive Feature Elimination (RFE)
Advantages
• High accuracy
• Considers feature interactions
Limitations
• Computationally expensive
• Prone to overfitting

5.3 Embedded Methods

Feature selection is built into model training.
Examples
• L1 regularization (Lasso)
• Decision trees
• Random Forest feature importance
Advantages
• Efficient
• Balanced bias-variance tradeoff
Limitations
• Model-dependent
Unit 5

Dimensionality Reduction Overview

Dimensionality reduction is the process of reducing the number of input
variables (features) in a dataset while preserving as much important
information as possible.
It is used to:
• Reduce computational cost
• Remove redundant/noisy features
• Improve visualization
• Reduce overfitting
• Improve model performance

Two widely used dimensionality reduction techniques are:

1. Principal Component Analysis (PCA) – linear method

2. t-Distributed Stochastic Neighbor Embedding (t-SNE) – non-linear
method
1. Principal Component Analysis (PCA)
Definition
Principal Component Analysis (PCA) is a linear dimensionality reduction
technique that transforms the original features into a new set of uncorrelated
variables called principal components, ordered by the amount of variance they
capture from the data.
Key Idea of PCA
• Identify directions (axes) of maximum variance
• Project data onto these new axes
• Keep only the top k components
• Discard less informative dimensions

Characteristics of PCA
• Linear transformation
• Unsupervised learning technique
• Components are orthogonal (uncorrelated)
• Sensitive to feature scaling
• Preserves global structure of data

Mathematical Intuition
1. Center the data (mean = 0)
2. Compute covariance matrix
3. Compute eigenvalues and eigenvectors
4. Sort eigenvectors by descending eigenvalues
5. Select top-k eigenvectors
6. Project data onto new feature space

Advantages of PCA
• Reduces dimensionality efficiently
• Removes multicollinearity
• Improves training speed
• Useful for noise reduction
• Works well with linear models
Limitations of PCA
• Assumes linear relationships
• Not ideal for complex non-linear data
• Reduced interpretability
• Sensitive to outliers

Applications of PCA
• Image compression
• Face recognition
• Gene expression analysis
• Feature extraction
• Visualization of high-dimensional data

PCA Implementation in Python (scikit-learn)

from [Link] import PCA
from [Link] import StandardScaler
from [Link] import load_iris

# Load dataset
data = load_iris()
X = [Link]

# Standardize the data

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Explained variance
print("Explained Variance Ratio:", pca.explained_variance_ratio_)
Explanation
• StandardScaler() → PCA requires scaled data
• n_components=2 → reduce to 2 dimensions
• explained_variance_ratio_ → tells how much information is retained

2. t-Distributed Stochastic Neighbor Embedding (t-SNE)

Definition
t-SNE is a non-linear dimensionality reduction technique mainly used for data
visualization. It preserves local relationships by mapping similar points close
together in lower-dimensional space.

Key Idea of t-SNE

• Converts distances between points into probability distributions
• Minimizes difference between high-dimensional and low-dimensional
distributions
• Focuses on local neighborhood structure

Characteristics of t-SNE
• Non-linear method
• Unsupervised
• Computationally expensive
• Best for visualization (2D or 3D)
• Not suitable for feature extraction for models
Important Parameters
• n_components: Output dimensions (usually 2 or 3)
• perplexity: Controls balance between local and global structure
• learning_rate: Affects convergence
• n_iter: Number of optimization iterations

Advantages of t-SNE
• Excellent visualization quality
• Reveals clusters clearly
• Handles complex non-linear patterns

Limitations of t-SNE
• Very slow on large datasets
• Results are non-deterministic
• No inverse transform
• Poor preservation of global structure
• Not ideal for downstream ML models

Applications of t-SNE
• Visualizing embeddings
• Cluster inspection
• Exploratory data analysis
• Deep learning feature visualization
t-SNE Implementation in Python (scikit-learn)
from [Link] import TSNE
from [Link] import StandardScaler
from [Link] import load_iris

# Load dataset
data = load_iris()
X = [Link]

# Scale data
X_scaled = StandardScaler().fit_transform(X)

# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=30, learning_rate=200, n_iter=1000)
X_tsne = tsne.fit_transform(X_scaled)

Explanation
• perplexity=30 → common default
• n_iter=1000 → sufficient for convergence
• Output is used only for visualization

PCA vs t-SNE (Comparison)

Feature PCA t-SNE

Type Linear Non-linear

Speed Fast Slow

Use case Feature reduction Visualization

Feature PCA t-SNE

Preserves Global variance Local structure

Scalability High Low

Inverse transform Possible Not possible

When to Use What

• Use PCA when:
o You want to reduce dimensions for ML models
o Speed and interpretability matter
o Data is approximately linear
• Use t-SNE when:
o You want to visualize clusters
o Data is complex and non-linear
o Interpretation is visual, not predictive

Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
41 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
32 pages
Data Preprocessing Techniques in Mining
No ratings yet
Data Preprocessing Techniques in Mining
5 pages
Data Mining and Preprocessing Essentials
No ratings yet
Data Mining and Preprocessing Essentials
31 pages
Data Mining and Warehousing Explained
No ratings yet
Data Mining and Warehousing Explained
20 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
6 pages
Data Cleaning and Transformation Essentials
No ratings yet
Data Cleaning and Transformation Essentials
3 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
10 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
19 pages
Data Transformation in Preprocessing
No ratings yet
Data Transformation in Preprocessing
8 pages
Data Preprocessing in Data Mining-New
No ratings yet
Data Preprocessing in Data Mining-New
3 pages
Data Preprocessing for Effective Visualization
No ratings yet
Data Preprocessing for Effective Visualization
4 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
7 pages
Data Pre Processing
No ratings yet
Data Pre Processing
3 pages
Data Preprocessing Unit III
No ratings yet
Data Preprocessing Unit III
52 pages
Data Preprocessing for Analysis
No ratings yet
Data Preprocessing for Analysis
14 pages
12030822004data Mining
No ratings yet
12030822004data Mining
10 pages
Data Preprocessing Techniques in Python
No ratings yet
Data Preprocessing Techniques in Python
11 pages
Essential Steps in Data Preprocessing
No ratings yet
Essential Steps in Data Preprocessing
1 page
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
19 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
17 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
23 pages
Data Preprocessing for Effective Mining
No ratings yet
Data Preprocessing for Effective Mining
15 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
19 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
2 pages
FDS (Module 2& 3)
No ratings yet
FDS (Module 2& 3)
22 pages
Essential Steps in Data Preprocessing
No ratings yet
Essential Steps in Data Preprocessing
2 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
4 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
23 pages
Data Preprocessing Techniques for Big Data
No ratings yet
Data Preprocessing Techniques for Big Data
51 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
5 pages
Data Pre-processing Essentials
No ratings yet
Data Pre-processing Essentials
10 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Data Preprocessing for AI Models
No ratings yet
Data Preprocessing for AI Models
12 pages
Understanding Data Mining and KDD
No ratings yet
Understanding Data Mining and KDD
22 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
22 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
9 pages
DM CS 2 Data Preprocessing Techniques
No ratings yet
DM CS 2 Data Preprocessing Techniques
56 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
19 pages
Essential Data Preprocessing Methods
No ratings yet
Essential Data Preprocessing Methods
3 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Importance of Data Preprocessing
No ratings yet
Importance of Data Preprocessing
39 pages
Essential Steps in Data Preprocessing
No ratings yet
Essential Steps in Data Preprocessing
4 pages
Essential Steps in Data Preprocessing
No ratings yet
Essential Steps in Data Preprocessing
4 pages
Unit-3 Data Preprocessing Techniques
No ratings yet
Unit-3 Data Preprocessing Techniques
16 pages
Module 2
No ratings yet
Module 2
66 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
32 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
12 pages
Data Preparation for Machine Learning
No ratings yet
Data Preparation for Machine Learning
106 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
25 pages
Data Preprocessing for Quality Mining
No ratings yet
Data Preprocessing for Quality Mining
14 pages
Understanding Data Binning Techniques
100% (1)
Understanding Data Binning Techniques
9 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
15 pages
Data Pre-processing Techniques Explained
No ratings yet
Data Pre-processing Techniques Explained
8 pages
Eeim Que
No ratings yet
Eeim Que
3 pages
AI Visualization Proposal
No ratings yet
AI Visualization Proposal
1 page
Edp Unit-1
No ratings yet
Edp Unit-1
15 pages
Edp Unit-2
No ratings yet
Edp Unit-2
9 pages
Edp Unit-3
No ratings yet
Edp Unit-3
19 pages
Understanding Body Language Signals
No ratings yet
Understanding Body Language Signals
10 pages
History of Computer Architecture Overview
No ratings yet
History of Computer Architecture Overview
17 pages
Understanding Data Structures & Algorithms
No ratings yet
Understanding Data Structures & Algorithms
146 pages
Python Inheritance Examples and Concepts
No ratings yet
Python Inheritance Examples and Concepts
7 pages
File Handling in Python: Read & Write
No ratings yet
File Handling in Python: Read & Write
1 page
Machine Learning Roadmap 2025
No ratings yet
Machine Learning Roadmap 2025
3 pages
Water Soluble Vitamins Lesson Plan
No ratings yet
Water Soluble Vitamins Lesson Plan
9 pages
Teaching Line Graphs in Math V
No ratings yet
Teaching Line Graphs in Math V
5 pages
Discriminant Analysis in Business Research
No ratings yet
Discriminant Analysis in Business Research
29 pages
TBLT and Intercultural Language Teaching
No ratings yet
TBLT and Intercultural Language Teaching
18 pages
Anti-Bullying Activity for Teens
No ratings yet
Anti-Bullying Activity for Teens
3 pages
BE1500 Fall2025 Homework6
No ratings yet
BE1500 Fall2025 Homework6
15 pages
Unit 1 Lesson 2: Animal Talk
No ratings yet
Unit 1 Lesson 2: Animal Talk
11 pages
Walk-In Interview for Adjunct Professors
No ratings yet
Walk-In Interview for Adjunct Professors
11 pages
Mother Tongue's Role in Language Learning
No ratings yet
Mother Tongue's Role in Language Learning
7 pages
Future 01 2e CCRSPlus U10
No ratings yet
Future 01 2e CCRSPlus U10
8 pages
Understanding Contemporary Organization Theory
No ratings yet
Understanding Contemporary Organization Theory
2 pages
Practical Pattern
No ratings yet
Practical Pattern
2 pages
Weekly Accomplishment Report NSTP
No ratings yet
Weekly Accomplishment Report NSTP
4 pages
Intelligent Guide to Buyer Personas
No ratings yet
Intelligent Guide to Buyer Personas
32 pages
Geometry PDF
100% (2)
Geometry PDF
285 pages
Decolonizing Sociology: A Critical Overview
No ratings yet
Decolonizing Sociology: A Critical Overview
10 pages
Prime Management Training Program Overview
No ratings yet
Prime Management Training Program Overview
9 pages
Experienced Administrative Assistant Resume
No ratings yet
Experienced Administrative Assistant Resume
1 page
Augmenting Data for YOLOv6 Models
No ratings yet
Augmenting Data for YOLOv6 Models
7 pages
Anti-Lock Braking System Report
100% (2)
Anti-Lock Braking System Report
10 pages
Science 6: Earth's Rotation Lesson Plan
No ratings yet
Science 6: Earth's Rotation Lesson Plan
15 pages
Understanding the Three Vows in Buddhism
No ratings yet
Understanding the Three Vows in Buddhism
4 pages
RHA and Bamboo Fiber in Concrete Mechanics
No ratings yet
RHA and Bamboo Fiber in Concrete Mechanics
4 pages
MBA Leadership and Change Management Exam
No ratings yet
MBA Leadership and Change Management Exam
3 pages
Youth Engagement in Environmental Education
0% (1)
Youth Engagement in Environmental Education
35 pages
Chapter 1 Strategic Staffing
No ratings yet
Chapter 1 Strategic Staffing
6 pages
Understanding Symposiums and Their Purpose
No ratings yet
Understanding Symposiums and Their Purpose
3 pages
Academic Stress and Parent Relations
No ratings yet
Academic Stress and Parent Relations
5 pages
Revenue Group Quarterly Moneylending Report
No ratings yet
Revenue Group Quarterly Moneylending Report
1 page
A2 Word Formation Exercises with Answers
100% (2)
A2 Word Formation Exercises with Answers
3 pages

Data Preprocessing

Uploaded by

Data Preprocessing

Uploaded by

Unit 1

➢ Need for Data Preprocessing:

Data preprocessing typically involves several key steps:

➢ Libraries Used for Data Preprocessing:

➢ Steps in Data Preprocessing:

1. Data Cleaning: It is the process of identifying and correcting errors or

➢ Uses of Data Preprocessing:

➢ Advantages of Data Preprocessing:

➢ Disadvantages of Data Preprocessing:

Final reduced attribute set: {X1, X2, X5}

• Step-wise Backward Selection -

Step-1: {X1, X2, X3, X4, X5}

Final reduced attribute set: {X1, X2, X5}

5. Discretization & Concept Hierarchy Operation:

For numeric data following techniques can be followed:

1. Equal Frequency partitioning: Partitioning the values based on

➢ Introduction to Dimensionality Reduction:

How Dimensionality Reduction Works?

Real World Use Case

• K-Means Clustering: Partitions data into

➢ Introduction to Data Sampling :

Data sampling is a statistical process of selecting a representative subset of

➢ Introduction to Data Discretization :

Categorical data refers to variables that belong to distinct categories such as

Types of Categorical Data

data = ['Red', 'Green', 'Blue', 'Red']

data = ['Red', 'Blue', 'Green', 'Red']

encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])

print(f"Encoded Ordinal Data: {encoded_data}")

print(f"Encoded Target Data:\n{df_tgt}")

encoded_data = [frequency_encoding[x] for x in data]

1.1 Normalization (Min–Max Scaling)

1.2 Standardization (Z-score Normalization)

2. Encoding Categorical Features

2.1 Nominal Encoding (One-Hot Encoding)

Red Blue Green

2.2 Ordinal Encoding

2.3 Target Encoding (Mean Encoding)

Other Encoding Techniques

5.1 Filter Methods

5.2 Wrapper Methods

5.3 Embedded Methods

Dimensionality Reduction Overview

Two widely used dimensionality reduction techniques are:

1. Principal Component Analysis (PCA) – linear method

PCA Implementation in Python (scikit-learn)

# Standardize the data

2. t-Distributed Stochastic Neighbor Embedding (t-SNE)

Key Idea of t-SNE

PCA vs t-SNE (Comparison)

Feature PCA t-SNE

Type Linear Non-linear

Speed Fast Slow

Use case Feature reduction Visualization

Preserves Global variance Local structure

Scalability High Low

Inverse transform Possible Not possible

When to Use What

You might also like