Unit 1
➢ Introduction:
Raw, real-world data is often incomplete, inconsistent, and dirty. For example, a
dataset might have missing values, contain errors (such as a person's age listed as
200), be noisy (containing random errors or variances), or be in a format incompatible
with many analytical models. Data preprocessing addresses these issues, ensuring
that the data is high-quality and reliable, which in turn improves the accuracy and
efficiency of any subsequent analysis or machine learning model.
➢ Definitions:
• Data Preprocessing: A series of techniques used to prepare raw data for analysis.
• Data Cleaning: The process of detecting and correcting (or removing) corrupt,
inaccurate, or irrelevant records from a dataset. Key tasks include handling missing
values and outlier detection .
• Data Transformation: The process of converting data from one format or structure to
another, often involving normalization, smoothing, and aggregation.
• Data Reduction: Techniques used to obtain a reduced representation of the data set
that is much smaller in volume but still produces the same analytical results .
➢ Need for Data Preprocessing:
The primary need for data preprocessing stems from the "garbage in, garbage out"
principle: the quality of the output is only as good as the quality of the input. Specific
reasons include:
• Improving Data Quality: Raw data is often incomplete, inconsistent, and noisy.
Preprocessing ensures that the data is accurate and reliable.
• Enhancing Model Accuracy: Most machine learning models perform better with
clean, well-structured data.
• Increasing Efficiency: Preprocessed data can be processed faster by algorithms,
reducing computation time.
• Compatibility: Many algorithms have specific data format or scaling requirements
(e.g., numerical inputs only, data scaled to a 0-1 range), which preprocessing fulfills .
➢ Understanding Data Preprocessing:
Data preprocessing typically involves several key steps:
1. Data Cleaning:
1. Handling Missing Values: Replacing missing data with a calculated value (mean,
median, mode) or simply removing the record .
2. Handling Noisy Data: Smoothing data or removing outliers using methods like
binning, regression, or clustering .
2. Data Integration: Merging data from multiple sources (databases, files, etc.) into a
single, consistent data store .
3. Data Transformation:
1. Normalization/Scaling: Adjusting data to a common scale (e.g., Min-Max scaling or
Z-score normalization) so that no single feature dominates the model .
2. Attribute/Feature Engineering: Creating new features from existing ones to help
the model learn more effectively .
4. Data Reduction:
1. Dimensionality Reduction: Reducing the number of variables, using techniques like
Principal Component Analysis (PCA) or feature selection .
2. Numerosity Reduction: Replacing the original data with smaller alternative
representations, such as histograms .
➢ Libraries Used for Data Preprocessing:
In the field of data science, several powerful libraries facilitate the data preprocessing
steps. Common Python libraries include:
• NumPy: Provides support for large, multi-dimensional arrays and matrices, along with
a collection of mathematical functions to operate on these arrays [3]. It is fundamental
for numerical operations and data structuring.
• Pandas: Offers powerful data structures like DataFrames and tools for data
manipulation and analysis, making it the primary tool for data cleaning, handling
missing values, and data integration [3]. You can find tutorials on their official
documentation.
• Scikit-learn (sklearn): A comprehensive machine learning library with a
dedicated preprocessing module. It provides a wide range of tools for scaling,
normalization, encoding categorical variables, and dimensionality reduction
techniques like PCA . Learn more in their preprocessing documentation.
• Matplotlib: While primarily for plotting, visualization tools are often used during the
preprocessing phase for outlier detection and initial data exploration.
• Seaborn: A visualization library built on Matplotlib, useful for visually identifying
patterns, missing data, and outliers .
Unit 2
➢ Steps in Data Preprocessing:
Some key steps in data preprocessing are Data Cleaning, Data Integration, Data
Transformation, and Data Reduction.
1. Data Cleaning: It is the process of identifying and correcting errors or
inconsistencies in the dataset. It involves handling missing values, removing
duplicates, and correcting incorrect or outlier data to ensure the dataset is accurate
and reliable. Clean data is essential for effective analysis, as it improves the quality of
results and enhances the performance of data models.
• Missing Values: This occur when data is absent from a dataset. You can either ignore
the rows with missing data or fill the gaps manually, with the attribute mean, or by
using the most probable value. This ensures the dataset remains accurate and
complete for analysis.
• Noisy Data: It refers to irrelevant or incorrect data that is difficult for machines to
interpret, often caused by errors in data collection or entry. It can be handled in
several ways:
o Binning Method: The data is sorted into equal segments, and each segment is
smoothed by replacing values with the mean or boundary values.
o Regression: Data can be smoothed by fitting it to a regression function, either linear
or multiple, to predict values.
o Clustering: This method groups similar data points together, with outliers either being
undetected or falling outside the clusters. These techniques help remove noise and
improve data quality.
• Removing Duplicates: It involves identifying and eliminating repeated data entries to
ensure accuracy and consistency in the dataset. This process prevents errors and
ensures reliable analysis by keeping only unique records.
2. Data Integration: It involves merging data from various sources into a single, unified
dataset. It can be challenging due to differences in data formats, structures, and
meanings. Techniques like record linkage and data fusion help in combining data
efficiently, ensuring consistency and accuracy.
• Record Linkage is the process of identifying and matching records from different
datasets that refer to the same entity, even if they are represented differently. It helps
in combining data from various sources by finding corresponding records based on
common identifiers or attributes.
• Data Fusion involves combining data from multiple sources to create a more
comprehensive and accurate dataset. It integrates information that may be
inconsistent or incomplete from different sources, ensuring a unified and richer
dataset for analysis.
3. Data Transformation: It involves converting data into a format suitable for analysis.
Common techniques include normalization, which scales data to a common range;
standardization, which adjusts data to have zero mean and unit variance; and
discretization, which converts continuous data into discrete categories. These
techniques help prepare the data for more accurate analysis.
• Data Normalization: The process of scaling data to a common range to ensure
consistency across variables.
• Discretization: Converting continuous data into discrete categories for easier analysis.
• Data Aggregation: Combining multiple data points into a summary form, such as
averages or totals, to simplify analysis.
• Concept Hierarchy Generation: Organizing data into a hierarchy of concepts to
provide a higher-level view for better understanding and analysis.
4. Data Reduction: It reduces the dataset's size while maintaining key information.
This can be done through feature selection, which chooses the most relevant features,
and feature extraction, which transforms the data into a lower-dimensional space
while preserving important details. It uses various reduction techniques such as,
• Dimensionality Reduction (e.g., Principal Component Analysis): A technique that
reduces the number of variables in a dataset while retaining its essential information.
• Numerosity Reduction: Reducing the number of data points by methods like sampling
to simplify the dataset without losing critical patterns.
• Data Compression: Reducing the size of data by encoding it in a more compact form,
making it easier to store and process.
➢ Uses of Data Preprocessing:
Data preprocessing is utilized across various fields to ensure that raw data is
transformed into a usable format for analysis and decision-making. Here are some key
areas where data preprocessing is applied:
1. Data Warehousing: In data warehousing, preprocessing is essential for cleaning,
integrating, and structuring data before it is stored in a centralized repository. This
ensures the data is consistent and reliable for future queries and reporting.
2. Data Mining: Data preprocessing in data mining involves cleaning and transforming
raw data to make it suitable for analysis. This step is crucial for identifying patterns
and extracting insights from large datasets.
3. Machine Learning: In machine learning, preprocessing prepares raw data for model
training. This includes handling missing values, normalizing features, encoding
categorical variables, and splitting datasets into training and testing sets to improve
model performance and accuracy.
4. Data Science: Data preprocessing is a fundamental step in data science projects,
ensuring that the data used for analysis or building predictive models is clean,
structured, and relevant. It enhances the overall quality of insights derived from the
data.
5. Web Mining: In web mining, preprocessing helps analyze web usage logs to extract
meaningful user behavior patterns. This can inform marketing strategies and improve
user experience through personalized recommendations.
6. Business Intelligence (BI): Preprocessing supports BI by organizing and cleaning
data to create dashboards and reports that provide actionable insights for decision-
makers.
7. Deep Learning Purpose: Similar to machine learning, deep learning applications
require preprocessing to normalize or enhance features of the input data, optimizing
model training processes.
➢ Advantages of Data Preprocessing:
• Improved Data Quality: Ensures data is clean, consistent, and reliable for analysis.
• Better Model Performance: Reduces noise and irrelevant data, leading to more
accurate predictions and insights.
• Efficient Data Analysis: Streamlines data for faster and easier processing.
• Enhanced Decision-Making: Provides clear and well-organized data for better
business decisions.
➢ Disadvantages of Data Preprocessing:
• Time-Consuming: Requires significant time and effort to clean, transform, and
organize data.
• Resource-Intensive: Demands computational power and skilled personnel for
complex preprocessing tasks.
• Potential Data Loss: Incorrect handling may result in losing valuable information.
• Complexity: Handling large datasets or diverse formats can be challenging.
Unit 3
INTRODUCTION:
Data reduction is a technique used in data mining to reduce the size of a
dataset while still preserving the most important information. This can be
beneficial in situations where the dataset is too large to be processed
efficiently, or where the dataset contains a large amount of irrelevant or
redundant information.
There are several different data reduction techniques that can be used in data
mining, including:
1. Data Sampling: This technique involves selecting a subset of the data to
work with, rather than using the entire dataset. This can be useful for
reducing the size of a dataset while still preserving the overall trends and
patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number
of features in the dataset, either by removing features that are not
relevant or by combining multiple features into a single feature.
3. Data Compression: This technique involves using techniques such as
lossy or lossless compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data
into discrete data by partitioning the range of possible values into
intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features
from the dataset that are most relevant to the task at hand.
6. It's important to note that data reduction can have a trade-off between
the accuracy and the size of the data. The more data is reduced, the less
accurate the model will be and the less generalizable it will be.
In conclusion, data reduction is an important step in data mining, as it can help
to improve the efficiency and performance of machine learning algorithms by
reducing the size of the dataset. However, it is important to be aware of the
trade-off between the size and accuracy of the data, and carefully assess the
risks and benefits before implementing it.
Methods of data reduction:
These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example,
imagine the information you gathered for your analysis for the years 2012 to
2014, that data includes the revenue of your company every three months.
They involve you in the annual sales, rather than the quarterly average, So we
can summarize the data in such a way that the resulting data summarizes the
total sales per year instead of per quarter. It summarizes the data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use
the attribute required for our analysis. It reduces data size as it eliminates
outdated or redundant features.
• Step-wise Forward Selection -
The selection begins with an empty set of attributes later on we decide
the best of the original attributes on the set based on their relevance to
other attributes. We know it as a p-value in statistics.
Suppose there are the following attributes in the data set in which few
attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}
• Step-wise Backward Selection -
This selection starts with a set of complete attributes in the original data
and at each point, it eliminates the worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few
attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }
Step-1: {X1, X2, X3, X4, X5}
Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}
• Combination of forwarding and Backward Selection -
It allows us to remove the worst and select the best attributes, saving
time and making the process faster.
3. Data Compression:
The data compression technique reduces the size of the files using different
encoding mechanisms (Huffman Encoding & run-length Encoding). We can
divide it into two types based on their compression techniques.
• Lossless Compression -
Encoding techniques (Run Length Encoding) allow a simple and minimal
data size reduction. Lossless data compression uses algorithms to restore
the precise original data from the compressed data.
• Lossy Compression -
Methods such as the Discrete Wavelet transform technique, PCA
(principal component analysis) are examples of this compression. For
e.g., the JPEG image format is a lossy compression, but we can find the
meaning equivalent to the original image. In lossy-data compression, the
decompressed data may differ from the original data but are useful
enough to retrieve information from them.
4. Numerosity Reduction:
In this reduction technique, the actual data is replaced with mathematical
models or smaller representations of the data instead of actual data, it is
important to only store the model parameter. Or non-parametric methods such
as clustering, histogram, and sampling.
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the
continuous nature into data with intervals. We replace many constant values of
the attributes by labels of small intervals. This means that mining results are
shown in a concise, and easily understandable way.
• Top-down discretization -
If you first consider one or a couple of points (so-called breakpoints or
split points) to divide the whole set of attributes and repeat this method
up to the end, then the process is known as top-down discretization also
known as splitting.
• Bottom-up discretization -
If you first consider all the constant values as split points, some are
discarded through a combination of the neighborhood values in the
interval, that process is called bottom-up discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts
(such as 43 for age) with high-level concepts (categorical variables such as
middle age or Senior).
For numeric data following techniques can be followed:
• Binning -
Binning is the process of changing numerical variables into categorical
counterparts. The number of categorical counterparts depends on the
number of bins specified by the user.
• Histogram analysis -
Like the process of binning, the histogram is used to partition the value
for the attribute X, into disjoint ranges called brackets. There are several
partitioning rules:
1. Equal Frequency partitioning: Partitioning the values based on
their number of occurrences in the data set.
2. Equal Width Partitioning: Partitioning the values in a fixed gap
based on the number of bins i.e. a set of values ranging from 0-
20.
3. Clustering: Grouping similar data together.
ADVANTAGED OR DISADVANTAGES OF Data Reduction in Data Mining :
Data reduction in data mining can have a number of advantages and
disadvantages.
Advantages:
1. Improved efficiency: Data reduction can help to improve the efficiency of
machine learning algorithms by reducing the size of the dataset. This can
make it faster and more practical to work with large datasets.
2. Improved performance: Data reduction can help to improve the
performance of machine learning algorithms by removing irrelevant or
redundant information from the dataset. This can help to make the
model more accurate and robust.
3. Reduced storage costs: Data reduction can help to reduce the storage
costs associated with large datasets by reducing the size of the data.
4. Improved interpretability: Data reduction can help to improve the
interpretability of the results by removing irrelevant or redundant
information from the dataset.
Disadvantages:
1. Loss of information: Data reduction can result in a loss of information, if
important data is removed during the reduction process.
2. Impact on accuracy: Data reduction can impact the accuracy of a model,
as reducing the size of the dataset can also remove important
information that is needed for accurate predictions.
3. Impact on interpretability: Data reduction can make it harder to interpret
the results, as removing irrelevant or redundant information can also
remove context that is needed to understand the results.
4. Additional computational costs: Data reduction can add additional
computational costs to the data mining process, as it requires additional
processing time to reduce the data.
➢ Introduction to Dimensionality Reduction:
When working with machine learning models, datasets with too many features
can cause issues like slow computation and overfitting. Dimensionality
reduction helps to reduce the number of features while retaining key
information. It converts high-dimensional data into a lower-dimensional space
while preserving important details.
For example, when you are building a model to predict house prices with
features like bedrooms, square footage and location. If you add too many
features such as room condition or flooring type, the dataset becomes large
and complex.
How Dimensionality Reduction Works?
Lets understand how dimensionality Reduction is used with the help of
example. Imagine a dataset where each data point exists in a 3D space defined
by axes X, Y and Z. If most of the data variance occurs along X and Y then the Z-
dimension may contribute very little to understanding the structure of the
data.
Dimensionality Reduction
• Before Reduction we can see that data exist in 3D (X,Y,Z). It has high
redundancy and Z contributes little meaningful information
• On the right after reducing the dimensionality the data is represented in
lower-dimensional spaces. The top plot (X-Y) maintains the meaningful
structure while the bottom plot (Z-Y) shows that the Z-dimension
contributed little useful information.
This process makes data analysis more efficient hence improving computation
speed and visualization while minimizing redundancy
Dimensionality Reduction Techniques
Dimensionality reduction techniques can be broadly divided into two
categories:
1. Feature Selection
Feature selection chooses the most relevant features from the dataset without
altering them. It helps remove redundant or irrelevant features, improving
model efficiency. Some common methods are:
• Filter methods rank the features based on their relevance to the target
variable.
• Wrapper methods use the model performance as the criteria for
selecting features.
• Embedded methods combine feature selection with the model training
process.
2. Feature Extraction
Feature extraction involves creating new features by combining or transforming
the original features. These new features retain most of the dataset’s
important information in fewer dimensions. Common feature extraction
methods are:
1. Principal Component Analysis (PCA): Converts correlated variables into
uncorrelated principal components hence reducing dimensionality while
maintaining as much variance as possible enabling more efficient
analysis.
2. Missing Value Ratio: Variables with missing data beyond a set threshold
are removed, improving dataset reliability.
3. Backward Feature Elimination: Starts with all features and removes the
least significant ones in each iteration. The process continues until only
the most impactful features remain, optimizing model performance.
4. Forward Feature Selection: It begins with one feature, adds others
incrementally and keeps those improving model performance.
5. Random Forest: Random forest uses decision trees to evaluate feature
importance, automatically selecting the most relevant features without
the need for manual coding, enhancing model accuracy.
6. Factor Analysis: Groups variables by correlation and keeps the most
relevant ones for further analysis.
7. Independent Component Analysis (ICA): Identifies statistically
independent components, ideal for applications like ‘blind source
separation’ where traditional correlation-based methods fall short.
Real World Use Case
Dimensionality reduction plays a important role in many real-world
applications such as text categorization, image retrieval, gene expression
analysis and more. Here are a few examples:
1. Text Categorization: With vast amounts of online data dimensionality
reduction helps classify text documents into predefined categories by
reducing the feature space like word or phrase features while
maintaining accuracy.
2. Image Retrieval: As image data grows indexing based on visual content
like color, texture, shape rather than just text descriptions has become
essential. This allows for better retrieval of images from large databases.
3. Gene Expression Analysis: Dimensionality reduction accelerates gene
expression analysis help to classify samples like leukemia by identifying
key features, improve both speed and accuracy.
4. Intrusion Detection: In cybersecurity dimensionality reduction helps
analyze user activity patterns to detect suspicious behaviors and
intrusions by identifying optimal features for network monitoring.
Advantages
As seen earlier high dimensionality makes models inefficient. Let's now
summarize the key advantages of reducing dimensionality.
• Faster Computation: With fewer features machine learning algorithms
can process data more quickly. This results in faster model training and
testing which is particularly useful when working with large datasets.
• Better Visualization: As we saw in the earlier figure reducing dimensions
makes it easier to visualize data and reveal hidden patterns.
• Prevent Overfitting: With few features models are less likely to
memorize the training data and overfit. This helps the model generalize
better to new, unseen data improve its ability to make accurate
predictions.
Disadvantages
• Data Loss & Reduced Accuracy: Some important information may be lost
during dimensionality reduction and affect model performance.
• Choosing the Right Components: Deciding how many dimensions to
keep is difficult as keeping too few may lose valuable information while
keeping too many can led to overfitting
➢ Introduction to Clustering :
Clustering is the task of dividing a dataset into several groups (clusters) such
that data points in the same cluster are more similar to each other than to
those in other clusters [3]. It is a form of unsupervised learning and can also be
used as a data reduction technique by summarizing the data through cluster
prototypes (e.g., centroids).
• K-Means Clustering: Partitions data into
distinct clusters, with each data point assigned to the cluster with the nearest
mean (centroid).
• Hierarchical Clustering: Builds a hierarchy of clusters, either
agglomeratively (bottom-up) or divisively (top-down).
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
Groups together points that are closely packed together, marking as
outliers points that lie alone in low-density regions .
➢ Introduction to Data Sampling :
Data sampling is a statistical process of selecting a representative subset of
data from a larger dataset to analyze the characteristics of the whole dataset.
This is particularly useful for very large datasets where processing all the data is
computationally prohibitive.
• Simple Random Sampling: Every individual in the population has an
equal chance of being selected.
• Stratified Sampling: The population is divided into subgroups (strata)
based on shared characteristics, and then a random sample is taken from
each subgroup.
• Cluster Sampling: The population is divided into clusters, and a random
selection of these clusters is analyzed completely.
➢ Introduction to Data Discretization :
Data discretization is the process of converting continuous data into a finite set
of intervals or bins. This makes the data easier to manage and analyze,
especially for algorithms that prefer categorical data, such as certain
classification models.
• Binning: Data is divided into several intervals (bins). Methods include
equal-width binning (bins have the same range) and equal-frequency
binning (each bin has the same number of data points).
• Histogram Analysis: Similar to binning, but often used for visualization
and density estimation.
• Feature Mapping: Transforming continuous values into categorical ones
using specific functions or rules.
Unit 4
Categorical data refers to variables that belong to distinct categories such as
labels, names or types. Since most machine learning algorithms require
numerical inputs, encoding categorical data to numerical data becomes
important. Proper encoding ensures that models can interpret categorical
variables effectively, leading to improved predictive accuracy and reduced bias.
Types of Categorical Data
1. Nominal Data: Nominal data consists of categories without any inherent
order or ranking. These are simple labels used to classify data.
• Example: 'Red', 'Blue', 'Green' (Car Color).
• Encoding Options: One-Hot Encoding or Label Encoding, depending on
the model's needs.
2. Ordinal Data: Ordinal data includes categories with a defined order or
ranking, where the relationship between values is important.
• Example: 'Low', 'Medium', 'High' (Car Engine Power).
• Encoding Options: Ordinal Encoding.
Using the right encoding techniques, we can effectively transform categorical
data for machine learning models which improves their performance and
predictive capabilities.
Techniques to perform Categorical Data Encoding
Techniques
1. Label Encoding
Label Encoding assigns each category a unique integer. It is simple and
memory-efficient but may unintentionally imply an order among categories
when none exists.
• Used in tree-based models like Decision Trees or XGBoost.
• Pros: Simple and memory-efficient.
• Cons: Introduces implicit order which may be misinterpreted by non-tree
models when used with nominal data.
Label Encoding
Let's look at the following example:
from [Link] import LabelEncoder
data = ['Red', 'Green', 'Blue', 'Red']
le = LabelEncoder()
encoded_data = le.fit_transform(data)
print(f"Encoded Data: {encoded_data}")
Output:
Encoded Data: [0 1 2 0]
Here, 'Red' becomes 0, 'Green' becomes 1 and 'Blue' becomes 2.
2. One-Hot Encoding
One-Hot Encoding converts categories into binary columns with each column
representing one category. It prevents false ordering but can lead to high
dimensionality if there are many unique values.
• Used in linear models, logistic regression and neural networks.
• Pros: Does not assume order; widely supported.
• Cons: Can cause high dimensionality and sparse data when feature has
many categories.
One-Hot Encoding
Let's look at the following example:
import pandas as pd
data = ['Red', 'Blue', 'Green', 'Red']
df = [Link](data, columns=['Color'])
one_hot_encoded = pd.get_dummies(df['Color'])
print(one_hot_encoded)
Output:
output
Each unique category ('Red', 'Blue', 'Green') is transformed into a separate
binary column, with 1 representing the presence of the category and 0 its
absence.
3. Ordinal Encoding
Ordinal Encoding maps categories to integers while preserving their natural
order. This works well for ordered data like ratings but is not suitable for
nominal variables.
• Used for ordered features like ratings or education levels.
• Pros: Maintains order; reduces dimensionality.
• Cons: Not suitable for nominal categories.
Ordinal Encoding
Let's consider the following example:
from [Link] import OrdinalEncoder
data = [['Low'], ['Medium'], ['High'], ['Medium'], ['Low']]
encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
encoded_data = encoder.fit_transform(data)
print(f"Encoded Ordinal Data: {encoded_data}")
Output:
output
In this case, 'Low' is encoded as 0, 'Medium' as 1 and 'High' as 2, preserving the
natural order of the categories.
4. Target Encoding
Target Encoding also known as Mean Encoding is a technique where each
category in a feature is replaced by the mean of the target variable for that
category.
• Useful for high-cardinality features like ZIP codes or product IDs.
• Pros: Captures relationship to target variable.
• Cons: Risk of overfitting, also must apply smoothing/statistical
techniques.
Target Encoding
Let's consider the following example:
import pandas as pd
import category_encoders as ce
df = [Link](
{'City': ['London', 'Paris', 'London', 'Berlin'], 'Target': [1, 0, 1, 0]}
)
encoder = [Link](cols=['City'])
df_tgt = encoder.fit_transform(df['City'], df['Target'])
print(f"Encoded Target Data:\n{df_tgt}")
Output:
output
In this case, each color is encoded based on the mean of the target variable.
For instance, 'Red' has a mean target value of approximately 0.485, which
reflects the target values for the rows where 'Red' appears.
5. Binary Encoding
Binary encoding represents categories as binary codes and splits them across
multiple columns. It is efficient for high-cardinality data but slightly more
complex to implement.
• Applied in high-cardinality text/NLP tasks to save memory.
• Pros: Reduces dimensionality, more memory-efficient than one-hot
encoding.
• Cons: Slightly more complex; requires careful handling of missing values.
Binary Encoding
Let's consider the following example:
import category_encoders as ce
data = ['Red', 'Green', 'Blue', 'Red']
encoder = [Link](cols=['Color'])
encoded_data = encoder.fit_transform([Link](data, columns=['Color']))
print(encoded_data)
Output:
output
Here, each category (like 'Red', 'Blue', 'Green') is converted into binary digits.
'Red' gets the binary code '10', 'Blue' becomes '01' and 'Green' becomes '11'.
Each binary digit is placed in a separate column (e.g., Color_0 and Color_1).
6. Frequency Encoding
Frequency Encoding assigns categories values based on how often they occur
in the dataset. It is simple and compact but can introduce data leakage if
applied improperly.
• Effective in retail, e-commerce or clickstream data for popularity trends.
• Pros: Low computational and storage requirements.
• Cons: Can introduce data leakage if not handled properly.
Frequency Encoding
Let's consider the following example:
import pandas as pd
data = ['Red', 'Green', 'Blue', 'Red', 'Red']
series_data = [Link](data)
frequency_encoding = series_data.value_counts()
encoded_data = [frequency_encoding[x] for x in data]
print("Encoded Data:", encoded_data)
Output:
Encoded Data: [np.int64(3), np.int64(1), np.int64(1), np.int64(3), np.int64(3)]
Here, 'Red' appears 3 times, so it is encoded as 3, while 'Green' and 'Blue'
appear once, so they are encoded as 1.
1. Feature Scaling
Feature scaling is a data preprocessing technique used to bring all numerical
features of a dataset onto a similar scale. In real-world datasets, features often
have very different units and ranges (e.g., age in years, income in lakhs,
distance in kilometers). If not scaled, features with larger numerical values may
dominate the learning process.
Why Feature Scaling is Important
• Prevents bias toward large-valued features
• Improves convergence speed of gradient-based algorithms
• Ensures fair distance computation in distance-based algorithms
• Enhances numerical stability
• Improves model performance and accuracy
Algorithms That Require Feature Scaling
• Gradient Descent
• K-Nearest Neighbors (KNN)
• Support Vector Machines (SVM)
• Principal Component Analysis (PCA)
• Neural Networks
• Logistic Regression
• Linear Regression (with regularization)
1.1 Normalization (Min–Max Scaling)
Normalization rescales the feature values into a fixed range, typically [0, 1].
Formula
𝑋 − 𝑋min
𝑋scaled =
𝑋max − 𝑋min
Characteristics
• Preserves the relative relationships between data points
• Compresses all values into a bounded range
• Sensitive to outliers
Advantages
• Useful for neural networks
• Improves model convergence
• Ensures consistent feature magnitude
Limitations
• Outliers can distort the scale
• Not ideal if future data exceeds the original min/max
Use Cases
• Image processing (pixel values)
• Deep learning models
• When features do not follow a Gaussian distribution
1.2 Standardization (Z-score Normalization)
Standardization transforms data so that it has:
• Mean = 0
• Standard Deviation = 1
Formula
𝑋−𝜇
𝑋scaled =
𝜎
Where:
• 𝜇= mean
• 𝜎= standard deviation
Characteristics
• Centers data around zero
• Does not bound values to a fixed range
• Less affected by outliers compared to Min–Max scaling
Advantages
• Works well when data follows a normal (Gaussian) distribution
• Preferred for SVM, PCA, Linear Regression
• Maintains statistical properties of data
Limitations
• Still influenced by extreme outliers
• Harder to interpret scaled values
Other Feature Scaling Techniques
• Robust Scaling (uses median and IQR, good for outliers)
• Log Transformation (reduces skewness)
• Power Transformation (Box-Cox, Yeo-Johnson)
2. Encoding Categorical Features
Machine learning algorithms operate on numerical data, so categorical
variables must be transformed into numeric representations without losing
meaning.
Types of Categorical Data
• Nominal: No order (e.g., color, city)
• Ordinal: Has order (e.g., low, medium, high)
2.1 Nominal Encoding (One-Hot Encoding)
One-hot encoding creates binary columns for each unique category.
Example
Color = {Red, Blue, Green}
Red Blue Green
1 0 0
Advantages
• No ordinal assumption
• Prevents misleading numerical relationships
Limitations
• Increases dimensionality
• Can cause curse of dimensionality
• Memory inefficient for high-cardinality features
Use Cases
• Gender, city, product type
• Tree-based models (Random Forest, XGBoost)
2.2 Ordinal Encoding
Assigns integers based on category order.
Example
Low → 1
Medium → 2
High → 3
Advantages
• Preserves ranking information
• Simple and compact
Limitations
• Assumes equal distance between categories
• Can mislead linear models if order spacing is uneven
Use Cases
• Education level
• Ratings or severity scales
2.3 Target Encoding (Mean Encoding)
Each category is replaced with the mean of the target variable for that
category.
Advantages
• Handles high-cardinality features efficiently
• Often improves performance
Limitations
• High risk of data leakage
• Overfitting if not regularized
• Requires careful cross-validation
Best Practices
• Apply only on training data
• Use smoothing or noise
• Combine with cross-validation
Other Encoding Techniques
• Frequency Encoding
• Binary Encoding
• Hash Encoding
• Label Encoding (use cautiously)
3. Binarization
Binarization converts continuous or discrete numerical features into binary
values (0 or 1) based on a threshold.
Rule
• Value > threshold → 1
• Value ≤ threshold → 0
Why Use Binarization
• Simplifies feature representation
• Focuses on presence/absence
• Reduces noise
• Useful for rule-based or Boolean models
Applications
• Spam detection
• Medical diagnosis (disease present or not)
• Text data (word present or not)
Limitations
• Loses magnitude information
• Threshold selection is critical
4. Data Labeling
Data labeling is the process of annotating raw data with meaningful labels so
that supervised learning models can learn from it.
Importance of Data Labeling
• Determines model accuracy
• Directly impacts learning quality
• Essential for supervised and semi-supervised learning
Types of Data Labeling
• Image labeling: bounding boxes, segmentation masks
• Text labeling: sentiment, topic, intent
• Audio labeling: speech transcription, emotion tagging
• Video labeling: object tracking, action recognition
Challenges
• Time-consuming
• Expensive
• Human bias and inconsistency
• Scalability issues
Solutions
• Automated labeling
• Active learning
• Semi-supervised learning
• Crowdsourcing
5. Feature Selection
Feature selection involves choosing a subset of the most relevant features
from the original dataset to improve model efficiency and performance.
Why Feature Selection Matters
• Reduces overfitting
• Improves generalization
• Decreases training time
• Enhances interpretability
• Removes irrelevant and redundant features
5.1 Filter Methods
Use statistical techniques independent of ML algorithms.
Common Techniques
• Correlation coefficient
• Chi-square test
• ANOVA
• Mutual Information
Advantages
• Fast and scalable
• Model-independent
Limitations
• Ignores feature interactions
5.2 Wrapper Methods
Evaluate feature subsets using a learning algorithm.
Examples
• Forward selection
• Backward elimination
• Recursive Feature Elimination (RFE)
Advantages
• High accuracy
• Considers feature interactions
Limitations
• Computationally expensive
• Prone to overfitting
5.3 Embedded Methods
Feature selection is built into model training.
Examples
• L1 regularization (Lasso)
• Decision trees
• Random Forest feature importance
Advantages
• Efficient
• Balanced bias-variance tradeoff
Limitations
• Model-dependent
Unit 5
Dimensionality Reduction Overview
Dimensionality reduction is the process of reducing the number of input
variables (features) in a dataset while preserving as much important
information as possible.
It is used to:
• Reduce computational cost
• Remove redundant/noisy features
• Improve visualization
• Reduce overfitting
• Improve model performance
Two widely used dimensionality reduction techniques are:
1. Principal Component Analysis (PCA) – linear method
2. t-Distributed Stochastic Neighbor Embedding (t-SNE) – non-linear
method
1. Principal Component Analysis (PCA)
Definition
Principal Component Analysis (PCA) is a linear dimensionality reduction
technique that transforms the original features into a new set of uncorrelated
variables called principal components, ordered by the amount of variance they
capture from the data.
Key Idea of PCA
• Identify directions (axes) of maximum variance
• Project data onto these new axes
• Keep only the top k components
• Discard less informative dimensions
Characteristics of PCA
• Linear transformation
• Unsupervised learning technique
• Components are orthogonal (uncorrelated)
• Sensitive to feature scaling
• Preserves global structure of data
Mathematical Intuition
1. Center the data (mean = 0)
2. Compute covariance matrix
3. Compute eigenvalues and eigenvectors
4. Sort eigenvectors by descending eigenvalues
5. Select top-k eigenvectors
6. Project data onto new feature space
Advantages of PCA
• Reduces dimensionality efficiently
• Removes multicollinearity
• Improves training speed
• Useful for noise reduction
• Works well with linear models
Limitations of PCA
• Assumes linear relationships
• Not ideal for complex non-linear data
• Reduced interpretability
• Sensitive to outliers
Applications of PCA
• Image compression
• Face recognition
• Gene expression analysis
• Feature extraction
• Visualization of high-dimensional data
PCA Implementation in Python (scikit-learn)
from [Link] import PCA
from [Link] import StandardScaler
from [Link] import load_iris
# Load dataset
data = load_iris()
X = [Link]
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Explained variance
print("Explained Variance Ratio:", pca.explained_variance_ratio_)
Explanation
• StandardScaler() → PCA requires scaled data
• n_components=2 → reduce to 2 dimensions
• explained_variance_ratio_ → tells how much information is retained
2. t-Distributed Stochastic Neighbor Embedding (t-SNE)
Definition
t-SNE is a non-linear dimensionality reduction technique mainly used for data
visualization. It preserves local relationships by mapping similar points close
together in lower-dimensional space.
Key Idea of t-SNE
• Converts distances between points into probability distributions
• Minimizes difference between high-dimensional and low-dimensional
distributions
• Focuses on local neighborhood structure
Characteristics of t-SNE
• Non-linear method
• Unsupervised
• Computationally expensive
• Best for visualization (2D or 3D)
• Not suitable for feature extraction for models
Important Parameters
• n_components: Output dimensions (usually 2 or 3)
• perplexity: Controls balance between local and global structure
• learning_rate: Affects convergence
• n_iter: Number of optimization iterations
Advantages of t-SNE
• Excellent visualization quality
• Reveals clusters clearly
• Handles complex non-linear patterns
Limitations of t-SNE
• Very slow on large datasets
• Results are non-deterministic
• No inverse transform
• Poor preservation of global structure
• Not ideal for downstream ML models
Applications of t-SNE
• Visualizing embeddings
• Cluster inspection
• Exploratory data analysis
• Deep learning feature visualization
t-SNE Implementation in Python (scikit-learn)
from [Link] import TSNE
from [Link] import StandardScaler
from [Link] import load_iris
# Load dataset
data = load_iris()
X = [Link]
# Scale data
X_scaled = StandardScaler().fit_transform(X)
# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=30, learning_rate=200, n_iter=1000)
X_tsne = tsne.fit_transform(X_scaled)
Explanation
• perplexity=30 → common default
• n_iter=1000 → sufficient for convergence
• Output is used only for visualization
PCA vs t-SNE (Comparison)
Feature PCA t-SNE
Type Linear Non-linear
Speed Fast Slow
Use case Feature reduction Visualization
Feature PCA t-SNE
Preserves Global variance Local structure
Scalability High Low
Inverse transform Possible Not possible
When to Use What
• Use PCA when:
o You want to reduce dimensions for ML models
o Speed and interpretability matter
o Data is approximately linear
• Use t-SNE when:
o You want to visualize clusters
o Data is complex and non-linear
o Interpretation is visual, not predictive