MODULE 2
Data Preprocessing in Data Mining
Real-world data is often incomplete, noisy, and inconsistent, which can lead to incorrect results
if used directly. Data preprocessing in data mining is the process of cleaning and preparing raw
data so it can be used effectively for analysis and model building.
Real data contains missing and incorrect values
Data may come from multiple sources
Large datasets often have irrelevant information
Clean data gives better mining results
Steps in Data Preprocessing
Some key steps in data preprocessing are:
1. Data Cleaning
It is the process of identifying and correcting errors or inconsistencies in the dataset. Its common
tasks include:
Handling missing values
Removing duplicate records
Correcting wrong or inconsistent data
Handling Outliers
Techniques used:
Mean Imputation: Replaces missing values with the average of the attribute.
Median Imputation: Replaces missing values with the middle value, useful when outliers
exist.
Mode Imputation: Replaces missing values with the most frequent value.
Deletion Method: Removes records that contain missing values.
Interquartile Range (IQR): Detects outliers using the range between Q1 and Q3.
Z-Score Method: Identifies outliers based on standard deviation from the mean.
Binning: Smooths noisy data by grouping values into bins.
Regression Smoothing: Uses regression to predict and smooth noisy values.
Duplicate Detection: Identifies and removes repeated records.
Example:
Replacing missing age values with the average age
Removing repeated rows in a dataset
2. Data Integration
It involves merging data from various sources into a single, unified dataset. It can be challenging
due to differences in data formats, structures, and meanings.
Used when data comes from databases, files, or APIs
Removes redundancy between datasets
Resolves conflicts in data values
Techniques used:
Schema Matching: Aligns attributes from different data sources.
Entity Resolution: Identifies records that refer to the same real-world entity.
Correlation Analysis: Finds and removes redundant attributes.
Data Conflict Resolution: Resolves inconsistencies in units or data values.
Duplicate Elimination: Removes overlapping records after integration.
Example: Merging customer data from sales and marketing databases
3. Data Transformation
Data transformation converts data into a suitable form so that data mining algorithms can work
effectively.
Bring data into a common format
Improve mining efficiency
Make data suitable for modeling
Techniques used:
Min-Max Normalization: Scales data into a fixed range, usually 0 to 1.
Z-Score Normalization: Transforms data using mean and standard deviation.
Decimal Scaling: Normalizes data by moving the decimal point.
Log Transformation: Reduces data skewness using logarithmic scaling.
One-Hot Encoding: Converts categories into binary columns.
Label Encoding: Assigns numeric labels to categorical values.
Aggregation: Combines detailed data into summarized form.
Example:
Converting salary values into a fixed range (0–1)
Changing text labels like Male/Female into numeric values
4. Data Reduction
It reduces the dataset's size while maintaining key information. This can be done through feature
selection which chooses the most relevant features and feature extraction which transforms the
data into a lower-dimensional space while preserving important details.
Improves processing speed
Saves storage space
Makes analysis easier
Techniques used:
Principal Component Analysis (PCA): Reduces dimensions by projecting data onto
principal components.
Linear Discriminant Analysis (LDA): Reduces dimensions while maximizing class
separation.
Filter Methods: Select features based on statistical measures.
Wrapper Methods: Select features using model performance.
Embedded Methods: Perform feature selection during model training.
Simple Random Sampling: Selects data points randomly from the dataset.
Stratified Sampling: Samples data proportionally from each class.
Benefits of Data Preprocessing
Improves data quality
Increases accuracy of mining results
Reduces errors in models
Makes data easier to understand
Advantages
Improved Data Quality: Ensures data is clean, consistent, and reliable for analysis.
Better Model Performance: Reduces noise and irrelevant data, leading to more accurate
predictions and insights.
Efficient Data Analysis: Streamlines data for faster and easier processing.
Enhanced Decision-Making: Provides clear and well-organized data for better business
decisions.
Data Cleaning in Data Mining
Data Cleaning is the main stage of the data mining process, which allows for data utilization
that is free of errors and contains all the necessary information. Some of them include error
handling, deletion of records, and management of missing or incomplete records. Absolute
data cleaning is necessary before data mining since the conclusions given by the data mining
process could well be misleading or even wrong. This makes it an important exercise for
anyone handling big data as it sets the groundwork leading to accurate and useable outcomes.
What is Data Cleaning in Data Mining?
Data cleaning in Data Mining is the process of identifying, validating, or eradicating the
errors and inconsistencies in data so that analysis might be exact. Raw data is usually full of
inaccuracies, outliers, missing entries, duplicates, and noise that can only worsen the result if
not handled correctly. In data cleaning, the raw data is pre-processed and made to be in a
format, which is suitable for mining activities such as pattern recognition, and predictive
modelling, amongst others. The aim is to enhance the general quality of the data to gain
significant conclusions in the data mining process.
Characteristics of Data Cleaning:
Accuracy: Making certain that the data that is entered is accurate, which means that the values
or the information inputted are all correct. This leads to an issue of accuracy because incorrect
information will result in wrong conclusions being made.
Coherence: All datasets should make logical sense of each other. Coherence contributes to the
understanding of data consistency, hence related data elements meet the correct orientation
about one another.
Validity: Validity means that the values must be ‘proper’ about rules governing the schema of
data or the business rules. This is true, but also includes verifying that the various data entries
submitted conform to bounded or expected values and formats.
Uniformity: Uniformity can be defined as the readiness to uphold steadfastness of data format
and measurement unit within the data set. For instance, dates should be in one format and the
same is applicable for numerical data, the same units should be used.
Data Verification: This characteristic involves comparing the data against other known right
sources or formulas to verify its accuracy. Verification assists in the discovery of such errors
that may remain unnoticed during the preparation of the accounts.
Clean Data Backflow: Once data has been cleaned the cleaned data must be re-inputted to the
system to substitute the erroneous data. It also facilitates data quality consistency over time
and ensures that the same mistakes are not repeated again and again in most of the succeeding
analyses.
Steps for Cleaning Data
1. Remove duplicates & irrelevant data
Delete repeated entries and data that is not useful for your analysis.
2. Fix structural errors
Correct spelling mistakes, inconsistent names, and formatting issues
(e.g., “N/A” and “Not Applicable” should be the same).
3. Remove or check outliers
Identify unusual values.
Remove them only if they are errors; keep them if they are meaningful.
4. Handle missing data
Fill missing values carefully or remove them, since many tools can’t handle missing data.
What is noisy data?
Noisy data is a data set that contains extra meaningless data. Almost all data sets will contain a
certain amount of unwanted noise. Noisy data can be filtered and processed into a higher quality
data set.
Noisy data unnecessarily increases the amount of storage space required and can adversely affect
the results of any data mining analysis. Statistical analysis can use information gleaned from
historical data to weed out noisy data and facilitate data mining.
Noisy data can be caused by hardware failures, programming errors, and gibberish input from
speech or optical character recognition programs. Spelling errors, industry abbreviations and
slang can also impede machine reading. Natural fluctuations in sensors and measurement can add
extra noise to readings. Gathering too broad of a data set can also make it hard to analyze.
Types of noisy data
Random noise is extra information that has no correlation to the underlying data that is somehow
introduced into the measurements or data set. It may also be called white noise.
Misclassified data is information that is incorrectly labeled or sorted in a data set. This can be
caused by human error or as a fault during data importing.
Uncontrolled variables are extra factors that affect the data but are not accounted for. They can
make the data look random when it is not or introduce patterns that aren't there.
Superfluous data is extra information that is completely unrelated to the information being
examined. There may be so much extra information that what you are looking for is completely
hidden.
How to clean noisy data
There are many methods to remove noise and produce the cleanest possible data. The exact
methods and implementations will depend on the data being worked on and the end goals.
Filtering is removing unwanted data. This can be as simple as removing certain categories or
types of data from the analysis. Analysts may also filter out outliers, such as unusually high or
low readings or ones very far from the mean data set.
Data binning is where the data is sorted into groups or categories to remove some of the random
variance between entries.
Linear regression is a mathematical method to determine the correlation between the data and
other variables. It can help determine how closely related the data is to the output.
Data Transformation
Data Transformation involves converting raw data from multiple heterogeneous sources into a
clean, standardized and analysis-ready format before loading it into the data warehouse.
Transformation ensures that the data stored in the warehouse is consistent, accurate and aligned
with business rules-ultimately improving the quality of reporting and decision-making.
The transformations can be divided into two categories:
Simple Data Transformations involve basic tasks like cleansing, standardization, aggregation and
filtering used to prepare data for analysis or reporting through straightforward manipulation
techniques
Complex Data Transformations involve advanced tasks like integration, migration, replication
and enrichment. They require techniques such as data modeling, mapping and validation and are
used to prepare data for machine learning, advanced analytics or data warehousing.
Applications of Data Transformation
Applications for data transformation are found in a number of industries:
Business intelligence (BI) is the process of transforming data for use in real-time reporting and
decision-making using BI technologies.
Healthcare: Ensuring interoperability across various healthcare systems by standardization of
medical records.
Financial Services: Compiling and de-identifying financial information for reporting and
compliance needs.
Retail: Improving customer experience through data transformation into an analytics-ready
format and customer behavior analysis.
Customer Relationship Management (CRM): By converting customer data, firms may obtain
insights into consumer behavior, tailor marketing strategies and increase customer satisfaction.
Challenges in Data Transformation
While transformation improves data quality, several challenges must be addressed:
Handling Large Volumes: High data volumes may slow down transformation or require
scaling.
Maintaining Data Accuracy: Incorrect transformation logic can introduce new errors.
Performance Optimization: Complex transformations may increase ETL execution time.
Managing Changing Business Rules: As business needs evolve, transformation logic must be
updated regularly.
Data Integration:
Data integration is a type of data transformation. As part of the data transformation process, a
data pipeline may bring data in from multiple sources. It then combines this data to provide a
single, unified view of a data set across the enterprise.
For example, data about customers in a retail business may be split across multiple systems - e-
commerce sales, marketing emails, website analytics, advertising campaigns, web search, etc.
Bringing this data together and presenting it as a unified customer record can help businesses
answer questions such as where they acquire their highest-spend customers, which types of users
are most likely to refer their friends, and more.
Types of data integration
There are several ways to perform data integration:
Data warehousing. The traditional approach in which engineers physically move data into a
single location and a minimal number of tables. Data warehousing uses different data modeling
techniques than relational database systems that enable faster querying for BI use cases.
Virtualized integration. Virtualized integration provides access to numerous data sources from
within a single location. This approach misses out on some of the performance benefits provided
by data warehousing (though it may utilize techniques such as caching to speed up subsequent
access). The upside is that business users can access data where it currently lives without waiting
for engineers to import it into the warehouse.
Data mesh. Bringing data into a data warehouse typically depends on relying on a centralized
data engineering team to create new data pipelines. This can create a bottleneck as the team’s
queue fills up with new requests. A data mesh architecture solves this by modeling data as a set
of interconnected domains. A mesh architecture provides a self-service data platform that makes
it easier for teams to create their data pipelines and for business users to find and use the data
sets they produce.
Data Reduction in Data Mining
Data reduction is a technique used in data mining to reduce the size of a dataset while still
preserving the most important information. This can be beneficial in situations where the dataset
is too large to be processed efficiently, or where the dataset contains a large amount of irrelevant
or redundant information.
There are several different data reduction techniques that can be used in data mining, including:
1. Data Sampling: This technique involves selecting a subset of the data to work with, rather
than using the entire dataset. This can be useful for reducing the size of a dataset while still
preserving the overall trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number of features in the
dataset, either by removing features that are not relevant or by combining multiple features
into a single feature.
3. Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data into discrete data by
partitioning the range of possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features from the dataset that
are most relevant to the task at hand.
6. It's important to note that data reduction can have a trade-off between the accuracy and the
size of the data. The more data is reduced, the less accurate the model will be and the less
generalizable it will be.
Methods of data reduction:
These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example, imagine the information
you gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your
company every three months. They involve you in the annual sales, rather than the quarterly
average, So we can summarize the data in such a way that the resulting data summarizes the
total sales per year instead of per quarter. It summarizes the data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute
required for our analysis. It reduces data size as it eliminates outdated or redundant features.
3. Data Compression:
The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types based
on their compression techniques.
Lossless Compression -
Encoding techniques (Run Length Encoding) allow a simple and minimal data size reduction.
Lossless data compression uses algorithms to restore the precise original data from the
compressed data.
Lossy Compression -
Methods such as the Discrete Wavelet transform technique, PCA (principal component
analysis) are examples of this compression. For e.g., the JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original image. In lossy-data
compression, the decompressed data may differ from the original data but are useful enough to
retrieve information from them.
4. Numerosity Reduction:
In this reduction technique, the actual data is replaced with mathematical models or smaller
representations of the data instead of actual data, it is important to only store the model
parameter. Or non-parametric methods such as clustering, histogram, and sampling.
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous nature into
data with intervals. We replace many constant values of the attributes by labels of small
intervals. This means that mining results are shown in a concise, and easily understandable way.
Top-down discretization -
If you first consider one or a couple of points (so-called breakpoints or split points) to divide
the whole set of attributes and repeat this method up to the end, then the process is known as
top-down discretization also known as splitting.
Bottom-up discretization -
If you first consider all the constant values as split points, some are discarded through a
combination of the neighborhood values in the interval, that process is called bottom-up
discretization.
ADVANTAGES & DISADVANTAGES OF Data Reduction in Data Mining :
Data reduction in data mining can have a number of advantages and disadvantages.
Advantages:
1. Improved efficiency: Data reduction can help to improve the efficiency of machine learning
algorithms by reducing the size of the dataset. This can make it faster and more practical to
work with large datasets.
2. Improved performance: Data reduction can help to improve the performance of machine
learning algorithms by removing irrelevant or redundant information from the dataset. This
can help to make the model more accurate and robust.
3. Reduced storage costs: Data reduction can help to reduce the storage costs associated with
large datasets by reducing the size of the data.
4. Improved interpretability: Data reduction can help to improve the interpretability of the results
by removing irrelevant or redundant information from the dataset.
Disadvantages:
1. Loss of information: Data reduction can result in a loss of information, if important data is
removed during the reduction process.
2. Impact on accuracy: Data reduction can impact the accuracy of a model, as reducing the size
of the dataset can also remove important information that is needed for accurate predictions.
3. Impact on interpretability: Data reduction can make it harder to interpret the results, as
removing irrelevant or redundant information can also remove context that is needed to
understand the results.
4. Additional computational costs: Data reduction can add additional computational costs to the
data mining process, as it requires additional processing time to reduce the data.
5. In conclusion, data reduction can have both advantages and disadvantages. It can improve the
efficiency and performance of machine learning algorithms by reducing the size of the dataset.
However, it can also result in a loss of information, and make it harder to interpret the results.
It's important to weigh the pros and cons of data reduction and carefully assess the risks and
benefits before implementing it.