0% found this document useful (0 votes)
14 views19 pages

Data Preprocessing Techniques Explained

Data preprocessing is essential for transforming raw data into a clean, consistent, and usable format for data mining, addressing issues like noise, missing values, and inconsistencies. Key steps include data cleaning, integration, reduction, and transformation, each aimed at improving data quality and mining efficiency. Effective preprocessing enhances the accuracy of mining algorithms and ensures reliable insights from data analysis.

Uploaded by

nandusweety1101
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views19 pages

Data Preprocessing Techniques Explained

Data preprocessing is essential for transforming raw data into a clean, consistent, and usable format for data mining, addressing issues like noise, missing values, and inconsistencies. Key steps include data cleaning, integration, reduction, and transformation, each aimed at improving data quality and mining efficiency. Effective preprocessing enhances the accuracy of mining algorithms and ensures reliable insights from data analysis.

Uploaded by

nandusweety1101
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT 2

DATA PRE- PROCESSING


Data Preprocessing:
Data is often called the "new oil," but just like crude oil, raw data is rarely useful in
its original state. It is often noisy, incomplete, inconsistent, or simply too large to
be processed directly. To get accurate and meaningful results from data mining, we
must first prepare the data properly. This preparation process is called data
preprocessing.
Data preprocessing plays a critical role in ensuring that the knowledge discovered
from data is reliable and valid. If we try to apply data mining techniques on raw
data without cleaning and transforming it, we may end up with wrong conclusions.
For example, imagine a retail database where some customers have missing
addresses, some transactions have been recorded twice, and some prices are
wrongly entered. Mining such dala would give misleading results about customer
behavior or sales patterns. Hence, preprocessing is the foundation for successful
data mining.
1. Importance of Data Preprocessing:
Before we go into the steps, it is important to understand why preprocessing is
required:
• Data quality issues - Real-world data may have errors, duplicate entries,
missing values, and inconsistencies
• Large volume - Data warehouses and big databases often contain terabytes
or petabytes of records, which are too big for direct mining.
• Heterogeneity Data may come from different sources such as spreadsheets,
databases, sensors, or the web, each with different formats.
• Improving accuracy - Well-prepared data improves the performance of
mining algorithms, leading to better classification, clustering, or prediction
results.
• Efficiency - Preprocessing reduces data size and complexity, so mining
becomes faster and more scalable.
Thus, preprocessing acts like the cleaning and organizing stage before
analysis, making sure the data is consistent, accurate, and ready for
knowledge discovery.

Efficiency - Preprocessing reduces data size and complexity, so mining


becomes faster and more scalable.

Thus, preprocessing acts like the cleaning and organizing stage before
analysis, making sure the data is consistent, accurate, and ready for
knowledge discovery.

Steps in Data Preprocessing:


Some key steps in data preprocessing are Data Cleaning, Data Integration,
Data Transformation, and Data Reduction.

Major Tasks In Data Pre-Processing:


1. Data Cleaning
2. Data Integration
3. Data Reduction
4. Data Transformation
1. Data Cleaning:
The first major task in preprocessing is data cleaning, which deals with missing
values, noisy values, and inconsistencies.
(a) Missing Values
Data often has blank or unknown fields. For example, a customer's age might not
be recorded, or income data may be absent. Methods to handle missing values
include:
➢ Ignoring the record (only if the dataset is large enough).
➢ Filling with a global constant (like "Unknown").
➢ Replacing with the mean, median, or mode of the attribute.
➢ Using predictive models (regression, decision trees, or k-nearest neighbor) to
estimate the missing value.

Method Description Example

Ignore Record Remove tuples with missing values. Drop student record with mis
grade.
Fill Constant Replace with fixed value. Missing city → 'Unknown'.
Mean/Median/Mode Statistical replacement. Income replaced with mean
salary.
Predictive Models Use ML models for estimation. Predict missing age using
regression.

(b) Noisy Data


Noise refers to random errors or variations in data. Example: typing mistakes,
wrong measurements, or outliers. Methods to reduce noise include:

Binning

Noisy Data Regression

Clustering
Binning: Sorting data into bins and replacing values by bin mean, median, or
boundary.
Regression: Fitting data into a regression function and smoothing values.
Clustering: Grouping similar records and identifying outliers as noise.

(c) Inconsistencies.
Data from different sources may have different formats or spellings. For example,
"Male/Female" vs. "M/F," or different currency notations. Cleaning detects and
corrects such conflicts to maintain consistency.

2. Data Integration
The second task is data integration, which combines data from multiple sources
like databases, flat files, or data cubes.
Key problems in integration include:
Entity identification - Matching records that refer to the same entity but have
different names (e.g.. "cust_ID" vs. "customer_no").
Schema integration Combining attributes with different names but same
meaning.
Redundancy removal If the same information is stored in two sources, it
must be detected and merged. Correlation analysis is often used to detect
redundancy.
Value conflict resolution If two sources record different values for the same
attribute, a strategy is needed to resolve conflicts.
Integration provides a unified view of data, which is essential for data
mining in large organizations where data comes from many departments and
systems.

3. Data Reduction

Since real-world data is often huge, data reduction techniques are applied to make
the dataset smaller but still representative of the original. The goal is to reduce the
volume while preserving essential patterns.
Methods of Data Reduction:
1. Data Cube Aggregation- Summarizing data at higher abstraction levels. For
example, instead of storing daily sales, we can keep monthly or yearly totals.
2. Dimensionality Reduction - Reducing the number of attributes using
techniques such as Principal Component Analysis (PCA) or Singular Value
Decomposition (SVD).
3. Attribute Subset Selection-Choosing only the most relevant attributes while
discarding irrelevant or redundant ones. Techniques like decision tree induction
and stepwise regression help in this.
4. Numerosity Reduction- Replacing large datasets with models. Example: using
regression equations or clusters to represent the data.
5. Sampling- Selecting a representative sample of the data for analysis instead of
the entire dataset.
6. Data Compression - Using encoding schemes like wavelet transforms to store
data in a compact format.
Reduction improves efficiency, saves storage, and speeds up mining algorithms
without losing much accuracy.

4. Data Transformation
Data transformation is the process of changing the structure, format, or values of
data to make it more suitable for analysis, modeling, or downstream processes.
This step can be simple (like renaming columns) or complex (like converting data
types or creating new features).
Methods of Data Transformation
1. Smoothing
• What It Is: Smoothing removes noise from the data to make patterns and
trends easier to identify.
• Why It’s Important: It highlights key features and helps in making
predictions.
Examples:
• Stock Market: Use moving averages to smooth out daily fluctuations and
reveal trends.
• Temperature Data: Smoothing temperature readings over time to reduce
minor variances caused by sensor inaccuracies.
• Original Data: [72, 74, 73, 75, 74, 76] Smoothed Data (using 3-day moving
average): [—, 73, 74, 74, 75]

2. Aggregation
• What It Is: Aggregation summarizes or combines data from multiple
sources to generate higher-level information.
• Why It’s Important: It ensures that insights are based on a comprehensive
view of data.

Examples:
Sales Data: Aggregating daily sales into monthly or yearly totals for performance
evaluation. Daily Sales: [500, 700, 600, 800, 650] Monthly Sales: [Sum = 3,250]
Website Traffic:
Summarize hourly visits into daily totals for trend analysis.
3. Discretization
What It Is: Continuous data is divided into smaller intervals or bins, reducing
complexity and size.
Why It’s Important: It makes data easier to interpret and analyze.
Examples:
Age Ranges: Instead of using precise ages, group people into categories: Original:
[23, 25, 30, 45, 55, 67] Discretized: [“Young”, “Middle-aged”, “Senior”]
Time Intervals: Group exact exercise times into bins: Original: [10 mins, 20 mins,
35 mins] Discretized: [0–15 mins, 15–30 mins, 30–45 mins]
4. Attribute Construction
What It Is: Create new attributes (features) from the existing data to make analysis
or mining more efficient.
Why It’s Important: New attributes can simplify the data and improve the
quality of insights.
Examples:
E-commerce: Create a "Customer Lifetime Value (CLV)" attribute based on
past purchases.
Weather Data: Generate a new attribute, "Feels Like Temperature," based on
temperature, humidity, and wind speed.
5. Generalization
What It Is: Converts specific, low-level data into broader, high-level
categories using a concept hierarchy.
Why It’s Important: It simplifies data and makes patterns easier to detect.
Examples:
Ages: Convert exact ages into categories: Original: [18, 22, 30, 55]
Generalized: [“Teenager”, “Young Adult”, “Middle-Aged”]
Addresses: Convert house addresses into broader categories: Original: [“221B
Baker Street, London”] Generalized: [“London, UK”]

Normalization
Normalization adjusts data values to ensure they fall within a specific range or
follow a standard distribution, enabling fair comparisons between variables.
Why It’s Needed:
1. Attributes with larger ranges may dominate analysis if not scaled.
2. It improves the performance of machine learning models by
reducing bias introduced by differing scales.

Normalization Techniques and Formulas


1. Min-Max Normalization
This method scales data to fit within a specific range
[new_min,new_max][new\_min, new\_max].
Formula:
2. Z-Score Normalization
This standardizes data using the mean (μ\mu) and standard deviation (σ\sigma).
3. Decimal Scaling
This method normalizes data by shifting the decimal point based on the maximum
absolute value of the attribute.
Practical Applications
1. Min-Max Normalization:
o Rescaling exam scores (0-100) into a range of 0 to 1 for
model inputs.
2. Z-Score Normalization:
o Standardizing survey responses (e.g., satisfaction ratings) for
comparison.
3. Decimal Scaling:
o Scaling sales revenue (e.g., $1M, $2M, $3M) into
manageable decimals for easier analysis.
Here is a simplified and more understandable explanation of Data Reduction
methods with examples for clarity.
Discretization and Concept Hierarchy
Transforms continuous attributes into discrete ones and replaces low-level data
with high-level concepts.
Discretization:
Convert numerical values into categories or intervals.
Example: Age (23, 45, 67) → Categories (Youth, Middle-aged, Senior).
Concept Hierarchy:
Replace specific values with broader categories.
Example: Replace “43” for age with "Middle-aged".
Techniques:
(a) Top-down Discretization:
Start with the entire range and split it into smaller intervals.
Example: Age range (0–100) → Split into (0–25, 26–50, 51–75, 76–100).

(b) Bottom-up Discretization:


Start with individual values and merge them into intervals.
Example: Merge ages 22, 23, 24 → (20–25).
(c) Binning:
Create bins based on specified ranges.
Example: Income ($20K, $40K, $60K) → Low, Medium, High.
(d) Histogram Analysis:
Partition data into intervals based on frequency or width.
Example: Grades (A, B, C, D) → Frequency bins based on number of students.
Applications of Data Reduction
Data Analysis: Reduces computational load by summarizing data.
Machine Learning: Focuses on the most relevant features for better predictions.
Data Storage: Saves storage space by compressing files or summarizing data.
Visualization: Simplifies complex data for easier interpretation.

Concept Hierarchies
Definition: Represent data at multiple levels of abstraction by grouping or
summarizing concepts.
Example: Low-level concepts: “TV,” “CD Player,” “VCR.”
High-level concept: “Home Entertainment.”

Use Case: Summarize sales data by high-level categories instead of individual


products.

Data Mining Task Primitives:

Each user will have a data mining task in mind, that is, some form of data analysis
that he or she would like to have performed.
• A data mining task can be specified in the form of a data mining query, which is input to the data
mining system.
• A data mining query is defined in terms of data mining task primitives. These primitives allow the
user to inter- actively communicate with the data mining system during discovery in order to direct
the mining process, or examine the findings from different angles or depths.
Main Data Mining Primitives
are basic elements used to define and control the data mining process. They tell the
system what data to mine, what knowledge to find, and how to present the results.
1. *Task-Relevant Data*

This specifies the portions of the database or the set of data in which the user is interested. This
includes the database attributes or data warehouse dimensions of interest (referred to as the
relevant attributes or dimensions).
Includes database, tables, attributes, and conditions.
Example: Student data with marks > 60.

2. *Kind of Knowledge to be Mined*

This specifies the data mining functions to be per- formed, such as characterization,
discrimination, association or correlation analysis, classification, prediction, clustering, outlier
analysis, or evolution analysis.
Examples:
* Association
* Classification
* Clustering
* Prediction
* Characterization
3. *Background Knowledge*

This knowledge about the domain to be mined is useful for guiding the knowledge discovery
process and for evaluating the patterns found. Concept hierarchies are a popular form of back-
ground knowledge, which allow data to be mined at multiple levels of abstraction. An example of a
concept hierarchy for the attribute (or dimension) age is shown in Figure
Includes concept hierarchies.
Example: City → State → Country.

4. *Interestingness Measures*

They may be used to guide the mining process or, after discovery, to evaluate the discovered
patterns. Different kinds of knowledge may have different interestingness measures. For example,
interestingness measures for association rules include support and confidence. Rules whose support
and confidence values are below user-specified thresholds are considered uninteresting.
Examples: Support, Confidence, Lift.

4. *Presentation of Discovered Knowledge*

This refers to the form in which discovered patterns are to be displayed, which may include rules,
tables, charts, graphs, decision trees, and cubes.

Examples: Tables, graphs, charts, rules, trees.


Data Mining Architecture

• Data Mining Architecture is the process of selecting, exploring, and modelling large
amounts of data to discover previously unknown regularities or relationships to generate
clear and valuable findings for the database owner. Data mining is exploring and analysing
large amounts of data using automated or semi-automated processes to identify practical
designs and procedures.
• The primary components of any data mining system are the Data source, data
warehouse server, data mining engine, pattern assessment module, graphical user
interface, and knowledge base.

Basic Working:
• When a user requests data mining queries, these requests are sent to data mining
engines for pattern analysis.
• These software applications use the existing database to try to discover a solution to the
query.
• The retrieved metadata is then transmitted to the data mining engine for suitable
processing, which may interact with pattern assessment modules to decide the
outcome.
• The result is finally delivered to the front end in a user-friendly format via an
appropriate interface.

Components Of Data Mining Architecture
• Data Sources
• Database Server
• Data Mining Engine
• Pattern Evaluation Modules
• Graphic User Interface
• Knowledge Base

Data Sources
• These sources provide the data in plain text, spreadsheets, or other media such as images or
videos Data sources include databases, the World Wide Web (WWW), and data
warehouses.

Database Server
• The real data is stored on the database server and is ready to be processed. Its job is to
handle data retrieval in response to the user's request.

Data Mining Engine:


• It is one of the most important parts of the data mining architecture since it conducts many
data mining techniques such as association, classification, characterisation, clustering,
prediction, and so on.

Pattern Evaluation Modules:


• They are responsible for identifying intriguing patterns in data and, on occasion,
interacting with database servers to provide the results of user queries.

Graphic User Interface:


• Because the user cannot completely comprehend the complexities of the data mining
process, a graphical user interface assists the user in efficiently communicating with the
data mining system.

Knowledge Base:
• The Knowledge Base is an essential component of the data mining engine that aids in the
search for outcome patterns. Occasionally, the knowledge base may also provide input to
the data mining engine. This knowledge base might include information gleaned from user
encounters. The knowledge base's goal is to improve the accuracy
• and reliability of the outcome. The Knowledge Base is a crucial component of
the data mining engine that aids in the search for outcome patterns.
Occasionally, the knowledge base may also provide input to the data mining
engine. This knowledge base might include information gleaned from user
encounters. The knowledge base's goal is to improve the accuracy and
reliability of the outcome.
Languages for Data Mining

a. Data Mining Query Languages (DMQLs):


Concept: Historically, there have been proposals for dedicated Data Mining Query Languages
(DMQLs), inspired by SQL. The idea was to have a declarative language where users could sp
specify what they want to mine, rather than how to mine it.

Han, Kamber, and Pei's DMQL: A prominent conceptual DMQL defined by the authors (Han,
Kamber, and Pei) in their widely-used textbook specifies data mining tasks using "primitives" (as
discussed in a previous response). These primitives cover:

Task-relevant data: Specifying the data source and subset.

Kind of knowledge to be mined: The data mining functionality (e.g., classification, association).

Background knowledge: Concept hierarchies, domain knowledge.

Interestingness measures: Thresholds (e.g., minimum support/confidence for association rules).

Expected output format: How results should be presented (e.g., rules, tables, visualizations).

Status: While DMQLs were proposed, a single, universally adopted standard like SQL for relational
databases doesn't exist for data mining. However, the principles of DMQL (specifying tasks using
primitives) are fundamental to how all data mining tools operate.

b. General-Purpose Programming Languages with Libraries:


In practice, much of modern data mining is done using general-purpose programming languages
coupled with powerful libraries. These offer flexibility, extensibility, and the ability to integrate with
other data science tasks.
Python:
Strengths: Highly popular, easy to learn, vast ecosystem of libraries.
Key Libraries:
Pandas: For data manipulation and analysis (data cleaning, integration, transformation).
NumPy: For numerical computing.
Scikit-learn: Comprehensive machine learning library for classification, regression, clustering,
dimensionality reduction, etc.
Matplotlib, Seaborn: For data visualization.
TensorFlow, PyTorch, Keras: For deep learning tasks, which are increasingly relevant in advanced
data mining.
NLTK, spaCy: For natural language processing (text mining).
Strengths: Excellent for statistical analysis, data visualization, and specialized machine learning
tasks. Strong community and a rich collection of statistical packages.
Key Packages: dplyr, ggplot2, caret, e1071, random Forest, arules.
Scala (with Apache Spark):
Strengths: Designed for big data processing, integrates well with Spark's distributed computing
capabilities.
Key Libraries: Spark MLlib (Spark's machine learning library).
Java: Strengths: Mature, robust, widely used in enterprise systems. Many early data mining tools
were built in Java (e.g., Weka, Apache Mahout).

SQL (for Data Preparation):

While not a data mining language itself, SQL is crucial for the data preprocessing stage, especially
for selecting, filtering, aggregating, and joining data in relational databases and data warehouses
before it's fed into mining algorithms.
c. Domain-Specific Languages/Tools:
Many commercial and open-source data mining tools provide their own visual interfaces or scripting
languages that abstract away the underlying code, making it accessible to business users
. Examples:
KNIME: Node-based visual workflow editor.
Orange: Visual programming tool for data analysis and machine learning. RapidMiner: Integrated
environment for predictive analytics, text mining, and
machine learning.
IBM SPSS Modeler: Drag-and-drop interface for building predictive models.
SAS Enterprise Miner: Comprehensive suite for statistical analysis and datamining.

Weka: Collection of machine learning algorithms for data mining tasks, with a GUI.

You might also like