Chapter 2: Data Preprocessing
Need of preprocessing the Data
Data Cleaning
Data Integration and Transformation
Data Reduction
Discretization and Concept Hierarchy
Generation
Need of preprocessing the Data
Data preprocessing is an essential step in data mining because raw data
collected from various sources is often incomplete, inconsistent, noisy, or
redundant. Such data can lead to inaccurate analysis or poor model
performance if used directly.
Definition: Data preprocessing is the process of improving the quality of data
by handling errors, missing values, duplicate records, and irrelevant
information, so that the data becomes clean and suitable for analysis.
The main goal of data preprocessing is to improve the quality of the data
to make it more suitable for data mining task.
Data Quality:
Data quality means how good, correct, and useful the data is for analysis or
decision-making. If the data is poor (wrong, incomplete, outdated), then even
the best model or visualization will give wrong results.
1. Accuracy
Accuracy means the data correctly reflects the real-world values or facts it
represents. Inaccurate data can lead to wrong decisions or conclusions.
Example: If an employee’s actual salary is ₹30,000 but it is recorded as ₹35,000
in the database, the data is inaccurate.
2. Completeness
Completeness means that all required data is present and no important
information is missing. Missing data can make analysis incomplete or
unreliable.
Example: A customer record without an email address or phone number is an
example of incomplete data.
3. Consistency
Consistency means that the same data is uniform and matches across different
databases or systems. Inconsistent data creates confusion and errors in
processing.
Example: If a product’s price is ₹500 in one database and ₹550 in another, the
data is inconsistent.
4. Timeliness
Timeliness means that data is up-to-date and available when needed. Old or
delayed data reduces the usefulness of information.
Example: A weather app showing yesterday’s temperature instead of todays is
not timely.
5. Believability
Believability means that data is trustworthy and comes from a reliable and
trusted source. Unreliable data may lead to poor decision-making.
Example: Data from a government health report is more believable than data
from an unverified blog.
6. Interpretability
Interpretability means that data is easy to understand, with clear meaning,
proper format, and well-defined labels. If users cannot understand what the data
represents, it loses value.
Example: A column named “Student_Name” is easy to understand, but a
column named “STN_NM_01” is confusing and less interpretable.
Quality Dimension Meaning Real-World Example
Student’s age entered as 20 (not
Accuracy Correct and true data
200)
Every student has marks and
Completeness No missing values
attendance
Consistency Same across sources “BCA” in all databases
Timeliness Updated and current Today’s weather data, not old
Comes from trusted
Believability HR database, not random site
source
Interpretability Easy to understand Clear names and formats
Data Preprocessing Tasks /Techniques:
1. Data Cleaning
Removing errors, duplicates, and missing or inconsistent values from data to
make it accurate and reliable.
2. Data Integration
Data integration means combining data from multiple sources into a single,
consistent dataset.
This is useful when data is stored in different databases or formats.
3. Transformation
Data transformation means converting data into a suitable format or structure
for analysis.
4. Data Reduction
Reducing the size or volume of data while keeping its important information
for faster analysis.
Data Cleaning
Data cleaning is the process of detecting and correcting errors, removing
duplicates, and filling missing or wrong values in a dataset to make it accurate
and ready for analysis.
Example: In a customer database:
Some phone numbers are missing.
Some names are repeated twice.
Some email addresses are written incorrectly.
How to Handle Missing Values in Data Cleaning:
1. Ignore the Tuple
Meaning: Remove the entire row (record) that has a missing value.
When to Use:
o When the dataset is large and only a few values are missing.
o When the missing value is in a class label (for classification tasks).
Example: If in a sales dataset, 2 out of 10,000 customer records have missing
income values, we can safely delete those two rows. It won’t affect the overall
analysis.
Drawback: If many records have missing data, removing them may cause loss
of valuable information.
2. Fill in the Missing Value Manually
Meaning: Manually enter the correct or estimated value by examining
other available data.
When to Use:
o When the dataset is small.
o When domain knowledge is available.
Example: In a small hospital database, if a patient’s weight is missing, the
nurse or doctor can manually check the patient file and enter it.
Drawback: Time-consuming and impractical for large datasets.
3. Use a Global Constant to Fill in the Missing Value
Meaning: Replace all missing values with the same constant (like
“Unknown,” “Not Available,” or –1).
When to Use:
o When you just want to indicate that a value is missing.
Example: In a customer database, if city is missing, replace it with
“Unknown.”
Drawback: The mining algorithm may treat all “Unknown” as a single
category, which can lead to bias or incorrect groupings.
4. Use a Measure of Central Tendency (Mean, Median, or Mode)
Meaning: Replace the missing value with a mean, median, or mode of
that attribute’s existing values.
When to Use:
o When data is numerical and distribution is known.
Examples:
If income data is normally distributed, fill missing income with the mean
income (e.g., ₹56,000).
Customer_ID Name Income (₹) Customer_ID Name Income (₹)
1 Asha 50,000 1 Asha 50,000
2 Vivek Missing 2 Vivek 54,000
3 Reena 58,000 3 Reena 58,000
If data is skewed (e.g., some people earn very high amounts), use the
median instead of the mean.
5. Use Class-specific Mean / Median (Use the Attribute Mean or Median
for All Samples Belonging to the Same Class)
Meaning: Instead of using the overall mean, use the mean or median
specific to each class or group.
When to Use:
o When data belongs to different classes or categories.
Example: In a bank dataset, if we are predicting credit risk:
For customers labeled as “High Credit Risk,” fill missing income with
the average income of High Risk customers only.
For “Low Credit Risk,” use the average income of that class.
Customer_ID Credit_Risk Income (₹) Customer_ID Credit_Risk Income (₹)
1 High 30,000 1 High 30,000
2 Low 70,000 2 Low 70,000
3 High Missing 3 High 30,000
6. Use the Most Probable Value to Fill in the Missing Value
Meaning: Predict the missing value using advanced models such as
Regression, Decision Trees, or Bayesian inference.
When to Use:
o When you want the most accurate estimation.
o When relationships exist between attributes.
Example: In a retail dataset, if a customer’s income is missing, we can use
their education, occupation, and spending pattern to predict the most likely
income using a regression model or decision tree.
Customer_ID Education Occupation Income (₹) Customer_ID Education Occupation Income (₹)
1 Graduate Engineer 60,000 1 Graduate Engineer 60,000
2 Graduate Teacher Missing 2 Graduate Teacher 45,000
3 12th Pass Clerk 35,000 3 12th Pass Clerk 35,000
Drawback: This requires more computation and assumes relationships
between variables.
How to Handle Noisy Data in Data Cleaning:
Data Smoothing is a process used to remove noise (random errors or
fluctuations) from data to make patterns and trends more visible and
meaningful.
Noisy data means data that contains errors, inconsistencies, or random
variations which do not represent the true values.
1. Binning Method: Group (bin) the data into small ranges, then smooth the
data by replacing values within each bin using:
o Bin mean
o Bin median
o Bin boundaries
Example:
2. Regression: Fit a mathematical model (like a line or curve) to the data. Points
far from the line are treated as noise.
Example: If we plot Age vs. Income, and one record shows a 10-year-old
earning ₹1, 00,000/month — it’s a noisy record, as it doesn’t fit the general
pattern.
3. Clustering: Group similar data points together. Points that don’t belong to
any cluster are treated as outliers (noise).
Use statistical techniques to find data points that are very different from
others.
Example: In a customer dataset:
Most customers have monthly spending between ₹5,000–₹30,000.
One customer shows ₹5, 00,000 — this point is noise, as it doesn’t fit
any group.
Data Cleaning as a Process in Data Cleaning:
Step1: Raw Data (Input)
Step 2: Discrepancy Detection → Find errors, missing values, inconsistencies
Step 3: Data Transformation → Correct or standardize data (fill, convert, fix)
Step 4: Verification → Recheck data quality after cleaning
Step 5: Clean Data Output (Ready for Mining)
Data Integration
Data Integration is the process of combining data from different sources
into a single, unified view.
In data mining, before we analyze data, we often need to collect and merge
data from multiple databases, files, or systems — that’s where data
integration helps.
Example: Imagine a university
The Student Database has student names, roll numbers, and courses.
The Attendance System has attendance records.
The Exam Department has marks.
To analyze overall student performance, we need to combine all three into one
table. This combining process = Data Integration.
Why Data Integration is Needed?
Data is often spread across multiple sources.
To get a complete picture for analysis or decision-making, we must merge it.
It helps in data consistency, avoiding redundancy, and better insights.
Example: All Social Media Platforms.
Challenges/Issues in Data Integration:
1. Schema Integration
Different databases may have different structures or column names.
Example:
o Table 1: Stu_ID, Stu_Name
o Table 2: StudentNo, Name
These refer to the same fields — they must be matched and
standardized.
2. Data Value Conflicts
Same data may be stored in different formats or units.
Example:
o In one system, salary is in ₹, in another, it’s in $.
o Dates: “2025-10-03” vs “03/10/2025”.
We need to convert and standardize the format.
Emp_ID Salary Join_Date Emp_ID Salary Join_Date
201 ₹50,000 2025-10-03 202 $600 03/10/2025
3. Redundant Data
Some data may appear more than once across sources.
Example: A student record present in both “Student Database” and
“Library Database.”
→ We must remove duplicates.
Student_ID Name Course Student_ID Name Course
301 Asha Nair BCA 301 Asha Nair BCA
302 Rahul Joshi [Link]. 303 Kiran Rao BBA
Student Data base Library database
4. Data Inconsistency
Same attribute may have different values in two sources.
Example:
o In System A: Student address = “Hubballi”
o In System B: Student address = “Hubli”
→ We must resolve conflicts and choose the correct value.
Entity Identification Problem:
The Entity Identification Problem occurs when we need to match and identify
which records from different databases refer to the same real-world object or
person, even though their names or IDs differ.
When we collect data from different sources, the same real-world entity
(person, object, etc.) may be represented differently in each source.
The Entity Identification Process helps in recognizing and merging records
that refer to the same entity across different databases — removing
duplication and improving data quality.
Example: University Database -- Let’s say a college has two databases:
Table 1: Student_Info
Stu_ID Name Phone
S001 Riya P 9876543210
S002 Arjun R 9123456789
Table 2: Library_Records
Library_No Student_Name Contact_No
L1001 Riya Patel 9876543210
L1002 A. R. 9123456789
Problem:
The same student Riya P (S001) appears as Riya Patel (L1001).
Arjun R (S002) appears as A. R. (L1002).
Their names and IDs differ, but phone numbers match.
The system must identify that:
S001 in Student_Info = L1001 in Library_Records
S002 in Student_Info = L1002 in Library_Records
This is the Entity Identification Problem.
Data Transformation
Data Transformation is the process of converting data into a suitable format
for analysis or mining. It ensures that data from different sources becomes
consistent, compatible, and ready to use.
Strategies of Data Transformation:
The data are stored in different formats, scales, or different structures. So,
transformation makes all the data uniform and comparable.
1. Smoothing: Remove noise (irregular or random variations) from data.
Example: Daily sales data = [100, 102, 98, 150, 101]
The value 150 looks unusual (noise).
--We can replace it using binning or moving average to make data
smoother.
2. Attribute/Feature Construction: Create new useful attributes from
existing ones.
Example: From date of birth (DOB), we can create a new attribute Age.
→ Age = Current Year − Birth Year.
--This helps improve model performance.
3. Aggregation: Summarize or combine data.
Example: Instead of storing daily sales, we can store monthly sales total.
Month Sales
Jan 2024 1000 Year Sales
Feb 2024 1200 2024 5200
March 2024 3000
4. Normalization: Normalization is a technique to make all data values fall
in a common range usually between 0 and 1 or (-1 +1) so that no single
feature dominates the others when analyzing data.
Scale data values to a specific range (commonly 0–1).
Helps when attributes have different scales.
Example:
Student Marks Attendance
A 85 90
B 45 70
Here marks (0–100) and attendance (0–100) are on different ranges.
Normalization adjusts them to 0–1 range.
5. Discretization: Convert continuous data into discrete bins or categories.
Example: Marks (0–100) → Categories
0–35 → Fail
36–60 → Average
61–85 → Good
86–100 → Excellent
6. Concept Hierarchy Generation: Replace low-level data with high-level
concepts.
Example:
City → “Hubli”
State → “Karnataka”
Country → “India”
When generalized: Hubli → Karnataka → India
Normalization Techniques:
Normalization means scaling data so that all values fall within a small,
specified range (commonly 0 to 1 or -1 to 1).
This is useful because:
Some attributes have large values (e.g., income in lakhs)
Some have small values (e.g., age in years)
If not normalized, large-valued attributes dominate the analysis.
1. Min–Max Normalization:
Example:
Student Marks Attendance
A 30 80
B 50 60
C 90 70
For Marks: min = 30, max = 90
Student Marks Normalized Marks
A 30 (30–30)/(90–30)=0
B 50 (50–30)/(90–30)=20/60=0.33
C 90 (90–30)/(90–30)=1
Normalized Marks: 0, 0.33, 1
For Attendance: min = 60, max = 80
Student Attendance Normalized Attendance
A 80 (80–60)/(80–60)=1
B 60 (60–60)/(80–60)=0
C 70 (70–60)/(80–60)=10/20=0.5
Normalized Attendance: 1, 0, 0.5
2. Z-Score Normalization:
Example:
Student Marks Attendance
A 30 80
B 50 60
C 90 70
For Marks:
Mean (μ) = (30+50+90)/3 = 170/3 = 56.67
Standard deviation (σ) = √[( (30–56.67)² + (50–56.67)² + (90–56.67)² ) / 3]
= √[(711.1 + 44.4 + 1111.1)/3]
= √(622.2) = 24.94
Student Marks Normalized Marks
A 30 (30–56.67)/24.94 = -1.07
B 50 (50–56.67)/24.94 = -0.27
C 90 (90–56.67)/24.94 = 1.34
For Attendance:
Mean = (80+60+70)/3 = 70
SD = √[( (80–70)² + (60–70)² + (70–70)² ) / 3] = √(200/3) = 8.16
Student Attendance Normalized Attendance
A 80 (80–70)/8.16 = 1.22
B 60 (60–70)/8.16 = -1.22
C 70 (70–70)/8.16 = 0
3. Decimal Scaling Normalization:
Where, j is the smallest integer, it is not constant.
Example:
Student Marks Attendance
A 30 80
B 50 60
C 90 70
For Marks: max = 90 → divide by 100 (10²)
For Attendance: max = 80 → divide by 100 (10²)
Student Marks Attendance Dec-Scaled Marks Dec-Scaled Attendance
A 30 80 0.30 0.80
B 50 60 0.50 0.60
C 90 70 0.90 0.70
Data Reduction
Data reduction is the process of reducing the volume of data while
maintaining the same analytical results and data integrity.
It helps make data mining faster, more efficient, and cost-effective by
keeping only the most relevant information.
Why is Data Reduction Needed?
Big data systems generate huge amounts of data.
Processing all of it consumes time, memory, and computational power.
Therefore, we compress or summarize data without losing its essential
meaning.
Strategies of Data Reduction:
Data reduction strategies include
1. Dimensionality reduction
2. Numerosity reduction
3. Data compression
1. Dimensionality Reduction: Reducing the number of attributes or features
in the dataset.
Techniques:
o Principal Component Analysis (PCA)
o Feature Selection
o Wavelet Transform
Example: Instead of using 10 exam subject scores, we take only 3 key
subjects that best represent performance.
- Keeps only important features, removes redundant or correlated ones.
2. Numerosity Reduction: Replacing the original data with a smaller model or
representation that approximates it.
Techniques:
o Regression models
o Histograms
o Clustering
o Data cube aggregation
o Sampling
Example:
o Instead of storing 1 million sales transactions, store a linear
regression model showing trend between sales and time.
o Or store sampled data representing the whole population.
- Saves storage and still preserves data patterns.
1. Regression Models
Use a mathematical equation to represent data instead of storing all data
points.
Example:
XY
1 2
2 4
3 6
4 8
Instead of storing all records, we can represent it as:
Y = 2X
This single equation replaces the entire dataset.
2. Histograms
Data is divided into intervals (bins), and only the frequency of values
in each bin is stored.
Example:
Raw data: 5, 7, 9, 10, 12, 14, 16, 18
Histogram (bin size = 5):
Interval Count
5–9 3
10–14 3
15–19 2
Now, we store intervals + counts instead of all 8 values.
3. Clustering
Group similar data points into clusters and represent each cluster by its
centroid (average value).
Example:
Data: (1,2), (2,1), (10,11), (11,10)
We can form two clusters:
Cluster 1 → (1,2), (2,1) → Centroid = (1.5,1.5)
Cluster 2 → (10,11), (11,10) → Centroid = (10.5,10.5)
So, only two centroids represent the four data points.
3. Data Cube Aggregation technique: Summarizing data in multiple
dimensions.
Example: Suppose you have sales data by city, month, and product.
You can aggregate data to get total sales by state instead of city (higher-
level summary).
City Month Sales
Hubli Jan 1,000
Dharwad Jan 1,500
→ Karnataka (Aggregate) Jan 2,500
- Less detailed data, but same overall pattern.
4. Samplings technique in Data Reduction:
Sampling is a data reduction technique in which a small representative
subset of a large dataset is selected for analysis.
The goal is to get similar analytical results as with the full dataset but with
less time and computation.
Types / Techniques of Sampling:
Let’s understand each with simple example using a “Student Marks”
database.
Student Marks
A 85
B 75
C 90
D 60
Student Marks
E 70
F 95
a) Simple Random Sampling without Replacement (SRSWOR)
Each record has an equal chance of being selected.
Once selected, it cannot be chosen again.
Example: Select 3 students randomly without repeating.
Sample: {A, D, F} -- Each student appears only once.
b) Simple Random Sampling with Replacement (SRSWR)
Each record has an equal chance of being selected, and after selection, it
goes back into the pool.
So, the same record can appear more than once.
Example: Select 3 students randomly with replacement.
Sample: {C, F, C} -- Here, “C” is selected twice because replacement is
allowed.
c) Stratified Sampling
The dataset is divided into groups (strata) based on some attribute, and
then sampling is done within each group.
This ensures balanced representation from all categories.
Example: Group students based on Marks Category:
Category Students Marks
High (≥85) A, C, F 85, 90, 95
Medium (70–84) B, E 75, 70
Category Students Marks
Low (<70) D 60
Now, select 1 sample from each category:
Sample: {F (High), B (Medium), D (Low)} -- Ensures that every group is
represented in the sample.
3. Data Compression: Encoding data in fewer bits without losing essential
information.
Techniques:
o Lossless compression (no data loss)
o Lossy compression (some data removed)
Example:
o In images, JPEG compression reduces file size.
o In text, run-length encoding stores “AAAA” as “A×4”.
- Useful for storing large multimedia or sensor data.
Discretization and Concept Hierarchy
Generation
Discretization is the process of converting continuous data (numerical
values) into a finite number of intervals (bins) or categories.
Purpose:
Simplifies data
Reduces number of values
Makes analysis (like classification) easier
Example:
Student Marks
A 35
B 55
C 68
D 78
E 90
Instead of using raw marks, we can discretize them into ranges:
Student Marks Grade (Discretized)
A 35 Low
B 55 Medium
C 68 Medium
D 78 High
E 90 High
Result: Continuous “Marks” → Categorical “Grade”. This reduces data
complexity and helps in decision-making.
Types of Discretization Techniques:
Type Description Example
1. Equal-Width Divides the range of values Range = 0–100 → bins (0–
Binning into equal-sized intervals. 33), (34–66), (67–100)
Each bin contains
2. Equal-Frequency 10 students → 3 bins → ~3
approximately same
Binning students per bin
number of values.
Marks grouped as per
3. Clustering-based Groups data using clustering
similarity (e.g., 30–50, 51–
Discretization algorithms (like K-Means).
70, 71–100)
Type Description Example
4. Supervised Considers class labels during If students passed/failed →
Discretization binning. split marks accordingly
Concept Hierarchy is a process of organizing attributes or values in levels
from low-level (detailed) to high-level (generalized) concepts.
Purpose:
Used for data abstraction and summarization
Helps in data generalization for reporting or OLAP (Online Analytical
Processing)
Example: Location Hierarchy
Level Example
City Hubli
District Dharwad
State Karnataka
Country India
Hierarchy: City → District → State → Country → Higher levels give
summarized information.