0% found this document useful (0 votes)
19 views15 pages

Data Preprocessing Techniques Overview

Data preprocessing involves preparing raw data for analysis through cleaning, transforming, and organizing it to enhance quality and ensure accuracy. Key steps include data cleaning, integration, transformation, and reduction, each addressing specific issues like missing values and inconsistencies. While it improves data quality and analysis efficiency, it can be time-consuming and resource-intensive.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views15 pages

Data Preprocessing Techniques Overview

Data preprocessing involves preparing raw data for analysis through cleaning, transforming, and organizing it to enhance quality and ensure accuracy. Key steps include data cleaning, integration, transformation, and reduction, each addressing specific issues like missing values and inconsistencies. While it improves data quality and analysis efficiency, it can be time-consuming and resource-intensive.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT-II

Data Preprocessing
Data Preprocessing: An Overview

Data preprocessing is the process of preparing raw data for analysis by cleaning and transforming it
into a usable format. In data mining it refers to preparing raw data for mining by performing tasks like
cleaning, transforming, and organizing it into a format suitable for mining algorithms.
 Goal is to improve the quality of the data.
 Helps in handling missing values, removing duplicates, and normalizing data.
 Ensures the accuracy and consistency of the dataset.
Steps in Data Preprocessing
Some key steps in data preprocessing are Data Cleaning, Data Integration, Data Transformation, and
Data Reduction.

1. Data Cleaning

Data cleaning refers to detecting and correcting errors or inconsistencies in a dataset. The goal is to
improve the accuracy and reliability of data.

Key Steps:

 Handling Missing Values


o Occurs when some data entries are absent.
o Methods:
 Ignore rows with missing values
 Fill with attribute mean/median
 Use the most probable value (e.g., prediction-based filling)
 Handling Noisy Data
o Noisy data = irrelevant, inconsistent, or incorrect data.
o Methods:
 Binning Method: Sort data into segments and replace with mean/boundary
values
 Regression: Fit data into a regression function to smooth values
 Clustering: Group similar points, treat outliers separately
 Removing Duplicates
o Detect and eliminate repeated entries
o Ensures consistency and prevents analysis errors

2. Data Integration

Data integration combines data from multiple sources into a unified dataset. Challenges include
differences in formats, structures, and meanings.

Techniques:

 Record Linkage
Identifies and matches records referring to the same entity across datasets, even if represented
differently.
 Data Fusion
Merges information from multiple sources, resolving inconsistencies to create a richer dataset.

3. Data Transformation

Data transformation converts data into suitable formats for analysis.

Techniques:

 Normalization
Scales data to a common range (e.g., 0–1).
 Standardization
Adjusts data to have mean = 0 and variance = 1.
 Discretization
Converts continuous values into discrete categories.
 Data Aggregation
Summarizes multiple values (e.g., totals, averages).
 Concept Hierarchy Generation
Organizes data into hierarchical levels for higher-level analysis.

4. Data Reduction

Reduces dataset size while preserving important information, improving efficiency.

Techniques:

 Dimensionality Reduction (PCA, etc.)


Minimizes the number of variables but retains essential data.
 Numerosity Reduction
Reduces number of records using methods like sampling.
 Data Compression
Encodes data compactly to save storage and processing power.

Uses of Data Preprocessing


Data preprocessing is widely applied in multiple domains:

1. Data Warehousing – Cleaning and integrating data before storing in repositories.


2. Data Mining – Preparing raw data to discover patterns and insights.
3. Machine Learning – Ensuring quality inputs for training and testing models.
4. Data Science – Structuring data for analysis and predictive modeling.
5. Web Mining – Processing web logs for user behavior analysis.
6. Business Intelligence (BI) – Preparing reliable data for dashboards and decision-making.
7. Deep Learning – Normalizing and enhancing data for effective neural network training.

Advantages of Data Preprocessing

 Improved Data Quality – Ensures accuracy, consistency, and reliability.


 Better Model Performance – Reduces noise and irrelevant data.
 Efficient Data Analysis – Streamlined data enables faster processing.
 Enhanced Decision-Making – Provides clear, structured insights.

Disadvantages of Data Preprocessing

 Time-Consuming – Requires effort for cleaning and organizing.


 Resource-Intensive – Needs computational power and skilled professionals.
 Potential Data Loss – Improper handling may remove useful information.
 Complexity – Difficult to manage with large or diverse datasets.

Data Cleaning
1. Introduction to Data Cleaning

 Definition:
Data Cleaning (or data cleansing) is the process of identifying and correcting (or removing)
corrupt, inaccurate, incomplete, duplicated, or irrelevant data in a dataset.
 Purpose:
Ensures the dataset is consistent, accurate, and ready for analysis or training machine learning
models.
 Importance:
o Removes noise and inconsistencies.
o Improves data quality → better decisions.
o Prevents errors in analysis and model training.

2. Causes of Dirty Data

1. Human Errors: Typing mistakes, misspellings, wrong entries.


2. Missing Data: Values not recorded during data collection.
3. Duplicates: Same records stored multiple times.
4. Inconsistent Data: Different formats (e.g., date 01-02-2025 vs 2025/02/01).
5. Outliers: Extreme values outside normal range.
6. Integration Issues: Combining data from multiple sources with different standards.

3. Common Problems in Data


 Missing Values
 Noisy Data (errors or outliers)
 Inconsistent Data (format mismatch, unit mismatch)
 Duplicate Records
 Irrelevant Data (unnecessary features/columns)

4. Techniques of Data Cleaning

A. Handling Missing Values

 Ignore the record: If missing values are few and dataset is large.
 Fill manually: When dataset is small.
 Mean/Median/Mode imputation: Replace missing values with average or most frequent value.
 Predictive imputation: Use ML models (e.g., regression, k-NN) to estimate missing values.
 Special value: Fill with Unknown or 0 if meaningful.

B. Handling Noisy Data

 Smoothing techniques:
o Binning: Group values into bins and smooth.
o Regression: Fit a regression line and replace noisy values.
o Moving Average: Replace value with average of neighbors.

C. Handling Inconsistent Data

 Convert into a standard format (e.g., USD vs US Dollar).


 Standardize units (e.g., kg vs lbs).
 Apply business rules (e.g., age should not be negative).

D. Removing Duplicate Data

 Use primary key / unique identifier to detect duplicates.


 Apply string similarity (Levenshtein distance, cosine similarity) to detect near-duplicates.
 Keep the latest or most reliable record.

E. Handling Outliers

 Statistical methods:
o Z-Score method (values beyond ±3 standard deviations).
o IQR method (values outside Q1–1.5IQR and Q3+1.5IQR).
 Transformation: Apply log or square root transformation.
 Clipping/Trimming: Remove extreme values.

F. Feature Cleaning

 Remove irrelevant attributes: Columns not useful for analysis.


 Standardization: Convert values to a standard scale (e.g., min-max normalization, z-score
normalization).
 Consistent representation: Example: Gender → {Male, Female} instead of {M, F, Male,
Female}.

5. Tools for Data Cleaning

 Excel / Google Sheets (basic cleaning).


 Python Libraries:
o pandas (dropna, fillna, drop_duplicates).
o numpy (handling missing values).
o scikit-learn (imputation, scaling).
 ETL Tools: Talend, Informatica.
 Data Preparation Platforms: Trifacta, OpenRefine.

6. Benefits of Data Cleaning

 Improves accuracy of analysis.


 Increases efficiency of processing.
 Reduces storage costs (by removing duplicates).
 Builds trust in data-driven decisions.
 Essential for machine learning (models depend on quality data).

7. Example

Suppose a dataset of customers:

ID Name Age City Phone


1 Ramesh 25 Mumbai 9876543210
2 Keerthi Hyd 9123456789
3 Ramesh 250 Bhimavaram 9876543210
4 Ram 25 Hyderabad NaN

Problems:

 Missing Age (row 2).


 Age = 250 (outlier).
 City names inconsistent (Hyd vs Hyderabad).
 Duplicate record for Ramesh.
 Missing phone number.

Cleaning Actions:

 Fill missing age with mean/median.


 Remove/adjust outlier age 250.
 Standardize city to Hyderabad.
 Remove duplicate.
 Fill missing phone as Unknown.
Data Integration
1. Introduction

 Data Integration is the process of combining data from multiple heterogeneous sources into a
coherent data store.
 Goal: Provide a unified view of data for analysis, decision-making, or building a data
warehouse.
 Important in data preprocessing, because raw data usually comes from different sources like
databases, files, web logs, sensors, etc.

2. Need for Data Integration

 Data is often spread across multiple sources:


o Different formats (SQL, CSV, JSON, XML, NoSQL).
o Different schemas (naming conflicts, data type mismatches).
o Different semantics (e.g., "salary" in one dataset vs. "income" in another).
 Integration ensures:
o Consistency → no duplicate/conflicting records.
o Completeness → all relevant attributes are combined.
o Improved quality → reliable data for mining/analysis.

3. Issues in Data Integration

When integrating, several problems can arise:

1. Schema Integration
o Merging schemas from multiple sources into a unified schema.
o Problems:
 Naming Conflicts → same attribute name but different meanings (e.g., “ID” in
student vs. employee).
 Structural Conflicts → same data represented differently (e.g., date stored as
dd/mm/yyyy vs. yyyy-mm-dd).
2. Entity Identification Problem
o Figuring out whether two records refer to the same real-world entity.
o Example: “Ramesh D” in one dataset and “Ramesh Durga” in another.
3. Redundancy & Duplicate Data
o Multiple records representing the same entity.
o Leads to overestimation in analysis if not handled.
4. Data Value Conflicts
o Same attribute, different values.
o Example: Employee salary = 50,000 in one source, 48,000 in another.
5. Data Quality Issues
o Missing values, noisy data, inconsistent formats.

4. Approaches to Data Integration

There are different strategies for integrating data:


1. Data Consolidation (Physical Integration)
o Data from multiple sources is extracted, transformed, and loaded (ETL) into a single data
repository (e.g., Data Warehouse).
o Example: Pulling sales data from MySQL, customer data from Oracle, merging into one
warehouse.
2. Data Federation (Virtual Integration)
o Data remains in original sources but is accessed via a unified virtual view.
o Uses middleware or query systems.
o Example: A query fetches real-time data from multiple databases without moving them.
3. Data Propagation
o Data is copied across systems using synchronization techniques.
o Example: Master–slave database replication.

5. Techniques in Data Integration

1. Schema Matching and Mapping


o Aligning attributes with same meaning but different names.
o Example: Cust_ID → CustomerNumber.
2. Entity Resolution (Record Linkage / Deduplication)
o Identifying and merging records referring to the same entity.
o Techniques: String matching, probabilistic matching.
3. Data Transformation
o Standardizing data types, units, formats.
o Example: Converting currency from USD to INR.
4. Data Cleaning
o Handling missing values, correcting inconsistencies before/after integration.
5. Data Reduction
o Eliminating duplicates, redundant attributes.

6. Example

Suppose two datasets:

Table 1: Employee (Oracle)

Emp_ID Name Salary Dept


101 Ramesh D 50,000 IT

Table 2: Staff (MySQL)

StaffNo FullName Income Department


E101 Ramesh Doddipatla 48,000 IT Dept

 Issues:
o Schema conflict → Emp_ID vs StaffNo.
o Naming conflict → Salary vs Income.
o Value conflict → 50,000 vs 48,000.
o Entity Identification → "Ramesh D" = "Ramesh Doddipatla".
After integration →

Unified Employee Table

Employee_ID Name Salary Department


101 Ramesh D 49,000 IT

(Conflict resolved by averaging or using trusted source.)

7. Applications of Data Integration

 Data Warehousing → combining multiple sources for OLAP.


 Business Intelligence (BI) → unified reports for decision making.
 ETL (Extract, Transform, Load) → core in data pipelines.
 Big Data Analytics → integrating logs, sensor data, social media, etc.
 Cloud & IoT → merging sensor, device, and user data.

Data Reduction
1. Introduction

 In large databases, the data can be very huge and complex.


 Running data mining algorithms directly on such data is time-consuming and may require
excessive memory/CPU resources.
 Data Reduction techniques are applied to reduce the size of data while maintaining its integrity
and analytical value.

👉 Goal: Obtain a reduced representation of the dataset that is smaller in volume but produces similar
results when analyzed.

2. Objectives of Data Reduction

1. Reduce data size – minimize memory/storage requirements.


2. Improve performance – make mining and analysis faster.
3. Preserve data quality – retain important patterns and information.
4. Avoid overfitting – by removing irrelevant/noisy data.

3. Data Reduction Techniques

3.1 Data Cube Aggregation

 Data cubes are used in OLAP (Online Analytical Processing) systems.


 Instead of storing detailed raw data, aggregated data (e.g., totals, averages) is stored.
 Example:
o Sales data of each product per day → can be aggregated into monthly sales or quarterly
sales.
 Reduces data size while keeping important trends.
3.2 Dimension Reduction

 Involves reducing the number of attributes/features in the dataset.


 Helps remove irrelevant, redundant, or highly correlated attributes.

Methods:

1. Feature Selection (Attribute Subset Selection)


o Select a subset of relevant attributes.
o Methods:
 Stepwise forward selection
 Stepwise backward elimination
 Decision-tree-based selection
o Example: For predicting student performance, "Study Hours" is more important than
"Favorite Color".
2. Feature Extraction
o Create new features from original ones.
o Methods:
 Principal Component Analysis (PCA) – transforms correlated features into
fewer independent variables (principal components).
 Singular Value Decomposition (SVD) – reduces dimensionality in text/data
mining.
o Example: Instead of using 10 correlated economic indicators, reduce them to 2–3
principal components.

3.3 Data Compression

 Encodes data into a more compact form.


 Helps reduce storage space and transmission time.

Types:

1. Lossless Compression – No information loss.


o Example: Huffman encoding, Run-length encoding.
o Useful for categorical/textual data.
2. Lossy Compression – Some information is lost, but data is still useful.
o Example: JPEG image compression, audio/video compression.
o Useful for multimedia data.

3.4 Numerosity Reduction

 Replaces large data volumes with smaller representations (models/approximations).

Methods:

1. Parametric Methods
o Assume data fits a statistical model.
o Store only model parameters instead of raw data.
o Example: Linear regression, Log-normal distribution.
2. Non-Parametric Methods
o No assumptions about data distribution.
o Techniques:
 Histograms – approximate data distribution by dividing data into intervals
(bins).
 Clustering – group similar data points and represent each cluster by its centroid.
 Sampling – select a small but representative portion of the data.

3.5 Discretization and Concept Hierarchy Generation

 Reduces continuous attributes into discrete intervals.


 Simplifies data representation and analysis.

Examples:

1. Age (continuous variable) → {0–12 = Child, 13–19 = Teen, 20–60 = Adult, 60+ = Senior}.
2. Income values → {Low, Medium, High}.

👉 Often used in decision tree algorithms and OLAP hierarchies.

4. Advantages of Data Reduction

 Faster data mining and query processing.


 Lower storage and memory requirements.
 Easier visualization and understanding of data.
 Improves efficiency without losing important patterns.

5. Disadvantages of Data Reduction

 Information loss (especially in lossy methods).


 Reduced data may not always capture all patterns.
 Incorrect feature selection may lead to poor model performance.

6. Applications of Data Reduction

 Big Data Analytics – reducing terabytes of data into manageable size.


 Data Warehousing – aggregation and summarization.
 Machine Learning – dimensionality reduction improves model accuracy and speed.
 Image/Video Processing – compression for storage/transmission.

Data Transformation
1. Introduction

 Data Transformation is a crucial step in data preprocessing in Data Warehousing and Data
Mining.
 It converts raw data into a suitable format for analysis, ensuring consistency, efficiency, and
better mining results.
 Transformation makes heterogeneous data compatible for integration and mining.

2. Objectives of Data Transformation

 Improve data quality.


 Reduce data redundancy and noise.
 Standardize values across different sources.
 Make data more comprehensible and efficient for mining algorithms.
 Enable meaningful comparisons across datasets.

3. Techniques of Data Transformation

3.1 Smoothing

 Removes noise from data.


 Methods:
o Binning: Sorting data into bins and replacing values with mean/median of the bin.
o Regression: Fitting data to a regression function (e.g., linear regression).
o Clustering: Detecting outliers and smoothing by cluster center values.

3.2 Aggregation

 Summarizing or combining data.


 Example:
o Daily sales → Weekly sales → Monthly sales.
 Used in data cubes and OLAP operations.

3.3 Generalization

 Replacing low-level (detailed) data with higher-level concepts.


 Example:
o "19 years old" → "Young"
o "India → Andhra Pradesh → Bhimavaram" → "India".
 Achieved using concept hierarchies.

3.4 Normalization

 Scaling numeric attributes to a specific range.


 Makes data comparable.
 Methods:
1. Min-Max Normalization
 Maps values to range [0,1].
 Formula:

v′=v−min⁡(A)max⁡(A)−min⁡(A)(new_max−new_min)+new_minv' = \frac{v
- \min(A)}{\max(A) - \min(A)} (new\_max - new\_min) +
new\_minv′=max(A)−min(A)v−min(A)(new_max−new_min)+new_min
2. Z-score Normalization (Standardization)
 Uses mean and standard deviation.
 Formula:

v′=v−μσv' = \frac{v - \mu}{\sigma}v′=σv−μ

3. Decimal Scaling
 Moves decimal point of values.
 Example: 985 → 0.985

3.5 Attribute/Feature Construction

 Creating new attributes from existing ones to make mining more effective.
 Example:
o From "Date of Birth" → create "Age".
o From "Marks in subjects" → create "Average Marks".

3.6 Discretization

 Converting continuous attributes into discrete ones.


 Helps in categorical data mining.
 Example:
o Age = 1–100 → {Child (1–12), Teen (13–19), Adult (20–59), Senior (60+)}

3.7 Encoding

 Converting categorical data into numerical format.


 Methods:
o One-Hot Encoding: Each category → binary attribute.
o Label Encoding: Assign numeric labels to categories.

3.8 Data Reduction through Transformation

 Dimensionality reduction using PCA (Principal Component Analysis) or Attribute Selection.


 Produces smaller datasets without losing essential information.

4. Benefits of Data Transformation

 Increases accuracy of mining algorithms.


 Handles inconsistencies across multiple sources.
 Reduces computational complexity.
 Provides standardized, clean data for warehousing and OLAP.

5. Example

Suppose we have a dataset:


Customer Age Salary Location Purchase

A 19 25000 Bhimavaram Yes

B 45 98000 Hyderabad No

C 30 52000 Vijayawada Yes

After Transformation:

 Age → Category (Young, Adult, Senior)


 Salary → Normalized [0–1]
 Location → Generalized to "Andhra Pradesh"
 Purchase → Encoded (Yes=1, No=0)

Data Discretization
1. Introduction to Data Discretization

 Definition:
Data discretization is the process of transforming continuous attributes (numeric data) into a finite
set of intervals (or categories).
 Purpose:
o To reduce the number of possible values.
o To improve efficiency and performance of mining algorithms.
o To make data more interpretable and suitable for certain algorithms (like decision trees,
association rules).

2. Importance of Discretization

 Continuous attributes have an infinite number of values, which can make data mining inefficient.
 Many algorithms (e.g., classification, clustering) work better with categorical data.
 Helps reduce data size without losing important patterns.
 Improves data quality, pattern discovery, and interpretability.

3. Types of Data Discretization

(A) Supervised Discretization

 Uses class label information to create intervals.


 Goal: maximize the information gain or minimize entropy within each interval.
 Example: Decision tree split (ID3, C4.5).

(B) Unsupervised Discretization

 Does not use class labels.


 Only considers distribution of attribute values.
 Example: Equal-width, equal-frequency binning.

4. Methods of Discretization

4.1 Binning

 Simple discretization method by grouping values into bins.


 Types:
1. Equal-width binning: Divides range into N intervals of equal size.
 Example: 0–100 split into 5 bins → [0–20], [21–40], …
2. Equal-frequency binning: Each bin has approximately the same number of data values.
 Example: Sort values and assign equal counts to each bin.
 Advantages: Easy to implement.
 Disadvantages: May lose information; sensitive to outliers.

4.2 Histogram Analysis

 Uses frequency distribution to define bins.


 Intervals are chosen based on the data distribution, not fixed.
 Example: More bins in dense regions, fewer bins in sparse regions.

4.3 Clustering

 Groups similar data points into clusters and treats each cluster as a bin.
 Example: Using k-means to discretize numeric attributes.
 Advantage: Finds natural groupings in data.

4.4 Decision Tree Analysis

 Uses top-down splitting to form intervals based on class labels.


 Example: ID3, C4.5 algorithms.
 Intervals chosen to maximize information gain or reduce Gini index.

4.5 Correlation-based Discretization

 Uses correlation analysis between attributes and class labels.


 Splits values in a way that preserves attribute–class relationship.

4.6 Entropy-based Discretization

 Selects boundaries that minimize class entropy (purity).


 Works in supervised learning.
 Example: If splitting at value "X" results in the lowest entropy, choose it as boundary.

5. Discretization vs. Concept Hierarchy

 Concept hierarchy generation: Organizing attributes into multiple levels (e.g., age → {young,
middle, old}).
 Discretization: Assigning continuous values into bins (e.g., 20–29 = young).
 Discretization is often a step in building concept hierarchies.

6. Advantages of Discretization

 Simplifies data representation.


 Improves performance of algorithms.
 Reduces memory usage.
 Makes patterns more understandable for humans.

7. Limitations of Discretization

 May lead to information loss.


 Choice of number of bins and boundaries is critical.
 Over-discretization may reduce accuracy.

8. Applications

 Decision tree learning.


 Association rule mining.
 Classification & Clustering.
 Data compression & summarization.

**THE END**

You might also like