0% found this document useful (0 votes)

14 views19 pages

Data Preprocessing Techniques Explained

Data preprocessing is essential for transforming raw data into a clean, consistent, and usable format for data mining, addressing issues like noise, missing values, and inconsistencies. Key steps include data cleaning, integration, reduction, and transformation, each aimed at improving data quality and mining efficiency. Effective preprocessing enhances the accuracy of mining algorithms and ensures reliable insights from data analysis.

Uploaded by

nandusweety1101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views19 pages

Data Preprocessing Techniques Explained

Uploaded by

nandusweety1101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

UNIT 2

DATA PRE- PROCESSING

Data Preprocessing:
Data is often called the "new oil," but just like crude oil, raw data is rarely useful in
its original state. It is often noisy, incomplete, inconsistent, or simply too large to
be processed directly. To get accurate and meaningful results from data mining, we
must first prepare the data properly. This preparation process is called data
preprocessing.
Data preprocessing plays a critical role in ensuring that the knowledge discovered
from data is reliable and valid. If we try to apply data mining techniques on raw
data without cleaning and transforming it, we may end up with wrong conclusions.
For example, imagine a retail database where some customers have missing
addresses, some transactions have been recorded twice, and some prices are
wrongly entered. Mining such dala would give misleading results about customer
behavior or sales patterns. Hence, preprocessing is the foundation for successful
data mining.
1. Importance of Data Preprocessing:
Before we go into the steps, it is important to understand why preprocessing is
required:
• Data quality issues - Real-world data may have errors, duplicate entries,
missing values, and inconsistencies
• Large volume - Data warehouses and big databases often contain terabytes
or petabytes of records, which are too big for direct mining.
• Heterogeneity Data may come from different sources such as spreadsheets,
databases, sensors, or the web, each with different formats.
• Improving accuracy - Well-prepared data improves the performance of
mining algorithms, leading to better classification, clustering, or prediction
results.
• Efficiency - Preprocessing reduces data size and complexity, so mining
becomes faster and more scalable.
Thus, preprocessing acts like the cleaning and organizing stage before
analysis, making sure the data is consistent, accurate, and ready for
knowledge discovery.

Efficiency - Preprocessing reduces data size and complexity, so mining

becomes faster and more scalable.

Thus, preprocessing acts like the cleaning and organizing stage before
analysis, making sure the data is consistent, accurate, and ready for
knowledge discovery.

Steps in Data Preprocessing:

Some key steps in data preprocessing are Data Cleaning, Data Integration,
Data Transformation, and Data Reduction.

Major Tasks In Data Pre-Processing:

1. Data Cleaning
2. Data Integration
3. Data Reduction
4. Data Transformation
1. Data Cleaning:
The first major task in preprocessing is data cleaning, which deals with missing
values, noisy values, and inconsistencies.
(a) Missing Values
Data often has blank or unknown fields. For example, a customer's age might not
be recorded, or income data may be absent. Methods to handle missing values
include:
➢ Ignoring the record (only if the dataset is large enough).
➢ Filling with a global constant (like "Unknown").
➢ Replacing with the mean, median, or mode of the attribute.
➢ Using predictive models (regression, decision trees, or k-nearest neighbor) to
estimate the missing value.

Method Description Example

Ignore Record Remove tuples with missing values. Drop student record with mis
grade.
Fill Constant Replace with fixed value. Missing city → 'Unknown'.
Mean/Median/Mode Statistical replacement. Income replaced with mean
salary.
Predictive Models Use ML models for estimation. Predict missing age using
regression.

(b) Noisy Data

Noise refers to random errors or variations in data. Example: typing mistakes,
wrong measurements, or outliers. Methods to reduce noise include:

Binning

Noisy Data Regression

Clustering
Binning: Sorting data into bins and replacing values by bin mean, median, or
boundary.
Regression: Fitting data into a regression function and smoothing values.
Clustering: Grouping similar records and identifying outliers as noise.

(c) Inconsistencies.
Data from different sources may have different formats or spellings. For example,
"Male/Female" vs. "M/F," or different currency notations. Cleaning detects and
corrects such conflicts to maintain consistency.

2. Data Integration
The second task is data integration, which combines data from multiple sources
like databases, flat files, or data cubes.
Key problems in integration include:
Entity identification - Matching records that refer to the same entity but have
different names (e.g.. "cust_ID" vs. "customer_no").
Schema integration Combining attributes with different names but same
meaning.
Redundancy removal If the same information is stored in two sources, it
must be detected and merged. Correlation analysis is often used to detect
redundancy.
Value conflict resolution If two sources record different values for the same
attribute, a strategy is needed to resolve conflicts.
Integration provides a unified view of data, which is essential for data
mining in large organizations where data comes from many departments and
systems.

3. Data Reduction

Since real-world data is often huge, data reduction techniques are applied to make
the dataset smaller but still representative of the original. The goal is to reduce the
volume while preserving essential patterns.
Methods of Data Reduction:
1. Data Cube Aggregation- Summarizing data at higher abstraction levels. For
example, instead of storing daily sales, we can keep monthly or yearly totals.
2. Dimensionality Reduction - Reducing the number of attributes using
techniques such as Principal Component Analysis (PCA) or Singular Value
Decomposition (SVD).
3. Attribute Subset Selection-Choosing only the most relevant attributes while
discarding irrelevant or redundant ones. Techniques like decision tree induction
and stepwise regression help in this.
4. Numerosity Reduction- Replacing large datasets with models. Example: using
regression equations or clusters to represent the data.
5. Sampling- Selecting a representative sample of the data for analysis instead of
the entire dataset.
6. Data Compression - Using encoding schemes like wavelet transforms to store
data in a compact format.
Reduction improves efficiency, saves storage, and speeds up mining algorithms
without losing much accuracy.

4. Data Transformation
Data transformation is the process of changing the structure, format, or values of
data to make it more suitable for analysis, modeling, or downstream processes.
This step can be simple (like renaming columns) or complex (like converting data
types or creating new features).
Methods of Data Transformation
1. Smoothing
• What It Is: Smoothing removes noise from the data to make patterns and
trends easier to identify.
• Why It’s Important: It highlights key features and helps in making
predictions.
Examples:
• Stock Market: Use moving averages to smooth out daily fluctuations and
reveal trends.
• Temperature Data: Smoothing temperature readings over time to reduce
minor variances caused by sensor inaccuracies.
• Original Data: [72, 74, 73, 75, 74, 76] Smoothed Data (using 3-day moving
average): [—, 73, 74, 74, 75]

2. Aggregation
• What It Is: Aggregation summarizes or combines data from multiple
sources to generate higher-level information.
• Why It’s Important: It ensures that insights are based on a comprehensive
view of data.

Examples:
Sales Data: Aggregating daily sales into monthly or yearly totals for performance
evaluation. Daily Sales: [500, 700, 600, 800, 650] Monthly Sales: [Sum = 3,250]
Website Traffic:
Summarize hourly visits into daily totals for trend analysis.
3. Discretization
What It Is: Continuous data is divided into smaller intervals or bins, reducing
complexity and size.
Why It’s Important: It makes data easier to interpret and analyze.
Examples:
Age Ranges: Instead of using precise ages, group people into categories: Original:
[23, 25, 30, 45, 55, 67] Discretized: [“Young”, “Middle-aged”, “Senior”]
Time Intervals: Group exact exercise times into bins: Original: [10 mins, 20 mins,
35 mins] Discretized: [0–15 mins, 15–30 mins, 30–45 mins]
4. Attribute Construction
What It Is: Create new attributes (features) from the existing data to make analysis
or mining more efficient.
Why It’s Important: New attributes can simplify the data and improve the
quality of insights.
Examples:
E-commerce: Create a "Customer Lifetime Value (CLV)" attribute based on
past purchases.
Weather Data: Generate a new attribute, "Feels Like Temperature," based on
temperature, humidity, and wind speed.
5. Generalization
What It Is: Converts specific, low-level data into broader, high-level
categories using a concept hierarchy.
Why It’s Important: It simplifies data and makes patterns easier to detect.
Examples:
Ages: Convert exact ages into categories: Original: [18, 22, 30, 55]
Generalized: [“Teenager”, “Young Adult”, “Middle-Aged”]
Addresses: Convert house addresses into broader categories: Original: [“221B
Baker Street, London”] Generalized: [“London, UK”]

Normalization
Normalization adjusts data values to ensure they fall within a specific range or
follow a standard distribution, enabling fair comparisons between variables.
Why It’s Needed:
1. Attributes with larger ranges may dominate analysis if not scaled.
2. It improves the performance of machine learning models by
reducing bias introduced by differing scales.

Normalization Techniques and Formulas

1. Min-Max Normalization
This method scales data to fit within a specific range
[new_min,new_max][new\_min, new\_max].
Formula:
2. Z-Score Normalization
This standardizes data using the mean (μ\mu) and standard deviation (σ\sigma).
3. Decimal Scaling
This method normalizes data by shifting the decimal point based on the maximum
absolute value of the attribute.
Practical Applications
1. Min-Max Normalization:
o Rescaling exam scores (0-100) into a range of 0 to 1 for
model inputs.
2. Z-Score Normalization:
o Standardizing survey responses (e.g., satisfaction ratings) for
comparison.
3. Decimal Scaling:
o Scaling sales revenue (e.g., $1M, $2M, $3M) into
manageable decimals for easier analysis.
Here is a simplified and more understandable explanation of Data Reduction
methods with examples for clarity.
Discretization and Concept Hierarchy
Transforms continuous attributes into discrete ones and replaces low-level data
with high-level concepts.
Discretization:
Convert numerical values into categories or intervals.
Example: Age (23, 45, 67) → Categories (Youth, Middle-aged, Senior).
Concept Hierarchy:
Replace specific values with broader categories.
Example: Replace “43” for age with "Middle-aged".
Techniques:
(a) Top-down Discretization:
Start with the entire range and split it into smaller intervals.
Example: Age range (0–100) → Split into (0–25, 26–50, 51–75, 76–100).

(b) Bottom-up Discretization:

Start with individual values and merge them into intervals.
Example: Merge ages 22, 23, 24 → (20–25).
(c) Binning:
Create bins based on specified ranges.
Example: Income ($20K, $40K, $60K) → Low, Medium, High.
(d) Histogram Analysis:
Partition data into intervals based on frequency or width.
Example: Grades (A, B, C, D) → Frequency bins based on number of students.
Applications of Data Reduction
Data Analysis: Reduces computational load by summarizing data.
Machine Learning: Focuses on the most relevant features for better predictions.
Data Storage: Saves storage space by compressing files or summarizing data.
Visualization: Simplifies complex data for easier interpretation.

Concept Hierarchies
Definition: Represent data at multiple levels of abstraction by grouping or
summarizing concepts.
Example: Low-level concepts: “TV,” “CD Player,” “VCR.”
High-level concept: “Home Entertainment.”

Use Case: Summarize sales data by high-level categories instead of individual

products.

Data Mining Task Primitives:

Each user will have a data mining task in mind, that is, some form of data analysis
that he or she would like to have performed.
• A data mining task can be specified in the form of a data mining query, which is input to the data
mining system.
• A data mining query is defined in terms of data mining task primitives. These primitives allow the
user to inter- actively communicate with the data mining system during discovery in order to direct
the mining process, or examine the findings from different angles or depths.
Main Data Mining Primitives
are basic elements used to define and control the data mining process. They tell the
system what data to mine, what knowledge to find, and how to present the results.
1. *Task-Relevant Data*

This specifies the portions of the database or the set of data in which the user is interested. This
includes the database attributes or data warehouse dimensions of interest (referred to as the
relevant attributes or dimensions).
Includes database, tables, attributes, and conditions.
Example: Student data with marks > 60.

2. Kind of Knowledge to be Mined

This specifies the data mining functions to be performed, such as characterization,
discrimination, association or correlation analysis, classification, prediction, clustering, outlier
analysis, or evolution analysis.
Examples:
* Association
* Classification
* Clustering
* Prediction
* Characterization
3. *Background Knowledge*

This knowledge about the domain to be mined is useful for guiding the knowledge discovery
process and for evaluating the patterns found. Concept hierarchies are a popular form of back-
ground knowledge, which allow data to be mined at multiple levels of abstraction. An example of a
concept hierarchy for the attribute (or dimension) age is shown in Figure
Includes concept hierarchies.
Example: City → State → Country.

4. *Interestingness Measures*

They may be used to guide the mining process or, after discovery, to evaluate the discovered
patterns. Different kinds of knowledge may have different interestingness measures. For example,
interestingness measures for association rules include support and confidence. Rules whose support
and confidence values are below user-specified thresholds are considered uninteresting.
Examples: Support, Confidence, Lift.

4. Presentation of Discovered Knowledge

This refers to the form in which discovered patterns are to be displayed, which may include rules,
tables, charts, graphs, decision trees, and cubes.

Examples: Tables, graphs, charts, rules, trees.

Data Mining Architecture

• Data Mining Architecture is the process of selecting, exploring, and modelling large
amounts of data to discover previously unknown regularities or relationships to generate
clear and valuable findings for the database owner. Data mining is exploring and analysing
large amounts of data using automated or semi-automated processes to identify practical
designs and procedures.
• The primary components of any data mining system are the Data source, data
warehouse server, data mining engine, pattern assessment module, graphical user
interface, and knowledge base.

Basic Working:
• When a user requests data mining queries, these requests are sent to data mining
engines for pattern analysis.
• These software applications use the existing database to try to discover a solution to the
query.
• The retrieved metadata is then transmitted to the data mining engine for suitable
processing, which may interact with pattern assessment modules to decide the
outcome.
• The result is finally delivered to the front end in a user-friendly format via an
appropriate interface.
•
Components Of Data Mining Architecture
• Data Sources
• Database Server
• Data Mining Engine
• Pattern Evaluation Modules
• Graphic User Interface
• Knowledge Base

Data Sources
• These sources provide the data in plain text, spreadsheets, or other media such as images or
videos Data sources include databases, the World Wide Web (WWW), and data
warehouses.

Database Server
• The real data is stored on the database server and is ready to be processed. Its job is to
handle data retrieval in response to the user's request.

Data Mining Engine:

• It is one of the most important parts of the data mining architecture since it conducts many
data mining techniques such as association, classification, characterisation, clustering,
prediction, and so on.

Pattern Evaluation Modules:

• They are responsible for identifying intriguing patterns in data and, on occasion,
interacting with database servers to provide the results of user queries.

Graphic User Interface:

• Because the user cannot completely comprehend the complexities of the data mining
process, a graphical user interface assists the user in efficiently communicating with the
data mining system.

Knowledge Base:
• The Knowledge Base is an essential component of the data mining engine that aids in the
search for outcome patterns. Occasionally, the knowledge base may also provide input to
the data mining engine. This knowledge base might include information gleaned from user
encounters. The knowledge base's goal is to improve the accuracy
• and reliability of the outcome. The Knowledge Base is a crucial component of
the data mining engine that aids in the search for outcome patterns.
Occasionally, the knowledge base may also provide input to the data mining
engine. This knowledge base might include information gleaned from user
encounters. The knowledge base's goal is to improve the accuracy and
reliability of the outcome.
Languages for Data Mining

a. Data Mining Query Languages (DMQLs):

Concept: Historically, there have been proposals for dedicated Data Mining Query Languages
(DMQLs), inspired by SQL. The idea was to have a declarative language where users could sp
specify what they want to mine, rather than how to mine it.

Han, Kamber, and Pei's DMQL: A prominent conceptual DMQL defined by the authors (Han,
Kamber, and Pei) in their widely-used textbook specifies data mining tasks using "primitives" (as
discussed in a previous response). These primitives cover:

Task-relevant data: Specifying the data source and subset.

Kind of knowledge to be mined: The data mining functionality (e.g., classification, association).

Background knowledge: Concept hierarchies, domain knowledge.

Interestingness measures: Thresholds (e.g., minimum support/confidence for association rules).

Expected output format: How results should be presented (e.g., rules, tables, visualizations).

Status: While DMQLs were proposed, a single, universally adopted standard like SQL for relational
databases doesn't exist for data mining. However, the principles of DMQL (specifying tasks using
primitives) are fundamental to how all data mining tools operate.

b. General-Purpose Programming Languages with Libraries:

In practice, much of modern data mining is done using general-purpose programming languages
coupled with powerful libraries. These offer flexibility, extensibility, and the ability to integrate with
other data science tasks.
Python:
Strengths: Highly popular, easy to learn, vast ecosystem of libraries.
Key Libraries:
Pandas: For data manipulation and analysis (data cleaning, integration, transformation).
NumPy: For numerical computing.
Scikit-learn: Comprehensive machine learning library for classification, regression, clustering,
dimensionality reduction, etc.
Matplotlib, Seaborn: For data visualization.
TensorFlow, PyTorch, Keras: For deep learning tasks, which are increasingly relevant in advanced
data mining.
NLTK, spaCy: For natural language processing (text mining).
Strengths: Excellent for statistical analysis, data visualization, and specialized machine learning
tasks. Strong community and a rich collection of statistical packages.
Key Packages: dplyr, ggplot2, caret, e1071, random Forest, arules.
Scala (with Apache Spark):
Strengths: Designed for big data processing, integrates well with Spark's distributed computing
capabilities.
Key Libraries: Spark MLlib (Spark's machine learning library).
Java: Strengths: Mature, robust, widely used in enterprise systems. Many early data mining tools
were built in Java (e.g., Weka, Apache Mahout).

SQL (for Data Preparation):

While not a data mining language itself, SQL is crucial for the data preprocessing stage, especially
for selecting, filtering, aggregating, and joining data in relational databases and data warehouses
before it's fed into mining algorithms.
c. Domain-Specific Languages/Tools:
Many commercial and open-source data mining tools provide their own visual interfaces or scripting
languages that abstract away the underlying code, making it accessible to business users
. Examples:
KNIME: Node-based visual workflow editor.
Orange: Visual programming tool for data analysis and machine learning. RapidMiner: Integrated
environment for predictive analytics, text mining, and
machine learning.
IBM SPSS Modeler: Drag-and-drop interface for building predictive models.
SAS Enterprise Miner: Comprehensive suite for statistical analysis and datamining.

Weka: Collection of machine learning algorithms for data mining tasks, with a GUI.

Data Preprocessing for Effective Mining
No ratings yet
Data Preprocessing for Effective Mining
15 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
19 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
6 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
5 pages
Data Preprocessing Techniques for Big Data
No ratings yet
Data Preprocessing Techniques for Big Data
51 pages
Essential Steps in Data Preprocessing
No ratings yet
Essential Steps in Data Preprocessing
4 pages
Data Transformation in Preprocessing
No ratings yet
Data Transformation in Preprocessing
8 pages
Data Pre-processing for Machine Learning
No ratings yet
Data Pre-processing for Machine Learning
61 pages
DMDW Notes
No ratings yet
DMDW Notes
61 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
25 pages
Understanding Data Mining and KDD
No ratings yet
Understanding Data Mining and KDD
22 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
25 pages
Data Mining
No ratings yet
Data Mining
5 pages
Importance of Data Preprocessing
No ratings yet
Importance of Data Preprocessing
39 pages
Essential Steps in Data Preprocessing
No ratings yet
Essential Steps in Data Preprocessing
4 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
57 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
12 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
22 pages
Data Pre Processing
No ratings yet
Data Pre Processing
3 pages
Data Mining and Warehousing - 1
No ratings yet
Data Mining and Warehousing - 1
23 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
40 pages
Unit-3 Data Preprocessing Techniques
No ratings yet
Unit-3 Data Preprocessing Techniques
16 pages
Data Preprocessing & Classification Techniques
No ratings yet
Data Preprocessing & Classification Techniques
115 pages
Data Pre-Processing Techniques Explained
No ratings yet
Data Pre-Processing Techniques Explained
37 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
23 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
22 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
82 pages
Data Preprocessing in R Programming
No ratings yet
Data Preprocessing in R Programming
29 pages
Essential Data Pre-processing Techniques
No ratings yet
Essential Data Pre-processing Techniques
25 pages
Data Preprocessing in Data Warehousing
No ratings yet
Data Preprocessing in Data Warehousing
28 pages
Essential Data Preprocessing Methods
No ratings yet
Essential Data Preprocessing Methods
3 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
32 pages
Data Mining and Processing Overview
No ratings yet
Data Mining and Processing Overview
16 pages
Data Preprocessing Techniques in Mining
No ratings yet
Data Preprocessing Techniques in Mining
5 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
Data Pre-Processing Techniques Explained
No ratings yet
Data Pre-Processing Techniques Explained
8 pages
Machine Learning Data Preprocessing Guide
No ratings yet
Machine Learning Data Preprocessing Guide
43 pages
Data Mining and Preprocessing Essentials
No ratings yet
Data Mining and Preprocessing Essentials
31 pages
Data Preprocessing for Machine Learning
No ratings yet
Data Preprocessing for Machine Learning
65 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
19 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
7 pages
Comprehensive Guide to Data Mining
No ratings yet
Comprehensive Guide to Data Mining
52 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
3 pages
Data Integration in Preprocessing
No ratings yet
Data Integration in Preprocessing
29 pages
Data Preparation and Preprocessing Guide
No ratings yet
Data Preparation and Preprocessing Guide
52 pages
Data Preprocessing Techniques in Data Mining
No ratings yet
Data Preprocessing Techniques in Data Mining
53 pages
Data Preprocessing Steps Explained
No ratings yet
Data Preprocessing Steps Explained
6 pages
Discretization and Concept Hierarchies
No ratings yet
Discretization and Concept Hierarchies
48 pages
Essential Steps in Data Preprocessing
No ratings yet
Essential Steps in Data Preprocessing
2 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Data Pre-Processing in Data Mining
No ratings yet
Data Pre-Processing in Data Mining
37 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
23 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
52 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
56 pages
AI Sample Questions and MCQs 30012026 082309
No ratings yet
AI Sample Questions and MCQs 30012026 082309
12 pages
Teaching Dossier For GET 307
No ratings yet
Teaching Dossier For GET 307
13 pages
Microsoft DP-900 Exam Questions and Answers
No ratings yet
Microsoft DP-900 Exam Questions and Answers
133 pages
AI Overview: Benefits and Challenges
No ratings yet
AI Overview: Benefits and Challenges
1 page
City Ride: Carpooling in Tamil Nadu
No ratings yet
City Ride: Carpooling in Tamil Nadu
65 pages
Cash Register System Database SRS
No ratings yet
Cash Register System Database SRS
15 pages
Microbiology Assignment: Golden Ages Analysis
No ratings yet
Microbiology Assignment: Golden Ages Analysis
2 pages
Overview of Data Mining Tasks
No ratings yet
Overview of Data Mining Tasks
12 pages
Foundations of AI: Problem Solving Explained
No ratings yet
Foundations of AI: Problem Solving Explained
15 pages
Encryption and Cipher Crossword Puzzle
No ratings yet
Encryption and Cipher Crossword Puzzle
1 page
Differences in Telecommunication Systems
No ratings yet
Differences in Telecommunication Systems
2 pages
Overview of Information Science and Society
No ratings yet
Overview of Information Science and Society
21 pages
GR 11-IP Worksheet
No ratings yet
GR 11-IP Worksheet
3 pages
Asdefcon CDRL
No ratings yet
Asdefcon CDRL
6 pages
OSI Security Architecture Overview
No ratings yet
OSI Security Architecture Overview
18 pages
DBMS: Challenges and Scalability Insights
No ratings yet
DBMS: Challenges and Scalability Insights
12 pages
Next-Gen Competitive Programming Platform
No ratings yet
Next-Gen Competitive Programming Platform
15 pages
.Net Developer Resume - Bilahari
No ratings yet
.Net Developer Resume - Bilahari
3 pages
KSGKnowledge and Skill Graph - Paper
No ratings yet
KSGKnowledge and Skill Graph - Paper
5 pages
Vehicle Number Recognition System
No ratings yet
Vehicle Number Recognition System
8 pages
Blockchain in Energy Systems Review
No ratings yet
Blockchain in Energy Systems Review
25 pages
Ritika Singh: CS Graduate & Projects
No ratings yet
Ritika Singh: CS Graduate & Projects
1 page
Aspiring Data Scientist Resume
No ratings yet
Aspiring Data Scientist Resume
1 page
BERT-Enhanced Stock Prediction Model
No ratings yet
BERT-Enhanced Stock Prediction Model
8 pages
Understanding Databases and DBMS
No ratings yet
Understanding Databases and DBMS
42 pages
Big Data Security in IoT and Cloud
No ratings yet
Big Data Security in IoT and Cloud
22 pages
Understanding Data Abstraction Techniques
No ratings yet
Understanding Data Abstraction Techniques
11 pages
AI-Powered Fraud Detection System
No ratings yet
AI-Powered Fraud Detection System
11 pages
Database System Concepts and Benefits
No ratings yet
Database System Concepts and Benefits
45 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
364 pages

Data Preprocessing Techniques Explained

Uploaded by

Data Preprocessing Techniques Explained

Uploaded by

UNIT 2

DATA PRE- PROCESSING

Efficiency - Preprocessing reduces data size and complexity, so mining

Steps in Data Preprocessing:

Major Tasks In Data Pre-Processing:

Method Description Example

(b) Noisy Data

Noisy Data Regression

Normalization Techniques and Formulas

(b) Bottom-up Discretization:

Use Case: Summarize sales data by high-level categories instead of individual

Data Mining Task Primitives:

2. *Kind of Knowledge to be Mined*

4. *Presentation of Discovered Knowledge*

Examples: Tables, graphs, charts, rules, trees.

Data Mining Engine:

Pattern Evaluation Modules:

Graphic User Interface:

a. Data Mining Query Languages (DMQLs):

Task-relevant data: Specifying the data source and subset.

Background knowledge: Concept hierarchies, domain knowledge.

Interestingness measures: Thresholds (e.g., minimum support/confidence for association rules).

b. General-Purpose Programming Languages with Libraries:

SQL (for Data Preparation):

You might also like

2. Kind of Knowledge to be Mined

4. Presentation of Discovered Knowledge