CS06504
Data Mining
Lecture # 7
Data Preprocessing
(Ch # 3)
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration
Data reduction
Data Transformation and
Discretization
Summary
Why Data
Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or names
Data quality is a major concern in Data Mining and
Knowledge Discovery tasks.
Why: At most all Data Mining algorithms induce
knowledge strictly from data.
No quality data, no quality mining results!
Quality decisions must be based on quality data
No quality data, inefficient mining process!
Complete, noise-free, and consistent data means faster
algorithms
The quality of knowledge extracted highly depends on
the quality of data
Effect of Noisy Data on Results Accuracy
age income student buys_computer Discover only
<=30 high yes yes those rules
<=30 high no yes which contain
>40 medium yes no support
Data Mining
>40 medium no no (frequency)
>40 low yes yes greater >= 2
31…40 no yes
31…40 medium yes yes
• If ‘age <= 30’ and income = ‘high’
Training then buys_computer = ‘yes’
data • If ‘age > 40’ and income =
‘medium’ then buys_computer =
‘no’
Due to the missing value in age income student buys_computer
training dataset, the accuracy <=30 high no ?
of prediction decreases and >40 medium yes ?
becomes “66.7%” 31…40 medium yes ?
Testing data or actual
data
Major Tasks in Data
Preprocessing
Data cleaning
Fill in missing values, smooth noisy data,
identify or remove outliers, and resolve
inconsistencies
Data integration
Integration of multiple databases, data cubes,
or files
Data reduction
Obtains reduced representation in volume but
produces the same or similar analytical results
Data transformation
Normalization and aggregation
Data discretization
Part of data reduction but with particular
Forms of data
preprocessing
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration
Data reduction
Data Transformation and
Discretization
Summary
Data Cleaning
Data cleaning tasks
Fill in missing values
Noisy data
Correct inconsistent data
Missing Data
Data is not always available
E.g., many tuples have no recorded value for
several attributes, such as customer income in
sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus
deleted
data not entered due to misunderstanding
certain data may not be considered important at
the time of entry
no register history or changes of the data
Missing data may need to be inferred.
Methods of Treating Missing Data
Ignoring and discarding data:- There are two main ways to
discard data with missing values.
Discard all those records which have missing data also called
as discard case analysis. Usually done when class label is
missing
Discarding only those attributes which have high level of
missing data.
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class.
Imputation using Mean, median or Mode:- One of the
most frequently used method (Statistical technique).
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter
Replace (numeric continuous) type “attribute missing
values” using mean/median. (Median robust against noise).
Methods of Treating Missing Data
Replace missing values using prediction/
classification model:-
Use the most probable value to fill in the missing
value: inference-based such as Bayesian formula or
decision tree
Advantage:- it considers relationship among the known
attribute values and the missing values, so the
imputation accuracy is very high.
Disadvantage:- If there is no correlation exist for some
missing attribute values and known attribute values.
The imputation can’t be performed.
(Alternative approach):- Use hybrid combination of
Prediction/Classification model and Mean/MODE.
• First try to impute missing value using
prediction/classification model, and then Median/MODE.
We will study more about this topic in Association
Methods of Treating Missing Data
K-Nearest Neighbor (k-NN) approach (Best
approach):-
k-NN imputes the missing attribute values on the
basis of nearest K neighbor. Neighbors are
determined on the basis of distance measure.
Once K neighbors are determined, missing value
are imputed by taking mean/median or MODE of
known attribute values of missing attribute.
Missing value record
Other dataset records
Imputation of Missing Data
(Basic)
Imputation is a term that denotes a procedure that
replaces the missing values in a dataset by some
plausible values
i.e. by considering relationship among
correlated values among the attributes of
the dataset.
Attribute 1 Attribute 2 Attribute 3 Attribute 4 If we consider only
20 cool high false {attribute#2}, then
cool high true value “cool” appears
20 cool high true in 3 records.
20 mild low false
30 cool normal false Probability of Imputing
10 mild high true value (20) = 66.7%
Probability of Imputing
value (30) = 33.3%
Imputation of Missing Data
(Basic) For {attribute#4}
Attribute 1 Attribute 2 Attribute 3 Attribute 4
20 cool high false the value “true”
cool high true appears in 2 records
20 cool high true
Probability of Imputing
20 mild low false value (20) = 50%
30 cool normal false
10 mild high true Probability of Imputing
value (10) = 50%
Attribute 1 Attribute 2 Attribute 3 Attribute 4 For {attribute#2,
20 cool high false attribute#3} the
cool high true value {“cool”,
20 cool high true “high”} appears in
20 mild low false only 2 records
30 cool normal false
Probability of Imputing
10 mild high true
value (20) = 100%
Noisy Data
Noise: random error or variance in a
measured variable
Incorrect attribute values may be due
to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires
data cleaning
duplicate records
incomplete data
Removing Noise
Data Smoothing (rounding, averaging
within a window).
Data smoothing by Binning method:
• first sort data and partition into (equi-depth) bins
• then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Smoothing by Regression
• smooth by fitting the data into regression functions
Clustering/merging and Detecting
outliers.
detect and remove outliers
Smoothing by Binning Method
Equal-width (distance) partitioning:
It divides the range into N intervals of equal size:
uniform grid
if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B-A)/k,
where k is the number of bins.
The most straightforward
But outliers may dominate presentation
Skewed data is not handled well.
Equal-depth (frequency) partitioning:
It divides the range into M intervals, each containing
approximately same number of samples
Good data scaling
Managing categorical attributes can be tricky.
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into (equi-width) bins: A+w, A+2w,…
- Bin 1: 4, 8, 9
- Bin 2: 15, 21, 21, 24
- Bin 3: 25, 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 7, 7, 7
- Bin 2: 20, 20, 20, 20
- Bin 3: 28, 28, 28, 28, 28
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 14
- Bin 2: 15, 24, 24, 24
- Bin 3: 25, 25, 25, 25, 34
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Regression Method for smoothing the
data
Regression is a
technique that conforms y
data values to a
function. Linear
regression involves Y1
finding the “best” line to
fit two attributes (or
variables) so that one Y1’ y=x+1
attribute can be used to
predict the other.
X1 x
Detecting Outliers (Clustering)
Outliers may be detected by clustering, where
similar values are organized into groups or
“clusters”.
Values which falls outside of the set of clusters
may be considered outliers.