DATA ANALYTICS
UE23CS342AA2
UNIT-1
Lecture 2, 3: Data Preprocessing –
Cleaning - Dealing with missing data
Gowri Srinivasa
Department of Computer Science and Engineering
Data Analytics
Unit 1
Lecture 2, 3: Data Preprocessing – Cleaning - Dealing with missing
data,
Slides collated by:
Nishanth M S, PESU-2023, Department of CSE, PESU
Slides excerpted from: Data Mining : Concepts and nishanthmsathish.23@[Link]
Techniques by Han, Kamber and Pei, 3rd Edition Harshitha Srikanth, PESU-2024, PES University
harshithasrikanth13@[Link]
Karthik Namboori, VII Sem, PESU, Department of CSE
namkarthik2003@[Link]
Gowri Srinivasa
Department of Computer Science and Engineering With grateful thanks for contribution of slides to:
Dr. Mamatha H R, Professor at the Department of CSE, PESU
DATA ANALYTICS
Data Preprocessing
• Analysis on data can only be as good as the data
itself. Low quality data will lead to low quality
analysis.
• Real world databases are highly susceptible to
noisy, missing and inconsistent data owing to their
huge size and multiple heterogeneous sources.
• Data processing techniques when applied before
analysis can substantially improve the overall
quality of analysis and/or the time required for the
actual analysis.
DATA ANALYTICS
Measures of Data Quality
1) Accuracy : Data must not contain errors or a lot of noise.
Example of inaccurate data : Date = 30/02/2002.
Reasons for inaccurate data :
• Data collection instruments may be faulty.
• Human errors occur during data entry.
• Disguised missing data : Users may purposefully submit incorrect data
values for mandatory fields when they don’t want to share their
personal information. Example : Choosing the default value of January
1st for date of birth.
DATA ANALYTICS
Measures of Data Quality
2) Completeness : Data must not lack attribute values. It must contain
attributes of interest and relevance to the problem at hand.
Reasons for incompleteness :
• Attributes of interest were not considered important at the time of entry.
• Data might not be recorded due to equipment malfunction resulting in
missing data.
3) Consistency : Should not contain any discrepancies in the data or the naming
convention of the attributes.
Examples of inconsistency:
• Age is recorded as 50 but Date of Birth = 03/04/2005.
• In the result column of students’ marks , few entries are in GPA format
and rest in percentage.
• Discrepancies can exist between duplicate records.
DATA ANALYTICS
Measures of Data Quality
4) Timeliness : The data must be updated in a timely fashion. For example , for
an analysis run on the first day of every month, previous month’s data must
be up to date for accurate analysis.
4) Interpretability : The data must be easily understood. If the attributes of the
data aren’t easily understandable , the analysis is going to be hindered.
4) Believability : The data and its source must be trusted by the users. If this
data or the source caused problems in the past , current users will find it
hard to trust it.
NOTE : The quality of data is subjective and depends on the intended use of
data. The data needs of each problem is different.
DATA ANALYTICS
Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
DATA ANALYTICS
Data Cleaning
Data cleaning entails :
• Filling in missing values
• Smoothening noisy data
• Identifying and removing outliers
If users believe the data is dirty, they are
unlikely to trust the outcome of the analysis.
DATA ANALYTICS
Missing Data
Data is not always complete. Missing data maybe due to :
• Equipment malfunction
• Inconsistent with other recorded data leading to its deletion
• Data not recorded due to a misunderstanding
• Certain data may not be considered useful at the time of entry
Missing data may need to be inferred.
DATA ANALYTICS
Handling missing data
1. Ignore the tuple : Usually done when the class label is missing (for a
classification task). This is not effective when the percentage of missing
values per attribute varies considerably
2. Fill in the missing value manually : Time consuming and infeasible for a large
data set.
3. Fill it with a global constant : Replace it with a global constant like the word
“Unknown”. The downside is that the model might learn patterns with
respect to the occurrence of the word “Unknown”.
4. Fill it with a central tendency : For symmetric data distributions , replace it
with the mean and for skewed data distributions, replace it with the median.
5. A smarter way is to use attribute mean or median(based on the
distribution)for all samples belonging to the same class.
6. Use most probable model : Use models like regression , decision tree or
inference-based Bayesian formalism to infer the missing value.
DATA ANALYTICS
Types of Missing Values
1. Missing Completely At Random (MCAR)
• The missing data is independent of the observed and unobserved data. In other
words, no systematic differences exist between records with missing data and
those with complete data.
• For example : A weighing scale running out of batteries. This is not dependent on
the person and the probability of this happening is equal to everyone.
• Assuming the data as MCAR is a strong and often unrealistic assumption as “true
randomness” is rare in the real world.
• MCAR data doesn’t add bias to the analysis.
• Ways to deal with it :
▪ Delete the records : If it is a small fraction of data
▪ Delete the attributes : If it is a small fraction of attributes
▪ Mean imputation
▪ Pairwise deletion : Compute the mean, variance and covariance with another
variable available.
DATA ANALYTICS
Types of Missing Values
2. Missing At Random (MAR)
• MAR assumes that the missing value can be predicted based on the other observed
data. The missingness is still random.
• Example : Employed people are less likely to answer all questions of a survey when
compared to unemployed people. Data is MAR if the likelihood of completing the
survey is dependent of the employment status but not on the topic of survey.
• Almost always produces a bias in the analysis.
• MCAR implies MAR but the converse isn’t true.
• Ways to deal with it :
▪ Regression imputation : unbiased if it considers the factor which influences the
missingness.
▪ Last observation carried forward (LOCF) and Baseline observation carried
forward(BOCF) : Yields biased estimates. Must be used only if the underlying
assumptions are scientifically justifiable.
▪ Use of multiple imputation (Packages mice and amelia in R)
DATA ANALYTICS
Types of Missing Values
3. Missing Not At random (MNAR)
• The missingness of the data depends on the value of the data. The mechanism for
why the data is missing is known. Yet , the values can’t be effectively inferred.
• Examples :
▪ Censored data
▪ People belonging to certain income brackets might not wish to disclose their
assets.
▪ A weighing machine can only measure weights in a particular range.
• Ways to deal with this :
▪ One must model the missingness explicitly, jointly modelling the response and
missingness.
▪ Generally , the data is assumed to be MAR whenever feasible to avoid this
situation.
NOTE : There is no statistical way to determine under which category your missing data
will fall under.
DATA ANALYTICS
Types of Missing Values-A Quick Glance
Missing Completely at Random, MCAR, means there is no relationship between the
missingness of the data and any values, observed or missing. Those missing data points are a
random subset of the data. There is nothing systematic going on that makes some data more
likely to be missing than others.
Missing at Random, MAR, means there is a systematic relationship between the propensity
of missing values and the observed data, but not the missing data.
Whether an observation is missing has nothing to do with the missing values, but it does
have to do with the values of an individual’s observed variables. So, for example, if men are
more likely to tell you their weight than women, weight is MAR.
Missing Not at Random, MNAR, means there is a relationship between the propensity of a
value to be missing and its values.
DATA ANALYTICS
An Interesting Thought
• Imagine you are collecting some information from your classmates. For many
reasons , not everyone will answer every question of yours. And that is okay!
• Well the next step is replacing missing values right? We can use any one of
the methods we have discussed till now after some analysis of the data.
• But wait! Don’t you think the fact that they did not answer is some kind of
information per se which can be beneficial to our analysis?
• So the next time you build a model , before dealing with the missing values ,
create an additional variable ( preferably a binary variable ) in which you
store if the particular student answered or not.
• This may (or may not! ) help you gain more insights about the population or
improve the analytics model you are building!
[Link]
DATA ANALYTICS
Test your understanding!
• Which of these is not a method to deal with noisy data?
a) Binning
b) Regression
c) Principal Component Analysis
d) Clustering
Solution
c) Principle Component Analysis
• Outliers need to be removed in every dataset , regardless of the problem
statement.
Solution
False
• Mean imputation can be done for which type of missing data?
Solution
MCAR
DATA ANALYTICS
Test your understanding!
• The statement “Most of the missing people from work are sickest people” denotes
what type of missingness?
MNAR
• Which type of missingness is called “non-ignorable”?
MNAR
Because the missing data mechanism itself has to be modelled as you deal with the
missing data. You have to include some model for why the data are missing and what the
likely values are.
DATA ANALYTICS
References
• Data Mining : Concepts and Techniques by Han, Kamber and Pei , The
Morgan Kaufmann Series in Data Management Systems ,3rd Edition
Chapter : 3.1-3.2
• [Link]
• [Link]
• [Link]
THANK YOU
Dr. Gowri Srinivasa
Professor, Department of Computer Science and
Engineering, PES University, Bengaluru
Email: gsrinivasa@[Link]