Data Preprocessing Techniques in Analytics

The document discusses data preprocessing techniques, focusing on cleaning and handling missing data. It outlines measures of data quality, types of missing values (MCAR, MAR, MNAR), and methods for dealing with missing data. The importance of high-quality data for effective analysis is emphasized, along with various strategies for improving data completeness and accuracy.

Uploaded by

adimallya02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views19 pages

Data Preprocessing Techniques in Analytics

Uploaded by

adimallya02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

DATA ANALYTICS

UE23CS342AA2
UNIT-1
Lecture 2, 3: Data Preprocessing –
Cleaning - Dealing with missing data

Gowri Srinivasa
Department of Computer Science and Engineering
Data Analytics
Unit 1
Lecture 2, 3: Data Preprocessing – Cleaning - Dealing with missing
data,
Slides collated by:
Nishanth M S, PESU-2023, Department of CSE, PESU
Slides excerpted from: Data Mining : Concepts and nishanthmsathish.23@[Link]
Techniques by Han, Kamber and Pei, 3rd Edition Harshitha Srikanth, PESU-2024, PES University
harshithasrikanth13@[Link]
Karthik Namboori, VII Sem, PESU, Department of CSE
namkarthik2003@[Link]
Gowri Srinivasa
Department of Computer Science and Engineering With grateful thanks for contribution of slides to:
Dr. Mamatha H R, Professor at the Department of CSE, PESU
DATA ANALYTICS
Data Preprocessing
• Analysis on data can only be as good as the data
itself. Low quality data will lead to low quality
analysis.

• Real world databases are highly susceptible to

noisy, missing and inconsistent data owing to their
huge size and multiple heterogeneous sources.

• Data processing techniques when applied before

analysis can substantially improve the overall
quality of analysis and/or the time required for the
actual analysis.
DATA ANALYTICS
Measures of Data Quality

1) Accuracy : Data must not contain errors or a lot of noise.

Example of inaccurate data : Date = 30/02/2002.
Reasons for inaccurate data :

• Data collection instruments may be faulty.

• Human errors occur during data entry.
• Disguised missing data : Users may purposefully submit incorrect data
values for mandatory fields when they don’t want to share their
personal information. Example : Choosing the default value of January
1st for date of birth.
DATA ANALYTICS
Measures of Data Quality
2) Completeness : Data must not lack attribute values. It must contain
attributes of interest and relevance to the problem at hand.
Reasons for incompleteness :
• Attributes of interest were not considered important at the time of entry.
• Data might not be recorded due to equipment malfunction resulting in
missing data.

3) Consistency : Should not contain any discrepancies in the data or the naming
convention of the attributes.
Examples of inconsistency:
• Age is recorded as 50 but Date of Birth = 03/04/2005.
• In the result column of students’ marks , few entries are in GPA format
and rest in percentage.
• Discrepancies can exist between duplicate records.
DATA ANALYTICS
Measures of Data Quality
4) Timeliness : The data must be updated in a timely fashion. For example , for
an analysis run on the first day of every month, previous month’s data must
be up to date for accurate analysis.

4) Interpretability : The data must be easily understood. If the attributes of the

data aren’t easily understandable , the analysis is going to be hindered.

4) Believability : The data and its source must be trusted by the users. If this
data or the source caused problems in the past , current users will find it
hard to trust it.

NOTE : The quality of data is subjective and depends on the intended use of
data. The data needs of each problem is different.
DATA ANALYTICS
Major Tasks in Data Preprocessing

• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
DATA ANALYTICS
Data Cleaning

Data cleaning entails :

• Filling in missing values
• Smoothening noisy data
• Identifying and removing outliers

If users believe the data is dirty, they are

unlikely to trust the outcome of the analysis.
DATA ANALYTICS
Missing Data

Data is not always complete. Missing data maybe due to :

• Equipment malfunction
• Inconsistent with other recorded data leading to its deletion
• Data not recorded due to a misunderstanding
• Certain data may not be considered useful at the time of entry

Missing data may need to be inferred.

DATA ANALYTICS
Handling missing data
1. Ignore the tuple : Usually done when the class label is missing (for a
classification task). This is not effective when the percentage of missing
values per attribute varies considerably
2. Fill in the missing value manually : Time consuming and infeasible for a large
data set.
3. Fill it with a global constant : Replace it with a global constant like the word
“Unknown”. The downside is that the model might learn patterns with
respect to the occurrence of the word “Unknown”.
4. Fill it with a central tendency : For symmetric data distributions , replace it
with the mean and for skewed data distributions, replace it with the median.
5. A smarter way is to use attribute mean or median(based on the
distribution)for all samples belonging to the same class.
6. Use most probable model : Use models like regression , decision tree or
inference-based Bayesian formalism to infer the missing value.
DATA ANALYTICS
Types of Missing Values
1. Missing Completely At Random (MCAR)
• The missing data is independent of the observed and unobserved data. In other
words, no systematic differences exist between records with missing data and
those with complete data.
• For example : A weighing scale running out of batteries. This is not dependent on
the person and the probability of this happening is equal to everyone.
• Assuming the data as MCAR is a strong and often unrealistic assumption as “true
randomness” is rare in the real world.
• MCAR data doesn’t add bias to the analysis.
• Ways to deal with it :
▪ Delete the records : If it is a small fraction of data
▪ Delete the attributes : If it is a small fraction of attributes
▪ Mean imputation
▪ Pairwise deletion : Compute the mean, variance and covariance with another
variable available.
DATA ANALYTICS
Types of Missing Values
2. Missing At Random (MAR)
• MAR assumes that the missing value can be predicted based on the other observed
data. The missingness is still random.
• Example : Employed people are less likely to answer all questions of a survey when
compared to unemployed people. Data is MAR if the likelihood of completing the
survey is dependent of the employment status but not on the topic of survey.
• Almost always produces a bias in the analysis.
• MCAR implies MAR but the converse isn’t true.
• Ways to deal with it :
▪ Regression imputation : unbiased if it considers the factor which influences the
missingness.
▪ Last observation carried forward (LOCF) and Baseline observation carried
forward(BOCF) : Yields biased estimates. Must be used only if the underlying
assumptions are scientifically justifiable.
▪ Use of multiple imputation (Packages mice and amelia in R)
DATA ANALYTICS
Types of Missing Values
3. Missing Not At random (MNAR)
• The missingness of the data depends on the value of the data. The mechanism for
why the data is missing is known. Yet , the values can’t be effectively inferred.
• Examples :
▪ Censored data
▪ People belonging to certain income brackets might not wish to disclose their
assets.
▪ A weighing machine can only measure weights in a particular range.
• Ways to deal with this :
▪ One must model the missingness explicitly, jointly modelling the response and
missingness.
▪ Generally , the data is assumed to be MAR whenever feasible to avoid this
situation.
NOTE : There is no statistical way to determine under which category your missing data
will fall under.
DATA ANALYTICS
Types of Missing Values-A Quick Glance

Missing Completely at Random, MCAR, means there is no relationship between the

missingness of the data and any values, observed or missing. Those missing data points are a
random subset of the data. There is nothing systematic going on that makes some data more
likely to be missing than others.

Missing at Random, MAR, means there is a systematic relationship between the propensity
of missing values and the observed data, but not the missing data.
Whether an observation is missing has nothing to do with the missing values, but it does
have to do with the values of an individual’s observed variables. So, for example, if men are
more likely to tell you their weight than women, weight is MAR.

Missing Not at Random, MNAR, means there is a relationship between the propensity of a
value to be missing and its values.
DATA ANALYTICS
An Interesting Thought

• Imagine you are collecting some information from your classmates. For many
reasons , not everyone will answer every question of yours. And that is okay!
• Well the next step is replacing missing values right? We can use any one of
the methods we have discussed till now after some analysis of the data.
• But wait! Don’t you think the fact that they did not answer is some kind of
information per se which can be beneficial to our analysis?
• So the next time you build a model , before dealing with the missing values ,
create an additional variable ( preferably a binary variable ) in which you
store if the particular student answered or not.
• This may (or may not! ) help you gain more insights about the population or
improve the analytics model you are building!

[Link]
DATA ANALYTICS
Test your understanding!
• Which of these is not a method to deal with noisy data?
a) Binning
b) Regression
c) Principal Component Analysis
d) Clustering
Solution
c) Principle Component Analysis
• Outliers need to be removed in every dataset , regardless of the problem
statement.
Solution
False
• Mean imputation can be done for which type of missing data?
Solution
MCAR
DATA ANALYTICS
Test your understanding!

• The statement “Most of the missing people from work are sickest people” denotes
what type of missingness?
MNAR

• Which type of missingness is called “non-ignorable”?

MNAR
Because the missing data mechanism itself has to be modelled as you deal with the
missing data. You have to include some model for why the data are missing and what the
likely values are.
DATA ANALYTICS
References

• Data Mining : Concepts and Techniques by Han, Kamber and Pei , The
Morgan Kaufmann Series in Data Management Systems ,3rd Edition
Chapter : 3.1-3.2
• [Link]
• [Link]
• [Link]
THANK YOU
Dr. Gowri Srinivasa
Professor, Department of Computer Science and
Engineering, PES University, Bengaluru
Email: gsrinivasa@[Link]

Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
105 pages
Data Preparation for Machine Learning
No ratings yet
Data Preparation for Machine Learning
252 pages
Handling Missing Values in Data Analysis
No ratings yet
Handling Missing Values in Data Analysis
3 pages
Data Preprocessing for Pollution Analysis
No ratings yet
Data Preprocessing for Pollution Analysis
6 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
66 pages
Handling Noisy Data and Quality Factors
No ratings yet
Handling Noisy Data and Quality Factors
32 pages
FCI - Categorical Data Analysis - 07 Missing Data in Categorical Analysis
No ratings yet
FCI - Categorical Data Analysis - 07 Missing Data in Categorical Analysis
32 pages
Data Cleaning Techniques in Analytics
No ratings yet
Data Cleaning Techniques in Analytics
26 pages
Data Cleaning: Process and Techniques
No ratings yet
Data Cleaning: Process and Techniques
8 pages
Missing Data
No ratings yet
Missing Data
3 pages
Descriptive Analytics in Marketing
No ratings yet
Descriptive Analytics in Marketing
78 pages
Data Cleaning: Importance & Best Practices
No ratings yet
Data Cleaning: Importance & Best Practices
26 pages
Strategies for Missing Data Handling
No ratings yet
Strategies for Missing Data Handling
13 pages
Missing Value
No ratings yet
Missing Value
3 pages
Data Preprocessing: Cleaning Techniques
No ratings yet
Data Preprocessing: Cleaning Techniques
36 pages
Handling Missing Data in Datasets
No ratings yet
Handling Missing Data in Datasets
5 pages
Techniques for Handling Missing Data
No ratings yet
Techniques for Handling Missing Data
23 pages
Review 1
No ratings yet
Review 1
52 pages
EDA Week4
No ratings yet
EDA Week4
23 pages
Missing Data Handling Techniques
No ratings yet
Missing Data Handling Techniques
21 pages
Data Quality Issues
No ratings yet
Data Quality Issues
24 pages
Understanding Missing Values in Data Science
No ratings yet
Understanding Missing Values in Data Science
3 pages
Big Data Analytics: Data Preprocessing Guide
No ratings yet
Big Data Analytics: Data Preprocessing Guide
73 pages
Chapter 3
No ratings yet
Chapter 3
55 pages
HR Analytics 4
No ratings yet
HR Analytics 4
44 pages
Understanding Imputation in Data Science
No ratings yet
Understanding Imputation in Data Science
4 pages
Understanding Missing Data in ML
No ratings yet
Understanding Missing Data in ML
25 pages
Lec 3 Data Preprocessing and Transformation
No ratings yet
Lec 3 Data Preprocessing and Transformation
66 pages
Understanding Missing Data in Research
No ratings yet
Understanding Missing Data in Research
6 pages
Handling Missing Data Methods
No ratings yet
Handling Missing Data Methods
32 pages
Data Preparation For Analysis
No ratings yet
Data Preparation For Analysis
17 pages
Handling Missing Data in Machine Learning
No ratings yet
Handling Missing Data in Machine Learning
22 pages
Data Quality and Missing Values in EDA
No ratings yet
Data Quality and Missing Values in EDA
70 pages
Handling Missing Data in Datasets
No ratings yet
Handling Missing Data in Datasets
12 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
6 pages
Data Preprocessing: Handling Missing Values
No ratings yet
Data Preprocessing: Handling Missing Values
20 pages
Big Data Preprocessing Techniques
No ratings yet
Big Data Preprocessing Techniques
66 pages
Data Screening in Research Methodology
No ratings yet
Data Screening in Research Methodology
21 pages
Data Wrangling: Handling Missing Values
No ratings yet
Data Wrangling: Handling Missing Values
26 pages
Module 3
No ratings yet
Module 3
44 pages
Handling Noisy and Missing Data
No ratings yet
Handling Noisy and Missing Data
32 pages
Importance of Data Integrity in Analytics
No ratings yet
Importance of Data Integrity in Analytics
13 pages
Missing Data Handling Techniques
No ratings yet
Missing Data Handling Techniques
93 pages
Handling Missing Values
No ratings yet
Handling Missing Values
36 pages
Understanding Missing Data in Statistics
No ratings yet
Understanding Missing Data in Statistics
4 pages
Data Cleaning: Handling Missing & Noisy Data
No ratings yet
Data Cleaning: Handling Missing & Noisy Data
8 pages
Handling Missing Data in Datasets
No ratings yet
Handling Missing Data in Datasets
3 pages
Advanced Data Cleaning Strategies
No ratings yet
Advanced Data Cleaning Strategies
25 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
37 pages
Data Cleaning and Outlier Management
No ratings yet
Data Cleaning and Outlier Management
17 pages
Importance of Data Exploration Techniques
No ratings yet
Importance of Data Exploration Techniques
48 pages
Businnes Intelligence
No ratings yet
Businnes Intelligence
36 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
8 pages
Unit-2 DM
No ratings yet
Unit-2 DM
46 pages
Handling Missing Data in Research Analysis
No ratings yet
Handling Missing Data in Research Analysis
8 pages
Understanding Computer Data Processing
No ratings yet
Understanding Computer Data Processing
27 pages
LEAGUE: Skill Learning for Long-Horizon Tasks
No ratings yet
LEAGUE: Skill Learning for Long-Horizon Tasks
8 pages
AI Hardware Market Growth Strategies
No ratings yet
AI Hardware Market Growth Strategies
34 pages
Manuale utente-EXcetera-UK Ver1 0 0 Agosto2013
No ratings yet
Manuale utente-EXcetera-UK Ver1 0 0 Agosto2013
24 pages
840 Series Control Valves Overview
No ratings yet
840 Series Control Valves Overview
4 pages
ALA Module 1
No ratings yet
ALA Module 1
244 pages
SY0-601問題集、CompTIA実際の試験問題 - 模擬練習
No ratings yet
SY0-601問題集、CompTIA実際の試験問題 - 模擬練習
24 pages
RFID's Impact on Walmart's SCM Efficiency
No ratings yet
RFID's Impact on Walmart's SCM Efficiency
10 pages
Tender for Steel Support Works at INTAN
No ratings yet
Tender for Steel Support Works at INTAN
2 pages
Project Work Book - PS II
No ratings yet
Project Work Book - PS II
53 pages
Big Data for Real Estate Pricing Strategies
No ratings yet
Big Data for Real Estate Pricing Strategies
10 pages
Duct Blanket Thermal & Fire Insulation
No ratings yet
Duct Blanket Thermal & Fire Insulation
2 pages
Bitcoin Network Analysis and Insights
No ratings yet
Bitcoin Network Analysis and Insights
7 pages
Process Control and Capability Overview
No ratings yet
Process Control and Capability Overview
31 pages
User Manual: QT4005 QT4000
No ratings yet
User Manual: QT4005 QT4000
104 pages
PRX700 Amplifier Troubleshooting Guide
No ratings yet
PRX700 Amplifier Troubleshooting Guide
17 pages
CNC Milling Machine User Manual
No ratings yet
CNC Milling Machine User Manual
237 pages
Understanding Secondary Memory Types
No ratings yet
Understanding Secondary Memory Types
11 pages
Piper 100-Hour Inspection Guide
No ratings yet
Piper 100-Hour Inspection Guide
66 pages
School Registration Form Template
No ratings yet
School Registration Form Template
1 page
AWS Machine Learning Module 2 Overview
No ratings yet
AWS Machine Learning Module 2 Overview
55 pages
Window Manager ANR Report 2025
No ratings yet
Window Manager ANR Report 2025
1,774 pages
Debarghya Das: Software Engineer Resume
No ratings yet
Debarghya Das: Software Engineer Resume
1 page
Python for Civil Engineering Automation
No ratings yet
Python for Civil Engineering Automation
3 pages
STC2 KL - Series 20230427
No ratings yet
STC2 KL - Series 20230427
3 pages
Testing Procedure for ABB ACS6000 IGCTs
No ratings yet
Testing Procedure for ABB ACS6000 IGCTs
4 pages
Emergency Beacon Registration Guidelines
No ratings yet
Emergency Beacon Registration Guidelines
2 pages
Ajeet Kumar Yadav's Professional Profile
No ratings yet
Ajeet Kumar Yadav's Professional Profile
2 pages
CS625 Assignment 1 Ethics & Management
No ratings yet
CS625 Assignment 1 Ethics & Management
1 page
Flying-Capacitor Based Hybrid LLC Converters With Input Voltage Auto-Balance Ability For High Voltage Applications
No ratings yet
Flying-Capacitor Based Hybrid LLC Converters With Input Voltage Auto-Balance Ability For High Voltage Applications
11 pages