0% found this document useful (0 votes)

7 views47 pages

Data Exploration and Missing Data Analysis

The document outlines the process of exploratory data analysis, emphasizing the importance of understanding data structures, completeness, and relationships. It discusses various datasets, such as the Boston House Prices and Iris datasets, and highlights the significance of handling missing data through imputation methods. The document also categorizes missing data into three classes: MCAR, MAR, and NMAR, and warns against simply discarding rows with missing values due to potential biases and inconsistencies.

Uploaded by

Pratham Choubey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views47 pages

Data Exploration and Missing Data Analysis

Uploaded by

Pratham Choubey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Learning Objectives

● Exploratory Data Analysis

● Hands-on Code
● Understanding missing data

2
Introduction

Where are we?

Exploratory Data
Data Modeling
Analysis
Problem Formulation
Presentation

Data Collection &

Insight/Prediction
Processing

● Raw data preprocessing tools

● Raw data collection & pre-processing ● Data query language (SQL) for search,
● Data collection and preprocessing consists of update Relational DBMS
~80% of time ● Storing semi structured data in XML, JSON
formats
3
Introduction Real world scenario

Table

Structured RDBMS SQL

Exploratory Data
Collected Semi Analysis
NRDBMS NoSQL
Raw Data structured
XML, JSON
Unstructured

Preprocessing

Small public R/Python

datasets CSV
Data Exploration
● The goal of the data exploration is to learn
about the data.

● The data scientist wants to know the basic

characteristics of the data, e.g.,
○ the structure,
○ the size,
○ the completeness (or rather where data is
missing), and
○ the relationships between different parts of
the data.
6
Data Exploration
● The exploration is usually a semi-automated
interactive process in which data scientists
use many different tools to consider
different aspects of the data.

● These tools allow the data scientist to

inspect raw data or preprocessed data, e.g.,
comma-separated values (CSV) files

● In this course we will use tools available in

Python:
○ Statistical measures
○ Visualizations
7
Examples

Boston House Prices

Dataset
This dataset contains
information collected by the
U.S Census Service concerning
housing in the area of Boston
Mass (1978).

8
Examples

Boston House Prices

Dataset
This dataset contains
information collected by the
U.S Census Service concerning
housing in the area of Boston
Mass (1978).
We will focus on exploring these two

9
Examples

Boston House Prices

Dataset
This dataset contains
information collected by the
U.S Census Service concerning
housing in the area of Boston
Mass (1978).
We will focus on exploring these two

11
Examples: MEDV

Histogram

12
Examples: Boston House Prices / MEDV

Density

13
Examples: Boston House Prices / MEDV

Density + rug

14
Examples: Boston House Prices / MEDV

Density + histogram + rug

15
Examples: Boston House Prices / CRIM

Density + histogram + rug

16
Examples: Boston House Prices / CRIM

Density + histogram +
rug

17
Examples

Iris Dataset

The Iris flower data set is a multivariate

data set introduced by the British
statistician and biologist Ronald Fisher in
his 1936 paper.

The data set consists of 50 samples from

each of three species of Iris (Iris Setosa, Iris
virginica, and Iris versicolor). Four features
were measured from each sample: the
length and the width of the sepals and
petals, in centimeters.

19
Examples: IRIS

20
Examples

Iris Dataset

The Iris flower data set is a multivariate

data set introduced by the British
statistician and biologist Ronald Fisher in
his 1936 paper.

The data set consists of 50 samples from

each of three species of Iris (Iris Setosa, Iris
virginica, and Iris versicolor). Four features
were measured from each sample: the
length and the width of the sepals and
petals, in centimeters.

22
Examples: IRIS

Box plot

23
Examples: IRIS

24
Examples: IRIS

25
Examples: Trend

Air Passengers
Dataset
The classic Box & Jenkins airline
data. Monthly totals of
international airline
passengers, 1949 to 1960.

27
Examples: Trend

Histogram

28
Examples: Trend

Scatter plot

29
Examples: Trend

Line plot

30
Missing Data Example of missing data

● Any occurrence where data for a

variable has not been recorded for
some observation is considered
missing from that observation.

34
Missing Data
● Any occurrence where data for a
variable has not been recorded for
some observation is considered
missing from that observation.

Can’t we just drop the row with

missing data?

35
Missing Data
Not a good idea. Why?

36
Missing Data
● Any occurrence where data for a
variable has not been recorded for
some observation is considered
missing from that observation.

Can’t we just drop the row with

missing data?

38
Missing Data
Missing data
Not a good idea. Why?

It is wasteful.

● May end up discarding a large portion

of data
● A relatively small amount of missing
data can have a big impact
Discarded data

39
Missing Data
Not a good idea. Why?

Creates inconsistency.

● Difficult to compare models that may

not use same variables

40
Missing Data
Not a good idea. Why?

It may create bias.

● Consider that each row indicates a

country and one of the features indicate
GDP. Poor countries may not report GDP
thus may show as missing data. So our
approach will just drop those poor
countries and data will be biased toward
the rich countries!
41
Missing Data
Missing data comes in three classes
● MCAR: Missing Completely At Random
● MAR: Missing At Random
● NMAR: Not Missing At Random

42
Missing Data
Missing data comes in three classes
● MCAR: Missing Completely At Random
● MAR: Missing At Random
● NMAR: Not Missing At Random ● Imagine tracking the number of
cars at an intersection over
time using a webcam. But the
Wifi on your laptop fails
occasionally, and you cannot
'Some of the data will be record cars during the outage.
missing simply because of bad The fact that they are missing
luck.' has nothing to do with the cars.
The missing car counts are MCAR.
‘This effectively implies that
causes of the missing data are
unrelated to the data.’
44
Missing Data
Missing data comes in three classes
● MCAR: Missing Completely At Random
● MAR: Missing At Random
● NMAR: Not Missing At Random ● If the chance that a value is missing
can be determined entirely by other
variables in the dataset, then the
data is missing at random.
● Say the webcam is known to shut
down every night from 1am to 5am
to save power.
These missing car counts are MAR.

45
Missing Data
Missing data comes in three classes
● MCAR: Missing Completely At Random
● MAR: Missing At Random
● NMAR: Not Missing At Random ● If data is NMAR, the chance that any
value for the given variable is missing
depends on data which is itself
missing.
● People who do not live in permanent
homes are much more likely to have
missing data in a census because
they less likely to be found by
pollsters.

46
Missing Data
Imputation is the act of filling in missing
data.
● Missing data be filled with predefined
values (e.g. 0).
● It can be filled with predictions of what
the values should be.

48
Missing Data
● Typically, imputation is considered when less
than 20% of the data is missing. The quality of
the imputation depends on both the
proportion of data that is missing, and the
pattern, if any, to the missingness.

● Imputation is only as reliable and valid as the

data it draws from. It isn't a magic method
that makes real information out of nothing.

Descriptive Analytics in Marketing
No ratings yet
Descriptive Analytics in Marketing
78 pages
Importance of Data Preprocessing Techniques
No ratings yet
Importance of Data Preprocessing Techniques
52 pages
Data Quality Issues
No ratings yet
Data Quality Issues
24 pages
Data Preparation for Effective Mining
No ratings yet
Data Preparation for Effective Mining
37 pages
Handling Noisy and Missing Data
No ratings yet
Handling Noisy and Missing Data
32 pages
Essential Guide to Exploratory Data Analysis
No ratings yet
Essential Guide to Exploratory Data Analysis
9 pages
FCI - Categorical Data Analysis - 07 Missing Data in Categorical Analysis
No ratings yet
FCI - Categorical Data Analysis - 07 Missing Data in Categorical Analysis
32 pages
Data Wrangling and Analytics Guide
No ratings yet
Data Wrangling and Analytics Guide
57 pages
Data Analysis (MIDSEM)
No ratings yet
Data Analysis (MIDSEM)
12 pages
Data Preparation for Data Mining Models
No ratings yet
Data Preparation for Data Mining Models
58 pages
Exploratory Data Analysis in Data Science
No ratings yet
Exploratory Data Analysis in Data Science
16 pages
Types of Data: Structured, Unstructured, Semi-Structured
No ratings yet
Types of Data: Structured, Unstructured, Semi-Structured
8 pages
Data Analysis: Steps & Missing Values
No ratings yet
Data Analysis: Steps & Missing Values
20 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
66 pages
Data Preparation (Sin Editar)
No ratings yet
Data Preparation (Sin Editar)
12 pages
Understanding Data Analytics Essentials
No ratings yet
Understanding Data Analytics Essentials
42 pages
Data Mining: Attributes & Preprocessing
No ratings yet
Data Mining: Attributes & Preprocessing
65 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
37 pages
M1 Ch2
No ratings yet
M1 Ch2
65 pages
Handling Missing Values
No ratings yet
Handling Missing Values
36 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
50 pages
Data Preparation For Analysis
No ratings yet
Data Preparation For Analysis
17 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
44 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
60 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
10 pages
EDA Basics: Python Guide for Data Analysis
100% (1)
EDA Basics: Python Guide for Data Analysis
30 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
36 pages
Essential Steps in Exploratory Data Analysis
No ratings yet
Essential Steps in Exploratory Data Analysis
47 pages
Exploratory Data Analysis (EDA) - Complete Guide
No ratings yet
Exploratory Data Analysis (EDA) - Complete Guide
25 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
12 pages
Data Analytics Learning Framework
No ratings yet
Data Analytics Learning Framework
56 pages
Data Preprocessing & Visualization Techniques
No ratings yet
Data Preprocessing & Visualization Techniques
26 pages
Comprehensive Guide to EDA Techniques
No ratings yet
Comprehensive Guide to EDA Techniques
48 pages
Comprehensive Guide to Data Cleaning
No ratings yet
Comprehensive Guide to Data Cleaning
36 pages
Stages in Machine Learning
No ratings yet
Stages in Machine Learning
34 pages
Handling Noisy Data and Quality Factors
No ratings yet
Handling Noisy Data and Quality Factors
32 pages
Understanding Datasets and Attributes
No ratings yet
Understanding Datasets and Attributes
58 pages
Data Analytics Techniques with Python
No ratings yet
Data Analytics Techniques with Python
71 pages
Understanding Missing Data in ML
No ratings yet
Understanding Missing Data in ML
25 pages
Strategies for Missing Data Handling
No ratings yet
Strategies for Missing Data Handling
13 pages
Unit 2 - Preprocessing
No ratings yet
Unit 2 - Preprocessing
74 pages
Handling Missing Data in Datasets
No ratings yet
Handling Missing Data in Datasets
5 pages
Data Preparation and Preprocessing Guide
No ratings yet
Data Preparation and Preprocessing Guide
52 pages
Data Preprocessing Techniques in Analytics
No ratings yet
Data Preprocessing Techniques in Analytics
19 pages
Exploratory Data Analysis in Data Science
No ratings yet
Exploratory Data Analysis in Data Science
37 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
105 pages
Data Preparation in Econometrics Analysis
No ratings yet
Data Preparation in Econometrics Analysis
74 pages
Essential Guide to Data Cleaning Techniques
No ratings yet
Essential Guide to Data Cleaning Techniques
45 pages
Essential Guide to Exploratory Data Analysis
No ratings yet
Essential Guide to Exploratory Data Analysis
36 pages
Missing Value
No ratings yet
Missing Value
3 pages
Exploratory Data Analysis Basics in Python
No ratings yet
Exploratory Data Analysis Basics in Python
10 pages
Data Import and Wrangling in R
No ratings yet
Data Import and Wrangling in R
76 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
32 pages
Data Preparation for Machine Learning
No ratings yet
Data Preparation for Machine Learning
252 pages
Chapter 3 - Data Preprocessing
No ratings yet
Chapter 3 - Data Preprocessing
54 pages
Bread Making Techniques Explained
No ratings yet
Bread Making Techniques Explained
25 pages
Shear and Bond in Reinforced Concrete
No ratings yet
Shear and Bond in Reinforced Concrete
9 pages
MegaSquirt Engine Startup Tuning Guide
No ratings yet
MegaSquirt Engine Startup Tuning Guide
3 pages
Geometry Practice Questions
No ratings yet
Geometry Practice Questions
2 pages
A Canvas of Data & Indian Card Industry
No ratings yet
A Canvas of Data & Indian Card Industry
21 pages
SSL Interception Setup for Proxy SG
No ratings yet
SSL Interception Setup for Proxy SG
15 pages
Current and Drift Velocity Concepts
No ratings yet
Current and Drift Velocity Concepts
2 pages
MVC Framework Lab for Employee App
No ratings yet
MVC Framework Lab for Employee App
3 pages
Jumo Diratron Indicator English
No ratings yet
Jumo Diratron Indicator English
19 pages
Chemistry Mock Exam Paper 2025
No ratings yet
Chemistry Mock Exam Paper 2025
12 pages
HL7 Standards: Escape Sequences & Delimiters
No ratings yet
HL7 Standards: Escape Sequences & Delimiters
11 pages
GSEB Std 12 Maths Question Bank
No ratings yet
GSEB Std 12 Maths Question Bank
68 pages
Adding Expressions with Pirate Pete
No ratings yet
Adding Expressions with Pirate Pete
19 pages
Overview of Generative Adversarial Networks
100% (1)
Overview of Generative Adversarial Networks
14 pages
Recrystallization of Benzoic Acid
No ratings yet
Recrystallization of Benzoic Acid
22 pages
A10 Ddos Datasheet
No ratings yet
A10 Ddos Datasheet
12 pages
Liquid Cooling Solution - ODCC2021
No ratings yet
Liquid Cooling Solution - ODCC2021
9 pages
Java Practical Question Answer Output
No ratings yet
Java Practical Question Answer Output
8 pages
Secure Log Storage with Blockchain
No ratings yet
Secure Log Storage with Blockchain
14 pages
Solve 4x4 Edges for 3x3 Cube
No ratings yet
Solve 4x4 Edges for 3x3 Cube
8 pages
2018 Book NetworkDataAnalytics PDF
100% (1)
2018 Book NetworkDataAnalytics PDF
406 pages
Carbon Black Manufacturing Process Overview
0% (1)
Carbon Black Manufacturing Process Overview
37 pages
Characteristics of Crystalline Solids
No ratings yet
Characteristics of Crystalline Solids
7 pages
Beee Working Model TOPIC: Li-Fi: Submitted To:-Sundeep Sir Submitted By
No ratings yet
Beee Working Model TOPIC: Li-Fi: Submitted To:-Sundeep Sir Submitted By
28 pages
Clinical Anatomy and Kinesiology Overview
No ratings yet
Clinical Anatomy and Kinesiology Overview
41 pages
XYZ-ATM Project Management Overview
No ratings yet
XYZ-ATM Project Management Overview
91 pages
SV6301A Handheld VNA Specifications
No ratings yet
SV6301A Handheld VNA Specifications
8 pages
Iron and Sulphur: Mixture vs Compound
No ratings yet
Iron and Sulphur: Mixture vs Compound
4 pages
Calibration Certificate for Compression Machine
No ratings yet
Calibration Certificate for Compression Machine
19 pages
False Position Method for Root Finding
No ratings yet
False Position Method for Root Finding
12 pages

Data Exploration and Missing Data Analysis

Uploaded by

Data Exploration and Missing Data Analysis

Uploaded by

Learning Objectives

● Exploratory Data Analysis

Where are we?

Data Collection &

● Raw data preprocessing tools

Structured RDBMS SQL

Small public R/Python

● The data scientist wants to know the basic

● These tools allow the data scientist to

● In this course we will use tools available in

Boston House Prices

Boston House Prices

Boston House Prices

Density + histogram + rug

Density + histogram + rug

The Iris flower data set is a multivariate

The data set consists of 50 samples from

The Iris flower data set is a multivariate

The data set consists of 50 samples from

● Any occurrence where data for a

Can’t we just drop the row with

Can’t we just drop the row with

● May end up discarding a large portion

● Difficult to compare models that may

It may create bias.

● Consider that each row indicates a

● Imputation is only as reliable and valid as the

You might also like