Learning Objectives
● Exploratory Data Analysis
● Hands-on Code
● Understanding missing data
2
Introduction
Where are we?
Exploratory Data
Data Modeling
Analysis
Problem Formulation
Presentation
Data Collection &
Insight/Prediction
Processing
● Raw data preprocessing tools
● Raw data collection & pre-processing ● Data query language (SQL) for search,
● Data collection and preprocessing consists of update Relational DBMS
~80% of time ● Storing semi structured data in XML, JSON
formats
3
Introduction Real world scenario
Table
Structured RDBMS SQL
Exploratory Data
Collected Semi Analysis
NRDBMS NoSQL
Raw Data structured
XML, JSON
Unstructured
Preprocessing
Small public R/Python
datasets CSV
Data Exploration
● The goal of the data exploration is to learn
about the data.
● The data scientist wants to know the basic
characteristics of the data, e.g.,
○ the structure,
○ the size,
○ the completeness (or rather where data is
missing), and
○ the relationships between different parts of
the data.
6
Data Exploration
● The exploration is usually a semi-automated
interactive process in which data scientists
use many different tools to consider
different aspects of the data.
● These tools allow the data scientist to
inspect raw data or preprocessed data, e.g.,
comma-separated values (CSV) files
● In this course we will use tools available in
Python:
○ Statistical measures
○ Visualizations
7
Examples
Boston House Prices
Dataset
This dataset contains
information collected by the
U.S Census Service concerning
housing in the area of Boston
Mass (1978).
8
Examples
Boston House Prices
Dataset
This dataset contains
information collected by the
U.S Census Service concerning
housing in the area of Boston
Mass (1978).
We will focus on exploring these two
9
Examples
Boston House Prices
Dataset
This dataset contains
information collected by the
U.S Census Service concerning
housing in the area of Boston
Mass (1978).
We will focus on exploring these two
11
Examples: MEDV
Histogram
12
Examples: Boston House Prices / MEDV
Density
13
Examples: Boston House Prices / MEDV
Density + rug
14
Examples: Boston House Prices / MEDV
Density + histogram + rug
15
Examples: Boston House Prices / CRIM
Density + histogram + rug
16
Examples: Boston House Prices / CRIM
Density + histogram +
rug
17
Examples
Iris Dataset
The Iris flower data set is a multivariate
data set introduced by the British
statistician and biologist Ronald Fisher in
his 1936 paper.
The data set consists of 50 samples from
each of three species of Iris (Iris Setosa, Iris
virginica, and Iris versicolor). Four features
were measured from each sample: the
length and the width of the sepals and
petals, in centimeters.
19
Examples: IRIS
20
Examples
Iris Dataset
The Iris flower data set is a multivariate
data set introduced by the British
statistician and biologist Ronald Fisher in
his 1936 paper.
The data set consists of 50 samples from
each of three species of Iris (Iris Setosa, Iris
virginica, and Iris versicolor). Four features
were measured from each sample: the
length and the width of the sepals and
petals, in centimeters.
22
Examples: IRIS
Box plot
23
Examples: IRIS
24
Examples: IRIS
25
Examples: Trend
Air Passengers
Dataset
The classic Box & Jenkins airline
data. Monthly totals of
international airline
passengers, 1949 to 1960.
27
Examples: Trend
Histogram
28
Examples: Trend
Scatter plot
29
Examples: Trend
Line plot
30
Missing Data Example of missing data
● Any occurrence where data for a
variable has not been recorded for
some observation is considered
missing from that observation.
34
Missing Data
● Any occurrence where data for a
variable has not been recorded for
some observation is considered
missing from that observation.
Can’t we just drop the row with
missing data?
35
Missing Data
Not a good idea. Why?
36
Missing Data
● Any occurrence where data for a
variable has not been recorded for
some observation is considered
missing from that observation.
Can’t we just drop the row with
missing data?
38
Missing Data
Missing data
Not a good idea. Why?
It is wasteful.
● May end up discarding a large portion
of data
● A relatively small amount of missing
data can have a big impact
Discarded data
39
Missing Data
Not a good idea. Why?
Creates inconsistency.
● Difficult to compare models that may
not use same variables
40
Missing Data
Not a good idea. Why?
It may create bias.
● Consider that each row indicates a
country and one of the features indicate
GDP. Poor countries may not report GDP
thus may show as missing data. So our
approach will just drop those poor
countries and data will be biased toward
the rich countries!
41
Missing Data
Missing data comes in three classes
● MCAR: Missing Completely At Random
● MAR: Missing At Random
● NMAR: Not Missing At Random
42
Missing Data
Missing data comes in three classes
● MCAR: Missing Completely At Random
● MAR: Missing At Random
● NMAR: Not Missing At Random ● Imagine tracking the number of
cars at an intersection over
time using a webcam. But the
Wifi on your laptop fails
occasionally, and you cannot
'Some of the data will be record cars during the outage.
missing simply because of bad The fact that they are missing
luck.' has nothing to do with the cars.
The missing car counts are MCAR.
‘This effectively implies that
causes of the missing data are
unrelated to the data.’
44
Missing Data
Missing data comes in three classes
● MCAR: Missing Completely At Random
● MAR: Missing At Random
● NMAR: Not Missing At Random ● If the chance that a value is missing
can be determined entirely by other
variables in the dataset, then the
data is missing at random.
● Say the webcam is known to shut
down every night from 1am to 5am
to save power.
These missing car counts are MAR.
45
Missing Data
Missing data comes in three classes
● MCAR: Missing Completely At Random
● MAR: Missing At Random
● NMAR: Not Missing At Random ● If data is NMAR, the chance that any
value for the given variable is missing
depends on data which is itself
missing.
● People who do not live in permanent
homes are much more likely to have
missing data in a census because
they less likely to be found by
pollsters.
46
Missing Data
Imputation is the act of filling in missing
data.
● Missing data be filled with predefined
values (e.g. 0).
● It can be filled with predictions of what
the values should be.
48
Missing Data
● Typically, imputation is considered when less
than 20% of the data is missing. The quality of
the imputation depends on both the
proportion of data that is missing, and the
pattern, if any, to the missingness.
● Imputation is only as reliable and valid as the
data it draws from. It isn't a magic method
that makes real information out of nothing.
49