Exploratory Data Analysis Overview
Exploratory Data Analysis Overview
2
Exploratory data analysis (EDA)
• EDA analyzes data to summarize their main characteristics,
using statistical graphics and data visualization methods.
• Data scientists use EDA to determine the best way to handle
data sources to get the answers they need.
3
Initial Data Analysis (IDA)
• IDA focuses on examining the structure, quality, and basic
characteristics of a dataset.
• Purposes:
• Understand the basic structure and quality of the dataset.
• Detect and fix problems (e.g., missing values, incorrect types,
inconsistencies).
• Ensure the data is clean, valid, and ready for further analysis.
• IDA makes sure that the data is in good shape, while EDA
explores well-prepared data to uncover insights.
4
Common tasks in EDA (and IDA)
Data quality
Data types
Data preprocessing
Data distribution
Outliers
Pattern discovery
7
Data collection: Graph datasets
• The Internet, social networks, molecular structures
8
Data collection: Ordered datasets
• Sequential data: transaction sequences, genetic sequences
• Video data, temporal data, time-series data, etc.
9
Data objects
• A data object depicts an entity, serving as the building block
for a dataset.
• Similar terms: sample, example, instance, data point, and tuple
University database
10
Attributes
• An attribute shows some characteristic of a data object.
• Similar terms: dimension, feature, and variable
• E.g., a Customer object has 3 attributes {id, name, address}
• Observation: an observed value for a given attribute
• Feature vector: a set of attributes used to describe an object
11
Attribute types: Nominal
• Qualitative, values do not have any meaningful order
• Enumerations: categories, states, or “names of things”
Occupation
Weather
Colors 12
Attribute types: Ordinal
• Qualitative, values have a meaningful order (ranking) but
magnitude between successive values is not known
13
Attribute types: Binary
• Nominal attribute with only 2 states
• Symmetric binary: both outcomes equally important
Switch light
On and Off
Day and night Male and Female
16
Attributes: Discrete vs. Continuous
• There are many ways to organize attribute types, which are
not mutually exclusive.
• Discrete attribute
• Only a finite or countably infinite set of values
• The values are sometimes represented as integers.
• Binary attributes are a special case of discrete attributes.
• Continuous attribute
• Real numbers of continuous domains
• The values are usually represented using a finite number of digits
→ floating-point variables
17
Quiz 01: Data types
1. For each of the following pairs of data types, given an example to
contrast the characteristics of data types.
• Nominal data vs. Ordinal data
• Symmetric binary data vs. Asymmetric binary data
• Interval numeric data vs. Ratio numeric data
x = 42
2. How to check the data type of a variable in Python?
y = "Hello"
Show
. the data type of the three variables, x, y, z,
z = [1, 2, 3]
shown aside.
20
Central tendency: Arithmetic mean
• Consider the score records of John John’s record Kelly’s record
and Kelly. Homework 92 Homework 100
Quiz 74 Quiz 82
• The (non-weighted) mean scores are
Lab 83 Lab 95
𝜇𝐽𝑜ℎ𝑛 = 82.6, 𝜇𝐾𝑒𝑙𝑙𝑦 = 84.6 Test 76 Test 70
Final exam 88 Final exam 76
Homework 15 %
Quiz 10 % • We now have the course grade distribution.
Lab 20 %
• The weighted mean scores are
Test 25 % 𝑤 𝑤
𝜇𝐽𝑜ℎ𝑛 = 83.2, 𝜇𝐾𝑒𝑙𝑙𝑦 = 82.5
Final exam 30 %
4 14 19 20 22 24 25 26 26 99
Trimmed mean: 22
22
Central tendency: Mode
• Mode is the value that occurs most frequently in the data,
defined for both qualitative and quantitative attributes.
• If each data value occurs only once, then there is no mode
23
Central tendency: Median
• Suppose that the given set of 𝑁 observations is sorted.
• Median is the middle value of the ordered set.
• 𝑁 is odd: pick the exact middle value; otherwise, take the average of
the two middlemost values.
• Midrange is the average of the largest and smallest values
in the set.
4 4 4 9 15 15 15 27 37 48
mean = 17.8 – mode: 4 and 15 – midrange = 26, median = (15+15)/2 = 15
3 3 6 9 15 15 15 27 27 37 48
mean = 18.636 – mode: 15 – midrange = 25.5, median = 15
24
Symmetric data vs. Skew data
symmetric
26
Data dispersion: Quantiles
• Let 𝑥1 , 𝑥2 , … , 𝑥𝑁 be a set of 𝑁 observations sorted in
increasing order for a numeric attribute 𝑋.
• Quantiles are points taken at regular intervals of a data
distribution, dividing it into equal-sized consecutive sets.
• kth q-quantile (0 < k < 𝑞, 𝑘 ∈ ℕ∗ ): a value 𝑥 such that at most
𝑘/𝑞 data values < 𝑥 and at most (𝑞 − 𝑘)/𝑞 of which > 𝑥.
• There are 𝑞 − 1 q-quantiles.
27
Data dispersion: Quantiles
• Quartiles (4-quantiles) split the data distribution into four equal
parts.
28
Data dispersion: Interquartile range
• Interquartile range (IQR) is the distance between the first
and third quartiles.
𝐼𝑄𝑅 = 𝑄3 − 𝑄1
30 36 47 50 52 52 56 60 63 70 70
𝑄1 𝑄2 𝑄3
(median)
IQR
29
How to determine the quartile?
• Use the median to divide the ordered set into two halves.
• If the original set has an even number of points, split it exactly in half
• Otherwise, do not include the median in either half.
• 𝑄1 and 𝑄3 are the medians of the lower and upper halves,
respectively.
6 7 15 36 39 40 41 42 43 47 49
𝑄1 𝑄2 𝑄3
7 15 36 39 40 41
𝑄1 𝑄2 = 37.5 𝑄3
30
Quiz 03: Quantiles
1. You are given the following dataset representing the scores of 15
students in a math exam, already sorted.
45, 48, 52, 55, 62, 67, 70, 72, 75, 77, 80, 85, 87,
90, 95
Compute the first quartile (Q1), second quartile (Q2), and third quartile
(Q3), and IQR.
2. Identify whether pandas supports the calculation of quartiles and IQR.
For each available function, show the result for the above data.
31
Data dispersion: Boxplot
• A five-number summary of a distribution includes
• The median (𝑸𝟐 ), the quartiles 𝑸𝟏 and 𝑸𝟑 ,
• The smallest (𝑴𝒊𝒏) and largest (𝑴𝒂𝒙) individual values.
32
Data dispersion: Boxplot
• The two whiskers refers to the smallest and largest values within
𝑸𝟏 − 1.5 × 𝐼𝑄𝑅, 𝑸𝟑 + 1.5 × 𝐼𝑄𝑅 .
• Outliers: points that are out the above range, plotted individually
33
Data dispersion: Boxplot
Boxplot for the unit price data for items
sold at four branches of AllElectronics
during a given time period
• For Branch 1, the median price of items sold is $80, 𝑄1 is $60, and 𝑄3 is
$100. Notice that two outlying observations, 175 and 202, were plotted
individually as they are more than 1.5 IQR.
34
Quiz 04: Draw a box plot
1. Consider the following 1D data series, which includes 15 data points
sorted in ascending order.
21, 25, 27, 29, 32, 36, 36, 48, 67, 70, 74, 75,
79, 150, 197
• Define the five-number summary for the above data.
• Draw the boxplot representing the above five-number summary.
Note the vertical axis and all the values.
35
Data dispersion: Variance
• The (population) variance is defined as
𝑁 𝑁
2
1 2
1
𝜎 = 𝑥𝑖 − 𝑥ҧ = 𝑥𝑖 2 − 𝑥ҧ 2
𝑁 𝑁
𝑖=1 𝑖=1
•
38
Quiz 05: Variance and standard deviation
1. Consider the following 1D data series, which includes 15 data points
sorted in ascending order.
21, 25, 27, 29, 32, 36, 36, 48, 67, 80, 84,
85, 89, 92, 97
Compute the variance and standard deviation.
39
Basic data
visualization
Why data visualization?
• Gain insight into an information space by mapping data onto
graphical primitives
• Provide qualitative overview of large datasets
• Search for patterns, trends, irregularities, relationships
• Help find interesting regions and suitable parameters for
further quantitative analysis
• Provide a visual proof of computer representations derived
41
Bar chart
• A bar chart presents nominal data by using rectangular
bars with heights proportional to the values represented.
42
Histogram
• The range of values for a numeric attribute 𝑋 is partitioned
into disjoint consecutive subranges, called buckets or bins.
• Each bar is for a subrange such that its height represents
the total items within the subrange.
43
Histogram: An example
44
Histogram over boxplot
• The two following histograms may have the same boxplot.
• However, they represent rather different data distributions.
45
Quantile plot
• A quantile plot presents the plot quantile information for a
univariate data distribution.
• It allows access to both overall behavior and unusual occurrences.
• Let 𝑥1 , 𝑥2 , … , 𝑥𝑁 be the data observations sorted in
increasing order for some ordinal or numeric attribute 𝑋.
𝑖−0.5
• Each value 𝑥𝑖 is paired with 𝑓𝑖 = , indicating that
𝑁
approximately 𝑓𝑖 100% of data are 𝑥𝑖 .
46
Quantile plot: An example
47
Quantile-Quantile plot
• A quantile-quantile plot draws the quantiles of one univariate
distribution against the corresponding quantiles of another.
48
Scatter plot
• A scatter plot looks at the bivariate data to see clusters of
points or outliers
• Each pair of values is treated as a pair of coordinates and plotted as
points in the plane.
The correlation
between Unit prices
and Item sold
49
Scatter plot: Data correlation
negatively correlated
positively correlated
uncorrelated data 50
Quiz 06: Scatter plot
1. Consider the following data table, in Attributes
which there are five tuples of two
No.
A B
attributes, A and B. 1 19 16
2 25 10
3 13 26
4 12 29
5 16 20
Draw the scatter plot, whose the horizontal axis denotes attribute A,
and the vertical axis represents attribute B.
2. How to draw a scatter plot using some Python library? Draw the
scatter plot for the above data.
51
Data proximity
measures
Similarity and Dissimilarity
Similarity
• A numerical measure of how alike two data objects, 𝑖 and 𝑗, are
• Values often falls in the range [0,1]: 0 – unalike → 1 – identical
Dissimilarity (distance)
• A numerical measure of how different two data objects are
• It works in an opposite direction to some similarity measure
• The lower bound is often 0, while the upper limit varies
Proximity
• This refers to either similarity or dissimilarity
53
Feature matrix vs. Dissimilarity matrix
• Feature matrices are essential to most machine learning task.
Feature matrix Dissimilarity matrix
x11 ... x1f ... x1p 0
d(2,1)
... ... ... ... ...
0
x ... xif ... xip d(3,1) d ( 3, 2 ) 0
i1
... ... ... ... ...
: : :
x xnp
n1 ... xnf ...
d ( n,1) d ( n, 2 ) ... ... 0
58
Measures for numeric attributes
• ℎ = 1: Manhattan (city block, 𝐿1 norm) distance
𝑑 𝑖, 𝑗 = 𝑥𝑖1 − 𝑥𝑗1 + 𝑥𝑖2 − 𝑥𝑗2 + ⋯ + 𝑥𝑖𝑝 − 𝑥𝑗𝑝
59
Cosine similarity
• A document can be represented by thousands of keywords
in the document.
60
Cosine similarity
• Let 𝑑1 and 𝑑2 are two vectors (e.g., term-frequency vectors).
𝑑1 ⋅ 𝑑2
• Cosine similarity is non-metric: 𝑠𝑖𝑚 𝑑1 , 𝑑2 =
𝑑1 𝑑3
• where ⋅ is vector dot product, 𝑑 is the length of vector 𝑑
• sim = 0 means no match, while sim = 1 means a complete match.
61
Measures for ordinal attributes
• The range of a numeric attribute can be mapped to an
ordinal attribute 𝑓 having 𝑀𝑓 states.
• E.g., temperate: cold (-30oC – 10oC), moderate (-10oC – 10oC), and
warm (10oC – 30oC)
• Let 𝑀 represent the number of possible ordered states,
which define the ranking 1, … , 𝑀𝑓
• Replace each 𝑥𝑖𝑓 by its corresponding rank, 𝑟𝑖𝑓 ∈ 1, … , 𝑀𝑓
𝑟𝑖𝑓 − 1
• Replace rank 𝑟𝑖𝑓 of 𝑖𝑡ℎ object by 𝑧𝑖𝑓 =
𝑀𝑓 − 1
• Continue with any measure for numeric attributes
62
Measures for ordinal attributes
63
Measures for attributes of mixed types
• Suppose that the dataset has 𝑝 attributes of mixed type.
𝒑 (𝒇) (𝒇)
σ𝒇=𝟏 𝜹𝒊𝒋 𝒅𝒊𝒋
• The distance between objects 𝑖 and 𝑗 is 𝒅 𝒊, 𝒋 = 𝒑 (𝒇)
σ𝒇=𝟏 𝜹𝒊𝒋
(𝑓)
• 𝛿𝑖𝑗 = 0 if (1) 𝑥𝑖𝑓 or 𝑥𝑗𝑓 is missing, or (2) 𝑥𝑖𝑓 = 𝑥𝑗𝑓 = 0 and attribute
(𝑓)
𝑓 is asymmetric binary. Otherwise, 𝛿𝑖𝑗 = 1
64
Measures for attributes of mixed types
Dissimilarity
matrix of test-1
Dissimilarity
(𝑓)
matrix of test-2 • 𝛿𝑖𝑗 = 1 for each attribute 𝑓
1 1 +1 0.50 +1(0.45)
• 𝑑 3,1 = = 0.65
3
65
Quiz 07: Jaccard coefficient
1. Calculate the similarity between these two observations, in which all
the attributes are binary asymmetric.
66
Correlation analysis
2
𝜒 -test for correlation analysis
• Suppose attribute 𝐴 has 𝑐 distinct values and attribute 𝐵 has 𝑟
distinct values. There are 𝑛 data tuples.
• Let (𝐴𝑖 , 𝐵𝑗 ) denote the joint event that 𝐴 = 𝑎𝑖 and 𝐵 = 𝑏𝑗 .
• 𝜒 2 -test checks the null hypothesis, 𝐴 and 𝐵 are independent
𝑐 𝑟 2
2
𝑜𝑖𝑗 − 𝑒𝑖𝑗
𝜒 =
𝑒𝑖𝑗
𝑖=1 𝑗=1
• 𝑜𝑖𝑗 : observed frequency (i.e., actual count) of (𝐴 = 𝑎𝑖 , 𝐵 = 𝑏𝑗 )
𝑐𝑜𝑢𝑛𝑡(𝐴=𝑎𝑖 )×𝑐𝑜𝑢𝑛𝑡(𝐵=𝑏𝑗 )
• 𝑒𝑖𝑗 : expected frequency of (𝐴𝑖 , 𝐵𝑗 ) 𝑒𝑖𝑗 = ൗ𝑛
• The larger 𝜒 2 value, the more likely the variables are related.
68
2
𝜒 -test: A numerical example
• Consider the below a contingency table.
70
Expected frequency: An example
• Consider the below a contingency table.
Yes No Row Total
Male 30 20 50
Female 40 10 50
Column Total 70 30 100
71
Expected frequency: Notes
• 𝜒 2 -test relies on an approximation that works best with
sufficiently large expected counts.
• Classical rule: Each expected frequency should be ≥ 5.
• Modern guideline: For larger tables, ≥ 1 is acceptable if no
more than 20% of cells are below 5.
72
2
𝜒 -test: Contingency table
• A contingency table (or crosstab) displays the frequency
distribution of two or more categorical variables.
• It helps to analyze the relationship between variables.
Grand total
Marginal totals
73
2
𝜒 -test: Contingency table
• Is a contingency table able to represent categorical variables
that have more than two values?
74
2
𝜒 -test: Contingency table
• Can a contingency table represent the relationship of more
than two categorical variables?
75
2
𝜒 -test: An example
• The test is based on a significance
level with a DOF of 1.
• If the hypothesis is denied, 𝐴 and
𝐵 are statistically correlated.
76
2
𝜒 -test: Degree of freedom (DOF)
• DOF is the number of independent values that can vary in a
statistical calculation without breaking constraints.
• It can be calculated from the contingency table as follows.
𝑫𝑶𝑭 = (𝒓 − 𝟏) ∙ (𝒄 − 𝟏)
• 𝑟 is the number of rows and 𝑐 is the number of columns.
• E.g., in a table with 3 rows and 2 columns, DOF = (3 – 1)(2 – 1) = 2.
77
2
𝜒 -test: Significance levels
• The significance level 𝛼 is the probability threshold to decide
whether to reject the null hypothesis.
• It represents the risk of rejecting a true null hypothesis.
78
2
Quiz 08: 𝜒 -test
1. Consider the data that relate the sex of children in families who have
two children. Apply 2 statistics at the 0.001 significance level.
79
Pearson correlation coefficient
• Consider two numeric attributes 𝐴 and 𝐵, and a set of 𝑛
observations 𝑎1 , 𝑏1 , . . , 𝑎𝑛 , 𝑏𝑛 .
• Pearson’s product moment coefficient
ഥ 𝒃𝒊 − 𝑩
σ𝒏𝒊=𝟏 𝒂𝒊 − 𝑨 ഥ ഥ𝑩
σ𝒏𝒊=𝟏 𝒂𝒊 𝒃𝒊 − 𝒏𝑨 ഥ
𝒓𝑨,𝑩 = =
𝒏𝝈𝑨 𝝈𝑩 𝒏𝝈𝑨 𝝈𝑩
• 𝐴,ҧ 𝐵,
ത 𝜎𝐴 , 𝜎𝐵 : means and standard deviations of 𝐴 and 𝐵, respectively
• Σ𝑎𝑖 𝑏𝑖 : sum of the 𝐴𝐵 cross-product
80
Pearson correlation coefficient
1 0. 0.4 0 -0.4 -0. -1
1 1 1 -1 -1 -1
0 0 0 0 0 0 0
Several sets of (x, y) points, with the Pearson correlation coefficient of x and y for each set.
The correlation reflects the noisiness and direction of a linear relationship (top row), but
not the slope of that relationship (middle), nor many aspects of nonlinear relationships
(bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation
coefficient is undefined because the variance of Y is zero. (Wikipedia) 81
Covariance analysis
• The covariance between 𝑨 and 𝑩 is defined as
σ𝑛𝑖=1 𝑎𝑖 − 𝐴ҧ 𝑏𝑖 − 𝐵ത
𝐶𝑜𝑣 𝐴, 𝐵 = = 𝐸 𝐴 ∙ 𝐵 − 𝐴ҧ𝐵ത
𝑛
σ𝑛
𝑖=1 𝑎𝑖 σ𝑛
𝑖=1 𝑏𝑖
• where 𝐸 𝐴 = 𝐴ҧ = and 𝐸 𝐵 = 𝐵ത = are the expected
𝑛 𝑛
values of 𝐴 and 𝐵
𝐶𝑜𝑣 𝐴,𝐵
• Covariance vs. correlation: 𝑟𝐴,𝐵 =
𝜎𝐴 𝜎𝐵
82
Covariance analysis: An example
• If the stocks are affected by the
same industry trends, will their
prices rise or fall together?
6+5+4+3+2 20
• 𝐸 𝐴𝑙𝑙𝐸𝑙𝑒𝑐𝑡𝑟𝑜𝑛𝑖𝑐𝑠 = = = $4
5 5
20+10+14+5+5 54
• 𝐸 𝐻𝑖𝑔ℎ𝑇𝑒𝑐ℎ = = = $10.80
5 5
6×20+5×10+4×14+3×5+2×5
• 𝐶𝑜𝑣 𝐴𝑙𝑙𝐸𝑙𝑒𝑐𝑡𝑟𝑜𝑛𝑖𝑐𝑠, 𝐻𝑖𝑔ℎ𝑇𝑒𝑐ℎ = − 4 × 10.80 = 7
5
84
References
• Jiawei Han, Micheline Kamber, and Jian Pei, 2011. Data Mining:
Concepts and Techniques (3rd ed.). Morgan Kaufmann Publishers Inc.
Chapter 2 and Chapter 3.
85