0% found this document useful (0 votes)

8 views86 pages

Exploratory Data Analysis Overview

The document provides an overview of Exploratory Data Analysis (EDA) and Initial Data Analysis (IDA), emphasizing their roles in summarizing data characteristics and ensuring data quality. It covers data objects, attributes, types, and basic statistical descriptions, including measures of central tendency and dispersion. Additionally, it discusses data visualization techniques and the importance of data cleaning and preprocessing.

Uploaded by

ndkhoa232

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views86 pages

Exploratory Data Analysis Overview

Uploaded by

ndkhoa232

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Exploratory Data Analysis (P1)

Nguyen Ngoc Thao

nnthao@[Link]
Content outline
• Data objects and Attributes
• Basic statistical data descriptions
• Basic data visualization
• Data proximity measures
• Data correlation analysis

2
Exploratory data analysis (EDA)
• EDA analyzes data to summarize their main characteristics,
using statistical graphics and data visualization methods.
• Data scientists use EDA to determine the best way to handle
data sources to get the answers they need.

3
Initial Data Analysis (IDA)
• IDA focuses on examining the structure, quality, and basic
characteristics of a dataset.
• Purposes:
• Understand the basic structure and quality of the dataset.
• Detect and fix problems (e.g., missing values, incorrect types,
inconsistencies).
• Ensure the data is clean, valid, and ready for further analysis.

• IDA makes sure that the data is in good shape, while EDA
explores well-prepared data to uncover insights.

4
Common tasks in EDA (and IDA)

Data quality
Data types
Data preprocessing

Data distribution
Outliers

Pattern discovery

Correlation Data visualization

5
Data objects and
Attributes
Data collection: Record datasets
• Relational / transactional tuples
• Term-frequency vectors, numerical matrices, crosstabs

7
Data collection: Graph datasets
• The Internet, social networks, molecular structures

8
Data collection: Ordered datasets
• Sequential data: transaction sequences, genetic sequences
• Video data, temporal data, time-series data, etc.

9
Data objects
• A data object depicts an entity, serving as the building block
for a dataset.
• Similar terms: sample, example, instance, data point, and tuple

Sales database Medical database

University database

• Data objects are described by attributes.

• In a database: rows → data objects, columns → attributes

10
Attributes
• An attribute shows some characteristic of a data object.
• Similar terms: dimension, feature, and variable
• E.g., a Customer object has 3 attributes {id, name, address}
• Observation: an observed value for a given attribute
• Feature vector: a set of attributes used to describe an object

11
Attribute types: Nominal
• Qualitative, values do not have any meaningful order
• Enumerations: categories, states, or “names of things”

Occupation

Day and Night

Weather
Colors 12
Attribute types: Ordinal
• Qualitative, values have a meaningful order (ranking) but
magnitude between successive values is not known

• Useful for subjective assessments of qualities that cannot be

measured objectively
• E.g., customer satisfaction

13
Attribute types: Binary
• Nominal attribute with only 2 states
• Symmetric binary: both outcomes equally important

Switch light
On and Off
Day and night Male and Female

• Asymmetric binary: outcomes not equally important

• Convention: assign 1 to the most important outcome (e.g., HIV test)

A positive result is more significant

Rh positive is
more common
14
Attribute types: Numeric
Interval numeric attribute
• Measured on a scale of equal-sized units
• Values have order (e.g., temperature in C˚ or F˚, calendar dates)
• No true zero-point: able to compute the difference – not able to talk
of one value as being a multiple of another
• E.g., 20˚C is five degrees higher than 15˚C (right), 10˚C is twice as warm
as 5˚C (wrong)

Ratio numeric attribute

• Inherent zero-point
• Values can be considered as being an order of magnitude larger than
the unit of measurement
• E.g., temperature (10˚K is twice as high as 5˚K), monetary (you are 100
times richer with $100 than with $1), measurements (height, weight)
15
Image credit: Google Sites

16
Attributes: Discrete vs. Continuous
• There are many ways to organize attribute types, which are
not mutually exclusive.
• Discrete attribute
• Only a finite or countably infinite set of values
• The values are sometimes represented as integers.
• Binary attributes are a special case of discrete attributes.
• Continuous attribute
• Real numbers of continuous domains
• The values are usually represented using a finite number of digits
→ floating-point variables

17
Quiz 01: Data types
1. For each of the following pairs of data types, given an example to
contrast the characteristics of data types.
• Nominal data vs. Ordinal data
• Symmetric binary data vs. Asymmetric binary data
• Interval numeric data vs. Ratio numeric data
x = 42
2. How to check the data type of a variable in Python?
y = "Hello"
Show
. the data type of the three variables, x, y, z,
z = [1, 2, 3]
shown aside.

3. How to check the schema of a df = [Link]({

pandas Dataframe? "name": ["Alice", "Bob"],
"age": [25, 30],
Check the scheme of the
"salary": [50000.0, 60000.0]
Dataframe show aside.
}) 18
Basic statistical
data descriptions
Central tendency: Arithmetic mean
• Let 𝑥1 , 𝑥2 , … , 𝑥𝑁 be a set of 𝑁 values or observations for
some numeric attribute 𝑋.
𝑵
𝟏
• The arithmetic mean is defined as 𝝁 = ෍ 𝒙𝒊
𝑵
𝒊=𝟏
σ𝑛𝑖=1 𝑤𝑖 𝑥𝑖
• The weighted arithmetic mean is written as 𝜇𝑤 = 𝑛
σ𝑖=1 𝑤𝑖
• where 𝑤𝑖 is the weight value that associates with 𝑥𝑖 .

• It is the most common and effective numeric measure

20
Central tendency: Arithmetic mean
• Consider the score records of John John’s record Kelly’s record
and Kelly. Homework 92 Homework 100
Quiz 74 Quiz 82
• The (non-weighted) mean scores are
Lab 83 Lab 95
𝜇𝐽𝑜ℎ𝑛 = 82.6, 𝜇𝐾𝑒𝑙𝑙𝑦 = 84.6 Test 76 Test 70
Final exam 88 Final exam 76
Homework 15 %
Quiz 10 % • We now have the course grade distribution.
Lab 20 %
• The weighted mean scores are
Test 25 % 𝑤 𝑤
𝜇𝐽𝑜ℎ𝑛 = 83.2, 𝜇𝐾𝑒𝑙𝑙𝑦 = 82.5
Final exam 30 %

𝑤 0.15 × 92 + 0.1 × 74 + 0.2 × 83 + 0.25 × 76 + 0.3 × 88

𝜇𝐽𝑜ℎ𝑛 = = 83.2
0.15 + 0.1 + 0.2 + 0.25 + 0.3
21
Central tendency: Arithmetic mean
• Means are highly sensitive to extreme values (e.g., outlier).
• Trimmed mean: chop extreme values before calculating the
regular mean

Typical mean: 27.9

4 14 19 20 22 24 25 26 26 99

remove 10% observations from each side

Trimmed mean: 22

22
Central tendency: Mode
• Mode is the value that occurs most frequently in the data,
defined for both qualitative and quantitative attributes.
• If each data value occurs only once, then there is no mode

23
Central tendency: Median
• Suppose that the given set of 𝑁 observations is sorted.
• Median is the middle value of the ordered set.
• 𝑁 is odd: pick the exact middle value; otherwise, take the average of
the two middlemost values.
• Midrange is the average of the largest and smallest values
in the set.
4 4 4 9 15 15 15 27 37 48
mean = 17.8 – mode: 4 and 15 – midrange = 26, median = (15+15)/2 = 15

3 3 6 9 15 15 15 27 27 37 48
mean = 18.636 – mode: 15 – midrange = 25.5, median = 15
24
Symmetric data vs. Skew data

symmetric

positively skewed negatively skewed

• For moderately skewed unimodal numeric data, the empirical formula is

𝑚𝑒𝑎𝑛 − 𝑚𝑜𝑑𝑒 ≈ 3 × (𝑚𝑒𝑎𝑛 − 𝑚𝑒𝑑𝑖𝑎𝑛)
25
Quiz 02: Mean, mode, and Midrange
1. Consider the following 1D data series, which includes 13 data points.
31, 40, 19, 45, 5, 18, 30, 5, 33, 33, 25, 5, 20
Compute the following values: arithmetic mean, midrange, median,
and mode.

2. For each of the following values, identify whether pandas provides a

function to calculate the value.
• Arithmetic mean • Median, mode
• Weighted arithmetic mean • Midrange
For each available function, show the result for the above data.

26
Data dispersion: Quantiles
• Let 𝑥1 , 𝑥2 , … , 𝑥𝑁 be a set of 𝑁 observations sorted in
increasing order for a numeric attribute 𝑋.
• Quantiles are points taken at regular intervals of a data
distribution, dividing it into equal-sized consecutive sets.
• kth q-quantile (0 < k < 𝑞, 𝑘 ∈ ℕ∗ ): a value 𝑥 such that at most
𝑘/𝑞 data values < 𝑥 and at most (𝑞 − 𝑘)/𝑞 of which > 𝑥.
• There are 𝑞 − 1 q-quantiles.

27
Data dispersion: Quantiles
• Quartiles (4-quantiles) split the data distribution into four equal
parts.

• Percentiles (100-quantiles): 100 equal-sized consecutive sets.

• 2-quantile is the median that splits the distribution into halves.

28
Data dispersion: Interquartile range
• Interquartile range (IQR) is the distance between the first
and third quartiles.
𝐼𝑄𝑅 = 𝑄3 − 𝑄1

• Range is the difference between the largest and smallest

values in the set.
range

30 36 47 50 52 52 56 60 63 70 70
𝑄1 𝑄2 𝑄3
(median)
IQR
29
How to determine the quartile?
• Use the median to divide the ordered set into two halves.
• If the original set has an even number of points, split it exactly in half
• Otherwise, do not include the median in either half.
• 𝑄1 and 𝑄3 are the medians of the lower and upper halves,
respectively.

6 7 15 36 39 40 41 42 43 47 49
𝑄1 𝑄2 𝑄3

7 15 36 39 40 41
𝑄1 𝑄2 = 37.5 𝑄3
30
Quiz 03: Quantiles
1. You are given the following dataset representing the scores of 15
students in a math exam, already sorted.

45, 48, 52, 55, 62, 67, 70, 72, 75, 77, 80, 85, 87,
90, 95
Compute the first quartile (Q1), second quartile (Q2), and third quartile
(Q3), and IQR.
2. Identify whether pandas supports the calculation of quartiles and IQR.
For each available function, show the result for the above data.

31
Data dispersion: Boxplot
• A five-number summary of a distribution includes
• The median (𝑸𝟐 ), the quartiles 𝑸𝟏 and 𝑸𝟑 ,
• The smallest (𝑴𝒊𝒏) and largest (𝑴𝒂𝒙) individual values.

• This summary is presented by a boxplot.

32
Data dispersion: Boxplot

• The two whiskers refers to the smallest and largest values within
𝑸𝟏 − 1.5 × 𝐼𝑄𝑅, 𝑸𝟑 + 1.5 × 𝐼𝑄𝑅 .
• Outliers: points that are out the above range, plotted individually
33
Data dispersion: Boxplot
Boxplot for the unit price data for items
sold at four branches of AllElectronics
during a given time period

• For Branch 1, the median price of items sold is $80, 𝑄1 is $60, and 𝑄3 is
$100. Notice that two outlying observations, 175 and 202, were plotted
individually as they are more than 1.5  IQR.
34
Quiz 04: Draw a box plot
1. Consider the following 1D data series, which includes 15 data points
sorted in ascending order.
21, 25, 27, 29, 32, 36, 36, 48, 67, 70, 74, 75,
79, 150, 197
• Define the five-number summary for the above data.
• Draw the boxplot representing the above five-number summary.
Note the vertical axis and all the values.

2. How to draw a boxplot in Python?

3. Can scikit-learn be used to draw a boxplot? If yes, how?

35
Data dispersion: Variance
• The (population) variance is defined as
𝑁 𝑁
2
1 2
1
𝜎 = ෍ 𝑥𝑖 − 𝑥ҧ = ෍ 𝑥𝑖 2 − 𝑥ҧ 2
𝑁 𝑁
𝑖=1 𝑖=1
•

• The standard deviation is the square root of the variance.

• Low  → the data tends to be very close to the mean.
• High  → the data spreads out over a large range of values.

[𝜇 − 𝜎, 𝜇 + 𝜎] [𝜇 − 2𝜎, 𝜇 + 2𝜎] [𝜇 − 3𝜎, 𝜇 + 3𝜎]

36
Image credit: ResearchGate

Box plot and probability density function of a normal distribution. 37

Types of outliers

38
Quiz 05: Variance and standard deviation
1. Consider the following 1D data series, which includes 15 data points
sorted in ascending order.
21, 25, 27, 29, 32, 36, 36, 48, 67, 80, 84,
85, 89, 92, 97
Compute the variance and standard deviation.

2. Does pandas provide the variance and standard deviation for a

dataframe? If yes, how?
For each available function, show the result for the above data.

39
Basic data
visualization
Why data visualization?
• Gain insight into an information space by mapping data onto
graphical primitives
• Provide qualitative overview of large datasets
• Search for patterns, trends, irregularities, relationships
• Help find interesting regions and suitable parameters for
further quantitative analysis
• Provide a visual proof of computer representations derived

41
Bar chart
• A bar chart presents nominal data by using rectangular
bars with heights proportional to the values represented.

42
Histogram
• The range of values for a numeric attribute 𝑋 is partitioned
into disjoint consecutive subranges, called buckets or bins.
• Each bar is for a subrange such that its height represents
the total items within the subrange.

Equal-width: equal bucket range

Equal-frequency: equal bucket depth

43
Histogram: An example

44
Histogram over boxplot
• The two following histograms may have the same boxplot.
• However, they represent rather different data distributions.

45
Quantile plot
• A quantile plot presents the plot quantile information for a
univariate data distribution.
• It allows access to both overall behavior and unusual occurrences.
• Let 𝑥1 , 𝑥2 , … , 𝑥𝑁 be the data observations sorted in
increasing order for some ordinal or numeric attribute 𝑋.
𝑖−0.5
• Each value 𝑥𝑖 is paired with 𝑓𝑖 = , indicating that
𝑁
approximately 𝑓𝑖 100% of data are  𝑥𝑖 .

46
Quantile plot: An example

Quantile plot for the unit price data

47
Quantile-Quantile plot
• A quantile-quantile plot draws the quantiles of one univariate
distribution against the corresponding quantiles of another.

Is there a shift in going from one distribution to another?

At Q1, the unit prices of items sold at Branch 1

tend to be lower than those at Branch 2

48
Scatter plot
• A scatter plot looks at the bivariate data to see clusters of
points or outliers
• Each pair of values is treated as a pair of coordinates and plotted as
points in the plane.
The correlation
between Unit prices
and Item sold

49
Scatter plot: Data correlation

negatively correlated

positively correlated

uncorrelated data 50
Quiz 06: Scatter plot
1. Consider the following data table, in Attributes
which there are five tuples of two
No.
A B
attributes, A and B. 1 19 16
2 25 10
3 13 26
4 12 29
5 16 20
Draw the scatter plot, whose the horizontal axis denotes attribute A,
and the vertical axis represents attribute B.

2. How to draw a scatter plot using some Python library? Draw the
scatter plot for the above data.

51
Data proximity
measures
Similarity and Dissimilarity
Similarity
• A numerical measure of how alike two data objects, 𝑖 and 𝑗, are
• Values often falls in the range [0,1]: 0 – unalike → 1 – identical

Dissimilarity (distance)
• A numerical measure of how different two data objects are
• It works in an opposite direction to some similarity measure
• The lower bound is often 0, while the upper limit varies

Proximity
• This refers to either similarity or dissimilarity
53
Feature matrix vs. Dissimilarity matrix
• Feature matrices are essential to most machine learning task.
Feature matrix Dissimilarity matrix
 x11 ... x1f ... x1p   0 
   d(2,1) 
 ... ... ... ... ... 
 0 
x ... xif ... xip   d(3,1) d ( 3, 2 ) 0 
 i1   
 ... ... ... ... ... 
 : : : 
x xnp 
 n1 ... xnf ...
 d ( n,1) d ( n, 2 ) ... ... 0

• 𝑛 data points with 𝑝 dimensions • A collection of distances for all

• Object-by-attribute structure pairs of 𝑛 objects
• Object-by-object structure

• Many nearest-neighbor algorithms use dissimilarity matrices.

54
Measures for nominal attributes
• Let the number of states of a nominal attribute be 𝑀
𝒑−𝒎
• Method 1: Simple matching 𝒅 𝒊, 𝒋 =
𝒑
• 𝑚: the number of attributes for which i and j are in the same state,
• 𝑝: the total number of attributes describing the objects

• Method 2: Create a binary attribute for each of the 𝑀 states

𝒎
• Measures of similarity 𝒔𝒊𝒎 𝒊, 𝒋 = 𝟏 − 𝒅 𝒊, 𝒋 =
𝒑
55
Measures for binary attributes
• Contingency table

• Symmetric binary variable • Asymmetric binary variable

𝒓+𝒔 𝒓+𝒔
𝒅 𝒊, 𝒋 = 𝒅 𝒊, 𝒋 =
𝒒+𝒓+𝒔+𝒕 𝒒+𝒓+𝒔
𝒒
• Jaccard coefficient: 𝒔𝒊𝒎 𝒊, 𝒋 = 𝟏 − 𝒅 𝒊, 𝒋 =
𝒒+𝒓+𝒔
56
Measures for binary attributes

• Gender is symmetric binary, the remaining attributes are asymmetric

• Let the values Y and P be 1 and the value N be 0.
• Suppose that the distance between objects (patients) is computed based
only on the asymmetric attributes
1+1 0+1
• 𝑑 𝐽𝑎𝑐𝑘, 𝐽𝑖𝑚 = = 0.67, 𝑑 𝐽𝑎𝑐𝑘, 𝑀𝑎𝑟𝑦 = = 0.33
1+1+1 2+0+1
1+2
𝑑 𝐽𝑖𝑚, 𝑀𝑎𝑟𝑦 = = 0.75
1+1+2
57
Measures for numeric attributes
• Consider two data points of 𝑝-dimensional
𝑖 = 𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑝 and 𝑗 = 𝑥𝑗1 , 𝑥𝑗2 , … , 𝑥𝑖𝑗
• Minkowski distance (𝐿ℎ norm)
ℎ ℎ ℎ ℎ
𝑑 𝑖, 𝑗 = 𝑥𝑖1 − 𝑥𝑗1 + 𝑥𝑖2 − 𝑥𝑗2 + ⋯ + 𝑥𝑖𝑝 − 𝑥𝑗𝑝
• where ℎ is the order

58
Measures for numeric attributes
• ℎ = 1: Manhattan (city block, 𝐿1 norm) distance
𝑑 𝑖, 𝑗 = 𝑥𝑖1 − 𝑥𝑗1 + 𝑥𝑖2 − 𝑥𝑗2 + ⋯ + 𝑥𝑖𝑝 − 𝑥𝑗𝑝

• ℎ = 2: Euclidean (𝐿2 norm) distance

2 2 2
𝑑 𝑖, 𝑗 = 𝑥𝑖1 − 𝑥𝑗1 + 𝑥𝑖2 − 𝑥𝑗2 + ⋯ + 𝑥𝑖𝑝 − 𝑥𝑗𝑝

• ℎ → ∞: “supremum” (𝐿𝑚𝑎𝑥 / 𝐿∞ norm, Chebyshev) distance

1/ℎ
𝑝
ℎ 𝑝
𝑑 𝑖, 𝑗 = lim ෍ 𝑥𝑖𝑓 − 𝑥𝑗𝑓 = max 𝑥𝑖𝑓 − 𝑥𝑗𝑓
ℎ→∞ 𝑓
𝑓=1

59
Cosine similarity
• A document can be represented by thousands of keywords
in the document.

𝑠𝑖𝑚(𝑑1, 𝑑2) = 0.94

60
Cosine similarity
• Let 𝑑1 and 𝑑2 are two vectors (e.g., term-frequency vectors).
𝑑1 ⋅ 𝑑2
• Cosine similarity is non-metric: 𝑠𝑖𝑚 𝑑1 , 𝑑2 =
𝑑1 𝑑3
• where ⋅ is vector dot product, 𝑑 is the length of vector 𝑑
• sim = 0 means no match, while sim = 1 means a complete match.

61
Measures for ordinal attributes
• The range of a numeric attribute can be mapped to an
ordinal attribute 𝑓 having 𝑀𝑓 states.
• E.g., temperate: cold (-30oC – 10oC), moderate (-10oC – 10oC), and
warm (10oC – 30oC)
• Let 𝑀 represent the number of possible ordered states,
which define the ranking 1, … , 𝑀𝑓
• Replace each 𝑥𝑖𝑓 by its corresponding rank, 𝑟𝑖𝑓 ∈ 1, … , 𝑀𝑓
𝑟𝑖𝑓 − 1
• Replace rank 𝑟𝑖𝑓 of 𝑖𝑡ℎ object by 𝑧𝑖𝑓 =
𝑀𝑓 − 1
• Continue with any measure for numeric attributes
62
Measures for ordinal attributes

• test-2 = {fair, good, excellent}, i.e., 𝑀𝑓 = 3

• The ranks of four objects are 3, 1, 2, and 3, respectively
• Map the rank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0
• Dissimilarity matrix using Euclidean distance

63
Measures for attributes of mixed types
• Suppose that the dataset has 𝑝 attributes of mixed type.
𝒑 (𝒇) (𝒇)
σ𝒇=𝟏 𝜹𝒊𝒋 𝒅𝒊𝒋
• The distance between objects 𝑖 and 𝑗 is 𝒅 𝒊, 𝒋 = 𝒑 (𝒇)
σ𝒇=𝟏 𝜹𝒊𝒋

(𝑓)
• 𝛿𝑖𝑗 = 0 if (1) 𝑥𝑖𝑓 or 𝑥𝑗𝑓 is missing, or (2) 𝑥𝑖𝑓 = 𝑥𝑗𝑓 = 0 and attribute
(𝑓)
𝑓 is asymmetric binary. Otherwise, 𝛿𝑖𝑗 = 1

(𝑓) 𝑥𝑖𝑓 −𝑥𝑗𝑓

• If 𝑓 is numeric: 𝑑𝑖𝑗 = , where ℎ runs over all
max 𝑥ℎ𝑓 −min 𝑥ℎ𝑓
ℎ ℎ
nonmissing objects for attribute 𝑓
(𝑓) (𝑓)
• If 𝑓 is nominal or binary: 𝑑𝑖𝑗 = 0 if 𝑥𝑖𝑓 = 𝑥𝑗𝑓 ; otherwise, 𝑑𝑖𝑗 = 1
𝑟𝑖𝑓 −1
• If 𝑓 is ordinal: compute 𝑟𝑖𝑓 and treat 𝑧𝑖𝑓 = as numeric
𝑀𝑓 −1

64
Measures for attributes of mixed types
Dissimilarity
matrix of test-1

Dissimilarity
(𝑓)
matrix of test-2 • 𝛿𝑖𝑗 = 1 for each attribute 𝑓
1 1 +1 0.50 +1(0.45)
• 𝑑 3,1 = = 0.65
3

• The resulting dissimilarity matrix

Dissimilarity
matrix of test-3

65
Quiz 07: Jaccard coefficient
1. Calculate the similarity between these two observations, in which all
the attributes are binary asymmetric.

IDs fever cough breathing fatigue headache loss of sore

difficulty taste throat
1 1 1 1 0 0 1 1
2 0 1 0 0 1 1 1

2. Explore the distance and similarity metrics supported in scikit-learn.

For each available metric, calculate the distance/similarity between the
two observations above.

66
Correlation analysis
2
𝜒 -test for correlation analysis
• Suppose attribute 𝐴 has 𝑐 distinct values and attribute 𝐵 has 𝑟
distinct values. There are 𝑛 data tuples.
• Let (𝐴𝑖 , 𝐵𝑗 ) denote the joint event that 𝐴 = 𝑎𝑖 and 𝐵 = 𝑏𝑗 .
• 𝜒 2 -test checks the null hypothesis, 𝐴 and 𝐵 are independent
𝑐 𝑟 2
2
𝑜𝑖𝑗 − 𝑒𝑖𝑗
𝜒 = ෍෍
𝑒𝑖𝑗
𝑖=1 𝑗=1
• 𝑜𝑖𝑗 : observed frequency (i.e., actual count) of (𝐴 = 𝑎𝑖 , 𝐵 = 𝑏𝑗 )
𝑐𝑜𝑢𝑛𝑡(𝐴=𝑎𝑖 )×𝑐𝑜𝑢𝑛𝑡(𝐵=𝑏𝑗 )
• 𝑒𝑖𝑗 : expected frequency of (𝐴𝑖 , 𝐵𝑗 ) 𝑒𝑖𝑗 = ൗ𝑛

• The larger 𝜒 2 value, the more likely the variables are related.

68
2
𝜒 -test: A numerical example
• Consider the below a contingency table.

(Numbers in parenthesis are

expected counts calculated
based on the data distribution
in the two categories)

• Are gender and preferred_reading correlated?

(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2
 =
2
+ + + = 507.93
90 210 360 840
• Two attributes are (strongly) correlated for the given group of people
• However, correlation does not imply causality.
• # of hospitals and # of car-theft in a city are correlated
• However, both are causally linked to the third variable – population.
69
Expected frequency
• Expected frequency is the theoretical count in a cell of a
contingency table if the two variables are independent.
(𝑹𝒐𝒘 𝒕𝒐𝒕𝒂𝒍𝒊 ) × (𝑪𝒐𝒍𝒖𝒎𝒏 𝒕𝒐𝒕𝒂𝒍𝒋 )
𝒆𝒊𝒋 =
𝑻𝒐𝒕𝒂𝒍
• 𝐸𝑖𝑗 : expected frequency for the cell in row i and column j
• 𝑅𝑜𝑤 𝑡𝑜𝑡𝑎𝑙𝑖 : count for row 𝑖, 𝐶𝑜𝑙𝑢𝑚𝑛 𝑡𝑜𝑡𝑎𝑙𝑖 : count for column 𝑗
• 𝑇𝑜𝑡𝑎𝑙: total number of observations in the entire table

• If the two variables are independent, the proportion in each

row should match the proportion in each column.

70
Expected frequency: An example
• Consider the below a contingency table.
Yes No Row Total
Male 30 20 50
Female 40 10 50
Column Total 70 30 100

• Compute expected frequency for each cell:

50×70 50×30
1️⃣ 𝐸𝑀𝑎𝑙𝑒−𝑌𝑒𝑠 = = 35 3️⃣ 𝐸𝑀𝑎𝑙𝑒−𝑁𝑜 = = 15
100 100
50×70 50×30
2️⃣ 𝐸𝐹𝑒𝑚𝑎𝑙𝑒−𝑌𝑒𝑠 = = 35 4️⃣ 𝐸𝐹𝑒𝑚𝑎𝑙𝑒−𝑁𝑜 = = 15
100 100

71
Expected frequency: Notes
• 𝜒 2 -test relies on an approximation that works best with
sufficiently large expected counts.
• Classical rule: Each expected frequency should be ≥ 5.
• Modern guideline: For larger tables, ≥ 1 is acceptable if no
more than 20% of cells are below 5.

72
2
𝜒 -test: Contingency table
• A contingency table (or crosstab) displays the frequency
distribution of two or more categorical variables.
• It helps to analyze the relationship between variables.

Values of the second variable

Values of the
first variable

Grand total

Marginal totals
73
2
𝜒 -test: Contingency table
• Is a contingency table able to represent categorical variables
that have more than two values?

18-29 30-49 50+ Row Total

Coffee 30 40 20 90
Tea 20 35 25 80
Juice 25 15 10 50
Column Total 75 90 55 220

74
2
𝜒 -test: Contingency table
• Can a contingency table represent the relationship of more
than two categorical variables?

75
2
𝜒 -test: An example
• The test is based on a significance
level with a DOF of 1.
• If the hypothesis is denied, 𝐴 and
𝐵 are statistically correlated.

76
2
𝜒 -test: Degree of freedom (DOF)
• DOF is the number of independent values that can vary in a
statistical calculation without breaking constraints.
• It can be calculated from the contingency table as follows.
𝑫𝑶𝑭 = (𝒓 − 𝟏) ∙ (𝒄 − 𝟏)
• 𝑟 is the number of rows and 𝑐 is the number of columns.
• E.g., in a table with 3 rows and 2 columns, DOF = (3 – 1)(2 – 1) = 2.

77
2
𝜒 -test: Significance levels
• The significance level 𝛼 is the probability threshold to decide
whether to reject the null hypothesis.
• It represents the risk of rejecting a true null hypothesis.

• Common significance levels

• 0.05 (5%): The most common choice, indicating a 5% risk of wrongly
rejecting the null hypothesis.
• 0.01 (1%): Used for stricter criteria, implying a 1% risk.
• 0.10 (10%): Sometimes used in exploratory studies where a higher
risk is acceptable.

78
2
Quiz 08: 𝜒 -test
1. Consider the data that relate the sex of children in families who have
two children. Apply 2 statistics at the 0.001 significance level.

First child Total

Male Female
Second child Male 114 131 245
Female 132 123 255
Total 246 254 500

2. How to perform 2 statistics using some library in Python?

79
Pearson correlation coefficient
• Consider two numeric attributes 𝐴 and 𝐵, and a set of 𝑛
observations 𝑎1 , 𝑏1 , . . , 𝑎𝑛 , 𝑏𝑛 .
• Pearson’s product moment coefficient
ഥ 𝒃𝒊 − 𝑩
σ𝒏𝒊=𝟏 𝒂𝒊 − 𝑨 ഥ ഥ𝑩
σ𝒏𝒊=𝟏 𝒂𝒊 𝒃𝒊 − 𝒏𝑨 ഥ
𝒓𝑨,𝑩 = =
𝒏𝝈𝑨 𝝈𝑩 𝒏𝝈𝑨 𝝈𝑩
• 𝐴,ҧ 𝐵,
ത 𝜎𝐴 , 𝜎𝐵 : means and standard deviations of 𝐴 and 𝐵, respectively
• Σ𝑎𝑖 𝑏𝑖 : sum of the 𝐴𝐵 cross-product

−1 ← 𝑟𝐴,𝐵 𝑟𝐴,𝐵 = 0 𝑟𝐴,𝐵 → 1

Negative correlation A and B are independent Positive correlation

80
Pearson correlation coefficient
1 0. 0.4 0 -0.4 -0. -1

1 1 1 -1 -1 -1

0 0 0 0 0 0 0

Several sets of (x, y) points, with the Pearson correlation coefficient of x and y for each set.
The correlation reflects the noisiness and direction of a linear relationship (top row), but
not the slope of that relationship (middle), nor many aspects of nonlinear relationships
(bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation
coefficient is undefined because the variance of Y is zero. (Wikipedia) 81
Covariance analysis
• The covariance between 𝑨 and 𝑩 is defined as
σ𝑛𝑖=1 𝑎𝑖 − 𝐴ҧ 𝑏𝑖 − 𝐵ത
𝐶𝑜𝑣 𝐴, 𝐵 = = 𝐸 𝐴 ∙ 𝐵 − 𝐴ҧ𝐵ത
𝑛
σ𝑛
𝑖=1 𝑎𝑖 σ𝑛
𝑖=1 𝑏𝑖
• where 𝐸 𝐴 = 𝐴ҧ = and 𝐸 𝐵 = 𝐵ത = are the expected
𝑛 𝑛
values of 𝐴 and 𝐵

𝐶𝑜𝑣 𝐴, 𝐵 > 0 𝐶𝑜𝑣 𝐴, 𝐵 < 0 𝐶𝑜𝑣 𝐴, 𝐵 = 0

Positive covariance Negative covariance A and B are independent

𝐶𝑜𝑣 𝐴,𝐵
• Covariance vs. correlation: 𝑟𝐴,𝐵 =
𝜎𝐴 𝜎𝐵

82
Covariance analysis: An example
• If the stocks are affected by the
same industry trends, will their
prices rise or fall together?

6+5+4+3+2 20
• 𝐸 𝐴𝑙𝑙𝐸𝑙𝑒𝑐𝑡𝑟𝑜𝑛𝑖𝑐𝑠 = = = $4
5 5
20+10+14+5+5 54
• 𝐸 𝐻𝑖𝑔ℎ𝑇𝑒𝑐ℎ = = = $10.80
5 5
6×20+5×10+4×14+3×5+2×5
• 𝐶𝑜𝑣 𝐴𝑙𝑙𝐸𝑙𝑒𝑐𝑡𝑟𝑜𝑛𝑖𝑐𝑠, 𝐻𝑖𝑔ℎ𝑇𝑒𝑐ℎ = − 4 × 10.80 = 7
5

• Therefore, a positive covariance indicates that stock prices for both

companies rise together
83
Quiz 09: Correlation tests
1. Consider the following data table, in Attributes
which there are five tuples of two
No.
A B
attributes, A and B. 1 19 16
Calculate the Pearson correlation 2 25 10
coefficient and Covariance between 3 13 26
A and B. 4 12 29
5 16 20

2. How to compute the above metrics using pandas?

Show the results for the above data.

84
References
• Jiawei Han, Micheline Kamber, and Jian Pei, 2011. Data Mining:
Concepts and Techniques (3rd ed.). Morgan Kaufmann Publishers Inc.
Chapter 2 and Chapter 3.

Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
35 pages
Understanding Data Types and Attributes
No ratings yet
Understanding Data Types and Attributes
124 pages
Data Objects and Attribute Types Explained
No ratings yet
Data Objects and Attribute Types Explained
19 pages
CS4053E Data Mining: Getting To Know Your Data
No ratings yet
CS4053E Data Mining: Getting To Know Your Data
38 pages
Data Preprocessing: Attribute Types & Stats
No ratings yet
Data Preprocessing: Attribute Types & Stats
8 pages
Data Types and Statistical Measures
No ratings yet
Data Types and Statistical Measures
57 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
10 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
63 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
59 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
42 pages
Data Sets and Attribute Types Explained
No ratings yet
Data Sets and Attribute Types Explained
41 pages
Understanding Data Types and Analysis
No ratings yet
Understanding Data Types and Analysis
100 pages
Data Objects and Attribute Types in Mining
No ratings yet
Data Objects and Attribute Types in Mining
50 pages
Understanding Data and Its Types
No ratings yet
Understanding Data and Its Types
51 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
58 pages
Five-Number Summary and IQR Analysis
No ratings yet
Five-Number Summary and IQR Analysis
31 pages
Understanding Data Attributes and Measures
No ratings yet
Understanding Data Attributes and Measures
7 pages
IDS Unit 2
No ratings yet
IDS Unit 2
7 pages
Data Attribute Types and Descriptions
No ratings yet
Data Attribute Types and Descriptions
24 pages
Types of Data in Machine Learning
No ratings yet
Types of Data in Machine Learning
35 pages
Foundations of Data Science: Describing Data
No ratings yet
Foundations of Data Science: Describing Data
21 pages
Data Mining Study Material For Engineering
No ratings yet
Data Mining Study Material For Engineering
70 pages
Understanding Variables in Statistics
No ratings yet
Understanding Variables in Statistics
63 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
29 pages
Data Objects and Attribute Types Overview
No ratings yet
Data Objects and Attribute Types Overview
43 pages
Understanding Data Types and Analysis
No ratings yet
Understanding Data Types and Analysis
86 pages
Data Analysis: Graphs & Summaries
No ratings yet
Data Analysis: Graphs & Summaries
97 pages
Descriptive Analytics in Retail Data
No ratings yet
Descriptive Analytics in Retail Data
44 pages
Statistical Data Descriptions in Mining
No ratings yet
Statistical Data Descriptions in Mining
40 pages
Data Preparation for Machine Learning
No ratings yet
Data Preparation for Machine Learning
52 pages
Understanding Data Types and Analysis
No ratings yet
Understanding Data Types and Analysis
25 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
74 pages
Understanding Data Objects and Types
No ratings yet
Understanding Data Objects and Types
44 pages
Data Objects and Attribute Types Explained
No ratings yet
Data Objects and Attribute Types Explained
28 pages
Data Objects and Attribute Types in Mining
No ratings yet
Data Objects and Attribute Types in Mining
29 pages
Histogram Analysis in Data Mining
100% (1)
Histogram Analysis in Data Mining
63 pages
Understanding Descriptive Analytics Data
No ratings yet
Understanding Descriptive Analytics Data
46 pages
Data Objects and Quality in Mining
No ratings yet
Data Objects and Quality in Mining
27 pages
Understanding Data Types in Statistics
No ratings yet
Understanding Data Types in Statistics
15 pages
Numerical Descriptive Measures Explained
No ratings yet
Numerical Descriptive Measures Explained
21 pages
Data Mining Concepts and Types
No ratings yet
Data Mining Concepts and Types
38 pages
Data Visualization and Statistics Overview
No ratings yet
Data Visualization and Statistics Overview
11 pages
Descriptive Statistics in Engineering
No ratings yet
Descriptive Statistics in Engineering
20 pages
Data Mining: Attributes and Analysis Techniques
No ratings yet
Data Mining: Attributes and Analysis Techniques
32 pages
Biostats - L3 - Summarizing Data
No ratings yet
Biostats - L3 - Summarizing Data
7 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
24 pages
Data Mining: Understanding Attributes and Objects
No ratings yet
Data Mining: Understanding Attributes and Objects
29 pages
Data Mining: Objects and Attributes Overview
No ratings yet
Data Mining: Objects and Attributes Overview
42 pages
Understanding Data Objects & Attributes
No ratings yet
Understanding Data Objects & Attributes
78 pages
Law of Large Counts in AP Stats
No ratings yet
Law of Large Counts in AP Stats
35 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
65 pages
Introduction to Statistics Concepts
No ratings yet
Introduction to Statistics Concepts
50 pages
Understanding Data Management and Statistics
No ratings yet
Understanding Data Management and Statistics
2 pages
Understanding Data Types and Attributes
No ratings yet
Understanding Data Types and Attributes
90 pages
Measures of Center for Dog Counts
No ratings yet
Measures of Center for Dog Counts
7 pages
Statistics & Probability - Part 1 Ground Floor
No ratings yet
Statistics & Probability - Part 1 Ground Floor
57 pages
Data Types and Statistical Analysis Overview
No ratings yet
Data Types and Statistical Analysis Overview
39 pages
Power System Modelling Exam Paper
No ratings yet
Power System Modelling Exam Paper
2 pages
Shaolin Disha Quan Techniques Guide
100% (1)
Shaolin Disha Quan Techniques Guide
3 pages
Acetone Vapor Pressure Lab Report
No ratings yet
Acetone Vapor Pressure Lab Report
4 pages
Fire Severity and Flashover Calculations
100% (1)
Fire Severity and Flashover Calculations
10 pages
Topological Edge States in Space-Time Crystals
No ratings yet
Topological Edge States in Space-Time Crystals
7 pages
2021 Internal Quality Audit Schedule
No ratings yet
2021 Internal Quality Audit Schedule
1 page
Simplifying Context-Free Grammars
No ratings yet
Simplifying Context-Free Grammars
37 pages
Understanding Ecological Systems Theory
No ratings yet
Understanding Ecological Systems Theory
3 pages
AR-Based Interactive Object Manual
No ratings yet
AR-Based Interactive Object Manual
6 pages
Wollo University Management Exam Model
No ratings yet
Wollo University Management Exam Model
20 pages
Chamber Refrigeration Mechanics Guide
No ratings yet
Chamber Refrigeration Mechanics Guide
51 pages
L&T Infotech Eligibility Declaration Form
No ratings yet
L&T Infotech Eligibility Declaration Form
2 pages
Khairul's Advanced Math PDF Guide
No ratings yet
Khairul's Advanced Math PDF Guide
383 pages
Mole Concept Mind Map for NEET
0% (2)
Mole Concept Mind Map for NEET
2 pages
Kinetic Particle Model Overview
No ratings yet
Kinetic Particle Model Overview
45 pages
C++ Data Structures: Linked Lists & Josephus
No ratings yet
C++ Data Structures: Linked Lists & Josephus
10 pages
Understanding Advertising Basics
No ratings yet
Understanding Advertising Basics
34 pages
How to Prepare Soluble and Insoluble Salts
No ratings yet
How to Prepare Soluble and Insoluble Salts
13 pages
BSC 700 Technical Data Overview
No ratings yet
BSC 700 Technical Data Overview
9 pages
Bulletin 2018 19
No ratings yet
Bulletin 2018 19
404 pages
Interdependence and Trade Benefits
No ratings yet
Interdependence and Trade Benefits
34 pages
Surveillance, Gender, and Incarceration
No ratings yet
Surveillance, Gender, and Incarceration
95 pages
B.Tech VII Semester Exam Schedule 2024
No ratings yet
B.Tech VII Semester Exam Schedule 2024
32 pages
Effective SWOT Analysis for Schools
No ratings yet
Effective SWOT Analysis for Schools
1 page
Muskingum Method Coefficients Study
No ratings yet
Muskingum Method Coefficients Study
14 pages
Special Motors and Transducers Overview
No ratings yet
Special Motors and Transducers Overview
11 pages
ICT Course Content Overview
No ratings yet
ICT Course Content Overview
149 pages
SOMC Code of Conduct Overview
100% (1)
SOMC Code of Conduct Overview
13 pages
Grade 12 STEM Subjects Overview
No ratings yet
Grade 12 STEM Subjects Overview
2 pages
Unbonded Flexible Pipe Overview
No ratings yet
Unbonded Flexible Pipe Overview
18 pages