0% found this document useful (0 votes)
8 views86 pages

Exploratory Data Analysis Overview

The document provides an overview of Exploratory Data Analysis (EDA) and Initial Data Analysis (IDA), emphasizing their roles in summarizing data characteristics and ensuring data quality. It covers data objects, attributes, types, and basic statistical descriptions, including measures of central tendency and dispersion. Additionally, it discusses data visualization techniques and the importance of data cleaning and preprocessing.

Uploaded by

ndkhoa232
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views86 pages

Exploratory Data Analysis Overview

The document provides an overview of Exploratory Data Analysis (EDA) and Initial Data Analysis (IDA), emphasizing their roles in summarizing data characteristics and ensuring data quality. It covers data objects, attributes, types, and basic statistical descriptions, including measures of central tendency and dispersion. Additionally, it discusses data visualization techniques and the importance of data cleaning and preprocessing.

Uploaded by

ndkhoa232
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Exploratory Data Analysis (P1)

Nguyen Ngoc Thao


nnthao@[Link]
Content outline
• Data objects and Attributes
• Basic statistical data descriptions
• Basic data visualization
• Data proximity measures
• Data correlation analysis

2
Exploratory data analysis (EDA)
• EDA analyzes data to summarize their main characteristics,
using statistical graphics and data visualization methods.
• Data scientists use EDA to determine the best way to handle
data sources to get the answers they need.

3
Initial Data Analysis (IDA)
• IDA focuses on examining the structure, quality, and basic
characteristics of a dataset.
• Purposes:
• Understand the basic structure and quality of the dataset.
• Detect and fix problems (e.g., missing values, incorrect types,
inconsistencies).
• Ensure the data is clean, valid, and ready for further analysis.

• IDA makes sure that the data is in good shape, while EDA
explores well-prepared data to uncover insights.

4
Common tasks in EDA (and IDA)

Data quality
Data types
Data preprocessing

Data distribution
Outliers

Pattern discovery

Correlation Data visualization


5
Data objects and
Attributes
Data collection: Record datasets
• Relational / transactional tuples
• Term-frequency vectors, numerical matrices, crosstabs

7
Data collection: Graph datasets
• The Internet, social networks, molecular structures

8
Data collection: Ordered datasets
• Sequential data: transaction sequences, genetic sequences
• Video data, temporal data, time-series data, etc.

9
Data objects
• A data object depicts an entity, serving as the building block
for a dataset.
• Similar terms: sample, example, instance, data point, and tuple

Sales database Medical database

University database

• Data objects are described by attributes.


• In a database: rows → data objects, columns → attributes

10
Attributes
• An attribute shows some characteristic of a data object.
• Similar terms: dimension, feature, and variable
• E.g., a Customer object has 3 attributes {id, name, address}
• Observation: an observed value for a given attribute
• Feature vector: a set of attributes used to describe an object

11
Attribute types: Nominal
• Qualitative, values do not have any meaningful order
• Enumerations: categories, states, or “names of things”

Occupation

Day and Night

Weather
Colors 12
Attribute types: Ordinal
• Qualitative, values have a meaningful order (ranking) but
magnitude between successive values is not known

• Useful for subjective assessments of qualities that cannot be


measured objectively
• E.g., customer satisfaction

13
Attribute types: Binary
• Nominal attribute with only 2 states
• Symmetric binary: both outcomes equally important

Switch light
On and Off
Day and night Male and Female

• Asymmetric binary: outcomes not equally important


• Convention: assign 1 to the most important outcome (e.g., HIV test)

A positive result is more significant


Rh positive is
more common
14
Attribute types: Numeric
Interval numeric attribute
• Measured on a scale of equal-sized units
• Values have order (e.g., temperature in C˚ or F˚, calendar dates)
• No true zero-point: able to compute the difference – not able to talk
of one value as being a multiple of another
• E.g., 20˚C is five degrees higher than 15˚C (right), 10˚C is twice as warm
as 5˚C (wrong)

Ratio numeric attribute


• Inherent zero-point
• Values can be considered as being an order of magnitude larger than
the unit of measurement
• E.g., temperature (10˚K is twice as high as 5˚K), monetary (you are 100
times richer with $100 than with $1), measurements (height, weight)
15
Image credit: Google Sites

16
Attributes: Discrete vs. Continuous
• There are many ways to organize attribute types, which are
not mutually exclusive.
• Discrete attribute
• Only a finite or countably infinite set of values
• The values are sometimes represented as integers.
• Binary attributes are a special case of discrete attributes.
• Continuous attribute
• Real numbers of continuous domains
• The values are usually represented using a finite number of digits
→ floating-point variables

17
Quiz 01: Data types
1. For each of the following pairs of data types, given an example to
contrast the characteristics of data types.
• Nominal data vs. Ordinal data
• Symmetric binary data vs. Asymmetric binary data
• Interval numeric data vs. Ratio numeric data
x = 42
2. How to check the data type of a variable in Python?
y = "Hello"
Show
. the data type of the three variables, x, y, z,
z = [1, 2, 3]
shown aside.

3. How to check the schema of a df = [Link]({


pandas Dataframe? "name": ["Alice", "Bob"],
"age": [25, 30],
Check the scheme of the
"salary": [50000.0, 60000.0]
Dataframe show aside.
}) 18
Basic statistical
data descriptions
Central tendency: Arithmetic mean
• Let 𝑥1 , 𝑥2 , … , 𝑥𝑁 be a set of 𝑁 values or observations for
some numeric attribute 𝑋.
𝑵
𝟏
• The arithmetic mean is defined as 𝝁 = ෍ 𝒙𝒊
𝑵
𝒊=𝟏
σ𝑛𝑖=1 𝑤𝑖 𝑥𝑖
• The weighted arithmetic mean is written as 𝜇𝑤 = 𝑛
σ𝑖=1 𝑤𝑖
• where 𝑤𝑖 is the weight value that associates with 𝑥𝑖 .

• It is the most common and effective numeric measure

20
Central tendency: Arithmetic mean
• Consider the score records of John John’s record Kelly’s record
and Kelly. Homework 92 Homework 100
Quiz 74 Quiz 82
• The (non-weighted) mean scores are
Lab 83 Lab 95
𝜇𝐽𝑜ℎ𝑛 = 82.6, 𝜇𝐾𝑒𝑙𝑙𝑦 = 84.6 Test 76 Test 70
Final exam 88 Final exam 76
Homework 15 %
Quiz 10 % • We now have the course grade distribution.
Lab 20 %
• The weighted mean scores are
Test 25 % 𝑤 𝑤
𝜇𝐽𝑜ℎ𝑛 = 83.2, 𝜇𝐾𝑒𝑙𝑙𝑦 = 82.5
Final exam 30 %

𝑤 0.15 × 92 + 0.1 × 74 + 0.2 × 83 + 0.25 × 76 + 0.3 × 88


𝜇𝐽𝑜ℎ𝑛 = = 83.2
0.15 + 0.1 + 0.2 + 0.25 + 0.3
21
Central tendency: Arithmetic mean
• Means are highly sensitive to extreme values (e.g., outlier).
• Trimmed mean: chop extreme values before calculating the
regular mean

Typical mean: 27.9

4 14 19 20 22 24 25 26 26 99

remove 10% observations from each side

Trimmed mean: 22

22
Central tendency: Mode
• Mode is the value that occurs most frequently in the data,
defined for both qualitative and quantitative attributes.
• If each data value occurs only once, then there is no mode

23
Central tendency: Median
• Suppose that the given set of 𝑁 observations is sorted.
• Median is the middle value of the ordered set.
• 𝑁 is odd: pick the exact middle value; otherwise, take the average of
the two middlemost values.
• Midrange is the average of the largest and smallest values
in the set.
4 4 4 9 15 15 15 27 37 48
mean = 17.8 – mode: 4 and 15 – midrange = 26, median = (15+15)/2 = 15

3 3 6 9 15 15 15 27 27 37 48
mean = 18.636 – mode: 15 – midrange = 25.5, median = 15
24
Symmetric data vs. Skew data

symmetric

positively skewed negatively skewed

• For moderately skewed unimodal numeric data, the empirical formula is


𝑚𝑒𝑎𝑛 − 𝑚𝑜𝑑𝑒 ≈ 3 × (𝑚𝑒𝑎𝑛 − 𝑚𝑒𝑑𝑖𝑎𝑛)
25
Quiz 02: Mean, mode, and Midrange
1. Consider the following 1D data series, which includes 13 data points.
31, 40, 19, 45, 5, 18, 30, 5, 33, 33, 25, 5, 20
Compute the following values: arithmetic mean, midrange, median,
and mode.

2. For each of the following values, identify whether pandas provides a


function to calculate the value.
• Arithmetic mean • Median, mode
• Weighted arithmetic mean • Midrange
For each available function, show the result for the above data.

26
Data dispersion: Quantiles
• Let 𝑥1 , 𝑥2 , … , 𝑥𝑁 be a set of 𝑁 observations sorted in
increasing order for a numeric attribute 𝑋.
• Quantiles are points taken at regular intervals of a data
distribution, dividing it into equal-sized consecutive sets.
• kth q-quantile (0 < k < 𝑞, 𝑘 ∈ ℕ∗ ): a value 𝑥 such that at most
𝑘/𝑞 data values < 𝑥 and at most (𝑞 − 𝑘)/𝑞 of which > 𝑥.
• There are 𝑞 − 1 q-quantiles.

27
Data dispersion: Quantiles
• Quartiles (4-quantiles) split the data distribution into four equal
parts.

• Percentiles (100-quantiles): 100 equal-sized consecutive sets.


• 2-quantile is the median that splits the distribution into halves.

28
Data dispersion: Interquartile range
• Interquartile range (IQR) is the distance between the first
and third quartiles.
𝐼𝑄𝑅 = 𝑄3 − 𝑄1

• Range is the difference between the largest and smallest


values in the set.
range

30 36 47 50 52 52 56 60 63 70 70
𝑄1 𝑄2 𝑄3
(median)
IQR
29
How to determine the quartile?
• Use the median to divide the ordered set into two halves.
• If the original set has an even number of points, split it exactly in half
• Otherwise, do not include the median in either half.
• 𝑄1 and 𝑄3 are the medians of the lower and upper halves,
respectively.

6 7 15 36 39 40 41 42 43 47 49
𝑄1 𝑄2 𝑄3

7 15 36 39 40 41
𝑄1 𝑄2 = 37.5 𝑄3
30
Quiz 03: Quantiles
1. You are given the following dataset representing the scores of 15
students in a math exam, already sorted.

45, 48, 52, 55, 62, 67, 70, 72, 75, 77, 80, 85, 87,
90, 95
Compute the first quartile (Q1), second quartile (Q2), and third quartile
(Q3), and IQR.
2. Identify whether pandas supports the calculation of quartiles and IQR.
For each available function, show the result for the above data.

31
Data dispersion: Boxplot
• A five-number summary of a distribution includes
• The median (𝑸𝟐 ), the quartiles 𝑸𝟏 and 𝑸𝟑 ,
• The smallest (𝑴𝒊𝒏) and largest (𝑴𝒂𝒙) individual values.

• This summary is presented by a boxplot.

32
Data dispersion: Boxplot

• The two whiskers refers to the smallest and largest values within
𝑸𝟏 − 1.5 × 𝐼𝑄𝑅, 𝑸𝟑 + 1.5 × 𝐼𝑄𝑅 .
• Outliers: points that are out the above range, plotted individually
33
Data dispersion: Boxplot
Boxplot for the unit price data for items
sold at four branches of AllElectronics
during a given time period

• For Branch 1, the median price of items sold is $80, 𝑄1 is $60, and 𝑄3 is
$100. Notice that two outlying observations, 175 and 202, were plotted
individually as they are more than 1.5  IQR.
34
Quiz 04: Draw a box plot
1. Consider the following 1D data series, which includes 15 data points
sorted in ascending order.
21, 25, 27, 29, 32, 36, 36, 48, 67, 70, 74, 75,
79, 150, 197
• Define the five-number summary for the above data.
• Draw the boxplot representing the above five-number summary.
Note the vertical axis and all the values.

2. How to draw a boxplot in Python?


3. Can scikit-learn be used to draw a boxplot? If yes, how?

35
Data dispersion: Variance
• The (population) variance is defined as
𝑁 𝑁
2
1 2
1
𝜎 = ෍ 𝑥𝑖 − 𝑥ҧ = ෍ 𝑥𝑖 2 − 𝑥ҧ 2
𝑁 𝑁
𝑖=1 𝑖=1

• The standard deviation is the square root of the variance.


• Low  → the data tends to be very close to the mean.
• High  → the data spreads out over a large range of values.

[𝜇 − 𝜎, 𝜇 + 𝜎] [𝜇 − 2𝜎, 𝜇 + 2𝜎] [𝜇 − 3𝜎, 𝜇 + 3𝜎]


36
Image credit: ResearchGate

Box plot and probability density function of a normal distribution. 37


Types of outliers

38
Quiz 05: Variance and standard deviation
1. Consider the following 1D data series, which includes 15 data points
sorted in ascending order.
21, 25, 27, 29, 32, 36, 36, 48, 67, 80, 84,
85, 89, 92, 97
Compute the variance and standard deviation.

2. Does pandas provide the variance and standard deviation for a


dataframe? If yes, how?
For each available function, show the result for the above data.

39
Basic data
visualization
Why data visualization?
• Gain insight into an information space by mapping data onto
graphical primitives
• Provide qualitative overview of large datasets
• Search for patterns, trends, irregularities, relationships
• Help find interesting regions and suitable parameters for
further quantitative analysis
• Provide a visual proof of computer representations derived

41
Bar chart
• A bar chart presents nominal data by using rectangular
bars with heights proportional to the values represented.

42
Histogram
• The range of values for a numeric attribute 𝑋 is partitioned
into disjoint consecutive subranges, called buckets or bins.
• Each bar is for a subrange such that its height represents
the total items within the subrange.

Equal-width: equal bucket range

Equal-frequency: equal bucket depth

43
Histogram: An example

44
Histogram over boxplot
• The two following histograms may have the same boxplot.
• However, they represent rather different data distributions.

45
Quantile plot
• A quantile plot presents the plot quantile information for a
univariate data distribution.
• It allows access to both overall behavior and unusual occurrences.
• Let 𝑥1 , 𝑥2 , … , 𝑥𝑁 be the data observations sorted in
increasing order for some ordinal or numeric attribute 𝑋.
𝑖−0.5
• Each value 𝑥𝑖 is paired with 𝑓𝑖 = , indicating that
𝑁
approximately 𝑓𝑖 100% of data are  𝑥𝑖 .

46
Quantile plot: An example

Quantile plot for the unit price data

47
Quantile-Quantile plot
• A quantile-quantile plot draws the quantiles of one univariate
distribution against the corresponding quantiles of another.

Is there a shift in going from one distribution to another?

At Q1, the unit prices of items sold at Branch 1


tend to be lower than those at Branch 2

48
Scatter plot
• A scatter plot looks at the bivariate data to see clusters of
points or outliers
• Each pair of values is treated as a pair of coordinates and plotted as
points in the plane.
The correlation
between Unit prices
and Item sold

49
Scatter plot: Data correlation

negatively correlated

positively correlated

uncorrelated data 50
Quiz 06: Scatter plot
1. Consider the following data table, in Attributes
which there are five tuples of two
No.
A B
attributes, A and B. 1 19 16
2 25 10
3 13 26
4 12 29
5 16 20
Draw the scatter plot, whose the horizontal axis denotes attribute A,
and the vertical axis represents attribute B.

2. How to draw a scatter plot using some Python library? Draw the
scatter plot for the above data.

51
Data proximity
measures
Similarity and Dissimilarity
Similarity
• A numerical measure of how alike two data objects, 𝑖 and 𝑗, are
• Values often falls in the range [0,1]: 0 – unalike → 1 – identical

Dissimilarity (distance)
• A numerical measure of how different two data objects are
• It works in an opposite direction to some similarity measure
• The lower bound is often 0, while the upper limit varies

Proximity
• This refers to either similarity or dissimilarity
53
Feature matrix vs. Dissimilarity matrix
• Feature matrices are essential to most machine learning task.
Feature matrix Dissimilarity matrix
 x11 ... x1f ... x1p   0 
   d(2,1) 
 ... ... ... ... ... 
 0 
x ... xif ... xip   d(3,1) d ( 3, 2 ) 0 
 i1   
 ... ... ... ... ... 
 : : : 
x xnp 
 n1 ... xnf ...
 d ( n,1) d ( n, 2 ) ... ... 0

• 𝑛 data points with 𝑝 dimensions • A collection of distances for all


• Object-by-attribute structure pairs of 𝑛 objects
• Object-by-object structure

• Many nearest-neighbor algorithms use dissimilarity matrices.


54
Measures for nominal attributes
• Let the number of states of a nominal attribute be 𝑀
𝒑−𝒎
• Method 1: Simple matching 𝒅 𝒊, 𝒋 =
𝒑
• 𝑚: the number of attributes for which i and j are in the same state,
• 𝑝: the total number of attributes describing the objects

• Method 2: Create a binary attribute for each of the 𝑀 states


𝒎
• Measures of similarity 𝒔𝒊𝒎 𝒊, 𝒋 = 𝟏 − 𝒅 𝒊, 𝒋 =
𝒑
55
Measures for binary attributes
• Contingency table

• Symmetric binary variable • Asymmetric binary variable


𝒓+𝒔 𝒓+𝒔
𝒅 𝒊, 𝒋 = 𝒅 𝒊, 𝒋 =
𝒒+𝒓+𝒔+𝒕 𝒒+𝒓+𝒔
𝒒
• Jaccard coefficient: 𝒔𝒊𝒎 𝒊, 𝒋 = 𝟏 − 𝒅 𝒊, 𝒋 =
𝒒+𝒓+𝒔
56
Measures for binary attributes

• Gender is symmetric binary, the remaining attributes are asymmetric


• Let the values Y and P be 1 and the value N be 0.
• Suppose that the distance between objects (patients) is computed based
only on the asymmetric attributes
1+1 0+1
• 𝑑 𝐽𝑎𝑐𝑘, 𝐽𝑖𝑚 = = 0.67, 𝑑 𝐽𝑎𝑐𝑘, 𝑀𝑎𝑟𝑦 = = 0.33
1+1+1 2+0+1
1+2
𝑑 𝐽𝑖𝑚, 𝑀𝑎𝑟𝑦 = = 0.75
1+1+2
57
Measures for numeric attributes
• Consider two data points of 𝑝-dimensional
𝑖 = 𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑝 and 𝑗 = 𝑥𝑗1 , 𝑥𝑗2 , … , 𝑥𝑖𝑗
• Minkowski distance (𝐿ℎ norm)
ℎ ℎ ℎ ℎ
𝑑 𝑖, 𝑗 = 𝑥𝑖1 − 𝑥𝑗1 + 𝑥𝑖2 − 𝑥𝑗2 + ⋯ + 𝑥𝑖𝑝 − 𝑥𝑗𝑝
• where ℎ is the order

58
Measures for numeric attributes
• ℎ = 1: Manhattan (city block, 𝐿1 norm) distance
𝑑 𝑖, 𝑗 = 𝑥𝑖1 − 𝑥𝑗1 + 𝑥𝑖2 − 𝑥𝑗2 + ⋯ + 𝑥𝑖𝑝 − 𝑥𝑗𝑝

• ℎ = 2: Euclidean (𝐿2 norm) distance


2 2 2
𝑑 𝑖, 𝑗 = 𝑥𝑖1 − 𝑥𝑗1 + 𝑥𝑖2 − 𝑥𝑗2 + ⋯ + 𝑥𝑖𝑝 − 𝑥𝑗𝑝

• ℎ → ∞: “supremum” (𝐿𝑚𝑎𝑥 / 𝐿∞ norm, Chebyshev) distance


1/ℎ
𝑝
ℎ 𝑝
𝑑 𝑖, 𝑗 = lim ෍ 𝑥𝑖𝑓 − 𝑥𝑗𝑓 = max 𝑥𝑖𝑓 − 𝑥𝑗𝑓
ℎ→∞ 𝑓
𝑓=1

59
Cosine similarity
• A document can be represented by thousands of keywords
in the document.

𝑠𝑖𝑚(𝑑1, 𝑑2) = 0.94

60
Cosine similarity
• Let 𝑑1 and 𝑑2 are two vectors (e.g., term-frequency vectors).
𝑑1 ⋅ 𝑑2
• Cosine similarity is non-metric: 𝑠𝑖𝑚 𝑑1 , 𝑑2 =
𝑑1 𝑑3
• where ⋅ is vector dot product, 𝑑 is the length of vector 𝑑
• sim = 0 means no match, while sim = 1 means a complete match.

61
Measures for ordinal attributes
• The range of a numeric attribute can be mapped to an
ordinal attribute 𝑓 having 𝑀𝑓 states.
• E.g., temperate: cold (-30oC – 10oC), moderate (-10oC – 10oC), and
warm (10oC – 30oC)
• Let 𝑀 represent the number of possible ordered states,
which define the ranking 1, … , 𝑀𝑓
• Replace each 𝑥𝑖𝑓 by its corresponding rank, 𝑟𝑖𝑓 ∈ 1, … , 𝑀𝑓
𝑟𝑖𝑓 − 1
• Replace rank 𝑟𝑖𝑓 of 𝑖𝑡ℎ object by 𝑧𝑖𝑓 =
𝑀𝑓 − 1
• Continue with any measure for numeric attributes
62
Measures for ordinal attributes

• test-2 = {fair, good, excellent}, i.e., 𝑀𝑓 = 3


• The ranks of four objects are 3, 1, 2, and 3, respectively
• Map the rank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0
• Dissimilarity matrix using Euclidean distance

63
Measures for attributes of mixed types
• Suppose that the dataset has 𝑝 attributes of mixed type.
𝒑 (𝒇) (𝒇)
σ𝒇=𝟏 𝜹𝒊𝒋 𝒅𝒊𝒋
• The distance between objects 𝑖 and 𝑗 is 𝒅 𝒊, 𝒋 = 𝒑 (𝒇)
σ𝒇=𝟏 𝜹𝒊𝒋

(𝑓)
• 𝛿𝑖𝑗 = 0 if (1) 𝑥𝑖𝑓 or 𝑥𝑗𝑓 is missing, or (2) 𝑥𝑖𝑓 = 𝑥𝑗𝑓 = 0 and attribute
(𝑓)
𝑓 is asymmetric binary. Otherwise, 𝛿𝑖𝑗 = 1

(𝑓) 𝑥𝑖𝑓 −𝑥𝑗𝑓


• If 𝑓 is numeric: 𝑑𝑖𝑗 = , where ℎ runs over all
max 𝑥ℎ𝑓 −min 𝑥ℎ𝑓
ℎ ℎ
nonmissing objects for attribute 𝑓
(𝑓) (𝑓)
• If 𝑓 is nominal or binary: 𝑑𝑖𝑗 = 0 if 𝑥𝑖𝑓 = 𝑥𝑗𝑓 ; otherwise, 𝑑𝑖𝑗 = 1
𝑟𝑖𝑓 −1
• If 𝑓 is ordinal: compute 𝑟𝑖𝑓 and treat 𝑧𝑖𝑓 = as numeric
𝑀𝑓 −1

64
Measures for attributes of mixed types
Dissimilarity
matrix of test-1

Dissimilarity
(𝑓)
matrix of test-2 • 𝛿𝑖𝑗 = 1 for each attribute 𝑓
1 1 +1 0.50 +1(0.45)
• 𝑑 3,1 = = 0.65
3

• The resulting dissimilarity matrix


Dissimilarity
matrix of test-3

65
Quiz 07: Jaccard coefficient
1. Calculate the similarity between these two observations, in which all
the attributes are binary asymmetric.

IDs fever cough breathing fatigue headache loss of sore


difficulty taste throat
1 1 1 1 0 0 1 1
2 0 1 0 0 1 1 1

2. Explore the distance and similarity metrics supported in scikit-learn.


For each available metric, calculate the distance/similarity between the
two observations above.

66
Correlation analysis
2
𝜒 -test for correlation analysis
• Suppose attribute 𝐴 has 𝑐 distinct values and attribute 𝐵 has 𝑟
distinct values. There are 𝑛 data tuples.
• Let (𝐴𝑖 , 𝐵𝑗 ) denote the joint event that 𝐴 = 𝑎𝑖 and 𝐵 = 𝑏𝑗 .
• 𝜒 2 -test checks the null hypothesis, 𝐴 and 𝐵 are independent
𝑐 𝑟 2
2
𝑜𝑖𝑗 − 𝑒𝑖𝑗
𝜒 = ෍෍
𝑒𝑖𝑗
𝑖=1 𝑗=1
• 𝑜𝑖𝑗 : observed frequency (i.e., actual count) of (𝐴 = 𝑎𝑖 , 𝐵 = 𝑏𝑗 )
𝑐𝑜𝑢𝑛𝑡(𝐴=𝑎𝑖 )×𝑐𝑜𝑢𝑛𝑡(𝐵=𝑏𝑗 )
• 𝑒𝑖𝑗 : expected frequency of (𝐴𝑖 , 𝐵𝑗 ) 𝑒𝑖𝑗 = ൗ𝑛

• The larger 𝜒 2 value, the more likely the variables are related.

68
2
𝜒 -test: A numerical example
• Consider the below a contingency table.

(Numbers in parenthesis are


expected counts calculated
based on the data distribution
in the two categories)

• Are gender and preferred_reading correlated?


(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2
 =
2
+ + + = 507.93
90 210 360 840
• Two attributes are (strongly) correlated for the given group of people
• However, correlation does not imply causality.
• # of hospitals and # of car-theft in a city are correlated
• However, both are causally linked to the third variable – population.
69
Expected frequency
• Expected frequency is the theoretical count in a cell of a
contingency table if the two variables are independent.
(𝑹𝒐𝒘 𝒕𝒐𝒕𝒂𝒍𝒊 ) × (𝑪𝒐𝒍𝒖𝒎𝒏 𝒕𝒐𝒕𝒂𝒍𝒋 )
𝒆𝒊𝒋 =
𝑻𝒐𝒕𝒂𝒍
• 𝐸𝑖𝑗 : expected frequency for the cell in row i and column j
• 𝑅𝑜𝑤 𝑡𝑜𝑡𝑎𝑙𝑖 : count for row 𝑖, 𝐶𝑜𝑙𝑢𝑚𝑛 𝑡𝑜𝑡𝑎𝑙𝑖 : count for column 𝑗
• 𝑇𝑜𝑡𝑎𝑙: total number of observations in the entire table

• If the two variables are independent, the proportion in each


row should match the proportion in each column.

70
Expected frequency: An example
• Consider the below a contingency table.
Yes No Row Total
Male 30 20 50
Female 40 10 50
Column Total 70 30 100

• Compute expected frequency for each cell:


50×70 50×30
1️⃣ 𝐸𝑀𝑎𝑙𝑒−𝑌𝑒𝑠 = = 35 3️⃣ 𝐸𝑀𝑎𝑙𝑒−𝑁𝑜 = = 15
100 100
50×70 50×30
2️⃣ 𝐸𝐹𝑒𝑚𝑎𝑙𝑒−𝑌𝑒𝑠 = = 35 4️⃣ 𝐸𝐹𝑒𝑚𝑎𝑙𝑒−𝑁𝑜 = = 15
100 100

71
Expected frequency: Notes
• 𝜒 2 -test relies on an approximation that works best with
sufficiently large expected counts.
• Classical rule: Each expected frequency should be ≥ 5.
• Modern guideline: For larger tables, ≥ 1 is acceptable if no
more than 20% of cells are below 5.

72
2
𝜒 -test: Contingency table
• A contingency table (or crosstab) displays the frequency
distribution of two or more categorical variables.
• It helps to analyze the relationship between variables.

Values of the second variable


Values of the
first variable

Grand total

Marginal totals
73
2
𝜒 -test: Contingency table
• Is a contingency table able to represent categorical variables
that have more than two values?

18-29 30-49 50+ Row Total


Coffee 30 40 20 90
Tea 20 35 25 80
Juice 25 15 10 50
Column Total 75 90 55 220

74
2
𝜒 -test: Contingency table
• Can a contingency table represent the relationship of more
than two categorical variables?

75
2
𝜒 -test: An example
• The test is based on a significance
level with a DOF of 1.
• If the hypothesis is denied, 𝐴 and
𝐵 are statistically correlated.

76
2
𝜒 -test: Degree of freedom (DOF)
• DOF is the number of independent values that can vary in a
statistical calculation without breaking constraints.
• It can be calculated from the contingency table as follows.
𝑫𝑶𝑭 = (𝒓 − 𝟏) ∙ (𝒄 − 𝟏)
• 𝑟 is the number of rows and 𝑐 is the number of columns.
• E.g., in a table with 3 rows and 2 columns, DOF = (3 – 1)(2 – 1) = 2.

77
2
𝜒 -test: Significance levels
• The significance level 𝛼 is the probability threshold to decide
whether to reject the null hypothesis.
• It represents the risk of rejecting a true null hypothesis.

• Common significance levels


• 0.05 (5%): The most common choice, indicating a 5% risk of wrongly
rejecting the null hypothesis.
• 0.01 (1%): Used for stricter criteria, implying a 1% risk.
• 0.10 (10%): Sometimes used in exploratory studies where a higher
risk is acceptable.

78
2
Quiz 08: 𝜒 -test
1. Consider the data that relate the sex of children in families who have
two children. Apply 2 statistics at the 0.001 significance level.

First child Total


Male Female
Second child Male 114 131 245
Female 132 123 255
Total 246 254 500

2. How to perform 2 statistics using some library in Python?

79
Pearson correlation coefficient
• Consider two numeric attributes 𝐴 and 𝐵, and a set of 𝑛
observations 𝑎1 , 𝑏1 , . . , 𝑎𝑛 , 𝑏𝑛 .
• Pearson’s product moment coefficient
ഥ 𝒃𝒊 − 𝑩
σ𝒏𝒊=𝟏 𝒂𝒊 − 𝑨 ഥ ഥ𝑩
σ𝒏𝒊=𝟏 𝒂𝒊 𝒃𝒊 − 𝒏𝑨 ഥ
𝒓𝑨,𝑩 = =
𝒏𝝈𝑨 𝝈𝑩 𝒏𝝈𝑨 𝝈𝑩
• 𝐴,ҧ 𝐵,
ത 𝜎𝐴 , 𝜎𝐵 : means and standard deviations of 𝐴 and 𝐵, respectively
• Σ𝑎𝑖 𝑏𝑖 : sum of the 𝐴𝐵 cross-product

−1 ← 𝑟𝐴,𝐵 𝑟𝐴,𝐵 = 0 𝑟𝐴,𝐵 → 1


Negative correlation A and B are independent Positive correlation

80
Pearson correlation coefficient
1 0. 0.4 0 -0.4 -0. -1

1 1 1 -1 -1 -1

0 0 0 0 0 0 0

Several sets of (x, y) points, with the Pearson correlation coefficient of x and y for each set.
The correlation reflects the noisiness and direction of a linear relationship (top row), but
not the slope of that relationship (middle), nor many aspects of nonlinear relationships
(bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation
coefficient is undefined because the variance of Y is zero. (Wikipedia) 81
Covariance analysis
• The covariance between 𝑨 and 𝑩 is defined as
σ𝑛𝑖=1 𝑎𝑖 − 𝐴ҧ 𝑏𝑖 − 𝐵ത
𝐶𝑜𝑣 𝐴, 𝐵 = = 𝐸 𝐴 ∙ 𝐵 − 𝐴ҧ𝐵ത
𝑛
σ𝑛
𝑖=1 𝑎𝑖 σ𝑛
𝑖=1 𝑏𝑖
• where 𝐸 𝐴 = 𝐴ҧ = and 𝐸 𝐵 = 𝐵ത = are the expected
𝑛 𝑛
values of 𝐴 and 𝐵

𝐶𝑜𝑣 𝐴, 𝐵 > 0 𝐶𝑜𝑣 𝐴, 𝐵 < 0 𝐶𝑜𝑣 𝐴, 𝐵 = 0


Positive covariance Negative covariance A and B are independent

𝐶𝑜𝑣 𝐴,𝐵
• Covariance vs. correlation: 𝑟𝐴,𝐵 =
𝜎𝐴 𝜎𝐵

82
Covariance analysis: An example
• If the stocks are affected by the
same industry trends, will their
prices rise or fall together?

6+5+4+3+2 20
• 𝐸 𝐴𝑙𝑙𝐸𝑙𝑒𝑐𝑡𝑟𝑜𝑛𝑖𝑐𝑠 = = = $4
5 5
20+10+14+5+5 54
• 𝐸 𝐻𝑖𝑔ℎ𝑇𝑒𝑐ℎ = = = $10.80
5 5
6×20+5×10+4×14+3×5+2×5
• 𝐶𝑜𝑣 𝐴𝑙𝑙𝐸𝑙𝑒𝑐𝑡𝑟𝑜𝑛𝑖𝑐𝑠, 𝐻𝑖𝑔ℎ𝑇𝑒𝑐ℎ = − 4 × 10.80 = 7
5

• Therefore, a positive covariance indicates that stock prices for both


companies rise together
83
Quiz 09: Correlation tests
1. Consider the following data table, in Attributes
which there are five tuples of two
No.
A B
attributes, A and B. 1 19 16
Calculate the Pearson correlation 2 25 10
coefficient and Covariance between 3 13 26
A and B. 4 12 29
5 16 20

2. How to compute the above metrics using pandas?


Show the results for the above data.

84
References
• Jiawei Han, Micheline Kamber, and Jian Pei, 2011. Data Mining:
Concepts and Techniques (3rd ed.). Morgan Kaufmann Publishers Inc.
Chapter 2 and Chapter 3.

85

You might also like