0% found this document useful (0 votes)
6 views44 pages

Module 5 Eda

Module 5 covers descriptive and inferential statistics, emphasizing their roles in data analysis. It explains key concepts such as population vs. sample, parameters vs. statistics, and various measures of central tendency and variability. The module also demonstrates how to summarize data using frequency tables, descriptive statistics, and Excel tools for practical applications in business intelligence.

Uploaded by

alskdus950922
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views44 pages

Module 5 Eda

Module 5 covers descriptive and inferential statistics, emphasizing their roles in data analysis. It explains key concepts such as population vs. sample, parameters vs. statistics, and various measures of central tendency and variability. The module also demonstrates how to summarize data using frequency tables, descriptive statistics, and Excel tools for practical applications in business intelligence.

Uploaded by

alskdus950922
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Module 5:

Descriptive Statistics
RBMI 1020 Business Intelligence for Data-Driven Decisions (Spring 2026)
Jason Ho

1
Descriptive statistics
• Descriptive statistics fall into the category of descriptive analysis (from
Module 1) among all data analytical methods
Find out what has happened in the past or present based on some
observed data
Summarize particular characteristics/features of the data in an
organized and meaningful way based on a numerical approach
May combine with using statistical graphics and other data
visualization methods
Serve as a fundamental step in any data analysis, providing a basic
understanding before deeper exploration
2
Inferential statistics
• Inferential statistics fall into the category of predictive analysis (from
Module 1) among all data analytical methods
Find out what is likely to happen in the future based on patterns or
relationships from some past data
Make educated guesses/predictions about a larger group (called a
population) that includes the past data (called a sample) as a
subset
Consist of making inference, testing hypotheses, and drawing
conclusions about a population based on a sample
Combine with descriptive statistics as the 2 major methodologies
in Statistics
3
Population VS Sample
• Population – all entities of interest for a particular decision or
investigation
e.g., all married women
e.g., all subscribers to Netflix in Hong Kong
• Sample – a subset of the population obtained from sampling
e.g., a group of HKUST alumni who are married women
e.g., subscribers to Netflix within HKUST
• The purpose of sampling is to obtain relevant and sufficient
information to draw a valid conclusion about a population

4
Parameter VS Statistic
• Parameter – a representative measure (i.e., a value) computed from all
entities of interest for a particular decision or investigation
Denoted by Greek letters, such as
e.g., average amount of insurance premium paid by all married women
e.g., proportion of all HK subscribers to Netflix who have watched a recent
comedy
• Statistic – a value computed from a sample
e.g., average amount of insurance premium paid by a group of HKUST alumni
who are married women
e.g., proportion of all HKUST subscribers to Netflix who have watched a recent
comedy
5
Inferential statistics: a general set-up with 4
components
Parameter(s)
describe a
characteristic of Population Find and
a population
compute
Sample Sample
Statistic(s)
to estimate
the parameter

Infer the population


using the statistic(s)

6 6
Frequency tables for summarizing a categorical
variable
• Recall in Module 4, Example 2: a frequency table is a standard way to
summarize data of a categorical variable in Statistics
 Possibly show relative frequency, or percent frequency, rather than
frequency [Link]
 Produced as a PivotTable in
With
Relative Frequency Frequency of Category i
=
of Category i # of data values
• In cell C4, type formula: =B4/$B$8, and
• copy down in column C
7
Descriptive statistics for a categorical variable
To summarize data of a categorical variable of interest, the key statistics
(also called summary statistics or descriptive statistics) are the
proportions of different categories
defined as fractions of all data that belong to a certain category
(i.e., have a certain characteristic)
equivalent to the relative frequency
• e.g., refer to previous Slide, cells C4:C7 give sample proportions of all
4 categories (which must sum up to 1) of the categorical variable,
Income Group, in [Link]

8
Descriptive statistics for a quantitative variable
Once data (i.e., a column of data values or observations) of a
quantitative variable of interest are available, it is imperative to
compute a handful of statistics from the data in view of
summarizing all values, and
understanding different aspects of the variable
as among the data (i.e., in the column),
there are way too many values
most data values are distinct from one another
some of the distinct values appear more often than others

9
Central tendency

Id e a Me a s u r e s
Central tendency Mean, Median,
characterizes a data set Mode, 90% Trimmed
using the middle or
Average
“typical” value

We define measures (including the proportions, these measures of central tendency,


and others to be introduced in the sequel) from sample/data. There exist counterparts
defined in the population, e.g., there exist both population mean and sample mean
10
Measures of central tendency – definitions
Based on data of a quantitative variable, central tendency is described by
Measure Definition function
The mean, sum of all values, divided by # of added AVERAGE(data range)
or average values
The median the middle value such that same # of MEDIAN(data range)
values are less than and greater than it
The mode the value that occurs the most frequently MODE(data range)
A 90% the average of the middle 90% of the TRIMMEAN(data range,
trimmed values (i.e., all values except the largest percent removed)
average 5% values and the smallest 5% values)
11
Example 1: mean
[Link]
Based on cost of all orders in cell
B2:B95 of [Link], the
average of cost per order is computed:
• By definition in cell B99:
= B96/B97 = SUM(B2:B95) /
COUNT(B2:B95)
= 2,471,760/94
= 26,295.32
• Directly by function in cell B101:
= AVERAGE(B2:B95)
12
Example 2: median
After sorting cost of all 94 orders in cell B2:B95 of [Link],
from smallest to largest, the median cost per order is computed:
• By definition in cell D50:
= Average of the 47th and
the 48th ranked observations
= (D48 + D49) / 2
= (15,562.50 + 15,750.00) / 2
= 15,656.25
• Directly by function in cell D52:
= MEDIAN(B2:B95)
13
Example 3: a 90% trimmed average
After sorting cost of all 94 orders in cell
B2:B95 of [Link], from smallest
to largest, a 90% trimmed average cost per
order is computed:
• By definition in cell D97:
= Average of all observations without the smallest
4 and the largest 4 observations
= AVERAGE( B6:B91 )
= 23,357.40
• Directly by function in cell D99:
= TRIMMEAN(B2:B95, 0.1)
14
Variability, variation, and dispersion

Id e a Me a s u r e s
Measure of how much Range, Variance,
data values for a Standard deviation (SD),
quantitative variable
coefficient of variation
differ from each other

15
Measures of variation/dispersion – definitions
Based on data of a quantitative variable, variation/variability/dispersion
is described by
Measure Definition function
The range difference between the largest and the MAX(data range)
smallest values; - MIN(data range)
the simplest measure of variation
The variance average of the squared deviations between VAR.S(data range)
data values and the mean, indicating the
dispersion of values around the mean
The standard square root of the variance; STDEV.S(data range)
deviation a more interpretable measure of variation
16
Example 4: range
Refer to Example 2, sorting cost of all 94 orders in cell B2:B95 of
[Link] yields
• Minimum in cell B2:
= 68.75
• Maximum in cell B95:
= 127,500.00
• Range = 127,500 - 68.75 = 127,431.25
= MAX(B2:B95) – MIN(B2:B95) in

17
Example 5: variance and
standard deviation (1/2)

18
Example 5: variance and
standard deviation (2/2)
• In cell D101, using directly function yields the sample variance as
= VAR.S(B2:B95) = 890,594,572.82
• The sample standard deviation/sample SD can be obtained
by taking square root of the sample variance:
[ = sqrt(D101) ] = 29,842.8312
directly by function :
= STDEV.S(B2:B95) = 29,842.8312

19
Example 6: population variance and
standard deviation
In a population (i.e., assuming all available values form a population of
interest),
• the population variance is computed
by definition:

directly by = VAR.P(data range)


function :
• the population SD is obtained
by definition:
directly by function : = STDEV.P(data range)
20
Coefficient of variation (CV)
• Another measure of variation, called the coefficient of variation (CV), is
defined by

provides a relative measure of dispersion in data relative to the mean


allows for comparison in terms of variation across different variables with
different units or scales (i.e., possibly with different means)
• In financial risk analysis, reward-to-risk ratio is often easier for
interpretation
The Sharpe ratio is a related measure in Finance
21
Example 7: coefficient of variation (CV)
[Link]
• Comparison the 4 stocks’
performances by SD as a
measure of risk misleads us
that IBM is the most risky due to
its largest SD
• But one should compare CVs to
conclude that INTC is the most
risky due to its largest CV

22
Percentiles and quartiles
Additional insights about a quantitative variable can be conveyed by its
data values at some particular positions (other than the “middle”)
1. The kth percentile (k = 1,2,…,99) is a value below which k% of the data
lie, and all of them together break all sorted data into 100 groups
e.g., below the 12th percentile, there are 12% of data
 function: [Link](data range, k), with 0 ≤ k ≤ 1
2. The first, second, and third quartiles (Q1, Q2, and Q3) correspond to
the 25th, 50th, and 75th percentiles, respectively
Break all sorted data into 4 groups
 function: [Link](data range, quart), with quart = 1,2,3,4
23
Example 8: percentiles and quartiles
Based on cost of all orders in cell B2:B95 of
[Link],
7th [Link](B2:B95, 0.07) 3,002.475
percentile
Q1 [Link](B2:B95,1) 6,757.82
Q2 [Link](B2:B95,2) 15,656.25
Q3 [Link](B2:B95,3) 27,593.75
75th [Link](B2:B95, 0.75) 27,593.75
percentile
24
Standardized values or z-scores
From data of a variable, a standardized value called a z-score
• provides a relative measure of the distance a data value is away from
the mean of all the data
• is independent of the unit or the scale of the data, allowing for
comparison across data from different variables or datasets
• is calculated as, for any data value ,

 and are mean and SD computed by all the data


 function: STANDARDIZE(x, mean, standard_dev)
25
Standardized values or z-scores: properties

• The numerator: distance of the data value from the mean


negative value: lies to the left of, or less than, the mean
positive value: lies to the right of, or larger than, the mean
• Division by scales and expresses the distance from the mean in
number of SD
• e.g.: for data of a variable, data value 1 and value 2 have
1. z-score of 1.0: the data value 1 is 1 SD to the right of the mean of all data
2. z-score of -2.5: the data value 2 is 2.5 x SD to the left of the mean of all data
26
Example 9: standardized values
Based on cost of all orders in cell
B2:B95 of [Link], z-scores
are computed
• by definition:

e.g., in cell C3: = (B3 - $B$97)/$B$98


• directly by function, with Mean and
SD:
= AVERAGE(B2:B95)
e.g., in cell C4: = STDEV.S(B2:B95)
= STANDARDIZE(C3,$B$97,$B$98)
27
Excel descriptive statistics tool
This tool provides a list of
summary statistics or descriptive
statistics for data of quantitative
variables
1. In Data tab, select Data
Analysis, and then Descriptive
Statistics
2. Select Input Range:, check
Summary statistics and/or
Labels in First Row
28
Example 10: descriptive statistics
In [Link], follow the 2 steps from previous slide to obtain
summary statistics of the 4 quantitative variables in columns E:H

29
Summarizing a quantitative variable grouped
by a categorical variable
• Dealing with a dataset with 2 or more variables, it is a common
practice to study data of a quantitative variable across each category
of a categorical variable via some aggregation/summarization
methods (including all measures defined in this Module, e.g., sum,
average, maximum, SD, quartiles) to gain some insights
 e.g., Module 3, Example 17, demonstrated
how to obtain Sum of Revenue by Region in
[Link] using PivotTable

30
Example 11: descriptive statistics in PivotTable
(1/4)

• Instead of simply getting descriptive statistics of Cost per order in


[Link] as in Example 10, we may be interested in how
our order costs vary across different suppliers

• Produce the next PivotTable following Module 3, Example 15, with


Drag Supplier to Rows
Drag Cost per order to Values
• By default in PivotTables, the
summarization on Cost per order is Sum

31
Example 11: descriptive statistics in PivotTable
(2/4)

To switch to another
summarization, say,
average
• Click the arrow next
to Sum of Cost per
order in Values, and
select Value Field
Settings
• In the pop-up dialog,
choose Average
32
Example 11: descriptive statistics in PivotTable
(3/4)

The PivotTable is
updated to show
Average of Cost per
order in column B
• Right-click any value
there, and select Sort
In view of cutting costs,
we can follow up on
why the average order
cost at supplier Alum
Sheeting is the largest
33
Example 11: descriptive statistics in PivotTable
(4/4)

From the same pop-up


dialog after clicking the
arrow next to any field
in Values to select
Value Field Settings
(back in Slide 32), click
Show Values As tab to
show various kinds of
percentages or apply
other computations

34
Bar/column charts (for 1 categorical variable &
1 quantitative variable)
• Instead of using a PivotTable, we can visualize summary of data of a
quantitative variable across each category of a categorical variable in
bar/column charts
• e.g., based on the PivotTable
in Example 11, in PivotChart
Analyse tab, select
PivotChart, and then choose
either Column or Bar

35
Association between variables
• Two variables have a statistical relationship, or an association, if 1
variable changes, the other variable tends to change in a certain
manner
• When 2 variables appear to be related/associated based on some
observed data, one might suspect a cause-and-effect relationship – 1
variable causes the other to change
 However, statistical relationships exist even though a change in 1
variable is not caused by a change in the other
 Never conclude any cause-and-effect relationship simply based on
observational data
36
Measure of association – covariance
• Covariance is a measure of the linear association/relationship
between 2 quantitative variables, X and Y, denoted by cov(X,Y),
Defined as average of products of deviations of each pair of observation (X,Y)
from their respective means
• Like variance, different formulas are used for populations and
samples, e.g., in :
Population covariance: =COVARIANCE.S(array1, array2)
Sample covariance: =COVARIANCE.P(array1, array2)
array1 and array2, of the same length/size, contain data of X and data of Y,
where their values at any corresponding position form a pair of observations
37 (X,Y) observed from the same entity
Example 12: computing covariance (1/2)
Consider [Link]
[below], which include data from 49
colleges/universities
• A scatterplot of Graduation % and
Median SAT reveals a linear relationship

38
cov(X,Y): average of products of
deviations of each pair

Example 12: computing covariance (2/2) of observation (X,Y) from


their respective means

= SUM(F2:F50)
= COUNT(F2:F50)
= F51/(F52-1)

A covariance > 0 corresponds to positive


linear relationship: 1 variable changes, the
39 other tends to change in the same direction = COVARIANCE.S(F2:F50)
Strength of association – correlation
To describe the strength of linear relationship (i.e., how high/strong the
tendency to change is), it is necessary to “descale” the covariance to
obtain
corr(X,Y)
• Correlation coefficient is a “standardized” measure of the linear
relationship between 2 quantitative variables, X and Y
Unit-free (i.e., not depending on units/scales of data values)
Ranges between -1 and 1
Variables tend to change in the same direction when >0; otherwise in opposite
direction when <0
40 The closer to -1 or 1, the higher the tendency to change
Interpretation of correlation

Higher tendency Higher tendency


Y tends to change in opposite Y tends to change in the same
direction as how X changes direction as how X changes
Negative correlation Positive correlation

Correlation coefficient

41
Example 13: computing correlation

= COVARIANCE.S(F2:F50)
= F53/B52/C52

A correlation of 0.56, mid-way between 0


and 1, corresponds to a moderately positive
42 linear relationship between the 2 variables = CORREL(F2:F50)
A correlation matrix
• Correlations of every pair of variables in a collection of >2 quantitative
variables can be visualized by a correlation matrix, e.g., obtained
below by 5 variables from [Link]

• In :
In Data tab, select Data Analysis  Correlation
43 Insert Input Range:
Identifying outliers
• No standard definition of what constitutes an outlier
• Typical rules of thumb to classify a data value as a “potential” outlier:
1. z-score too far away from 0, e.g., >3 or < −3
2. Value > Q3 + 1.5 x IQR or < Q1 - 1.5 x IQR [rules used in boxplots]
IQR = Q3 – Q1 is the interquartile range, another possible measure of
dispersion
• NEVER simply remove any data value upon satisfying any rule
 See if there is any clerical error
 Study why it is atypical from other values
44

You might also like