Statistics and Econometrics
Lesson 3: Central Tendency and Dispersion
Zhejin ZHAO
Statistical Analysis for Business & Economics: Spring 2011
Introduction
Recall Lesson 2, where we used graphical techniques to
describe data:
Histogram
18
16
14
12
10
Frequency
6 Frequency
4
0
0~5 5~10 10~15 15~20 20~25 25~30 30~35
Waiting time
While this histogram provides some new insight, other
interesting questions (e.g. what is the average of waiting time?)
remain unclear.
2
Introduction: Summarizing Distributions
3
3 Measures of Central Tendency
Statistic Formula Excel Formula Pro Con
Average Familiar and
of all the Influenced
uses all the
Mean data =AVERAGE(Data) by extreme
data
values.
information.
Middle May not be
Robust when
value in influenced
Median =MEDIAN(Data) extreme data
sorted by extreme
values exist.
array values.
4
3 Measures of Central Tendency (cont’d)
E.g. 1 2 2 2 2 4 7
Mean? Median? Mode?
5
Notations
When referring to the number of observations in a
population, we use uppercase letter N
When referring to the number of observations in a
sample, we use lower case letter n
The mean for a population is denoted with :
Parameter or Statistic?
The mean for a sample is denoted with : Parameter
or Statistic?
6
Characteristics of the Mean
Note: mean is very sensitive to
exceptionally large or small
observations called outliers
Examples
As soon as a billionaire
moves into a neighborhood,
the average household
income increases beyond what
it was previously!
7
Characteristics of the Mean
Imagine if Ming Yao were in this class, what happens to
the mean height of the class.
8
Median
Defined as the value below which are 50% of observation
in the data set and above which are 50% of observation.
Median separates the upper and lower half of the sorted
observations.
If n is odd, the median is the middle observation in the
data array.
If n is even, the median is the average of the middle two
observations in the data array.
9
Median (cont’d)
Example 1
Consider the following n = 6 data values:
11 12 15 17 21 32
What is the median?
xn / 2 x( n / 21)
For even n, Median =
2
n/2 = 6/2 = 3 and n/2+1 = 6/2 + 1 = 4
M = (x3+x4)/2 = (15+17)/2 = 16
11 12 15 16 17 21 32
10
Median (cont’d)
Example 2
Consider the following n = 7 data values:
12 23 23 25 27 34 41
What is the median?
For odd n, Median = x( n 1) / 2
(n+1)/2 = (7+1)/2 = 8/2 = 4
M = x4 = 25
12 23 23 25 27 34 41
11
Median (cont’d)
• The median is insensitive to extreme data values.
• For example, consider the following quiz scores for 3 students:
Adrian’s scores:
20, 40, 70, 75, 80 Mean =57, Median = 70, Total = 285
Dustin’s scores:
60, 65, 70, 90, 95 Mean = 76, Median = 70, Total = 380
Josh’s scores:
50, 65, 70, 75, 90 Mean = 70, Median = 70, Total = 350
• In above case, is the median informative?
12
Mode
Defined as the most frequently occurring data value.
May have multiple modes or no mode
cf.) Mean and median: unique to a data set
13
Mode (cont’d)
An example
Consider the following quiz scores for 4 students:
Scarlett’s scores:
60, 70, 70, 70, 80 Mean =70, Median = 70, Mode = 70
Johan’s scores:
45, 45, 70, 90, 100 Mean = 70, Median = 70, Mode = 45
Tim’s scores:
50, 60, 70, 80, 90 Mean = 70, Median = 70, Mode = none
Ryan’s scores:
50, 50, 70, 90, 90 Mean = 70, Median = 70, Modes = 50,90
14
Mean, Median, Mode
If a distribution is symmetric, the mean, median and
mode may coincide.
median
mode
mean
15
Mean, Median, Mode
If a distribution is asymmetrical, say skewed to the left or
to the right, the three measures may differ. E.g.:
median
mode
mean
16
Practice
1, 2, 3, 4, 2, 2, 3, 4, 5, 2
1. Calculate mean, median and mode of the data set.
2. Add 2 to each observation and recalculate mean,
median and mode.
3. Multiply each original observation by 2 and recalculate
mean, median and mode.
4. Describe how each measure of center changed with the
different operations.
17
Tips: Which Measures to Use?
For ordinal and nominal data the calculation of the mean is
not valid.
Median is appropriate for ordinal data.
For nominal data, a mode calculation is useful for
determining highest frequency.
18
Outline of Dispersion
Dispersion is the “spread” of data points about the center
of the distribution.
Measures of dispersion:
• Range
• Variance
• Standard deviation
• Coefficient of variation
19
Range
The difference between the largest and smallest
observation.
Range = xmax – xmin
An example: Tina’s homework score
85 98 87 83 84 7 86
Range = 98 – 7 = 91
Drawback: determined by only the two extreme values
20
Variance
N
2
The population variance is defined as xi
the sum of squared deviations around 2 i 1
N
the mean divided by the population size.
For the sample variance, we divide by n
2
n – 1 instead of n, otherwise it would tend xi x
to underestimate the unknown population s 2 i 1
variance. n 1
Note! the denominator is sample size (n) minus one !
Drawback: due to its units, hard to interpret
21
Standard Deviation
The square root of the variance.
Explains how individual values in a data set vary from the
mean.
Units of measure are the same as X.
Population N Sample n
2
standard
2
xi standard xi x
i 1
i 1
deviation s
deviation N n 1
22
Standard Deviation (cont’d)
Excel’s built in functions are
Statistic Excel population Excel sample
formula formula
Variance =VARP(Array) =VAR(Array)
Standard deviation =STDEVP(Array) =STDEV(Array)
23
Calculating a Standard Deviation
Consider the following five quiz scores for Stephanie.
(Table 412)
Now, calculate the sample standard deviation:
n
2
xi x 2380
i 1
s 595 24.39
n 1 5 1
24
Calculating a Standard Deviation
The standard deviation is nonnegative because
deviations around the mean are squared.
When every observation is exactly equal to the mean,
the standard deviation is zero.
Standard deviations can be large or small, depending
on the units of measure.
Compare standard deviations only for data sets
measured in the same unit.
25
Coefficient of Variation
The coefficient of variation of a set of observations is
the standard deviation of the observations divided by
their mean, that is:
• Population coefficient of variation = CV =
• Sample coefficient of variation = CV =
This coefficient provides a proportionate measure of
variation, which is free of units
It measures relative dispersion.
26
Coefficient of Variation (cont’d)
Example 1
data sets with the same units
• Let’s compare two stocks you’re interested in:
“Penny” Stock (A) vs. Regular Stock (B)
X = $2.00 X = $100.00
s = $0.04 s = $2.00
• Question: Which stock is more unstable?
27
Coefficient of Variation (cont’d)
• In terms of absolute dispersion, stock B seems to be more
unstable. However, to compare the dispersion, we often
adjust to the returns. i.e. using the CV, we have:
Stock A: CV = ($.04/$2)*100 = 2%
Stock B: CV = ($2 /$100)*100 = 2%
Statistic Formula Excel Pro
Coefficient Measures relative variation in
of s percent so can compare data
100 None
variation x sets even with different units.
(CV) (Unit free)
28
Coefficient of Variation (cont’d)
Example 2
data sets with different units
• Compare variability for the following two variables.
National SAT mean = 1000 vs. GPA mean = 2.5
National SAT s.d. = 200 vs. GPA s.d. = 1.2
• Again, here CV becomes a useful tool for comparison.
Without CV, it is not possible to compare them.
29
Measures of dispersion
Population Sample
Size N n
Mean
Variance
Standard Deviation S
Coefficient of s
Variation CV=
CV=
x
Tips: Which Measures to Use?
If data are symmetric, with no serious outliers,
can use range and standard deviation.
If comparing variation across two data sets, use
coefficient of variation.
The measures of variability introduced in this section
can be used only for numerical data.
31