Chapter 3
Describing Data: Numerical
Chap 3-1
Chapter Goals
After completing this chapter, you should be able to:
n Compute and interpret the mean, median, and mode for a
set of data
n Find the range, variance, standard deviation, and
coefficient of variation and know what these values mean
n Apply the empirical rule to describe the variation of
population values around the mean
n Explain the weighted mean and when to use it
n Explain how a least squares regression line estimates a
linear relationship between two variables
Chap 3-2
Chapter Topics
n Measures of central tendency, variation, and
shape
n Mean, median, mode, geometric mean
n Quartiles
n Range, interquartile range, variance and standard
deviation, coefficient of variation
n Symmetric and skewed distributions
n Population summary measures
n Mean, variance, and standard deviation
n The empirical rule and Bienaymé-Chebyshev rule
Chap 3-3
Chapter Topics
(continued)
n Five number summary and box-and-whisker
plots
n Covariance and coefficient of correlation
n Pitfalls in numerical descriptive measures and
ethical considerations
Chap 3-4
Describing Data Numerically
Describing Data Numerically
Central Tendency Variation
Arithmetic Mean Range
Median Interquartile Range
Mode Variance
Standard Deviation
Coefficient of Variation
Chap 3-5
Measures of Central Tendency
Overview
Central Tendency
Mean Median Mode
åx i
x= i=1
n
Arithmetic Midpoint of Most frequently
average ranked values observed value
Chap 3-6
Arithmetic Mean
n The arithmetic mean (mean) is the most
common measure of central tendency
n For a population of N values:
N
åxx1 + x 2 + + x N
i Population
μ= =
i=1
values
N N
Population size
n For a sample of size n:
n
åx i
x1 + x 2 + + x n Observed
x= i=1
= values
n n
Sample size
Chap 3-7
Arithmetic Mean
(continued)
n The most common measure of central tendency
n Mean = sum of values divided by the number of values
n Affected by extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Mean = 3 Mean = 4
1 + 2 + 3 + 4 + 5 15 1 + 2 + 3 + 4 + 10 20
= =3 = =4
5 5 5 5
Chap 3-8
Median
n In an ordered list, the median is the “middle”
number (50% above, 50% below)
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Median = 3 Median = 3
n Not affected by extreme values
Chap 3-9
Finding the Median
n The location of the median:
n +1
Median position = position in the ordered data
2
n If the number of values is odd, the median is the middle number
n If the number of values is even, the median is the average of
the two middle numbers
n +1
n Note that is not the value of the median, only the
2
position of the median in the ranked data
Chap 3-10
Mode
n A measure of central tendency
n Value that occurs most often
n Not affected by extreme values
n Used for either numerical or categorical data
n There may may be no mode
n There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
No Mode
Mode = 9
Chap 3-11
Review Example
n Five houses on a hill by the beach
$2,000 K
House Prices:
$2,000,000
500,000 $500 K
300,000 $300 K
100,000
100,000
$100 K
$100 K
Chap 3-12
Review Example:
Summary Statistics
House Prices:
n Mean: ($3,000,000/5)
$2,000,000 = $600,000
500,000
300,000
100,000
100,000 n Median: middle value of ranked data
Sum 3,000,000
= $300,000
n Mode: most frequent value
= $100,000
Chap 3-13
Which measure of location
is the “best”?
n Mean is generally used, unless
extreme values (outliers) exist
n Then median is often used, since
the median is not sensitive to
extreme values.
n Example: Median home prices may be
reported for a region – less sensitive to
outliers
Chap 3-14
Shape of a Distribution
n Describes how data are distributed
n Measures of shape
n Symmetric or skewed
Left-Skewed Symmetric Right-Skewed
Mean < Median Mean = Median Median < Mean
Chap 3-15
Measures of Variability
Variation
Range Interquartile Variance Standard Coefficient
Range Deviation of Variation
n Measures of variation give
information on the spread
or variability of the data
values.
Same center,
different variation
Chap 3-16
Range
n Simplest measure of variation
n Difference between the largest and the smallest
observations:
Range = Xlargest – Xsmallest
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
Chap 3-17
Disadvantages of the Range
n Ignores the way in which data are distributed
7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5
n Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Chap 3-18
Interquartile Range
n Can eliminate some outlier problems by using
the interquartile range
n Eliminate high- and low-valued observations
and calculate the range of the middle 50% of
the data
n Interquartile range = 3rd quartile – 1st quartile
IQR = Q3 – Q1
Chap 3-19
Interquartile Range
Example:
Median
Q1 – 1.5IQR Q1 (Q2) Q3 Q3 + 1.5IQR
25% 25% 25% 25%
30 45 57
Interquartile range
= 57 – 30 = 27
Chap 3-20
Quartiles
n Quartiles split the ranked data into 4 segments with
an equal number of values per segment
25% 25% 25% 25%
Q1 Q2 Q3
n The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
n Q2 is the same as the median (50% are smaller, 50% are
larger)
n Only 25% of the observations are greater than the third
quartile
Chap 3-21
Quartile Formulas
Find a quartile by determining the value in the
appropriate position in the ranked data, where
First quartile position: Q1 = 0.25(n+1)
Second quartile position: Q2 = 0.50(n+1)
(the median position)
Third quartile position: Q3 = 0.75(n+1)
where n is the number of observed values
Chap 3-22
Quartiles
n Example: Find the first quartile
Sample Ranked Data: 11 12 13 16 16 17 18 21 22
(n = 9)
Q1 = is in the 0.25(9+1) = 2.5 position of the ranked data
so use the value half way between the 2nd and 3rd values,
so Q1 = 12.5
Chap 3-23
Population Variance
n Average of squared deviations of values from
the mean
N
n Population variance:
å (x - μ)
i
2
σ =2 i=1
N -1
Where μ = population mean
N = population size
xi = ith value of the variable x
Chap 3-24
Sample Variance
n Average (approximately) of squared deviations
of values from the mean
n
n Sample variance:
å (x - x)i
2
s =
2 i=1
n -1
Where X = arithmetic mean
n = sample size
Xi = ith value of the variable X
Chap 3-25
Population Standard Deviation
n Most commonly used measure of variation
n Shows variation about the mean
n Has the same units as the original data
n Population standard deviation:
å i
(x - μ) 2
σ= i=1
N -1
Chap 3-26
Sample Standard Deviation
n Most commonly used measure of variation
n Shows variation about the mean
n Has the same units as the original data
Sample standard deviation: n
å i
n
(x - x) 2
S= i=1
n -1
Chap 3-27
Calculation Example:
Sample Standard Deviation
Sample
Data (xi) : 10 12 14 15 17 18 18 24
n=8 Mean = x = 16
(10 - X)2 + (12 - x)2 + (14 - x)2 + + (24 - x)2
s=
n -1
(10 - 16)2 + (12 - 16)2 + (14 - 16)2 + + (24 - 16)2
=
8 -1
126 A measure of the “average”
= = 4.2426 scatter around the mean
7
Chap 3-28
Measuring variation
Small standard deviation
Large standard deviation
Chap 3-29
Comparing Standard Deviations
Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 3.338
Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 0.926
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 4.570
Chap 3-30
Advantages of Variance and
Standard Deviation
n Each value in the data set is used in the
calculation
n Values far from the mean are given extra
weight
(because deviations from the mean are squared)
Chap 3-31
Chebyshev’s Theorem
n For any population with mean μ and
standard deviation σ , and k > 1 , the
percentage of observations that fall within
the interval
[μ + kσ]
Is at least
100[1 - (1/k )]%
2
Chap 3-32
Chebyshev’s Theorem
(continued)
n Regardless of how the data are distributed,
at least (1 - 1/k2) of the values will fall
within k standard deviations of the mean
(for k > 1)
n Examples:
At least within
(1 - 1/12) = 0% ……..... k=1 (μ ± 1σ)
(1 - 1/22) = 75% …........ k=2 (μ ± 2σ)
(1 - 1/32) = 89% ………. k=3 (μ ± 3σ)
Chap 3-33
The Empirical Rule
n If the data distribution is bell-shaped, then
the interval:
n μ ± 1σ contains about 68% of the values in
the population or the sample
68%
μ
μ ± 1σ
Chap 3-34
The Empirical Rule
n μ ± 2σ contains about 95% of the values in
the population or the sample
n μ ± 3σ contains about 99.7% of the values
in the population or the sample
95% 99.7%
μ ± 2σ μ ± 3σ
Chap 3-35
Coefficient of Variation
n Measures relative variation
n Always in percentage (%)
n Shows variation relative to mean
n Can be used to compare two or more sets of
data measured in different units
æ sö
CV = çç ÷÷ × 100%
èx ø
Chap 3-36
Comparing Coefficient
of Variation
n Stock A:
n Average price last year = $50
n Standard deviation = $5
æs ö $5
CVA = çç ÷÷ ×100% = ×100% = 10%
èx ø $50 Both stocks
n Stock B: have the same
standard
n Average price last year = $100 deviation, but
stock B is less
n Standard deviation = $5 variable relative
to its price
æs ö $5
CVB = çç ÷÷ ×100% = ×100% = 5%
èx ø $100
Chap 3-37
Weighted Mean
n The weighted mean of a set of data is
åw x i i
w 1x1 + w 2 x 2 + + w n x n
x= i=1
=
åw å wi
n Where wi is the weight of the ith observation
n Use when data is already grouped into n classes, with
wi values in the ith class
Chap 3-38
Approximations for Grouped Data
Suppose a data set contains values m1, m2, . . ., mk,
occurring with frequencies f1, f2, . . . fK
n For a population of N observations the mean is
K
å fimi K
where N = å fi
μ= i=1 i=1
N
n For a sample of n observations, the mean is
K
åfm i i
K
where n = å fi
x= i=1
i=1
n
Chap 3-39
Approximations for Grouped Data
Suppose a data set contains values m1, m2, . . ., mk,
occurring with frequencies f1, f2, . . . fK
n For a population of N observations the variance is
K
åi i
f (m - μ) 2
σ2 = i=1
N
n For a sample of n observations, the variance is
K
åi i
f (m - x) 2
s2 = i=1
n -1
Chap 3-40
Chapter Summary
n Described measures of central tendency
n Mean, median, mode
n Illustrated the shape of the distribution
n Symmetric, skewed
n Described measures of variation
n Range, interquartile range, variance and standard deviation,
coefficient of variation
n Discussed measures of grouped data
Chap 3-41