Chapter 3
Summarizing Data Sets
24 January 2024
Statistics - Spring semester 2023-2024
Outline
- Numerical data may be summarized according to several characteristics.
We will work with each in turn and then all of them together.
Measures of Location
- Measures of central tendency: Mean; Median; Mode
- Measures of noncentral tendency - Quantiles
Quartiles; Quintiles; Percentiles
Measures of Dispersion
- Range
- Interquartile range
- Variance
- Standard Deviation
- Coefficient of Variation
Measures of Shape
Skewness
Five Number Summary (Boxplot)
Scatter Plots and Correlation
24 January 2024 Statistics - Spring semester 2023-2024
Measures of Location
Measures of Central Tendency
• Measures of location place the data set on the scale of real numbers.
• Measures of central tendency (i.e., central location) help find the approximate
center of the dataset.
• These include the mean, the median, and the mode.
The Mean: The mean of a set of data is the sum of all values in a data set
divided by the number of values in the set.
1- Arithmetic mean: the mean of set of numbers(N) x1,x2 …xn denoted by x (̅ x
bar) and defined as :
Example-1: Eight measurement of the diameter of a cylinder were recorded by a
scientist as 38.8 40.9 ,39.2 , 39.7, 40.2, 39.2, 39.8 and 40.6 millimeter find the
arithmetic mean:
Solution:
mean = (38.8+40.9+39.2+39.7+40.2+39.2+39.8+40.6)/8 = 39.8
24 January 2024 Statistics - Spring semester 2023-2024
Example-2: For the data: 1, 2, 2, 3, 51. Calculate the mean. Note: n = 5 (five
observations)
Solution: ∑Xi = 1 + 2+ 2+ 3+ 51 = 59
ത 59 / 5 = 11.8
𝑋=
Here we see that the mean is affected by extreme values.
2. Weighted mean: Sometimes we associate with the numbers x1 ,x2 , …..certain
weighting factor or weight w1, w2 , ….
𝑤1 𝑥1 + 𝑤2 𝑥2 … + 𝑤𝑛 ⋅ 𝑥𝑛 σ𝑛𝑖=0 𝑤𝑖 𝑥𝑖
𝑥ҧ = = 𝑛
𝑤1 + 𝑤2 ⋯ 𝑤𝑛 σ𝑖=0 𝑤𝑖
If the numbers x1 ,x2 , …. xn Occur f1 ,f2 , ….. …fn times respectively then the means
is: 𝑛
𝑓1 𝑥1 + 𝑓2 𝑥2 … +𝑓𝑛 ⋅ 𝑥𝑛 σ𝑖=0 𝑓𝑖 𝑥𝑖
𝑥ҧ = = 𝑛
𝑓1 + 𝑓2 ⋯ 𝑓𝑛 σ𝑖=0 𝑓𝑖
Example-1: If the marks of a student were as follow : math(85),statistic(92),
computer(90) and English(88) . And the weight for these subject were 4, 2, 2, and 3
ത 88)
respectively find the weighted mean mark. (Ans. 𝑋=
Example-2: If 5, 8, 6, and 2 occur with frequencies 3 ,2 ,4 and 1. Find the mean.
ത 5.7)
(Ans. 𝑋=
24 January 2024 Statistics - Spring semester 2023-2024
The Median: The median is the middle value of the ordered data.
➢ To get the median, we must first rearrange the data into an ordered array (in
ascending or descending order). Generally, we order the data from the
lowest value to the highest value.
➢ Therefore, the median is the data value such that half of the observations
are larger and half are smaller. It is also the 50th percentile (we will be
learning about percentiles in a bit).
To find the median:
• Case 1: When the data is a simple list of numbers arranged in ascending order,
then the median is the middle value (if there is an odd number of values) or the
average of the middle two values (if there is an even number of values)
Example-1: 43,47,48, 50, 51, 51, 75 Example-2:1,2,2,2,3,3,4,6
median = 50 median = (3+2)/2 = 2.5
o The mean and the median are unique for a given set of data. There will be
exactly one mean and one median.
o Unlike the mean, the median is not affected by extreme values.
Q: What happens to the median if we change the 75 in Example-1 to 7500?
Ans: The median will still be 50.
24 January 2024 Statistics - Spring semester 2023-2024
• Case 2: When the data is arranged in a
frequency distribution then calculate the
cumulative frequency and use the following
formula:
Where:
• L1 =lower boundaries for median class.
• [Link]. =cumulative frequency of the class before
median class.
• fi =Frequency of median class.
• C=length of classes.
• The Median Class is the interval in which the middle
item (N/2) lies.
Example: Find the median for frequency distribution shown below:
Solution: N/2=50
Median
Class
24 January 2024 Statistics - Spring semester 2023-2024
The Mode: The mode is the most frequent value.
Example-1: Find the mode for following.
A: 3, 5, 2, 6, 5, 9 , 5, 2, 8, and 6 mode=5
B: 51, 48, 59, 49, and 47 No mode
Example-2: 5, 5, 5, 6, 8, 10, 10, 10.
The mode is: 5, 10.
There are two modes. This is a bi-modal dataset.
▪ The mode is different from the mean and the median in that those measures
always exist and are always unique. For any numeric data set there will be one
mean and one median.
▪ The mode may not exist. (Example-1 B)
▪ The mode may not be unique. (Example-2)
24 January 2024 Statistics - Spring semester 2023-2024
Finding mode for frequency distribution:
Where:
▪ Mode class= is the class which has largest frequencies.
▪ L1=lower boundary for mode class.
▪ D1=the difference between F of mode class and F of previous class.
▪ D2= the difference between F of mode class and F of next class.
▪ C= class length.
Example: Find the mode for frequency distribution shown below:
Solution:
24 January 2024 Statistics - Spring semester 2023-2024
Mode can also be obtained from a histogram.
• Step 1: Identify the modal class and the bar representing it.
• Step 2: Draw two cross lines as shown in the diagram.
• Step 3: Drop a perpendicular from the intersection of the two lines until it
touch the horizontal axis.
• Step 4: Read the mode from the horizontal axis.
24 January 2024 Statistics - Spring semester 2023-2024
Measures of Noncentral Tendency
➢ Measures of non-central location used to summarize a set of data.
➢ Examples of commonly used quantiles:
• Quartiles
• Quintiles
• Deciles
• Percentiles
Quartiles
• Quartiles split a set of ordered data into four parts.
• Imagine cutting a chocolate bar into four equal pieces… How many cuts
would you make? (yes, 3)
• Q1 is the First Quartile: 25% of the observations are smaller than Q1 and 75% of
the observations are larger
• Q2 is the Second Quartile: 50% of the observations are smaller than Q2 and 50%
of the observations are larger. Same as the Median. It is also the 50th percentile.
• Q3 is the Third Quartile: 75% of the observations are smaller than Q3and 25% of
the observations are larger
24 January 2024 Statistics - Spring semester 2023-2024
• A quartile, like the median, either takes the value of one of the observations, or
the value halfway between two observations.
• If n/4 is an integer, the first quartile (Q1) has the value halfway between the
(n/4)th observation and the next observation.
• If n/4 is not an integer, the first quartile has the value of the observation
whose position corresponds to the next highest integer.
n=10
24 January 2024 Statistics - Spring semester 2023-2024
Example-1: Computer Sales (n = 12 salespeople)
Original Data: 3, 10, 2, 5, 9, 8, 7, 12, 10, 0, 4, 6
Compute the mean, median, mode, quartiles.
First order the data:
0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 12
∑Xi = 76
𝑋= 76 / 12 = 6.33 computers sold
Median = 6.5 computers
Mode = 10 computers
Q1 = 3.5 computers, Q3 = 9.5 computers
24 January 2024 Statistics - Spring semester 2023-2024
Other Quantiles
• Similar to what we just learned about quartiles, where 3 quartiles split the data
into 4 equal parts,
• There are 9 deciles dividing the distribution into 10 equal portions
(tenths).
• There are four quintiles dividing the population into 5 equal portions.
• There are 99 percentiles dividing the data set into 100 equal portions.
• In all these cases, the convention is the same. The point, be it a quartile, decile,
or percentile, takes the value of one of the observations or it has a value
halfway between two adjacent observations. It is never necessary to split the
difference between two observations more finely.
➢ Percentiles are used in analyzing the results of standardized exams. For
instance, a score of 40 on a standardized test might seem like a terrible grade,
but if it is the 99th percentile, don’t worry about telling your parents.
➢ Which percentile is Q1? Q2 (the median)? Q3?
24 January 2024 Statistics - Spring semester 2023-2024
Example-2: For the following data ,
1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 8, 10
Compute the mean, median, mode, quartiles.
Solution: n=16
1 1 2 2 ┋ 2 2 3 3 ┋ 4 4 5 5 ┋ 6 7 8 10
Mean = 65/16 = 4.06
Median = 3.5
Mode = 2
Q1 = 2, Q2 = Median = 3.5, Q3 = 5.5
Example-3: Data- number of absences
0, 5, 3, 2, 1, 2, 4, 3, 1, 0, 0, 6, 12
Compute the mean, median, mode, quartiles.
Solution: First order the data (n=13):
0, 0, 0,┋ 1, 1, 2, 2, 3, 3, 4,┋ 5, 6, 12
Mean = 39/13 = 3.0 absences
Median = 2 absences
Mode = 0 absences
Q1 = 0.5 absences, Q3 = 4.5 absences
24 January 2024 Statistics - Spring semester 2023-2024
Example: for the frequency distribution table, find:
1-First and Third quartiles (Q1)
2-6th decile.
3-fortieth percentile.
𝛴𝑓
− 𝐹𝑄1−1
𝑄1 = 𝐿1 + 4 ⋅𝑐
𝑓𝑄1
L1=lower boundaries for first quartile class
FQ1-1=cumulative frequency of the class before first quartile class
fQ1=frequency of first quartile class
C=length of classes
Note: the First Quartile Class is the interval in which the first quartile item (N/4) lies.
Solution:
1- Q1:
First Quartile
𝑁 𝛴𝑓 100
= = = 25 Class
4 4 4
25 − 23
𝑄1 = 66 + ⋅ 3 = 66.143
42
24 January 2024 Statistics - Spring semester 2023-2024
2- 6th decile:
6
∗ 100 = 60
10
60 − 23 6th decile
th
6 decile = 66 + ⋅3
42 Class
6th decile = 68.643
3- 80th percentile:
80
∗ 100 = 80
100
40th
80 − 65 percentile
40th percentile = 69 + ⋅3 Class
27
40th percentile = 70.667
24 January 2024 Statistics - Spring semester 2023-2024
Measures of Dispersion
• Dispersion is the amount of spread, or variability, in a set of data.
• Why do we need to look at measures of dispersion?
• Consider this example:
A company is about to buy computer chips that must have an average life of 10
years. The company has a choice of two suppliers. Whose chips should they
buy? They take a sample of 10 chips from each of the suppliers and test them.
See the data. Supplier A chips Supplier B chips
(life in years) (life in years)
11 170
11 1
10 1
We see that supplier B’s chips have a longer 10 160
average life. 11 2
11 150
However, what if the company offers a 3-year 11 150
warranty? 11 170
Then, computers manufactured using the chips 10 2
12 140
from supplier A will have no returns while using
supplier B will result in 4/10 or 40% returns.
𝑋A = 10.8 years 𝑋𝐵 = 94.6 years
MedianA = 11 years MedianB = 145 years
sA = 0.63 years sB = 80.6 years
24 January 2024 Statistics - Spring semester 2023-2024 Range = 2 years RangeB = 169 years
A
• We will study these five measures of dispersion
▪ Range
▪ Interquartile Range
▪ Standard Deviation
▪ Variance
▪ Coefficient of Variation
▪ Range : The range is the difference between the largest and smallest
values in a set of date. (Range = Largest Value – Smallest Value)
Example: The mean of the following data in section A and B is equal to 19.
Determine the range for both sections.
Section A : 6 9 11 13 15 21 23 28 29 35
Section B: 15 16 16 17 18 19 20 21 23 25
The range for Section A = 35 - 6 = 29
The range for Section B = 25 – 15 = 10
• The range is simple to use and to explain to others.
• One problem with the range is that it is influenced by extreme values at
either end.
24 January 2024 Statistics - Spring semester 2023-2024
▪ Inter-Quartile Range (IQR): The Inter-Quartile Range is the difference between
the values of third and first quartile. (IQR = Q3 – Q1)
• Example (n = 15):
0, 0, 2, 3, 4, 7, 9, 12, 17, 18, 20, 22, 45, 56, 98
Q1 = 3, Q3 = 22
IQR = 22 – 3 = 19 (Range = 98)
• This is basically the range of the central 50% of the observations in the
distribution.
• Problem: The interquartile range does not take into account the
variability of the total data (only the central 50%). We are “throwing
out” half of the data.
24 January 2024 Statistics - Spring semester 2023-2024
▪ Standard Deviation
• The standard deviation, s, measures a kind of “average” deviation about the
mean. It is not really the “average” deviation, even though we may think of it
that way.
• Why can’t we simply compute the average deviation σ𝑛𝑖=1(𝑋𝑖 − 𝑋)
about the mean, if that’s what we want? 𝑛
• If you take a simple mean, and then add up the deviations about the mean, as
above, this sum will be equal to 0. Therefore, a measure of “average deviation”
will not work.
• Instead, we use:
• This is the “definitional formula” for standard deviation.
• The standard deviation has lots of nice properties, including:
o By squaring the deviation, we eliminate the problem of the deviations summing
to zero.
o In addition, this sum is a minimum. No other value subtracted from X and
squared will result in a smaller sum of the deviation squared. This is called the
“least squares property.”
24 January 2024 Statistics - Spring semester 2023-2024
Estimation of Variance and Standard Deviation from a Sample
When we calculate the results from a sample (i.e., a part of the population) we
do not usually know the population mean, so we must find a way to use the
sample mean, which we can calculate.
2
σ𝑁 𝑋
𝑖=1 𝑖 −𝜇
Note that σ = 𝑁
σ = populationstandard deviation
σ𝑁
𝑖=1 𝑥 − 𝑥ҧ
2
S = sample standard deviation
𝑆=
𝑁−1
σ𝑁
𝑖=1 𝑓𝑖 𝑥𝑖 − 𝑥
2
For data organized in frequency distribution
𝑆=
σ𝑁𝑖=1 𝑓𝑖 − 1 table
For a set of values where xi represents the class mid-point in frequency
distribution table.
24 January 2024 Statistics - Spring semester 2023-2024
for Faster Calculation
2 σ 𝑓𝑖 𝑥𝑖 2
σ 𝑥𝑖 σ 𝑓𝑖 𝑥𝑖2 −
σ 𝑥𝑖2 − σ 𝑓𝑖
𝑆= 𝑁 𝑆=
𝑁−1 σ 𝑓𝑖 − 1
Example: Two data sets, X and Y. Which of the two data sets has greater
variability? Calculate the standard deviation for each.
We note that both sets of data have the same mean:
𝑋=3
𝑌=3
10
SX = 4
= 1.58
80
SY = 4
= = 4.47
24 January 2024 Statistics - Spring semester 2023-2024
Variance: The variance, s2, is the standard deviation (s) squared.
Conversely, 𝑠 = 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒.
σ 𝑥𝑖 2
σ 𝑥𝑖2 −
or 𝑆2 = 𝑁
𝑁−1
σ 𝑓𝑖 𝑥𝑖 2
σ𝑁
𝑖=1 𝑓𝑖 𝑥𝑖 − 𝑥
2 σ 𝑓𝑖 𝑥𝑖2 −
𝑆2 = or σ 𝑓𝑖
σ𝑁 𝑆2 =
𝑖=1 𝑓𝑖 − 1 σ 𝑓𝑖 − 1
For data organized in frequency distribution table
24 January 2024 Statistics - Spring semester 2023-2024
Coefficient of Variation: A dimensionless quantity, the coefficient of variation is
the ratio between the standard deviation and the mean for the same set of data,
expressed as a percentage. This can be either (σ / μ) or (s / 𝑥ҧ ), whichever is
appropriate, multiplied by 100%.
𝜎
𝐶𝑂𝑉 = ∗ 100 (for population) 𝐶𝑂𝑉 = 𝑆ൗ ∗ 100 (for sample)
𝜇 𝑥
• The problem with s2 and s is that they are both, like the mean, in the “original”
units.
• This makes it difficult to compare the variability of two data sets that are in
different units or where the magnitude of the numbers is very different in the
two sets. For example,
• Suppose you wish to compare two stocks and one is in dollars and the other
is in ID; if you want to know which one is more volatile, you should use the
coefficient of variation.
• It is also not appropriate to compare two stocks of vastly different prices
even if both are in the same units.
• The standard deviation for a stock that sells for around $300 is going to be
different from one with a price of around $25.
• The coefficient of variation will be a better measure of dispersion in these cases
than the standard deviation.
24 January 2024 Statistics - Spring semester 2023-2024
COV is in terms of a percent. What we are in effect calculating is what percent
of the sample mean is the standard deviation. If COV is 100%, this indicates that
the sample mean is equal to the sample standard deviation. This would
demonstrate that there is a great deal of variability in the data set. 200% would
obviously be even worse.
Example: Stock Prices Stock A Stock B
JAN $1.00 $180
Which stock is more volatile? FEB 1.50 175
Closing prices over the last 8 months: MAR 1.90 182
APR .60 186
MAY 3.00 188
JUN .40 190
COVA = (1.62/1.7)x 100% = 95.3% JUL 5.00 200
AUG .20 210
COVB = (11.33/188.88)x 100% = 6.0%
Mean $1.70 $188.88
s2 2.61 128.41
s $1.62 $11.33
Answer: The standard deviation of B is higher than for A, but A is more volatile:
24 January 2024 Statistics - Spring semester 2023-2024
Example: Test Scores
Data (n=10): 0, 0, 40, 50, 50, 60, 70, 90, 100, 100
Compute the mean, median, mode, quartiles (Q1, Q2, Q3), range,
interquartile range, variance, standard deviation, and coefficient of variation.
We shall refer to all these as the descriptive (or summary) statistics for a set
of data.
Solution: First order the data:
0, 0, 40, 50, 50 ┋ 60, 70, 90, 100, 100
• Mean: ∑Xi = 560 and n = 10, so 𝑋 = 560/10 = 56.
Median = Q2 = 55
• Q1 = 40 ; Q3 = 90
• Mode = 0, 50, 100
Range = 100 – 0 = 100
• IQR = 90 – 40 = 50
• s2 = 11,840/9 = 1315.5
• s = √1315.5 = 36.27
• COV = (36.27/56) x 100% = 64.8%
24 January 2024 Statistics - Spring semester 2023-2024
Example: Compute variance and standard deviation for the
data shown in frequency distribution table.
Solution:
σ 𝑓𝑖 𝑥𝑖 2
σ 𝑓𝑖 𝑥𝑖2−
σ 𝑓𝑖
𝑆2 =
σ 𝑓𝑖 − 1
2537.52
197306 −
2
𝑆 = 33 = 68.37
33 − 1
𝑆 = 68.37 = 8.27
24 January 2024 Statistics - Spring semester 2023-2024
Measures of Shape
• A third important property of data – after location and dispersion - is its
shape.
• Shape can be described by degree of asymmetry (i.e., skewness).
• mean > median positive or right-skewness
• mean = median symmetric or zero-skewness
• mean < median negative or left-skewness
• Positive skewness can arise when the mean is increased by some unusually
high values.
positive skew
24 January 2024 Statistics - Spring semester 2023-2024
• symmetric or zero-skewness: can arise when the values of mean, median
and mode are equal. There is equal number of values on both sides of the
mean which means the values occur at regular frequencies.
symmetric or zero-skewness
• Negative skewness can arise when the mean is decreased by some unusually
low values.
negative skew
24 January 2024 Statistics - Spring semester 2023-2024
Example: The following data represent number of hours to complete a
task. Describe the shape of the data.
Data (for n=12 employees): This employee took
2 3 8 ┋ 8 9 10 ┋ 10 12 15 ┋ 18 22 63 a VERY long time!
𝑋= 180/12 = 15 hours
Median = 10 hours
The (extremely slow) employee who took 63 hours to complete the task
skewed the entire distributon to the right.
24 January 2024 Statistics - Spring semester 2023-2024
Five Number Summary
• When examining a distribution for shape, sometime the five number summary is
useful:
Smallest| Q1 | Median | Q3 | Largest
• Example: The data from previous example.
Data (for n=12 employees):
2 3 8 ┋ 8 9 10 ┋ 10 12 15 ┋ 18 22 63
𝑋 = 15, Q1=8, median=10, Q3=16.5
5-number summary: 2 | 8 | 10 | 16.5 | 63
This data is right-skewed.
In right-skewed distributions, the distance from Q3 to Xlargest (16.5 to 63) is
significantly greater than the distance from Xsmallest to Q1(2 to 8).
24 January 2024 Statistics - Spring semester 2023-2024
Boxplot: The boxplot is a way to graphically portray a distribution of data by
means of its five-number summary.
• Vertical line drawn within the box is the median
• Vertical line at the left side of box is Q1
• Vertical line at the right side of box is Q3
• Line on left connects left side of box with Xsmallest (lower 25% of data)
• Line on right connects right side of box with Xlargest (upper 25% of data)
Boxplot for previous example
Boxplot can be drawn along the horizontal or vertically.
• A “bell-shaped” symmetric data distribution would
look like this:
24 January 2024 Statistics - Spring semester 2023-2024
Scatter Plots and Correlation
▪ A scatter plot is a graph of the ordered pairs (x, y) of numbers
consisting of the independent variable x and the dependent
variable y.
• A scatter plot is used to determine the relationship exists
between the two variables.
24 January 2024 Statistics - Spring semester 2023-2024
Analyzing the Scatter Plot
1. A positive linear relationship exists when the points fall
approximately in an ascending straight line from left to right
and both the x and y values increase at the same time.
2. A negative linear relationship exists when the points fall
approximately in a descending straight line from left to right.
24 January 2024 Statistics - Spring semester 2023-2024
Analyzing the Scatter Plot
3. A nonlinear relationship exists when the points fall in a
curved line.
4. It is said that no relationship exists when there is no
discernable pattern of the points.
24 January 2024 Statistics - Spring semester 2023-2024
Problem: The table shows the depth in (cm) for foundation of large building.
construct frequency , relative , cumulative and cumulative relative frequency
distribution and find:
a) number of foundation with depth less than 70 cm.
b) percentage of foundation in a less than 70 cm.
c) number of foundations with depth more than 60 cm.
d) number of foundation with depth more than 60 cm and less than 80 cm.
Depth class in cm f.
50-55 6
55-60 4
60-65 8
65-70 8
70-75 6
75-80 6
80-85 12
total 50
24 January 2024 Statistics - Spring semester 2023-2024
Example: Draw the scatter plot for the following data of wet
bike accidents and then state the type of the relationship.
24 January 2024 Statistics - Spring semester 2023-2024