Chapter 2
Strategies for understanding
the meanings of Data
2-1. Organizing Data
Raw data is difficult to interpret => organized into
a frequency distribution.
Each value is placed into a class (i.e. a quant./qual.
Category).
Frequency of a Class is the number of data
values contained in it.
Class Limits are the limits of a class
Frequency Distribution is the organization of
raw data in table form, using classes and
frequencies.
2
Types of Frequency Distributions
Categorical Freq. Distribution: used for
data that is nominal or ordinal (discrete classes).
Grouped Freq. Distribution: used for
numerical data, where classes are more than
one unit wide (for large data ranges).
Ungrouped Freq. Distribution: numerical
data with narrow range (single value instead of
classes).
3
Categorical Frequency Distribution
Twenty-five army inductees were given a blood
test to determine their blood type.
Raw Data: A,B,B,AB,O,O,O,B,AB,B, B,B,O,A,O
A,O,O,O,AB,AB,A,O,B,A
Construct a frequency distribution for the data.
4
Categorical Frequency Distribution-
Steps to Take
Step 1: Draw a table containing Class; Tally
(i.e. Count); Frequency; Percent (optional).
Step 2;Tally the data
Step 3: Count Tallies => Frequency
Step 4: Find percentage value of each class
(%=f/n*100) or Relative Freq (f/n).
Step 5: Calculate totals to confirm results.
5
Categorical Frequency Distribution
Twenty-five army inductees were given a blood test to
determine their blood type.
Raw Data: A,B,B,AB,O,O,O,B,AB,B,B,B,O,A,O
A,O,O,O,AB, AB,A,O,B,A
Class Tally Frequency Percent
A IIII 5 20%
B IIII II 7 28%
O IIII IIII 9 36%
AB IIII 4 16%
6
Grouped Frequency Distribution
Grouped frequency distributions are
used when the range of the data is large.
The smallest and largest possible data values
in a class are the lower and upper class
limits. Class boundaries separate the
classes.
To find a class boundary, average the upper
class limit of one class and the lower class
limit of the next class.
7
Grouped Frequency Distribution
The class width can be calculated by subtracting
◦ successive lower class limits (or boundaries)
◦ successive upper class limits (or boundaries)
◦ upper and lower class boundaries
The class midpoint Xm can be calculated by averaging
◦ upper and lower class limits (or boundaries)
8
The range (R).
It is the difference between the largest
and the smallest observation in the data
set.
The Width of the interval (w).
Class intervals should be of the same width.
If we want k intervals, then w is
𝑹
w≥
𝑲
Assume that the number of observations =100,
then
k = 1+3.322(log10 100)
= 1 + 3.3222 (2) = 7.6 8.
K is the number of classes
9
Rules for Classes in Grouped Frequency
Distributions
1. There should be 5-20 classes.
2. The class width should be an odd
number.
3. The classes must be mutually exclusive.
4. The classes must be continuous.
5. The classes must be equal in width
(except in open-ended distributions).
10
Constructing a Grouped Frequency
Distribution
The following data represent the record
high temperatures for each of the 50
states. Construct a grouped frequency
distribution for the data using 7 classes.
112 100 127 120 134 118 105 110 109 112
110 118 117 116 118 122 114 114 105 109
107 112 114 115 118 117 118 122 106 110
116 108 110 121 113 120 119 111 104 111
120 113 120 117 105 110 118 112 114 114
11
Constructing a Grouped Frequency
Distribution
STEP 1 Determine the classes.
Find the class width by dividing the range by
the number of classes 7.
Range = High – Low
= 134 – 100 = 34
Width = Range/7 = 34/7 = 5
Rounding Rule: Always round up
12
Constructing a Grouped Frequency Distribution
STEP 4 Find the cumulative frequencies by
keeping a running total of the frequencies.
Class Class Cumulative
Frequency
Limits Boundaries Frequency
100 - 104 99.5 - 104.5 2 0+ 2= 2
105 - 109 104.5 - 109.5 8 2+8= 10
110 - 114 109.5 - 114.5 18 10+18 =28
115 - 119 114.5 - 119.5 13 28+13 =41
120 - 124 119.5 - 124.5 7 41+7= 48
125 - 129 124.5 - 129.5 1 48+1= 49
130 - 134 129.5 - 134.5 1 49+1 = 50
13
The Cumulative Frequency:
It can be computed by adding successive frequencies.
The Relative Frequency:
It is used to know proportions. It can be computed by
R.F. = Freq/n
The Cumulative Relative Frequency:
It can be computed by adding successive relative frequencies.
14
Example
Class Mid – Frequency Cumulative Relative Cumulative
interval interval Freq (f) Frequency Frequency Relative
R.f Frequency
30 – 39 34.5 11 11 0.0582 0.0582
40 – 49 44.5 46 57 0.2434 0.3016
50 – 59 54.5 70 127 0.3704 0.6720
60 – 69 64.5 45 172 0.2381 0.9101
70 – 79 74.5 16 188 0.0847 0.9948
80 – 89 84.5 1 189 0.0053 1
Total 189 1
15
Histograms, Frequency Polygons, and
Ogives
3 Most Common Graphs in Research
1. Histogram
2. Frequency Polygon
3. Cumulative Frequency Polygon
(Ogive)
16
Histograms, Frequency Polygons, and
Ogives
The histogram is a graph that displays the data by using
vertical bars of various heights to represent the
frequencies of the classes.
The class boundaries are represented on the horizontal
x-axis (this ensures bars are adjacent).
Frequencies are represented by vertical bars on y-axis
(equal in width).
17
Histograms
Construct a histogram to represent the
data for the record high temperatures for
each of the 50 states
112 100 127 120 134 118 105 110 109 112
110 118 117 116 118 122 114 114 105 109
107 112 114 115 118 117 118 122 106 110
116 108 110 121 113 120 119 111 104 111
120 113 120 117 105 110 118 112 114 114
18
Histograms
Histograms use class boundaries and
frequencies of the classes.
Class Class
Frequency
Limits Boundaries
100 - 104 99.5 - 104.5 2
105 - 109 104.5 - 109.5 8
110 - 114 109.5 - 114.5 18
115 - 119 114.5 - 119.5 13
120 - 124 119.5 - 124.5 7
125 - 129 124.5 - 129.5 1
130 - 134 129.5 - 134.5 1
19
Histograms
Histograms use class boundaries and
frequencies of the classes.
20
Histograms, Frequency Polygons, and
Ogives
The frequency polygon is a graph that displays the data
by using lines that connect points plotted for the
frequencies at the class midpoints.
The frequencies are represented by the heights of the
points.
The class midpoints are represented on the horizontal
axis.
21
Frequency Polygons
Frequency polygons use class midpoints and
frequencies of the classes.
22
Frequency Polygons
Frequency polygons use class midpoints and
frequencies of the classes.
A frequency polygon
is anchored on the
x-axis before the first
class and after the
last class.
23
Ogives
The Ogive is a graph that represents the cumulative
frequencies for the classes in a frequency distribution.
The upper class boundaries are represented on the
horizontal axis.
Cumulative Frequencies are represented by the height
of the plotted points.
Start from zero on the X-axis at the first lower boundary
24
Ogives
Ogives use upper class boundaries and
cumulative frequencies of the classes.
Cumulative
Class Boundaries
Frequency
Less than 104.5 2
Less than 109.5 10
Less than 114.5 28
Less than 119.5 41
Less than 124.5 48
Less than 129.5 49
Less than 134.5 50
25
Ogives
Ogives use upper class boundaries and
cumulative frequencies of the classes.
26
For example, to find out how many record
high temperatures are less than 114.5 ₒF
27
Histograms, Polygons & Ogives –
Summary Table
Graph Type X- axis Y- axis Displays
Histogram Boundaries Frequency Adjacent Bars
Freq. Polygon Midpoints Frequency Dotted Lines
Ogive Upper Boundaries Cum. Frequency Dotted Lines
28
Relative Frequency Graphs.
If proportions are used instead of frequencies,
the graphs are called relative frequency
graphs.
Relative frequency graphs are used when the
proportion of data values that fall into a given
class is more important than the actual number
of data values that fall into that class.
29
Construct a histogram, frequency polygon, and ogive using relative
frequencies for the distribution of the miles that 20 randomly selected
runners ran during a given week.
Class Relative Divide each
Frequency frequency by
Boundaries Frequency
the total
5.5 - 10.5 1 1/20 = 0.05 frequency to
10.5 - 15.5 2 2/20 = 0.10 get the
15.5 - 20.5 3 3/20 = 0.15 relative
20.5 - 25.5 5 5/20 = 0.25 frequency.
25.5 - 30.5 4 4/20 = 0.20
30.5 - 35.5 3 3/20 = 0.15
35.5 - 40.5 2 2/20 = 0.10
f = 20 rf = 1.00
30
Histograms
Use the class boundaries and the
relative frequencies of the classes.
31
Frequency Polygons
The following is a frequency distribution of
miles run per week by 20 selected runners.
Class Class Relative
Boundaries Midpoints Frequency
5.5 - 10.5 8 0.05
10.5 - 15.5 13 0.10
15.5 - 20.5 18 0.15
20.5 - 25.5 23 0.25
25.5 - 30.5 28 0.20
30.5 - 35.5 33 0.15
35.5 - 40.5 38 0.10
32
Frequency Polygons
Use the class midpoints and the
relative frequencies of the classes.
33
Ogives
The following is a frequency distribution of
miles run per week by 20 selected runners.
Class Cumulative Cum. Rel.
Frequency
Boundaries Frequency Frequency
5.5 - 10.5 1 1 1/20 = 0.05
10.5 - 15.5 2 3 3/20 = 0.15
15.5 - 20.5 3 6 6/20 = 0.30
20.5 - 25.5 5 11 11/20 = 0.55
25.5 - 30.5 4 15 15/20 = 0.75
30.5 - 35.5 3 18 18/20 = 0.90
35.5 - 40.5 2 20 20/20 = 1.00
f = 20
34
Ogives
Ogives use upper class boundaries and
cumulative frequencies of the classes.
Cum. Rel.
Class Boundaries
Frequency
Less than 10.5 0.05
Less than 15.5 0.15
Less than 20.5 0.30
Less than 25.5 0.55
Less than 30.5 0.75
Less than 35.5 0.90
Less than 40.5 1.00
35
Ogives
Use the upper class boundaries and the
cumulative relative frequencies.
36
Other Types of Graphs
Bar Graphs
37
The Bar Graph
Represents data by using vertical or horizontal
bars, whereby the height or length represents the
frequencies.
Used for qualitative or categorical data.
Horizontal Bar Graph: Place frequencies on the
X-axis; categories on Y-axis.
Vertical Bar Graph: Place frequencies on Y-axis;
categories on X-axis.
Bars should not be adjacent
38
The Time Series Graph
Used to represent data that occur over a
specific period of time.
Plots the time periods on the X-axis and
the data values on the Y-axis; displayed as
dotted lines.
Indicates the evolution of the variable
over time.
Compound Time Series Graph: compares
the evolution of two datasets over time.
39
Time Series Graphs
40
The Pie Graph
It’s a circle divided into sections according to
the percentage of frequencies for each
category.
Used for categorical or nominal data.
Shows the relationship of the parts to the
whole, through visual comparison.
41
Pie Graphs
42
The Pie Graph- Steps to Draw
Convert frequencies to degrees:
𝑓
d= *360 => use to divide the sections.
𝑛
Convert frequencies to percentages to
𝑓
display on the sections.% = *100
𝑛
Use a compass to draw the sections;
display category names & percentages.
43
Stem and Leaf Plots
A stem and leaf plot is a data plot that uses part
of a data value as the stem and part of the data
value as the leaf to form groups or classes.
It has the advantage over grouped frequency
distribution of retaining the actual data while
showing them in graphic form.
Uses leading digit (or more) as stem & trailing
digit as leaf.
44
Stem & Leaf Plot- Steps
Step 1: Arrange the data in increasing
order.
Step 2: Separate data according to the
first digit(s).
Step 3: Use leading digit(s) as stem and
trailing digit as leaf.
45
At an outpatient testing center, the number
of cardiograms performed each day for 20
days is shown. Construct a stem and leaf
plot for the data.
25 31 20 32 13
14 43 02 57 23
36 32 33 32 44
32 52 44 51 45
46
25 31 20 32 13
14 43 02 57 23
36 32 33 32 44
32 52 44 51 45
Unordered Stem Plot Ordered Stem Plot
0 2 0 2
1 3 4 1 3 4
2 5 0 3 2 0 3 5
3 1 2 6 2 3 2 2 3 1 2 2 2 2 3 6
4 3 4 4 5 4 3 4 4 5
5 7 2 1 5 1 2 7
47
Descriptive Statistics
Measures of Central
Tendency
The Statistic and The Parameter
A Statistic:
It is a descriptive measure computed from the
data of a sample.
A Parameter:
It is a a descriptive measure computed from
the data of a population.
Since it is difficult to measure a parameter from the
population, a sample is drawn of size n, whose values
are 1 , 2 , …, n. From this data, we measure the
statistic.
49
Measures of Central Tendency
A measure of central tendency is a measure
which indicates where the middle of the data
is.
The three most commonly used measures of
central tendency are:
The Mean, the Median, and the Mode.
The Mean:
It is the average of the data.
50
The Population Mean:
which is usually unknown, then we use the
sample mean to estimate or approximate it.
The Sample Mean:
Example:
Here is a random sample of size 10 of ages, where
1 = 42, 2 = 28, 3 = 28, 4 = 61, 5 = 31,
6 = 23, 7 = 50, 8 = 34, 9 = 32, 10 = 37.
x = (42 + 28 + … + 37) / 10 = 36.6
51
Properties of the Mean:
Uniqueness. For a given set of data there is one and only
one mean.
Simplicity. It is easy to understand and to compute.
Affected by extreme values. Since all values enter
into the computation.
Example: Assume the values of physicians charges are $115, $110,
$119, $117, $121 and $[Link] mean = $118.
But assume that the values are $75, $75, $80, $80 and $280.
The mean = 118, a value that is not representative of the set of
data as a whole. $280 inflated the mean
52
The Median:
When ordering the data, it is the observation that
divide the set of observations into two equal parts
such that half of the data are before it and the other
are after it.
* If n is odd, the median will be the middle of
observations. It will be the (n+1)/2 th ordered
observation.
When n = 11, then the median is the 6th observation.
* If n is even, there are two middle [Link]
median will be the mean of these two middle
observations. It will be the (n+1)/2 th ordered
observation.
When n = 12, then the median is the 6.5th observation,
which is an observation halfway between the 6th and
7th ordered observation.
53
Example:
For the same random sample, the ordered
observations will be as:
23, 28, 28, 31, 32, 34, 37, 42, 50, 61.
Since n = 10, then the median is the 5.5th observation,
i.e. = (32+34)/2 = 33.
Properties of the Median:
Uniqueness. For a given set of data there is
one and only one median.
Simplicity. It is easy to calculate.
It is not affected by extreme values
as is the mean.
54
The Mode:
It is the value which occurs most frequently.
If all values are different there is no mode.
Sometimes, there are more than one mode.
Example:
For the same random sample, the value 28 is
repeated two times, so it is the mode.
Properties of the Mode:
Sometimes, it is not unique.
It may be used for describing qualitative
data.
55
Skewness:
Data distributions may be classified on the basis of
whether they are symmetric or asymmetric.
If a distribution is symmetric, the left half of its graph
(histogram or frequency polygon) will be a mirror
image of its right half.
When the left half and right half of the graph of a
distribution are not mirror images of each other, the
distribution is asymmetric.
56
Shapes of Distributions
Negatively skewed
Positively skewed
57
Kurtosis
Kurtosis is a measure of the degree to
which a distribution is “peaked” or flat in
comparison to a normal distribution whose
graph is characterized by a bell-shaped
appearance
Descriptive Statistics
Measures of Dispersion
Descriptive Statistics – Measures
of Dispersion:
A measure of dispersion conveys information regarding the
amount of variability present in a set of data.
Note:
1. If all the values are the same
→ There is no dispersion .
2. If all the values are different
→ There is a dispersion:
3. If the values close to each other
→The amount of Dispersion small.
4. If the values are widely scattered
→ The Dispersion is greater.
60
Measures of Dispersions
1. Range: R
[Link] : s2
3. Standard deviation: s
4. Coefficient of variation: C.V
61
[Link] Range (R):
Range = Largest value - Smallest value
Data:
43,66,61,64,65,38,59,57,57,50.
Find the Range
Range = 66-38 = 28
62
[Link] Variance:
It measure dispersion relative to the scatter of the values
about their mean.
Sample Variance:
X X
2
s 2
n 1
Find Sample Variance of ages x = 56
Solution:
s2 = [(43-56) 2 +(66-56) 2+…..+(50-56) 2 ]/ 10-1
= 810/9 = 90
63
Population Variance:
i
(x μ) 2
σ2 i 1
where µ, is Population mean
64
[Link] Standard Deviation:
The standard deviation is useful as a measure of
variation within a given set of data.
is the square root of variance= Variance
Sample Standard Deviation = s = s2
Population Standard Deviation = σ = 2
65
4. The Coefficient of Variation
Is a measure used to compare the dispersion
in two sets of data which is independent of the
unit of the measurement .
s
C .V (100 )%
x
where s: Sample standard deviation.
x : Sample mean
66
Example
Suppose two samples of human males yield the
following data:
Sample1 Sample2
Age 25 years 11 years
Mean weight 145 pounds 80 pounds
Standard deviation 10 pounds 10 pounds
67
We wish to know which is more variable.
Solution:
c.v (Sample1)= (10/145)*100= 6.9%
c.v (Sample2)= (10/80)*100= 12.5%
Then age of 11-years old(sample2) has more
variation
68
General Rounding Rule
Look at 3rd value after decimal point:
1. If it is > 5 increase the number before it by 1
0.218495 ≈ 0.22
2. If it is < 5 the number before it does not change
0.2113 ≈ 0.21
3. If it is = 5 look at the number before it
◦ If it is odd increase by 1
0.21548 ≈ 0.22
◦ If it is even don’t change
0.34598 ≈ 0.34
69
DEFINITION
Given a set of n observations x1,
x2,…xn, the pth percentile P is the
value of X such that p% or less of the
observations are less than P and
(100-p) percent or less of the
observations are greater than P.
Percentiles and Quartiles
The nth percentile is denoted by Pn.
The 10th percentile is P10
The first Quartile denoted by Q1 is the 25th
percentile. Q1 = P25.
The second Quartile is the 50th percentile
Q2 = P50 is the median
The third quartile Q3 is the 75th percentile.
Q3 = P75
When we wish to find the quartiles for a
set of data, the following formulas are
used:
Interquartile Range
The range provides a simple measure of the
variability present in a set of data.
A disadvantage of the range is the fact that it
is computed from only two values, the
largest and the smallest.
A similar measure that reflects the variability
among the middle 50 percent of the
observations in a data set, is the interquartile
range or IQR
DEFINITION
The interquartile range (IQR) is the
difference between the third and
first quartiles:
IQR = Q3 – Q1
IQR
A large IQR a large amount of
variability among the middle 50 percent of
the relevant observations,
a small IQR a small amount of
variability.
It is more informative to compare the
interquartile range with the range of the
entire data set.
100 (IQR/R) tells us what percent the
IQR is of the overall range.
Box-and-Whisker Plots
1. Represent the variable of interest on the horizontal axis.
2. Draw a box above the horizontal axis where the left end of the
box aligns with Q1 and the right end aligns with Q3
3. Divide the box into two parts by a vertical line that aligns with
the median Q2
4. Draw a horizontal line called a whisker from the left end of the
box to a point that aligns with the smallest measurement in the
data set.
5. Draw another horizontal line, or whisker, from the right end of
the box to a point that aligns with the largest measurement in
the data set.
Examination of a box-and-whisker plot for a set of data reveals
information regarding the amount of spread, location of
concentration, and symmetry of the data.
Evans et al. examined the effect of velocity on ground
reaction forces (GRF) in dogs with lameness from a
torn cranial cruciate ligament. The dogs were walked
and trotted over a force platform, and the GRF was
recorded during a certain phase of their
performance.
The following table contains 20 measurements of
force where each value shown is the mean of five
force measurements per dog when trotting.
Smallest measurement = 14.6
Largest measurement = 44
20+1
Q1=( ) = 5.25th measurement
4
Which is 27.2 +(0.25)(27.4-27.2) = 27.25
20+1
The median Q2 = 2( ) = 10.5th measurement
4
or 30.7 + (0.5)(31.5-30.7) = 31.1
20+1
Q3 = 3 ( ) = 15.75th measurement
4
which is 33.3 +(0.75)(33.6-33.3) = 33.525
IQR = 33.525 – 27.25 = 6.275
Range = 29.4
IQR = 100(6.275/29.4) = 21% of the range
Many statistical software packages have
the capability of constructing box-and
whisker
Box and Whiskers
Outliers
The asterisks alert us to the fact that the data set
contains one unusually large and one unusually small
value, called outliers.
DEFINITION
An outlier is an observation whose value, x,
either exceeds the value of the third quartile
by a magnitude greater than 1.5(IQR) or is
less than the value of the first quartile by a
magnitude greater than 1.5(IQR).
That is, an observation of x > Q3 + 1.5(IQR)
or an observation of x < Q1 - 1.5(IQR) is
called an outlier.
Outlier
We may use the previously computed values of
Q1, Q3, and IQR to determine how large or
how small a value would have to be in order to
be considered an outlier. The calculations are as
follows:
X < 27.25 – 1.5(6.275)= 17.8375 and
X > 33.525 + 1.5(6.275) = 42.9375
An observed value smaller than 17.8375 or
larger than 42.9375 would be considered an
outlier.
The Boxplot: Checking for outliers
Calculate the Inner Fences, given by
ILF =Q1- 1.5*IQR & IUF = Q3+ 1.5*IQR.
The Whiskers are the smallest & largest values within the
Inner Fences.
If there are outliers, compute the Outer Fences, given by
OLF =Q1- 3*IQR & OUF= Q3+ 3*IQR.
A mild outlier is defined as one that lies between the
inner fences and outer fences; whereas an extreme
outlier is one that lies outside the outer fences.