0% found this document useful (0 votes)
13 views83 pages

Data Interpretation and Frequency Distribution

Chapter 2 discusses strategies for understanding data through organization into frequency distributions, including categorical, grouped, and ungrouped types. It outlines steps for constructing frequency distributions and visual representations such as histograms, frequency polygons, and ogives. The chapter emphasizes the importance of class limits, cumulative frequencies, and the use of relative frequency graphs for data interpretation.

Uploaded by

Razan Ismail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views83 pages

Data Interpretation and Frequency Distribution

Chapter 2 discusses strategies for understanding data through organization into frequency distributions, including categorical, grouped, and ungrouped types. It outlines steps for constructing frequency distributions and visual representations such as histograms, frequency polygons, and ogives. The chapter emphasizes the importance of class limits, cumulative frequencies, and the use of relative frequency graphs for data interpretation.

Uploaded by

Razan Ismail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 2

Strategies for understanding


the meanings of Data
2-1. Organizing Data
 Raw data is difficult to interpret => organized into
a frequency distribution.
 Each value is placed into a class (i.e. a quant./qual.
Category).
 Frequency of a Class is the number of data
values contained in it.
 Class Limits are the limits of a class
 Frequency Distribution is the organization of
raw data in table form, using classes and
frequencies.
2
Types of Frequency Distributions
 Categorical Freq. Distribution: used for
data that is nominal or ordinal (discrete classes).

 Grouped Freq. Distribution: used for


numerical data, where classes are more than
one unit wide (for large data ranges).

 Ungrouped Freq. Distribution: numerical


data with narrow range (single value instead of
classes).
3
Categorical Frequency Distribution

Twenty-five army inductees were given a blood


test to determine their blood type.
Raw Data: A,B,B,AB,O,O,O,B,AB,B, B,B,O,A,O
A,O,O,O,AB,AB,A,O,B,A

Construct a frequency distribution for the data.

4
Categorical Frequency Distribution-
Steps to Take
 Step 1: Draw a table containing Class; Tally
(i.e. Count); Frequency; Percent (optional).
 Step 2;Tally the data
 Step 3: Count Tallies => Frequency
 Step 4: Find percentage value of each class
(%=f/n*100) or Relative Freq (f/n).
 Step 5: Calculate totals to confirm results.

5
Categorical Frequency Distribution
Twenty-five army inductees were given a blood test to
determine their blood type.
Raw Data: A,B,B,AB,O,O,O,B,AB,B,B,B,O,A,O
A,O,O,O,AB, AB,A,O,B,A

Class Tally Frequency Percent


A IIII 5 20%
B IIII II 7 28%
O IIII IIII 9 36%
AB IIII 4 16%
6
Grouped Frequency Distribution
 Grouped frequency distributions are
used when the range of the data is large.
 The smallest and largest possible data values
in a class are the lower and upper class
limits. Class boundaries separate the
classes.
 To find a class boundary, average the upper
class limit of one class and the lower class
limit of the next class.

7
Grouped Frequency Distribution
 The class width can be calculated by subtracting
◦ successive lower class limits (or boundaries)
◦ successive upper class limits (or boundaries)
◦ upper and lower class boundaries

 The class midpoint Xm can be calculated by averaging


◦ upper and lower class limits (or boundaries)

8
The range (R).
It is the difference between the largest
and the smallest observation in the data
set.
The Width of the interval (w).
Class intervals should be of the same width.
If we want k intervals, then w is
𝑹
w≥
𝑲
Assume that the number of observations =100,
then
k = 1+3.322(log10 100)
= 1 + 3.3222 (2) = 7.6  8.
K is the number of classes
9
Rules for Classes in Grouped Frequency
Distributions
1. There should be 5-20 classes.
2. The class width should be an odd
number.
3. The classes must be mutually exclusive.
4. The classes must be continuous.
5. The classes must be equal in width
(except in open-ended distributions).

10
Constructing a Grouped Frequency
Distribution
The following data represent the record
high temperatures for each of the 50
states. Construct a grouped frequency
distribution for the data using 7 classes.
112 100 127 120 134 118 105 110 109 112
110 118 117 116 118 122 114 114 105 109
107 112 114 115 118 117 118 122 106 110
116 108 110 121 113 120 119 111 104 111
120 113 120 117 105 110 118 112 114 114

11
Constructing a Grouped Frequency
Distribution
STEP 1 Determine the classes.
Find the class width by dividing the range by
the number of classes 7.
Range = High – Low
= 134 – 100 = 34
Width = Range/7 = 34/7 = 5

Rounding Rule: Always round up

12
Constructing a Grouped Frequency Distribution

STEP 4 Find the cumulative frequencies by


keeping a running total of the frequencies.
Class Class Cumulative
Frequency
Limits Boundaries Frequency
100 - 104 99.5 - 104.5 2 0+ 2= 2
105 - 109 104.5 - 109.5 8 2+8= 10
110 - 114 109.5 - 114.5 18 10+18 =28
115 - 119 114.5 - 119.5 13 28+13 =41
120 - 124 119.5 - 124.5 7 41+7= 48
125 - 129 124.5 - 129.5 1 48+1= 49
130 - 134 129.5 - 134.5 1 49+1 = 50
13
The Cumulative Frequency:
It can be computed by adding successive frequencies.

The Relative Frequency:


It is used to know proportions. It can be computed by
R.F. = Freq/n

The Cumulative Relative Frequency:


It can be computed by adding successive relative frequencies.

14
Example
Class Mid – Frequency Cumulative Relative Cumulative
interval interval Freq (f) Frequency Frequency Relative
R.f Frequency

30 – 39 34.5 11 11 0.0582 0.0582

40 – 49 44.5 46 57 0.2434 0.3016

50 – 59 54.5 70 127 0.3704 0.6720

60 – 69 64.5 45 172 0.2381 0.9101

70 – 79 74.5 16 188 0.0847 0.9948

80 – 89 84.5 1 189 0.0053 1

Total 189 1

15
Histograms, Frequency Polygons, and
Ogives

3 Most Common Graphs in Research


1. Histogram

2. Frequency Polygon
3. Cumulative Frequency Polygon
(Ogive)

16
Histograms, Frequency Polygons, and
Ogives
 The histogram is a graph that displays the data by using
vertical bars of various heights to represent the
frequencies of the classes.

 The class boundaries are represented on the horizontal


x-axis (this ensures bars are adjacent).
 Frequencies are represented by vertical bars on y-axis
(equal in width).

17
Histograms
Construct a histogram to represent the
data for the record high temperatures for
each of the 50 states

112 100 127 120 134 118 105 110 109 112
110 118 117 116 118 122 114 114 105 109
107 112 114 115 118 117 118 122 106 110
116 108 110 121 113 120 119 111 104 111
120 113 120 117 105 110 118 112 114 114

18
Histograms
Histograms use class boundaries and
frequencies of the classes.
Class Class
Frequency
Limits Boundaries
100 - 104 99.5 - 104.5 2
105 - 109 104.5 - 109.5 8
110 - 114 109.5 - 114.5 18
115 - 119 114.5 - 119.5 13
120 - 124 119.5 - 124.5 7
125 - 129 124.5 - 129.5 1
130 - 134 129.5 - 134.5 1

19
Histograms
Histograms use class boundaries and
frequencies of the classes.

20
Histograms, Frequency Polygons, and
Ogives

 The frequency polygon is a graph that displays the data


by using lines that connect points plotted for the
frequencies at the class midpoints.
 The frequencies are represented by the heights of the
points.
 The class midpoints are represented on the horizontal
axis.

21
Frequency Polygons
Frequency polygons use class midpoints and
frequencies of the classes.

22
Frequency Polygons
Frequency polygons use class midpoints and
frequencies of the classes.
A frequency polygon
is anchored on the
x-axis before the first
class and after the
last class.

23
Ogives

 The Ogive is a graph that represents the cumulative


frequencies for the classes in a frequency distribution.
 The upper class boundaries are represented on the
horizontal axis.
 Cumulative Frequencies are represented by the height
of the plotted points.

Start from zero on the X-axis at the first lower boundary

24
Ogives
Ogives use upper class boundaries and
cumulative frequencies of the classes.
Cumulative
Class Boundaries
Frequency
Less than 104.5 2
Less than 109.5 10
Less than 114.5 28
Less than 119.5 41
Less than 124.5 48
Less than 129.5 49
Less than 134.5 50

25
Ogives
Ogives use upper class boundaries and
cumulative frequencies of the classes.

26
For example, to find out how many record
high temperatures are less than 114.5 ₒF

27
Histograms, Polygons & Ogives –
Summary Table

Graph Type X- axis Y- axis Displays

Histogram Boundaries Frequency Adjacent Bars

Freq. Polygon Midpoints Frequency Dotted Lines

Ogive Upper Boundaries Cum. Frequency Dotted Lines

28
Relative Frequency Graphs.

If proportions are used instead of frequencies,


the graphs are called relative frequency
graphs.

Relative frequency graphs are used when the


proportion of data values that fall into a given
class is more important than the actual number
of data values that fall into that class.

29
Construct a histogram, frequency polygon, and ogive using relative
frequencies for the distribution of the miles that 20 randomly selected
runners ran during a given week.

Class Relative Divide each


Frequency frequency by
Boundaries Frequency
the total
5.5 - 10.5 1 1/20 = 0.05 frequency to
10.5 - 15.5 2 2/20 = 0.10 get the
15.5 - 20.5 3 3/20 = 0.15 relative
20.5 - 25.5 5 5/20 = 0.25 frequency.
25.5 - 30.5 4 4/20 = 0.20
30.5 - 35.5 3 3/20 = 0.15
35.5 - 40.5 2 2/20 = 0.10
f = 20 rf = 1.00
30
Histograms
Use the class boundaries and the
relative frequencies of the classes.

31
Frequency Polygons
The following is a frequency distribution of
miles run per week by 20 selected runners.
Class Class Relative
Boundaries Midpoints Frequency
5.5 - 10.5 8 0.05
10.5 - 15.5 13 0.10
15.5 - 20.5 18 0.15
20.5 - 25.5 23 0.25
25.5 - 30.5 28 0.20
30.5 - 35.5 33 0.15
35.5 - 40.5 38 0.10

32
Frequency Polygons
Use the class midpoints and the
relative frequencies of the classes.

33
Ogives
The following is a frequency distribution of
miles run per week by 20 selected runners.
Class Cumulative Cum. Rel.
Frequency
Boundaries Frequency Frequency
5.5 - 10.5 1 1 1/20 = 0.05
10.5 - 15.5 2 3 3/20 = 0.15
15.5 - 20.5 3 6 6/20 = 0.30
20.5 - 25.5 5 11 11/20 = 0.55
25.5 - 30.5 4 15 15/20 = 0.75
30.5 - 35.5 3 18 18/20 = 0.90
35.5 - 40.5 2 20 20/20 = 1.00
f = 20
34
Ogives
Ogives use upper class boundaries and
cumulative frequencies of the classes.
Cum. Rel.
Class Boundaries
Frequency
Less than 10.5 0.05
Less than 15.5 0.15
Less than 20.5 0.30
Less than 25.5 0.55
Less than 30.5 0.75
Less than 35.5 0.90
Less than 40.5 1.00

35
Ogives
Use the upper class boundaries and the
cumulative relative frequencies.

36
Other Types of Graphs
Bar Graphs

37
The Bar Graph
 Represents data by using vertical or horizontal
bars, whereby the height or length represents the
frequencies.

 Used for qualitative or categorical data.

 Horizontal Bar Graph: Place frequencies on the


X-axis; categories on Y-axis.

 Vertical Bar Graph: Place frequencies on Y-axis;


categories on X-axis.
 Bars should not be adjacent

38
The Time Series Graph
 Used to represent data that occur over a
specific period of time.
 Plots the time periods on the X-axis and
the data values on the Y-axis; displayed as
dotted lines.
 Indicates the evolution of the variable
over time.
 Compound Time Series Graph: compares
the evolution of two datasets over time.

39
Time Series Graphs

40
The Pie Graph

 It’s a circle divided into sections according to


the percentage of frequencies for each
category.

 Used for categorical or nominal data.

 Shows the relationship of the parts to the


whole, through visual comparison.
41
Pie Graphs

42
The Pie Graph- Steps to Draw
 Convert frequencies to degrees:
𝑓
d= *360 => use to divide the sections.
𝑛

 Convert frequencies to percentages to


𝑓
display on the sections.% = *100
𝑛

 Use a compass to draw the sections;


display category names & percentages.

43
Stem and Leaf Plots

A stem and leaf plot is a data plot that uses part


of a data value as the stem and part of the data
value as the leaf to form groups or classes.
It has the advantage over grouped frequency
distribution of retaining the actual data while
showing them in graphic form.
Uses leading digit (or more) as stem & trailing
digit as leaf.

44
Stem & Leaf Plot- Steps

 Step 1: Arrange the data in increasing


order.

 Step 2: Separate data according to the


first digit(s).

 Step 3: Use leading digit(s) as stem and


trailing digit as leaf.
45
At an outpatient testing center, the number
of cardiograms performed each day for 20
days is shown. Construct a stem and leaf
plot for the data.

25 31 20 32 13
14 43 02 57 23
36 32 33 32 44
32 52 44 51 45

46
25 31 20 32 13
14 43 02 57 23
36 32 33 32 44
32 52 44 51 45

Unordered Stem Plot Ordered Stem Plot


0 2 0 2
1 3 4 1 3 4
2 5 0 3 2 0 3 5
3 1 2 6 2 3 2 2 3 1 2 2 2 2 3 6
4 3 4 4 5 4 3 4 4 5
5 7 2 1 5 1 2 7

47
Descriptive Statistics
Measures of Central
Tendency
The Statistic and The Parameter
 A Statistic:
It is a descriptive measure computed from the
data of a sample.
 A Parameter:
It is a a descriptive measure computed from
the data of a population.
Since it is difficult to measure a parameter from the
population, a sample is drawn of size n, whose values
are  1 ,  2 , …, n. From this data, we measure the
statistic.

49
Measures of Central Tendency
A measure of central tendency is a measure
which indicates where the middle of the data
is.
The three most commonly used measures of
central tendency are:
The Mean, the Median, and the Mode.
The Mean:
It is the average of the data.

50
The Population Mean:
which is usually unknown, then we use the

sample mean to estimate or approximate it.


The Sample Mean:

Example:
Here is a random sample of size 10 of ages, where
 1 = 42,  2 = 28,  3 = 28,  4 = 61,  5 = 31,
 6 = 23,  7 = 50,  8 = 34,  9 = 32,  10 = 37.

x = (42 + 28 + … + 37) / 10 = 36.6


51
Properties of the Mean:
 Uniqueness. For a given set of data there is one and only
one mean.
 Simplicity. It is easy to understand and to compute.
 Affected by extreme values. Since all values enter
into the computation.

Example: Assume the values of physicians charges are $115, $110,


$119, $117, $121 and $[Link] mean = $118.
But assume that the values are $75, $75, $80, $80 and $280.
The mean = 118, a value that is not representative of the set of
data as a whole. $280 inflated the mean

52
The Median:
When ordering the data, it is the observation that
divide the set of observations into two equal parts
such that half of the data are before it and the other
are after it.
* If n is odd, the median will be the middle of
observations. It will be the (n+1)/2 th ordered
observation.
When n = 11, then the median is the 6th observation.
* If n is even, there are two middle [Link]
median will be the mean of these two middle
observations. It will be the (n+1)/2 th ordered
observation.
When n = 12, then the median is the 6.5th observation,
which is an observation halfway between the 6th and
7th ordered observation.
53
Example:
For the same random sample, the ordered
observations will be as:
23, 28, 28, 31, 32, 34, 37, 42, 50, 61.
Since n = 10, then the median is the 5.5th observation,
i.e. = (32+34)/2 = 33.
Properties of the Median:
 Uniqueness. For a given set of data there is
one and only one median.
 Simplicity. It is easy to calculate.
 It is not affected by extreme values
as is the mean.

54
The Mode:
It is the value which occurs most frequently.
If all values are different there is no mode.
Sometimes, there are more than one mode.
Example:
For the same random sample, the value 28 is
repeated two times, so it is the mode.
Properties of the Mode:
 Sometimes, it is not unique.
 It may be used for describing qualitative
data.

55
Skewness:

 Data distributions may be classified on the basis of


whether they are symmetric or asymmetric.
 If a distribution is symmetric, the left half of its graph
(histogram or frequency polygon) will be a mirror
image of its right half.
 When the left half and right half of the graph of a
distribution are not mirror images of each other, the
distribution is asymmetric.

56
Shapes of Distributions

Negatively skewed
Positively skewed

57
Kurtosis
 Kurtosis is a measure of the degree to
which a distribution is “peaked” or flat in
comparison to a normal distribution whose
graph is characterized by a bell-shaped
appearance
Descriptive Statistics
Measures of Dispersion
Descriptive Statistics – Measures
of Dispersion:
A measure of dispersion conveys information regarding the
amount of variability present in a set of data.
Note:
1. If all the values are the same
→ There is no dispersion .
2. If all the values are different
→ There is a dispersion:
3. If the values close to each other
→The amount of Dispersion small.
4. If the values are widely scattered
→ The Dispersion is greater.

60
Measures of Dispersions
1. Range: R
[Link] : s2
3. Standard deviation: s
4. Coefficient of variation: C.V

61
[Link] Range (R):
 Range = Largest value - Smallest value

 Data:
43,66,61,64,65,38,59,57,57,50.

 Find the Range


Range = 66-38 = 28

62
[Link] Variance:
It measure dispersion relative to the scatter of the values
about their mean.
Sample Variance:

  X X
2

s 2

n 1

Find Sample Variance of ages x = 56


Solution:
s2 = [(43-56) 2 +(66-56) 2+…..+(50-56) 2 ]/ 10-1
= 810/9 = 90

63
 Population Variance:

 i
(x  μ) 2

σ2  i 1

where µ, is Population mean

64
[Link] Standard Deviation:

The standard deviation is useful as a measure of


variation within a given set of data.

 is the square root of variance= Variance

Sample Standard Deviation = s = s2

Population Standard Deviation = σ =  2

65
4. The Coefficient of Variation
Is a measure used to compare the dispersion
in two sets of data which is independent of the
unit of the measurement .
s
C .V  (100 )%
x
where s: Sample standard deviation.
x : Sample mean

66
Example
 Suppose two samples of human males yield the
following data:

Sample1 Sample2
Age 25 years 11 years
Mean weight 145 pounds 80 pounds
Standard deviation 10 pounds 10 pounds

67
 We wish to know which is more variable.

Solution:
 c.v (Sample1)= (10/145)*100= 6.9%

 c.v (Sample2)= (10/80)*100= 12.5%

 Then age of 11-years old(sample2) has more


variation

68
General Rounding Rule
Look at 3rd value after decimal point:
1. If it is > 5  increase the number before it by 1
0.218495 ≈ 0.22

2. If it is < 5  the number before it does not change


0.2113 ≈ 0.21

3. If it is = 5  look at the number before it


◦ If it is odd  increase by 1
0.21548 ≈ 0.22
◦ If it is even  don’t change
0.34598 ≈ 0.34

69
DEFINITION
 Given a set of n observations x1,
x2,…xn, the pth percentile P is the
value of X such that p% or less of the
observations are less than P and
(100-p) percent or less of the
observations are greater than P.
Percentiles and Quartiles
 The nth percentile is denoted by Pn.
The 10th percentile is P10
 The first Quartile denoted by Q1 is the 25th
percentile. Q1 = P25.
 The second Quartile is the 50th percentile
Q2 = P50 is the median
 The third quartile Q3 is the 75th percentile.
Q3 = P75
 When we wish to find the quartiles for a
set of data, the following formulas are
used:
Interquartile Range
 The range provides a simple measure of the
variability present in a set of data.
 A disadvantage of the range is the fact that it
is computed from only two values, the
largest and the smallest.
 A similar measure that reflects the variability
among the middle 50 percent of the
observations in a data set, is the interquartile
range or IQR
DEFINITION
 The interquartile range (IQR) is the
difference between the third and
first quartiles:

IQR = Q3 – Q1
IQR
 A large IQR a large amount of
variability among the middle 50 percent of
the relevant observations,
 a small IQR  a small amount of
variability.
 It is more informative to compare the
interquartile range with the range of the
entire data set.
 100 (IQR/R) tells us what percent the
IQR is of the overall range.
Box-and-Whisker Plots
1. Represent the variable of interest on the horizontal axis.
2. Draw a box above the horizontal axis where the left end of the
box aligns with Q1 and the right end aligns with Q3
3. Divide the box into two parts by a vertical line that aligns with
the median Q2
4. Draw a horizontal line called a whisker from the left end of the
box to a point that aligns with the smallest measurement in the
data set.
5. Draw another horizontal line, or whisker, from the right end of
the box to a point that aligns with the largest measurement in
the data set.
Examination of a box-and-whisker plot for a set of data reveals
information regarding the amount of spread, location of
concentration, and symmetry of the data.
 Evans et al. examined the effect of velocity on ground
reaction forces (GRF) in dogs with lameness from a
torn cranial cruciate ligament. The dogs were walked
and trotted over a force platform, and the GRF was
recorded during a certain phase of their
performance.
 The following table contains 20 measurements of
force where each value shown is the mean of five
force measurements per dog when trotting.
 Smallest measurement = 14.6
 Largest measurement = 44
20+1
 Q1=( ) = 5.25th measurement
4
 Which is 27.2 +(0.25)(27.4-27.2) = 27.25
20+1
 The median Q2 = 2( ) = 10.5th measurement
4
or 30.7 + (0.5)(31.5-30.7) = 31.1

20+1
 Q3 = 3 ( ) = 15.75th measurement
4
which is 33.3 +(0.75)(33.6-33.3) = 33.525

 IQR = 33.525 – 27.25 = 6.275


 Range = 29.4
 IQR = 100(6.275/29.4) = 21% of the range
 Many statistical software packages have
the capability of constructing box-and
whisker
Box and Whiskers
Outliers
 The asterisks alert us to the fact that the data set
contains one unusually large and one unusually small
value, called outliers.
DEFINITION
 An outlier is an observation whose value, x,
either exceeds the value of the third quartile
by a magnitude greater than 1.5(IQR) or is
less than the value of the first quartile by a
magnitude greater than 1.5(IQR).
 That is, an observation of x > Q3 + 1.5(IQR)
or an observation of x < Q1 - 1.5(IQR) is
called an outlier.
Outlier
 We may use the previously computed values of
Q1, Q3, and IQR to determine how large or
how small a value would have to be in order to
be considered an outlier. The calculations are as
follows:
X < 27.25 – 1.5(6.275)= 17.8375 and
X > 33.525 + 1.5(6.275) = 42.9375
 An observed value smaller than 17.8375 or
larger than 42.9375 would be considered an
outlier.
The Boxplot: Checking for outliers
Calculate the Inner Fences, given by
ILF =Q1- 1.5*IQR & IUF = Q3+ 1.5*IQR.
The Whiskers are the smallest & largest values within the
Inner Fences.

If there are outliers, compute the Outer Fences, given by


OLF =Q1- 3*IQR & OUF= Q3+ 3*IQR.

A mild outlier is defined as one that lies between the


inner fences and outer fences; whereas an extreme
outlier is one that lies outside the outer fences.

You might also like