Introduction to Descriptive Statistics
Introduction to Descriptive Statistics
01 02 03 04
Descriptive Inferential
Population Sample
All items Items selected from population
Population
Population parameter: Numerical value that describes a characteristic of a population
Calculation: Calculated using data from the entire population
•All items of interest
Purpose: Used to describe the entire population opulation
•Censusof– all
Example: The population mean income if the study involve
households in the
a country, census
Population whole population
arameter
•Parameter – summary measure of
the whole population
Sample
Sample statistics: Numerical value •Portion of population
that describes a characteristic of a sample
•Sample
Calculation: Calculated using data from a sampleinvolved
survey – of the population
subgroup (or
Sample
Purpose: Used sample) of
to estimate the population selected population
parameter ample
•Statistic
Example: The sample mean of income in a randomlymeasure
– summary selected group of households
computed from sample data tatistics
Secondary
Primary • Taken from other investigator’s collection of
• First hand data figures
• Collected by the investigator • Data collected from other parties
• Eg. Interview respondents, survey, experiment • eg. Bank Negara, Statistics Department
• Advantage – more accurate and consistent • Advantage – easily accessible from the internet,
• Able to explain how the data are collected and journals, books, annual reports etc and
limitation used inexpensive, less time to collect
• Disadvantage – requires more time, manpower • Disadvantage – lack accuracy because
and high cost method of data collection are not explained
and biased – original purpose of data
collection is not known
TYPES OF VARIABLES
Qualitative Quantitative
categorical (names) nominal, plus can ordinal, plus interval, plus ratios are
be ranked (order) intervals are consistent, true zero
consistent
i. NOMINAL DATA
• Represent observations that can be categorized, do not have a
meaningful numeric value
Nominal • Examples: Gender, Religion, Nationality, Favorite colour, Number
on a football jersey
Properties:
1. Observations of a qualitative variable can only be classified and
counted.
2. There is no particular order to the labels.
Note:
• The values cannot be compared to see if one is larger than
the other
• Cannot calculate the MEAN
ii. ORDINAL DATA
• Represent observations that can be categorized and rank ordered
• The values can be compared to see if one is larger or smaller than the other
• Examples:
Ordinal o Consumer satisfaction ratings,
o Military rank - Private, Lieutenant, Captain, General
o Class ranking - Grade (A, B, C, D, E, F)
Properties:
1. Data classifications are represented by sets of labels or names
(high, medium, low) that have relative values.
2. Because of the relative values, the data classified can be ranked
or ordered.
Note:
• cannot assume the differences between adjacent scale
values are equal
iii. INTERVAL DATA
• Represent observations that can be categorized, rank ordered, and have
a unit of measure
Interval • A unit of measure implies that the difference between any two successive
values is identical
• Examples: Farennheit temperature scale
Properties:
1. Interval data are ordered, can be continuous or discrete
2. The degree of difference between items is meaningful (their
intervals are equal)
Note:
o Can be added or subtracted (cannot be multiplied or divided)
o Interval data can be negative
o No true zero point (A value of zero on an interval scale does not
mean the absence of the variable)
iv. RATIO DATA
• Highest and most informative scale
• Observations that can be categorized, rank ordered, have a unit
measure and have a true zero (an absolute zero point)
Ratio • The true zero implies that a value zero represents the complete
absence of the variable
• Examples:
- amount of money – zero money indicates the absence of money
- time
Properties:
1. Data classifications are ordered according to the amount of the
characteristics they possess.
2. Equal differences in the characteristic are represented by equal
differences in the numbers assigned to the classifications.
3. The zero point is the absence of the characteristic and the ratio
between two numbers is meaningful.
◦ Note:
• Can be multiplied or divided
Strongest forms
Ratio of measurement
Highest
scale
Interval
to
Ordinal
iv. State the variable (s) and measurement scale from this study.
v. What is the most suitable data collection method?
Give ONE (1) advantage and ONE (1) disadvantage of this method
REVIEW EXERCISE
▪ eg. Census
27
▪ Disadvantage
It can be time-consuming when you’re dealing with a large population
size.
Resource use (e.g. cost) can be higher to develop these types of
samples.
Greater expertise and knowledge of the subject matter is needed to
determine what type of sampling approach is most appropriate.
PROBABILITY SAMPLING
i. SIMPLE RANDOM SAMPLING (SRS)
▪ Characteristic of SRS:
o Target population must homogeneous
o Must have complete sampling frame
▪ Advantage:
o Easy to conduct
o Every element has equal chance to be selected
▪ Disadvantage
o Difficult to obtain sampling frame
o Tend to be bias
o Sometimes no assurance of representativeness
PROBABILITY SAMPLING
i. SIMPLE RANDOM SAMPLING (SRS)
▪ The first element is selected randomly from a list or from sequential files, and then
every kth element is selected.
▪ Advantage:
o Researchers can create, analyze, and conduct samples easily when using this
method because of its structure.
o Systematic sampling makes it easy to check whether every kth number or
name has been selected.
▪ Disadvantage
o Systematic sampling ignores all persons between every kth element chosen.
o In systematic sampling, the sampling error increases if the list is arranged in a
particular order.
PROBABILITY SAMPLING
iii. STRATIFIED SAMPLING
▪ When to use?
o Stratified sampling is beneficial in cases where the population has diverse
subgroups, and researchers want to be sure that the sample includes all of them.
o When the group means are different, and the goal of the study is to understand
these differences.
▪ Advantage:
o Stratified random sampling is more accurate than other sampling techniques
because it divides the population into smaller groups, or strata, based on
important characteristics.
o Studies can become less expensive and more practical when the researchers
divide a large population into smaller groups containing similar members.
▪ Disadvantage
o Researchers must then have sufficient information to assign subjects to the
correct strata.
PROBABILITY SAMPLING
iv. CLUSTER SAMPLING
▪ This method is useful when it is difficult or costly to develop a complete list of the
population members or when the population elements are widely dispersed
geographically.
PROBABILITY SAMPLING
iv. CLUSTER SAMPLING
A group of researchers plan to survey all family in Alam Maju.
Suppose they divide the people who live in Alam Maju into 6 village.
In order to save cost, they decide to survey only 2 villages.
By using simple random sampling or systematic random sampling, they only select 2
villages from 6 villages and sampled each (all) of the elements in 2 villages.
PROBABILITY SAMPLING
CLUSTER SAMPLING
▪ Advantage:
o Cluster sampling is relatively easy to implement and cost effective. Eg: it is more
economical to observe clusters of units in a population than randomly selected
units scattered over throughout the state.
o Cluster sampling is particularly useful when dealing with large and widely
dispersed populations.
▪ Disadvantage
o The participants within each cluster may not be representative of the entire
population. Therefore, it might not be possible to apply its findings to another
area.
PROBABILITY SAMPLING
SUMMARY
EXERCISE
Name the following sampling technique.
i. The population is divided into groups. Samples are collected randomly from
each group.
ii. A sample is drawn in such a way that each element of the population has
the same chance of being selected.
iii. One member is randomly selected from the first k units. Then every kth
member starting with the first selected number is included in the sample.
iv. The population is divided into groups. All elements from the randomly
selected groups are taken as the sample.
EXERCISE
For each of the following statements, identify the sampling technique used.
i. To check the accuracy of a machine that is used for filling detergent containers,
every 20th bottle is selected and weighted.
ii. In a large school district, a researcher numbers all the full-time teachers and
then randomly select 30 teachers to be interviewed.
iv. Out of 10 hospitals in a city, a researcher selects 2 hospitals and collect records
for a 24-hour period on the types of emergencies that were treated here.
NON-PROBABILITY SAMPLING
Non-probability Sampling
Convenient Sampling
Judgmental Sampling
Snowball Sampling
Quota Sampling
NON-PROBABILITY SAMPLING
▪ The process of selecting a sample from a population without using statistical
probability – eg the chance to be in the sample is unknown.
▪ Advantage
Obtaining the sample can be easier and less costly. Little research is required
prior to surveying as the researcher simply seeks out those easily within reach.
▪ Disadvantage
Difficcult to make valid ineference about the entire population because the
sample selected is not representative
Sampling bias. For example, a researcher may only select people they feel
comfortable with.
NON-PROBABILITY SAMPLING
i. CONVENIENT SAMPLING
Example: Best example, asking people who live in your area to take
survey for your project.
NON-PROBABILITY SAMPLING
ii. JUDMENTAL SAMPLING (PURPOSIVE SAMPLING)
Example: For instance, let’s say you want to collect responses from
patients who suffer from a rare type of cancer. In this case, other
sampling techniques might prove inadequate for gathering relevant
subjects—you cannot just walk into the hospital and request patients’
contact information or medical records. What you can do, however,
is put out a call to speak with one or two patients with the condition,
and then ask them to refer you to other potential subjects who might
be willing to participate in your study.
NON-PROBABILITY SAMPLING
iv. QUOTA SAMPLING
Let us assume that we need to know about the career goals of university students.
More particularly, the differences in the career goals among fresher, juniors and seniors
are to be examined. Suppose the concerned university contains 10,000 students and
can be taken as our population.
Now, we have to divide our population of 10,000 students into categories such as
freshers, juniors and seniors. Suppose we find that there are 4,500 freshers (45%), 3000
junior students (30%) and 2500 senior students (25%).
Our sample must have these proportions 45%-30%-25%. It means that if we sample 1000
students, then we must consider 450 freshers, 300 juniors and 250 seniors.
Lastly, we may start collecting samples from these students based on our proportion by
using convenient or judgmental sampling.
NON-PROBABILITY SAMPLING
Difference between non-probability sampling and probability sampling:
Non-probability sampling Probability sampling
Not everyone has an equal chance to Everyone in the population has an equal
participate. chance of getting selected.
The researcher does not consider sampling Used when sampling bias has to be
bias. reduced.
Useful when the population has similar traits. Useful when the population is diverse.
The sample does not accurately represent
Used to create an accurate sample.
the population.
Finding respondents is easy. Finding the right respondents is not easy.
QUESTIONNAIRE DESIGN
Over the years, a lot of thought has been put into the science of the design of
survey questions. Key design principles:
▪ Of course your data collection strategy depends on other factors as well, such as
the amount of time that you have to collect it, money available in your budget
and the complexity or nature of the questions.
Interviews:
▪ Face-to-face Direct observation
▪ Telephone
Questionnaire:
▪ Direct (multiple-choice, Others:
yes/no) Internet e-mail and online
▪ Indirect (open-ended) surveys, video record
The following table sets out the main components of each method of
collection and their advantages and disadvantages.
EXERCISE
The Public Service Department (JPA) wants to carry out a survey on
students studying overseas under its sponsorship. The objective of the
study is to collect information on the problems faced by them.
a) State the most suitable data collection method for this study.
b) b) Explain why the method you chose is better than the others.
WEEK 2
Bluman Chapter 2
LEARNING OBJECTIVES
QUALITATIVE QUANTITATIVE
DATA DATA
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
PRESENTING QUALITATIVE DATA
▪ After data is collected, it will be processed, organized and presented.
▪ In order to enhance the presentation, some charts, tables and graphs can
be used.
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Bluman Chapter 2
1. FREQUENCY DISTRIBUTION
▪ Data collected in original form is called raw data.
▪ A frequency distribution is the organization of raw data in table form, using classes and frequencies.
Example 1
Twenty-five army inductees were given a blood test to determine their blood type. The data set is:
A B B AB O
O O B AB B
B B O A O
A O O O AB
AB A O B A
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
1. FREQUENCY DISTRIBUTION
A B B AB O
O O B AB B
B B O A O
A O O O AB
AB A O B A
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Bluman Chapter 2
2. GRAPHICAL METHOD: PIE CHART
▪ Pie chart can be used to represent categorical data.
▪ It is a circle that is divided into sectors.
▪ The sectors show the percentage of frequencies of each category of the distribution.
Note: If possible, construct the pie chart so that %s are either in ascending or descending order
(helps in the interpretation of the data).
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
2. GRAPHICAL METHOD: BAR CHART
▪ A graph of bars whose heights represent the frequencies of respective categories.
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
2. GRAPHICAL METHOD: BAR CHART
Example 2
NUMBER OF STUDENTS
A 450 600
B 1200 1500
C 800 1100
D 300 400
E 650 800
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Bluman Chapter 2
3. CROSS TABULATION/ CONTINGENCY TABLE
▪ A cross tabulation(often abbreviated as cross tab) or cross-classification table is often used to examine the
categorical response in terms of two qualitative variables simultaneously.
▪ Some data can be grouped according to two or more criteria of classification or variables.
Urban 5 12 8 35
Suburban 8 15 9 32
Rural 6 8 7 21
Total 29 35 24 88
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
3. CROSS TABULATION/ CONTINGENCY TABLE
Example 3:
A group of researchers surveyed 530 staff working with Company Y. Out of 145 professional staff, 40
are women whereas 140 non-professional staff are men. Present this data in the form of 2 x 2 table.
Answer:
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Bluman Chapter 2
PRESENTING QUANTITATIVE DATA
▪ Quantitative data is information about quantities; that is, information that can be
measured and written down with numbers.
a. Focuses on numbers
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
1. FREQUENCY DISTRIBUTION: UNGROUPED DATA
▪ Frequency is the number of times a values occurs. By accounting frequency, we can make a
frequency distribution table.
▪ The frequency distribution is a table that contains a list of data values and its frequency.
▪ Ungrouped data is defined as the data given as individual points (i.e. values or numbers) such as
15, 63, 34, 20, 25, and so on.
Example:
These are the numbers of newspapers sold at a local shop over the last 10 days:
22, 20, 18, 23, 20, 22, 20, 18, 20
Class midpoint, Xm The middle value of a class interval; averaging the upper limit and
lower limit or upper boundary and lower boundary
30 – 50 30 – 50
Example 1 40
50 – 70 50 – 70
30 – 49 29.5 – 49.5
Example 2 39.5
50 – 69 49.5 – 69.5
30 - < 50 30 - 50
Example
Copyright 3 McGraw-Hill Companies, Inc. Permission required for reproduction or display.
© 2015 The 40
50 - < 70 50 - 70
1. FREQUENCY DISTRIBUTION: GROUPED DATA
Constructing a Grouped Frequency Distributions:
[Class width = Dividing the range by the number of class and rounding up. Number of class is
usually between 5 and 20]
3. Pick a suitable starting point less than or equal to the minimum value.
The subsequent lower class limits are found by adding the width to the previous lower class
limits
4. To find the upper limit of the first class, subtract one from the lower limit of the second class.
Then continue to add the class width to this upper limit to find the rest of the upper limits.
5. Find the boundaries by subtracting 0.5 units from the lower limits and adding 0.5 units from the
upper limits (if necessary). Tally the data.
7. Find the cumulative frequencies. Depending on what you're trying to accomplish, it may not
be necessary to find the cumulative frequencies.
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
8. If necessary, find the relative frequencies and/or relative cumulative frequencies.
1. FREQUENCY DISTRIBUTION: GROUPED DATA
Example 4: Constructing a Grouped Frequency Distribution
The following data represent the record high temperatures for each of the 50 states. Construct a
grouped frequency distribution for the data using 7 classes.
112 100 127 120 134 118 105 110 109 112
110 118 117 116 118 122 114 114 105 109
107 112 114 115 118 117 118 122 106 110
116 108 110 121 113 120 119 111 104 111
120 113 120 117 105 110 118 112 114 114
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
1. FREQUENCY DISTRIBUTION: GROUPED DATA
Step 1: Compute range = max – min = 134 – 100 = 34
Step 2: Find class width (number of classes = 7) = 34/7 = 5
Step 3: Starting point of 1st class = 100. The subsequent lower-class limits are found
by adding the width to the previous lower-class limits.
Step 4: Upper limit of 1st class = 105 – 1. The subsequent upper-class limits are found
by adding the width to the previous upper-class limits.
Step 5: The class boundary is midway between an upper-class limit and a
subsequent lower-class limit.
Class Limits Class Boundaries
1st class
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
1. FREQUENCY DISTRIBUTION: GROUPED DATA
112 100 127 120 134 118 105 110 109 112
110 118 117 116 118 122 114 114 105 109
107 112 114 115 118 117 118 122 106 110
116 108 110 121 113 120 119 111 104 111
120 113 120 117 105 110 118 112 114 114
▪ Steps:
Step 1 Draw and label the x and y axes.
Step 2 Choose a suitable scale for the frequencies or cumulative frequencies, and
label it on the y axis. (Do not label the y axis with numbers in the cumulative
frequency)
Step 3 Represent the class boundaries for the histogram or ogive, or the midpoint for
the frequency polygon, on the x axis.
Step 4 Plot the points and then draw the bars or lines.
Class Class
Frequency
Limits Boundaries
100 - 104
99.5 - 104.5
105 - 109 104.5 - 109.5
110 - 114 109.5 - 114.5
114.5 - 119.5
115 - 119 119.5 - 124.5
120 - 124 124.5 - 129.5
129.5 - 134.5
125 - 129
130 - 134
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
2. GRAPHICAL METHOD: POLYGON
▪ The frequency polygon is a graph that displays the data by using lines that
connect points plotted for the frequencies at the class midpoints. The
frequencies are represented by the heights of the points.
▪ The frequency polygon plots frequency on y-axis and class midpoint along
the x-axis.
Class Class
Frequency
Limits Midpoint (Xm)
100 - 104
105 - 109
110 - 114
115 - 119
120 - 124
125 - 129
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
130 - 134
2. GRAPHICAL METHOD: OGIVE
▪ An ogive, sometimes called a cumulative frequency polygon, is a type of
frequency polygon that shows cumulative frequencies.
▪ An ogive graph plots cumulative frequency on the y-axis and upper class
boundaries along the x-axis.
5.5 - 10.5 1
10.5 - 15.5 2
15.5 - 20.5 3
20.5 - 25.5 5
25.5 - 30.5 4
30.5 - 35.5 3
35.5 - 40.5 2
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
ANSWER: 1. HISTOGRAM
Class Relative
Frequency
Boundaries Frequency
5.5 - 10.5 1
10.5 - 15.5 2
15.5 - 20.5 3
20.5 - 25.5 5
25.5 - 30.5 4
30.5 - 35.5 3
35.5 - 40.5 2
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
ANSWER: 2. POLYGON
5.5 - 10.5
10.5 - 15.5
15.5 - 20.5
20.5 - 25.5
25.5 - 30.5
30.5 - 35.5
35.5 - 40.5
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
ANSWER: 3. OGIVE
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Bluman Chapter 2
2. GRAPHICAL METHOD: STEM AND LEAF
▪ A stem-and-leaf plot is a data plot that uses part of a data value as the stem
and part of the data value as the leaf to form groups or classes.
▪ It has the advantage over grouped frequency distribution of retaining the
actual data while showing them in graphic form.
This number is 39
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
2. GRAPHICAL METHOD: STEM AND LEAF
Example 6
At an outpatient testing center, the number of cardiograms performed each
day for 20 days is shown. Construct a stem and leaf plot for the data.
25 31 20 32 13
14 43 02 57 23
36 32 33 32 44
32 52 44 51 45
Answer
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
2. GRAPHICAL METHOD: STEM AND LEAF
Example 7
The following data shows the ages for the CEOs of the 30 top-ranked small
companies in a country.
59 38 47 53 60 69
44 50 56 63 40 48
53 61 41 44 49 55
62 43 55 61 61 53
48 48 55 62 43 48
Construct:
(a) a stem-and-leaf diagram with one line per stem.
(b) a stem-and-leaf diagram with two lines per stem.
Copyright © 2015 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
WEEK 3
Bluman Chapter 2
LEARNING OBJECTIVES
MEASURES OF
CENTRAL
TENDENCY
1 MEASURES OF
VARIATION 2 MEASURES OF
SKEWNESS 3
The extent to which all The amount of The pattern of the
the data values group dispersion, or scattering, distribution of values
around a typical or of values: from the lowest value to
central value: the highest value
σ𝑛𝑖=1 𝑋𝑖 𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛
▪ The mean is the sum of the values, divided by the 𝑋= =
𝑛 𝑛
total number of values.
Observed values
▪ Useful in comparing two or more population. Sample size
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 σ𝑁
𝑚=1 𝑓. 𝑋𝑚
𝑋ത =
𝑛
1 + 2 + 3 + 4 + 5 15 1 + 2 + 3 + 4 + 10 20 Where:
= =3 = =4 f = frequency
5 3 5 5
𝑋𝑚 = midpoint of each class
𝑛 = sample size (total frequency)
THE MEAN: UNGROUPED DATA
Example 3-1 (Days off per year)
The data represent the number of days off per year for a sample of
individuals selected from nine different countries. Find the mean.
Answer
Interpretation:
THE MEAN: GROUPED DATA
Example 3-3 (Miles Run per Week)
Below is a frequency distribution of miles run per week. Find the mean.
Class Answer:
Frequency
Boundaries σ𝑁
𝑚=1 𝑓. 𝑋𝑚
Formula: 𝑋ത =
𝑛
5.5 - 10.5 1
10.5 - 15.5 2
15.5 - 20.5 3
20.5 - 25.5 5
25.5 - 30.5 4
30.5 - 35.5 3
35.5 - 40.5 2
UNGROUPED DATA
MEASURES OF CENTRAL Step 1 Arrange the data values in ascending order.
TENDENCY : THE MEDIAN Step 2 Median position =
𝑛+1
2
Step 3
▪ In an ordered array, the median is the “middle” a. If the number of values is odd, the median is the
middle number
number (50% above, 50% below)
b. If the number of values is even, the median is the
▪ Not affected by extreme values average of the two middle numbers
𝑛+1
*Note that is the position of the median in the ranked data
2
GROUPED DATA
1) Find median class: (n/2). Locate the class whose
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 cumulative frequency is greater than (nearest to) n/2.
σ𝒇
− σ 𝒇𝒎−𝟏
Median = 3 Median = 3 = 𝑳𝒎 + 𝟐
2) Median = 𝒙 .𝑪
𝒇𝒎
𝐿𝑚 = lower class boundary of the median class
Σ𝑓𝑚−1 = cumulative frequency of all class intervals
before the median class
𝑓𝑚 = frequency of the median class
𝐶 = width of the median class boundaries
THE MEDIAN: UNGROUPED DATA
Example 3-4 (Hotel Rooms)
The number of rooms in the seven hotels in downtown Pittsburgh is 713, 300,
618, 595, 311, 401, and 292. Find the median.
Solution:
THE MEDIAN: UNGROUPED DATA
Example 3-6 (Tornadoes)
The number of tornadoes that have occurred in the United States over an 8-
year period follows. Find the median.
Solution:
THE MEDIAN: GROUPED DATA
Example 3-3 (Miles Run)
Below is a frequency distribution of miles run per week. Find the median.
Solution:
Class 1) Find the median class:
Frequency
boundaries
5.5 – 10.5 1
10.5 – 15.5 2
15.5 – 20.5 3
2) Use formula: σ𝑓
20.5 – 25.5 5 − σ 𝑓𝑚−1
Median = 𝑥 = 𝐿𝑚 + 2 .𝐶
𝑓𝑚
25.5 – 30.5 4
30.5 – 35.5 3
35.5 – 40.5 2
Interpretation:
UNGROUPED DATA
MEASURES OF CENTRAL
TENDENCY : THE MODE Step 1 Arrange the data values in ascending order.
Step 2 See which number appears the most often.
▪ Value that occurs most often
▪ Not affected by extreme values
▪ Used for either numerical or categorical
(nominal) data
▪ There may be no mode GROUPED DATA
▪ There may be several modes 1) Find the modal class
∆𝟏
2) Mode = 𝒙
ෝ = 𝑳𝒎𝒐 + .𝑪
∆𝟏 +∆𝟐
Solution
You may find it easier to sort first
10 10 10 11.3 12.4 14.0 18.0 34.5
Find the mode for the number of coal employees per county for 10
selected counties in southwestern Pennsylvania.
Solution
Conclusion:
THE MODE: UNGROUPED DATA
Example 3-11 (Licensed Nuclear Reactors)
The data show the number of licensed nuclear reactors in the US for a
recent 15-year period. Find the mode
104 104 104 104 104 107 109 109 109 110 109 111 112 111 109
Solution
THE MODE: GROUPED DATA
Example 3-12 (Miles Run per Week)
Find the modal class for the frequency distribution of miles that 20
runners ran in one week.
Solution:
30.5 - 35.5 3
35.5 - 40.5 2
MEASURES OF CENTRAL UNGROUPED DATA
GROUPED DATA
2, 3, 6, 8, 4, 1
Solution
MEASURES OF CENTRAL TENDENCY :
WHICH MEASURE TO CHOOSE?
o The mean is generally used, unless extreme values (outliers) exist.
o The median is often used, since the median is not sensitive to extreme
values. For example, median home prices may be reported for a
region; it is less sensitive to outliers.
o In some situations it makes sense to report both the mean and the
median.
TUTORIAL
Based on the grouped data below, find the mean, median and
mode.
Same center,
different variation
UNGROUPED DATA
MEASURES OF VARIATION :
Range = Xlargest – Xsmallest
THE RANGE
▪ Simplest measure of variation
Example:
▪ Difference between the largest and the
smallest value. 16 24 22 25 26 27 28 23
▪ Disadvantage of using range:
➢ Based on two values only. All other Range: 28 – 16 = 12
values in a dataset are ignored.(Ignores
the way in which data are distributed)
7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5
➢ Sensitive to outlier
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5 Range = 5 - 1 = 4
THE VARIANCE
2
Population variance, 𝜎 = 𝑋2 −
𝑁 𝑁
σ 𝑓. 𝑋𝑚 2
1
Population variance, 𝜎2 = 𝑓. 𝑋𝑚2 −
𝑁 𝑁
σ 𝑓. 𝑋𝑚 2
1
*Sample variance, 𝑆2 = 𝑓. 𝑋𝑚2 −
𝑛−1 𝑛
𝑛−1 𝑛
▪ Notation
➢ Population standard deviation, 𝜎
GROUPED DATA
➢ Sample standard deviation, 𝑠
Population standard deviation,
▪ The lower the standard deviation, the closer 1 σ 𝑓. 𝑋𝑚 2
the values are to the mean and the less 𝜎=
𝑁
𝑓. 𝑋𝑚2 −
𝑁
variability there is.
▪ The higher the standard deviation, the *Sample standard deviation,
farther the values are spread from the mean 2
1 σ 𝑓. 𝑋𝑚
and the more variability there is 𝑆= 𝑓. 𝑋𝑚2 −
𝑛−1 𝑛
*𝑋𝑚 is the midpoint (Please check & compare the formula in Appendix)
THE VARIANCE & STANDARD DEVIATION:
UNGROUPED DATA
Example
Sample Data (Xi) : 10 12 14 15 17 18 18 24
Solution:
σ𝑋 2
1
Variance, 2
𝑆 =
𝑛−1
𝑋2 −
𝑛
THE VARIANCE & STANDARD DEVIATION:
UNGROUPED DATA
Example 3-23 (European Auto Sales)
Find the variance and standard deviation for the amount of European auto
sales for a sample of 6 years. The data are in millions of dollars.
Data A
Smaller standard deviation
Mean = 15.5
S = 3.338
11 12 13 14 15 16 17 18 19 20 21
Data B
Mean = 15.5
S = 0.926
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
S = 4.570
11 12 13 14 15 16 17 18 19 20 21
COMPARING STANDARD DEVIATION
Example 3-21 (Outdoor Paint)
The average for both brands is the same, but the range for Brand A is much
greater than the range for Brand B. Which brand would you buy?
Brand A Brand B
Brand A Brand B
10 35 𝑋 = 210 𝑋 2 = 9100 N = 6 𝑋 = 210 𝑋 2 = 7600 N = 6
60 45 σ 𝑋 210 σ 𝑋 210
𝜇= = = 35 𝜇= = = 35
𝑁 6 𝑁 6
50 30 𝑅 = 60 − 10 = 50 𝑅 = 45 − 25 = 20
30 35
40 40
20 25
THE VARIANCE & STANDARD DEVIATION:
GROUPED DATA
Example
Find the variance and the standard deviation for the frequency distribution of
miles that 20 runners ran in one week.
Distance (in
one week.. Number of 1 σ 𝑓. 𝑋𝑚 2
2-5 2
6-9 4
10 - 13 7
14 - 17 15
18 - 21 5
MEASURES OF VARIATION: SUMMARY CHARACTERISTICS
▪ The more the data are spread out, the greater the range, variance, and
standard deviation.
▪ The more the data are concentrated, the smaller the range, variance, and
standard deviation.
▪ If the values are all the same (no variation), all these measures will be zero.
▪ None of these measures are ever negative.
▪ Symbol:
Stock A:
Average price last year = $50
Standard deviation = $5
Stock B:
Average price last year = $100
Standard deviation = $5
THE COEFFICIENT OF VARIATION
Example 3-25 (Sales of Automobiles)
The mean of the number of sales of cars over a 3-month period is 87, and the
standard deviation is 5. The mean of the commissions is $5225, and the
standard deviation is $773. Compare the variations of the two.
Solution:
“
MEASURES OF
SKEWNESS
MEASURES OF SKEWNESS : i. PEARSON COEFFICIENT OF SKEWNESS
THE SKEWNESS
𝑚𝑒𝑎𝑛 − 𝑚𝑜𝑑𝑒
𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 =
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
or
▪ Skewness is measurement of the shape
of distribution. 3(𝑚𝑒𝑎𝑛 − 𝑚𝑒𝑑𝑖𝑎𝑛)
𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 =
▪ This measurement is widely used 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
whenever we want to identify whether
the data is normal or skewed. Skewness = 0 Symmetrical/ normal
Skewness = positive value Skewed to the right
▪ In general, the shape of distribution can
be normal, positively skewed/ skewed to Skewness = negative value Skewed to the left
Calculate the Skewness for the data and comment on the shape of
the distribution.
THE SKEWNESS
Class 𝑚𝑒𝑎𝑛 − 𝑚𝑜𝑑𝑒 3(𝑚𝑒𝑎𝑛 − 𝑚𝑒𝑑𝑖𝑎𝑛)
𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 =
𝑠𝑡𝑑 𝑑𝑒𝑣 𝑠𝑡𝑑 𝑑𝑒𝑣
Melati
Lily
Mawar