SUMMARIZING DATA
Based on effect
Independent
Variables
Types of variables Dependent
Based on measurement scale
Quantitative Qualitative
Based on type of numbers
Discrete Continuous
Types of variables
Qualitative vs quantitative variables
• Qualitative variables (nominal or categorical): Quantitative variables (numerical)
• Variables measured in nominal scale and ordinal scale. Variables measured in interval or ratio scale.
• sex Height
• marital status
• eye color
Weight
• Educational level Number of siblings
Dependent vs Independent Variables
An independent variable, sometimes called an experimental or predictor variable, is a variable that is being
manipulated in an experiment in order to observe the effect on a Dependent variable, sometimes called an
outcome variable.
Discrete vs Continuous Variables
Both are quantitative variables
Discrete are the results of counting Contimous is measured arbitrarily accurately
Example: number of car accidents Example: weight
Number of vaccinated children Height
What is data?
Characteristic variable value Data
Observation As age, sex, These values of Collection of
marital status observation is observation
called DATA
Data my be primary or secondary
Primary data
Secondary data
Data is collected
firsthand by a researcher Data is readily available
and collected by some
one else
Person, Journals
organization, etc. and newspaper
Sources of primary data
Personal investigation The researcher conducts the experiment or survey
himself/herself and collected data from it
The trained (experienced) investigators are employed to collect
Through investigators the required data.
The required information is obtained by sending a
Through questionnaire questionnaire to the selected individuals to fill in and return it
to the investigator
Through local sources Organization, telephone or internet
It is important to go through the primary data and locate any inconsistent observations
before it is given a statistical treatment.
Sources of secondary data
Data is readily available and collected by some one else
Government Organizations Semi-Government Organization
Teaching and Research Organizations Research Journals and Newspapers
Internet
Descriptive summarization of
data
SUMMARIZING DATA
MEASURES OF MEASURES OF VARIATION
CENTRAL TENDANCY (DISPERSION)
1- Range
1- The mean
2- Variance
2- The median
3- Standard deviation
3- The mode 4- Quartile
5- Inter-quartile range
The idea is to compute a single value that can represent the entire
elements of the set.
MEASURES OF CENTRAL
TENDANCY
• In statistics, measures of central tendency are a set
of “middle” values representative of the data
points.
Central tendency
• Measures of central tendency
help us locate the center point
of a dataset where the
observations tend to gather
Measures of Central Tendency
The three common methods of
This type of statistics is a calculating central tendency are:
descriptive statistics.
1. The MEAN
2. The MEDIAN
3. The MODE
1- THE MEAN (average)
The mean is considered the balancing point or fulcrum of a
distribution of observations.
It's calculated by dividing the sum of observations by the total
number of observations
Deviation from the mean
(x- ̅ )
-6 6
4
The mean is the balance
point of a distribution.
1 2 3 4 5 6 7
1 + 2 +3 + 4 + 5 + 6 +7
Mean = 4
sum of the values
Mean =
number of values
Suppose that the observations in a sample are x1, x2, ………….xn.
1 + 2 + ⋯+
• The mean, denoted by ̅ ( X bar) =
• xi is the individual values of a sample size n.
Example
Samy has been working on programing and updating a Web site for his
company for the past 15 months. The following numbers represent the number
of hours Samy has worked on this Web site for each of the past 7 months:
24, 25, 31, 50, 53, 66, 78
What is the mean (average) number of hours that Samy worked on this Web site
each month?
STEP ONE: Add the numbers to determine the total number of hours he worked.
24 + 25 + 33 + 50 + 53 + 66 + 78 = 329
STEP TWO: Divide the total by the number of months = = 47
Example
Ahmad operates a Web site service that employs 8 people. Find the
mean age of his workers if the ages of the employees are as follows:
55, 63, 34, 59, 29, 46, 51, 41
Mean of grouped Data
Driving time Frequency
• In Tim's office, there are 25 employees.
(f)
Each employee travels to work every 0- 3
morning in his or her own car. 10- 10
• The distribution of the driving times (in 20- 6
minutes) from home to work for the 30- 4
employees is shown in the table. 40- 2
• Calculate the mean of the driving times. 50-59 0
n =25
If data is collected in group form, then the calculation for mean is a little different.
1- Calculate the midpoint for each class.
Driving Frequency m f*m
time (f)
!!"#$$ !!
Midpoint
0- 3 5 = 15
Mid-point of the first class = (0 + 10)/2 = 5 *
Mid-point of the second class = (10 + 20)/2 = 15 10- 10 15 150
Mid-point of the third class = (10 + 20)/2 = 15
20- 6 25 150
2- Multiply column (m) * (f) 30- 4 35 140
40- 2 45 90
3- Find the sum of (f * m) 50-59 0 545
4- Divide it by the total number in the data set. Mean = (f m) Mean = 545 = 21.8
n 25
ADVANTAGES AND DISADVANTAGES THE MEAN
Advantages Disadvantages
• Considers all observations: All data points • Sensitive to outliers: Affected by extreme
are included in the calculation.
values in the dataset.
• Easy to calculate: A straightforward
computation. • Decimal values: Presence of result in decimal
• Widely understood: A commonly used values that are not meaningful in certain
measure of central tendency. contexts, such as the average family size of
• Reliable for large datasets without outliers: 5.1 persons.
Most reliable when the dataset is large and
doesn't contain extreme values.
1, 2, 3, 4, 5, 6, 7 = 4 1, 2, 3, 4, 5, 6, 28 =
%
7
Number of accidents = 1, 2, 3, 4, 5, 6, 8 = 4.14 1, 2, 3, 4, 5, 6, 0 = 3
2- THE MEDIAN
• The median is the middle value in a dataset when the values are
arranged in order.
• It divides the data into two equal halves, with the same number of
observations above and below it.
• The median is used to describe ordinal data.
Given that the observations in a sample are x1, x2, x3 ……………xn,
arranged in an increasing order of magnitude, the sample median will be:
) *+x n/2) + +x n/2 +1)}/2 01 + ) 02 567 ) + 1/2 01 + ) 02 344
Calculate the median for the data set below.
EVEN data
45, 36, 36, 28, 24, 19, 16, 16, 12, 7, 3, 3, 3, 1
• As the data set is arranged in a descending order, we will proceed to the second step.
• Identify the order of the median
) *+x n/2) + +x n/2 +1)}/2 01 + ) 02 567
• Number of observations is even (14).
• Median values orders are: (n/2)= 14/2=7 the 7th value and (n/2)+1=
th
8 9:
" ; 9:
<=>?@A
(14/2)+1=16/2 the 8 [Link] =
B
Median = 16
) *+x n/2) + +x n/2 +1)}/2 01 + ) 02 567
4 3 7 8 4 5 12 4 5 3 2 3
Example
Put number in an increasing order
2 3 3 3 4 4 4 5 5 7 8 12
%"%
Median = 4
The median is the average of the two middle numbers
) *+x n/2) + +x n/2 +1)}/2 01 + ) 02 567
4 3 7 8 4 5 12 4 5 3 2 3
Put number in an increasing order
Example
2 3 3 3 4 4 4 5 5 7 8 12
The median is the average of the two middle numbers
Median = (4 + 4)/2 = 4
2 3 3 3 4 4 4 5 5 7 8 12
2 3 3 3 4 4 4 5 5 7 8 12
2 3 3 3 4 4 4 5 5 7 8 12
2 3 3 3 4 4 4 5 5 7 8 12
2 3 3 3 4 4 4 5 5 7 8 12
ODD data
If the data is odd as the following data set:
36, 36, 28, 24, 19, 16, 16,12, 7, 3, 3, 3, 1
• Also, as the data is arranged in a descending order, we will
calculate the order of the median directly.
C "
As the data set is an odd one, the order of the median is
"
= =7
AS the order of the median value is equal to (7), therefore, the
median value is equal to (16)
The mean differences between the MEAN and The MEDIAN
50, 28, 24, 19, 16, 16, 13, 12, 7, 3, 3, 3, 1 115, 28, 24, 19, 16, 16, 13 ,12, 7, 3, 3, 3, 1
Observations Observation - ̅ Observation - ) Observations Observation - ̅ Observation - )
50 35 37 180 155 167
28 13 15 28 3 15
24 9 11 24 -1 11
19 4 6 19 -6 6
16 1 3 16 -9 3
16 1 3 16 -9 3
13 -2 0 13 -12 0
12 -3 -1 12 -13 -1
7 -8 -6 7 -18 -6
3 -12 -10 3 -22 -10
3 -12 -10 3 -22 -10
3 -12 -10 3 -22 -10
1 -14 -12 1 -24 -12
= 195 = 0 = 26 = 325 = 0 = 156
MEAN = 15 MEADIAN = 13 MEAN = 25 MEADIAN = 13
Advantages of the median:
Simplicity: Easy to understand and calculate, often determined by inspection.
Insensitivity to outliers: Not affected by extreme values in the data.
Disadvantages:
• Tedious calculation: Can be time-consuming for large datasets.
• Less representative: May not be as representative as the mean because it doesn't
consider all data points
3- MODE
• The mode is the value that appears most often in a dataset
What is the mode of the following data set?
45, 36, 36, 28, 24, 19, 16, 16,12, 7, 3, 3, 3, 1
• The value that is most frequent is (3) as it is repeated three times
• As it is the only number that frequently repeated, thus the data set is
unimodal
• If we delete one value from the three (3), the data set will be multimodal
modal as the value (3) repeated two times and the value (16) also
repeated and (36) two times.
ADVANTAGES AND DISADVANTGES OF THE MODE
Advantages of the Mode
• Describes distribution shape: Indicates the overall shape of the distribution,
such as unimodal or bimodal.
• Quick and easy: No calculations are required.
Disadvantages of the mode
• Limited mathematical value: The modal value might represent extreme values
in the dataset.
• It does not consider the spread of values within the data set.
In summarizing data set using measures of central tendency.
• In quantitative data with symmetric distribution (normally
distributed), the mean is proper measure of center.
• If the data is quantitative with skewed distribution, the
median is good choice for the measure of center.
• If the data is qualitative variable, the mode is the
appropriate measure for the center.
SUMMARIZING DATA (2)
MEASURES OF VARIATION (DISPERSION)
MEASURES OF DISPERSION
(Variations)
• Measures of average such as the median and mean represent the
typical value for a dataset.
Suppose that the observations in a sample are x1, x2, x3 …………xn. Th sample
mean denoted by ̅ , 02
1 + 2 + 3………
̅
Write out the formula for calculating the mean, calculate the mean of the following
data set.
45, 36, 36, 28, 24, 19, 16, 16, 12, 7, 3, 3, 3, 1.
• Within the dataset the actual values usually differ from one to another
and from the average value itself.
45, 36, 36, 28, 24, 19, 16, 16, 12, 7, 3, 3, 3, 1.
17.87
• The extent to which the median and mean are good representatives of the values in the
population data depends upon the variability or dispersion in the population data.
• Datasets are said to have high dispersion when they contain values considerably higher and lower
than the mean value.
3
100
200 3
300 5
10 5
25 4
73.9 3.8 4
31
7 6
9 6
High variability Low variability
21 1
36 1
Distributions with Equal Means and Unequal Dispersions
• The mean describes where the probability distribution is centered.
• By itself, however, the mean does not give an adequate description of the shape of the
distribution. We also need to characterize the variability in the distribution.
• In the Figure below, we have the histograms of two discrete probability distributions that have
the same mean = 2, but differ considerably in variability, or the dispersion of their observations
about the mean.
So, It is also important to know
whether the observations tend to
be quite similar (homogeneous)
or whether they vary
considerably (heterogeneous).
1 2 3 0 ١ 2 ٣ 4
1. The range
2. Variance
The most common
3. The standard deviation
measures of variability are:
4. The quartiles
5. Interquartile range
1- The Range
The range is the most obvious measure
of dispersion and is defined as the
Range = x max – x min
difference between the lowest and
highest values in a dataset.
Below is the scores of individual students in an examination and coursework
component of a module.
Students A B C D E F G H I J K L M N
Coursework marks 27 44 39 23 41 48 37 34 40 43 30 43 29 27
Examination marks 12 47 26 25 38 45 35 35 41 39 32 25 18 30
• To find the range in marks the highest and lowest values need to be found from the table.
• The highest coursework mark was 48 and the lowest was 23 giving a range of 25.
• In the examination, the highest mark was 47 and the lowest 12 producing a range of 35.
• This indicates that there was wider variation in the students’ performance in
the examination than in the coursework for this module.
• Advantages of the Range
• Easy to calculate: The range is straightforward to compute, requiring only the maximum and minimum values.
• Describes data spread: It provides a quick overview of the distribution of values in a dataset.
• Complementary to the mean: When combined with the mean, it helps illustrate the distribution of observations
around the central tendency.
Disadvantages of the Range
Relies only on two values: The range is solely based on the maximum and minimum values, ignoring the majority
of the data points. This can be problematic, especially when dealing with outliers.
Sensitive to outliers: Outliers can significantly affect the range, potentially distorting the overall picture of the
data's dispersion.
Increases with sample size: As the sample size grows, the range tends to increase, which can make it difficult to
compare the spread of different datasets.
• For example, imagine in the above example that one student failed to hand in any coursework and was
awarded a mark of zero,. The range for the coursework marks would now become 48 (48-0), rather than 25.
• However, the new range is not typical of the dataset as a whole and is distorted by the outlier in the
coursework marks.
2- The quartiles
The quartiles are values that divide the data into quarters.
1st 2nd 3rd 4th
25% 25% 25% 25%
Q1 Med Q3
Number line
The rule for calculating the quartiles uses the rule for the median.
As quartiles divide numbers according to where their position is on the number
line, you have to put the numbers in an ascending order before you can figure out
where the quartiles are.
3 Steps to calculate the quartiles, Q1 and Q3
Step 3: The third quartile Q3 is the median of the observations whose position
in the ordered list is to the right of the location of the overall median.
step 2: The first quartile Q1 is the median of the observations whose position
in the ordered list is to the left of the location of the overall median.
Step 1: Arrange the observations in an increasing order and locate
the median in the ordered list of observations
Q3-Q1= 30-10=20
Odd number
5 10 10 10 10 12 15 20 20 25 30 30 40 40 60
Q1 Median Q3
Q3-Q1= 42.5 – 15 = 27.5
Even number 15 22.5 42.5
5 10 10 15 15 15 15 20 20 20 | 25 30 30 40 40 45 60 60 65 85
Q1 Median Q3
What is an Upper Quartile?
• The upper quartile (sometimes called Q3) is the number dividing the third and fourth
quartile.
• The upper quartile can also be thought of as the median of the upper half of the
numbers.
• The upper quartile is also called the 75th percentile; it splits the lowest 75% of data from
the highest 25%
• The quartiles are not affected by extreme observations.
5 10 10 10 10 12 15 20 20 25 30 30 40 40 60
Q1 Median Q3
5 10 10 10 10 12 15 20 20 25 30 30 40 40 600
Q1 Median Q3
• For example, Q3 would still be 30 if the outlier were 600 rather
than 60.
3- Interquartile range
• The difference between upper and lower quartiles (Q3–Q1),
which is called the interquartile range.
• Used as a measure of the spread and the dispersion of a data.
• The interquartile range spans 50% of a data set
• It eliminates the influence of outliers because, in effect, the
highest and lowest quarters are removed.
Interquartile range = difference between upper quartile (Q3) and lower quartile
(Q1)
An example
34, 47, 1, 15, 57, 24, 20, 11, 19, 50, 28, 37
Use the above data set to calculate: First arrange the data set in order
1, 11, 15, 19, 20, 24, 28, 34, 37, 47, 50, 57
1. The median 26
2. The range 56
3. The lower and upper quartiles 17 42
4. The interquartile range 25
4- The variance ( square of the mean variations)
18 (+6)
16 (+4)
• The variance (s2) is a measure of how far
each value in the data set is from the mean Mean (12)
8 (- 4)
6 (- 6)
Mean 6+4–4–6=0
The variance, or S2, is computed by squaring each deviation
from the mean, adding them up, and dividing their sum by
(n-1), (n) is the sample size:
18 (+6)
16 (+4)
Why we square the
differences? Mean (12)
F"% G F G% H 8 (- 4)
= =0 6 (- 6)
%G
F "% G% GF F" F" F" F H%
34.7
%G
To calculate the variance:
4. Divide the sum of the squares by the number of values in
the data set minus 1 (n-1).
3. Square each of these distances (so that they are all
positive values) and add all of the squares together.
2. Subtract the mean from each value in the data. This gives you
a measure of the distance of each value from the mean.
1. Calculate the mean
Sample variance (S2)
2 ∑ C GC̅
S=
G As an example, calculate the variance for
S2 = variance the following data sets:
= Sum
0 = observations in data set data set 1: 3, 4, 4, 5, 6, 8
̅ = Sample mean
n = sample size data set 2: 1, 2, 4, 5, 7, 11
Data set 1: 3, 4, 4, 5, 6, 8
4. Divide the sum of the squares
by (n-1).
1 2 3 4
3. Square each of these distances
and add all squares together.
2. Subtract the mean from each
value in the data.
1. Calculate the mean 16/(6-1) = 3.2
5- Standard deviation
• It is the most widely used measure of variation, represented by the symbol small s. or
σ or SD.
• The standard deviation is used in conjunction with the mean to summarize
continuous data.
Standard deviation (s)
• The standard deviation, in combination with the
C GC̅ mean, will tell you what is the range of your
s=
G observations.
s = Standard deviation
= Sum • For example, if your mean weight of a sample
0 = observations in data set =150 pounds and your standard deviation = 99
̅ = Sample mean
n = sample size pounds.
• This means that your observations of people
weight lying between 51 pounds (mean-SD) and
249 pounds (mean + SD).
STEP FIVE: Calculate the square root of the variance 1 2 3 4
STEP FOUR: Divide the sum of the squares by (n-1).
Square each of these distances and add all
STEP THREE:
squares together.
STEP TWO: Subtract the mean from each value in
the data. 16/(6-1) = 3.2
STEP ONE: Calculate the mean
2 3.2 = 1.78
• Both variance and SD are measures of variation in a data
set.
• The larger they are, the more heterogeneous the
distribution
• So, if we are comparing two samples, the sample with the
smaller standard deviation would have observation who
are more homogenous
Measures of Shape
Measures of symmetry (Skewness)
Measures of flatness (kurtosis)
Measures of symmetry (Skewness)
The mean, the
median and the Symmetrical
mode are equal
The mean is the right
of the median (the Skewed to the right
value of the mean is (Positive skewness)
the largest)
The mean is the left
of the median (the Skewed to the left
value of the mean is (Negative skewness)
the smallest)
Measures of flatness (kurtosis)
Leptokurtic
Peak is High
Platykurtic
Peak is flat
Mesokurtic
Normal distribution