DESCRIPTIVE STATISTICS
• EXAMPLES OF DESCRIPTIVE STATISTICS
INCLUDE:
– RATIOS E.G MEASURES OF MORBIDITY,
MORTALITY AND NATALITY.
– MEASURES OF CENTRAL TENDENCY E.G. MEAN,
MODE AND MEDIAN.
– MEASURES OF DISPERSION E.G. RANGE,
INTERQUATILE RANGE AND STANDARD
DEVIATION
MEASURES OF CENTRAL
TENDENCY
•MEAN
•MEDIAN
•MODE
• For qualitative variables, two summary
measures are commonly used:
– Measures of central tendency
– Measures of dispersion
• Three measures of central tendency
exist:
1. Arithmetic Mean
2. Median
3. Mode
INTRODUCTION
• Measure of central tendency: single number
that is most representative of the entire data.
1. MEAN: is the sum of the numbers divided by
n
2. MEDIAN: the middle number when the
numbers are ordered. If set is even, the
median is the average of the two middle
numbers.
3. MODE: most frequent number. Can be
bimodal or trimodal.
• Appropriate MoCT depends on the data
itself.
• Continuous data e.g. ht, use mean. = mean
height is 32.5 cm'. The mode is not a good
measure here because, it may not exist
• Discrete data e.g. number of children, use
mode or median, this avoids mean is 2.3
children‘!
• Categorical data e.g. Colour of houses sold
use mode, for example, ‘White” is the most
common house colour'.
MEAN
• The arithmetic mean is the most common
measure of central tendency.
• The symbol "μ" is used for the mean of a
population. The symbol "M" is used for
the mean of a sample. The formula for μ
is shown below:
• μ = ΣX/N (population mean)
• M = ΣX/n (sample mean)
• The mean presented along with the variance
and the standard deviation is the "best"
measure of central tendency for continuous
data.
• In some situations the mean is not the "best"
measure of central tendency. The median is
the preferred measure. E.G:
• When data distribution is skewed
• When you believe that a distribution might
be skewed
• When you have a small number of subjects
MEDIAN
• The median is also a frequently used
measure of central tendency. The median
is the midpoint of a distribution: the
same number of scores is above the
median as below it if the distribution of
data is odd. When the data set is even,
the median is the mean of the two
middle numbers.
• The median can also be thought of as the
50th percentile.
MODE
• The mode is the most frequently
occurring value
• With continuous data measured to
many decimals, the frequency of each
value is one since no two scores will be
exactly the same.
• Therefore the mode of continuous data
is normally computed from a grouped
frequency distribution.
Mode of continuous data
Grouped frequency distribution.
Wt Frequency
500-600 2
600-700 3
700-800 6
800-900 5
900-1000 4 1000-1100 0
700-800 is most frequent group, the MODE is
the midpoint of the scale range i.e. 750 kg
• The mode is not usually used because
the largest frequency of scores might
not be at the center. The only situation
in which the mode may be preferred
over the other two measures of central
tendency is when describing discrete
categorical data. The mode is preferred
in this situation because the greatest
frequency of responses is important for
describing categorical data
MEASURES OF SPREAD
Introduction
• A measure of spread (dispersion), is
used to describe the variability in a
sample or population. It is usually used
in conjunction with a measure of central
tendency, such as the mean or median,
to provide an overall description of a set
of data.
• Why is it important to measure the
spread of data?
• It gives us an idea how well a mean
representative a data set. Mean is not
good with data set with large spread but
is appropriate if the spread of data is
small. Large spread indicates high
variability between individual scores,
such does not auger well in research.
Types of measures of spread
• Range
• Quartiles
• Variance
• Absolute deviation and
• Standard deviation.
Range
• The range is the difference between the
highest and lowest scores in a data set
and is the simplest measure of spread.
• Range = maximum value - minimum
value
• NB, unlike with median, data must not
be ordered, however, an ordered data
makes it easier to quickly see the
minimum and maximum values
• The range delineates the boundaries of
data sets. The importance of this is seen
if you are measuring a variable that has
a high and low values that should not be
crossed. The range can be used to detect
any errors when entering data. E.g., if
you are recording the age of school
children, you quickly note a mistake if
your range is 7 to 118yrs!
Quartiles and Interquartile Range
• Quartiles measure spread of a data set
by breaking the data set into quarters.
There are four quartiles in a percentile-
1st quartile is in 25th percentile, 2nd
quartile= 50th percentile, 3rd quartile=
75th percentile, and 4th quartile is in 100th
percentile.
• When an ordered data set is even, the
quartiles will be calculated as follows:
• First quartile (Q1) = 25th+ 26th value of
data set/2 i.e. x+y/2 = Q1
• Second quartile (Q2) = 50th + 51st/2 = Q2
• Third quartile (Q3) = 75th + 76th ÷ 2 = Q3
• If data set is odd, Q1, 2 and 3 will be
data on 25th, 50th, and 75th position of an
ordered set
• Quartiles are much less affected by
outliers or a skewed data set than the
equivalent measures of mean and
standard deviation. For this reason,
quartiles are often reported along with
the median as the best choice of
measure of spread and central
tendency, respectively, when dealing
with skewed and/or data with outliers.
• A common way of expressing quartiles is
as an interquartile range. The
interquartile range describes the
difference between the third quartile
(Q3) and the first quartile (Q1). It tells of
the range of the middle half of the
scores in the distribution. i.e.
• Formula: IQR = Q3 - Q1
Absolute Deviation, Variance and standard deviation (Variations)
• Quartiles do not take into account every
score in our group of data. To take into
account the actual values of each score
in a data set and get the spread we use
the ABSOLUTE DEVIATION,
VARIANCE & STANDARD DEVIATION.
• Either of the three variations can be
used in research.
Variance
• Another method for calculating the deviation
of a group of scores from the mean, is to use
the variance. Unlike the absolute deviation,
which substitutes negative results with the
absolute value of the deviation, the variance
achieves positive values by squaring each of
the deviations . Adding up these squared
deviations gives us the sum of squares, which
we can then divide by the total number of
scores in a group of data
• As a measure of variability, the variance is
useful. If the scores in a data set are
spread out, the variance will be a large
number. Conversely, if the scores are
spread closely around the mean, the
variance will be a smaller number.
However, there are two potential
problems with the variance. First, because
the deviations of scores from the mean are
'squared', this gives more weight to
extreme scores.
• If our data contains outliers this can give
undue weight to these scores. Secondly,
the variance is not in the same units as the
scores in our data set: variance is
measured in the units squared. This means
it cannot be placed on a frequency
distribution and cannot directly relate its
value to the values in our data set.
Calculating the standard deviation rather
than the variance rectifies this problem.
Standard Deviation
Introduction
• The standard deviation is a measure of
the spread of scores within a set of data.
Usually, the standard deviation of a
population is preferred. However, as
researchers often deal with data from a
sample only, the population standard
deviation can be derived from a sample
standard deviation.
When to use the sample or population standard deviation
• Knowing the population standard deviation is
more important because the population
contains all the researchers are interested in.
Therefore, population standard deviation will
be preferred if: (1) the entire population is
available or (2) a sample of a larger
population is available but the interest is in
the sample only and the researchers do not
wish to generalize their findings to the
population.
What type of data should you use when you
calculate a standard deviation?
• The standard deviation is used in
conjunction with the mean to
summarise continuous data, NOT
categorical data. In addition, the
standard deviation, like the mean, is
normally only appropriate when the
continuous data is not significantly
skewed or has outliers.
What are the formulas for the population and standard deviation?
• The sample standard deviation formula is:
• The population standard deviation formula
is: