TOPIC 2: DESCRIPTIVE STATISTICS
Descriptive statistics are used to examine the basic features of data. They provide simpe
summaries about the sample. The focus is on immediate data (sample) and does not extend
beyond this. This is the difference between descriptive and inferential statistics as inferential
extends beyond the immediate data (sample) to what it means to the population.
As the word states, it is about describing the data; it uses quantitative descriptions; it depitcs
characteristics, demographics, sizes and other facts about the data.
Descriptive statistics are therefore tabular, graphical, and numerical summaries of data that
serves to not only provide a basic understanding but also prepare the data for further analysis.
The purpose of descriptive statistics is to facilitate the presentation and interpretation of data.
Univariate methods of descriptive statistics use data to enhance the understanding of a single
variable; multivariate methods focus on using statistics to understand the relationships among
two or more variables.
There are three major categories of descriptive statistics;
Frequency distribution,
Measures of central tendency
Measures of dispersion/variability.
Frequency Distribution
In statistics, the frequency (or absolute frequency) of an event is the number of times the
observation occurred/was recorded. These frequencies are often depicted graphically or in
tabular form.
The cumulative frequency is the total of the absolute frequencies of all events at or below a
certain point in an ordered list of events/sum of the frequencies.
The relative frequency (or empirical probability) of an event is the absolute frequency
normalized by the total number of events
A frequency distribution shows us a summarized grouping of data divided into mutually
exclusive classes and the number of occurrences in a class.
a. Frequency Distribution Table
The most commonly used tabular summary of data for a single variable is a frequency
distribution.
A frequency distribution shows the number of data values in each of several nonoverlapping
classes.
Another tabular summary, called a relative frequency distribution, shows the fraction,
or percentage, of data values in each class.
The most common tabular summary of data for two variables is a cross tabulation, a two-
variable analogue of a frequency distribution.
For a qualitative variable, a frequency distribution shows the number of data values in each
qualitative category.
For instance, if a variable gender has two categories: male and female. Thus, a frequency
distribution for gender would have two nonoverlapping classes to show the number of males
and females.
A relative frequency distribution for this variable would show the fraction of individuals that are
male and the fraction of individuals that are female.
Constructing a frequency distribution for a quantitative variable requires more care in defining
the classes and the division points between adjacent classes.
For instance, if age data ranges from 22 to 78 years, the following six nonoverlapping classes
could be used: 20–29, 30–39, 40–49, 50–59, 60–69, and 70–79.
A frequency distribution would show the number of data values in each of these classes, and a
relative frequency distribution would show the fraction of data values in each.
Constructing a frequency distribution table
1. Decide the number of classes. Too many classes or too few classes might not reveal the
basic shape of the data set, also it will be difficult to interpret such frequency
distribution.
2. Calculate the range of the data (Range = Max – Min) by finding the minimum and
maximum data values. Range will be used to determine the class interval or class width.
3. Decide the width of the classes (assuming the class intervals are the same for all
classes).
4. Generally the class interval or class width is the same for all classes. The classes all
taken together must cover at least the distance from the lowest value (minimum) in the
data to the highest (maximum) value.
5. Equal class intervals are preferred in frequency distribution, while unequal class intervals
(for example logarithmic intervals) may be necessary in certain situations to produce a
good spread of observations between the classes and avoid a large number of empty, or
almost empty classes
Example 1
Frequency distribution table A frequency distribution table is an arrangement of the values that
one or more variables take in a sample. Each entry in the table contains the frequency or count
of the occurrences of values within a particular group or interval, and in this way, the table
summarizes the distribution of values in the [Link] is an example of a univariate (=single
variable) frequency table.
Rank Degree of agreement Number
1 Strongly agree 22
2 Agree somewhat 30
3 Not sure 20
4 Disagree somewhat 15
5 Strongly disagree 15
Example 2
Frequency distribution table B aggregates values into bins such that each bin encompasses a
range of values. For example, the heights of maize plants in a class could be organized into the
following frequency table.
Height range Number of plants Cumulative number
less than 5.0 feet 25 25
5.0–5.5 feet 35 60
5.5–6.0 feet 20 80
100
6.0–6.5 feet 20
Cross tabulation
A cross tabulation is a two-way table with the rows of the table representing the classes of one
variable and the columns of the table representing the classes of another variable.
To construct a cross tabulation using the variables gender and age, gender could be shown with
two rows, male and female, and age could be shown with six columns corresponding to the age
classes 20–29, 30–39, 40–49, 50–59, 60–69, and 70–79.
The entry in each cell of the table would specify the number of data values with the gender given
by the row heading and the age given by the column heading. Such a cross tabulation could be
helpful in understanding the relationship between gender and age. (Assignment: complete a
cross tabulation as described)
Graphical methods
Some of the graphs that can be used with frequency distributions are histograms, line charts,
bar charts and pie charts. Frequency distributions are used for both qualitative and quantitative
data.
Bar graph
A bar graph is a graphical device for depicting qualitative data that have been summarized in a
frequency distribution. Labels for the categories of the qualitative variable are shown on the
horizontal axis of the graph. A bar above each label is constructed such that the height of each
bar is proportional to the number of data values in the category.
Pie chart
A pie chart is another graphical device for summarizing qualitative data.
The size of each slice of the pie is proportional to the number of data values in the
corresponding class.
Histogram
A histogram is the most common graphical presentation of quantitative data that have been
summarized in a frequency distribution.
The values of the quantitative variable are shown on the horizontal axis. A rectangle is drawn
above each class such that the base of the rectangle is equal to the width of the class interval
and its height is proportional to the number of data values in the class.
Measures of Central Tendency
Measures of central tendency seek to find the single "middle" or "average" value in a dataset.
These measures include;
a. Mean:
The mean is a measure of the central location for the data. The sum of all values divided by the
total number of values (the "average"). The mean, often called the average, is computed by
adding all the data values for a variable and dividing the sum by the number of data values. The
men is often affected when there are huge outliers in data. In this case, the mean might not
correctly indicate the "middle' as it could be pushed upwards or downwards by a single large
number or very small number that is far away from the rest.
b. Median
The median is the middle value in a dataset that is ordered from least to greatest. The median is
another measure of central location but, unlike the mean, median is not affected by extremely
large or extremely small data values. When determining the median, the data values are first
ranked in order from the smallest value to the largest value. If there is an odd number of data
values, the median is the middle value; if there is an even number of data values, the median is
the average of the two middle values.
c. Mode
The value that appears most frequently in the dataset. The third measure of central tendency is
the mode, the data value that occurs with greatest frequency.
d. Percentiles
Percentiles provide an indication of how the data values are spread over the interval from the
smallest value to the largest value.
Approximately p percent of the data values fall below the pth percentile, and roughly 100
− p percent of the data values are above the pth percentile. Percentiles are reported, for
example, on most standardized tests.
e. Quantiles
Quartiles divide the data values into four parts; the first quartile is the 25th percentile, the
second quartile is the 50th percentile (also the median), and the third quartile is the 75th
percentile.
Measures of Variability (or Dispersion)
Measuers of variability describe "how spread out or scattered" the data points are from the
center.
Measures of variability include:
a. Range
The range, the difference between the largest value and the smallest value, is the simplest
measure of variability in the data. The range is determined by only the two extreme data values
(The difference between the highest and lowest values in the dataset)
b. Variance
The average of the squared differences from the mean. The deviation (difference) of each data
value from the sample mean is computed and squared. The squared deviations are then
summed and divided by n − 1 to provide the sample variance.
c. Standard deviation
A measure of the typical amount that data points deviate from the mean. The standard
deviation is the square root of the variance. Because the unit of measure for the standard
deviation is the same as the unit of measure for the data, many individuals prefer to use the
standard deviation as the descriptive measure of variability.
TOPIC 3: MULTIVARIATE PARAMETRIC STATISTICAL TECHNIQUES
Parametric tests refers to a pool of tests that impose certain requirements on the data in order
to produce reliable results (maximize prediction/minimize error). Parametric tests are only
aplied to continuous data(interval and ratio-scale). The tests require that data is normally
distributed (has no significant outliers) and that there is no significant variability in means
(homogeneous variance). The observations or data points should be independent of each other
(independence).
Paremetric tests have greater power in estimating results compared to non-parametric tests.
They have also found wide use, hence are widely accepted. Their robustness means they can
detect true differences even with smaller samples.
Multivariate parametric tests focus on two or more variables while univariate parametric tests
focus on a single variable. The following are examples of multivariate parametric tests;
a. Correlation Analysis
Correlation analysis examines association between two variables. Where the test statistic is not
0, there is a relationship.
Besides confirming the existence of a relationship, correlation analysis also tends to find out on
the type of relationship.
i) Positive or negative relationship/correlation
ii. Linear or non-linear
Positive correlation
If the values of the two variables progress in the same direction i.e if an increase (or
decrease) in values of one variable, results, on average, in a corresponding increase (or
decrease) in values of the other variable, the correlation is said to be positive.
Example 1: Increase in taxes correlates with increase in cost of goods
Example 2: Heights and weights tend to be positively correlated.
Example 3: A child grows through a normal positive correlation between months and
weight.
Example 4: Amount of rainfall and yield
Example 5: Demand and supply of commodities (see table below)
Demand Supply
January 0 0
February 4 4
March 8 8
April 12 12
Demand and supply is expected to follow a positive correlation
Line of fit.
One way of depicting a correlation is to use a line of best fit. Corresponding values of the
two variables are plotted at their intersection on the x and y axis.
A line that goes through all the intersections (data points) is then drawn beggining from
the bottom (0) upwards through each data point.
A positive correlation is shown by data points that that begin on the bottom left corner
and head to the top right corner as shown in the figure below.
A perfect positive correlation has no outliers (data points that do not align to a straight
line). In a perfect correlation, the points begin at the same point and the scores are the
same thoughout.
Negative correlation
Correlation between variables is said to be negative where the two deviate in opposite
directions.
This means increase in the value of one variable, result on average, to a corresponding
decrease in the value of the other variable. In the same manner, if there is decrease in
one variable, the other increases.
There are many examples of negatively correlated variables;
Taxes and job creation
Food production levels and inflation
Linear correlation
Linear correlation means there is constant increase in y when x increases. It means that
the relationship between the two variables over the entire period is put at a constant,
predictable number of x to y.
Ideally the relationship assumes the equation;
y= a+ bx; where a=. b= and x=
Assignment: Read about the interpretation of the gradient and intercept.
Non-linear correlation
Variables that change, but not at a constant rate, are deemed to be non-linear.
Example of an equation; y=a+bx+cx2
Coefficient of Correlation (r)
The test statistic for correlation analysis is the coefficient of Correlation (r).
The coefficient of Correlation is widely used in data analysis to give an estimate on the
level, and direction of the relationship between two variables.
r thus quantifies the relationship besides indicating the direction i.e a measure of the
degree of relationship
r takes values between +1 and -1.
If r assumes a negative value ( -), there is a negative relationship.
Where the r statistic is +, there is a positive relationship.
Where r is lower than 5, the relationship is said to be a weak relationship while where r is
equal or greater than 6, it is deemed a strong relationship.