Descriptive Statistics
Research Methodology
Faculty of Dentistry
University of Malaya
Semester 1 Session 2023/2024
‘Abqariyah Yahya, PhD
abqariyahyahya@[Link]
Learning outcomes
• Methods of summarizing data
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 2
Type of Statistics
Inferential
Descriptive Statistics
Statistics
• Information about the sample is used to infer
to the population
• Describes information of the sample
• E.g. of statistics used:
• Statistics presented should be
• p value, 95% CI
interpreted with regards to the sample
• (Hypothesis testing and interval
• Limited to measures of location, central
estimation)
tendency and dispersion
• Extrapolation is part & parcel of the process
• Extrapolation is judgmental
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 3
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 4
(a) Measures of central tendency
Parameter Definition Example
Mean The sum divided by the (4+7+5+9+5)/5=6
number of cases. Xbar = Xi / n
Median The mid value of an sorted Ordered: 4,5,5,7,9
data. (n+1)/2=(5+1)/2=3; third
value, m=5
Mode The value with the largest M=5 (appear twice)
frequency
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 5
Using Measures of Central Tendency
• Which measure is the best to use?
• Depend on two factors:
i. The scale of the measurement (ordinal or numerical)
ii. The shape of the distribution
• If outlier observations occurs in one direction – skewed distribution
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 6
Using Measures of Central Tendency
• Some guidelines:
• The mean is used for numerical data and for symmetrical (not skewed
distribution)
• The median is used for ordinal data or for numerical data if the distribution is
skewed.
• The mode is used primarily for binomial distributions
• The geometric mean is used primarily for observations measured on a
logarithmic scale.
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 7
(b) Measures of variation/ spread
• Do all the observations tend to be quite similar and therefore lie close
to the center, or are they spread out across a broad range of values?
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 8
Measures of Dispersion
• The extent to which values in a distribution deviate from their central
tendency.
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 9
Two distributions with identical means,
medians and modes.
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 10
Measures of dispersion
Parameter Definition Example
1. Range Difference between the maximum and the minimum 4,5,5,7,9
values Max-Mini=9-4=5
2. Variance The average of the squared deviations from the S2 = (Xi - X ) 2 / n-1
mean.
3. Standard Square root of variance. (Measure the variability S= S2
deviation around the mean, same measurement unit)
4. Coefficient of the ratio of the standard deviation to the mean (to CV=S/Xbar
variation describe the dispersion of the variable in a way does
not depend on the measurement unit)
5. Inter-quartile a measure of where the “middle fifty” is in a data set Q3-Q1
range
6. Percentile A percentage of a distribution that is equal to or 95th percentile of weigh=12; in
below a particular number. the data, 95% weigh 12kg or
less
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 11
Summary of quantiles
Quantile No of
Interval Description in ordered set
name quantiles
2 Median 1 50% of observations both above and below
median
4 Quartiles 3 25% of observations below 1st, above 3rd and
between successive quartiles
5 Quintiles 4 20% of observations below 1st, above 4th and
between successive quintiles
10 Deciles 9 10% of observations below 1st, above 9th and
between successive deciles
100 Percentiles 99 1% of observations below 1st, above and
between successive percentiles
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 12
Measures of skewness
Parameter Definition
SK = 0 Symmetry
0 < SK < 3 Skewed to the right
-3 < SK < 0 Skewed to the left
SK = 3*( X - Md ) / S
Pearson 2 skewness coefficient
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 13
Using different measures of dispersion
• Some guidelines:
• The SD is used when the mean is used-symmetric data.
• The range is used with numerical data when the purpose
is to emphasize extreme values
• The CV is used when the intent is to compare numerical
distributions measured on different scales.
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 14
Using different measures of dispersion
• Some guidelines:
• Percentiles and the IQR are used in two situations:
i. When the median is used, e.g. median(IQR);
ii. When the mean is used but the objective is to
compare individual observations with a set of norms
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 15
2. Summarizing/ displaying nominal & ordinal
data with numbers
a) Meaningful statistic:
i. Proportion
ii. Percentage
iii. Ratio
iv. Rate
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 16
i. Proportion
• A proportion is the number a (with a given characteristic) devided by
the total number of observations, a+b
𝑎
𝑃𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 =
𝑎+𝑏
• A part divided by the whole
• Useful for ordinal, nominal and numerical data
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 17
ii. Percentage
• Percentage is the proportion multiplied by 100%
𝑎
𝑃𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 = × 100%
𝑎+𝑏
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 18
iii. Ratio
• the quantitative relation between two amounts showing the number
of times one value contains or is contained within the other.
• A ratio says how much of one thing there is compared to another
thing.
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 19
• There are 3 blue squares to 1 yellow square
• Ratios can be shown in different ways:
• Using the ":" to separate the values: 3 : 1
• Instead of the ":" we can use the word "to": 3 to 1
• Or write it like a fraction:
3
1
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 20
iv. Rate
• Rates are used by people every day, such as when they work 40 hours
a week or earn interest every year at a bank.
• When rates are expressed as a quantity of 1, such as 2 feet per
second or 5 miles per hour, they are called unit rates.
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 21
• Similar to proportions except that a multiplier (e.g. 1000, 10,000 or
100,000) is used, and they are computed over a specified period of
time
• The multiplier is called the base
𝑎
𝑅𝑎𝑡𝑒 = × 𝐵𝑎𝑠𝑒
𝑎+𝑏
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 22
Example of rate
• If the aspirin study among physicians had lasted exactly 1 year, the
rate of MI per 10,000 physicians taking aspirin per year would be
(139/11,037)x(10,000) or 0.0126 x 10,000 or 126 per 10,000
physicians per year
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 23
2. Summarizing/ displaying nominal & ordinal
data with numbers
b) Tables & Graphs
i. Simple frequency table
ii. Contingency table
iii. Bar chart
iv. Pie chart
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 24
a. Frequency table
• Summarizing data using tables
sbp interval | Freq. Percent Cum.
------------+-----------------------------------
94- 113.2 | 21 21.00 21.00
132.4 | 36 36.00 57.00
151.6 | 29 29.00 86.00
170.8 | 12 12.00 98.00
190 | 2 2.00 100.00
------------+-----------------------------------
Total | 100 100.00
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 25
b. Contingency table
Table #: Favorite way to eat ice cream between gender
gender cup cone sundae sandwich other total
male 592 300 204 24 80 1200
female 410 335 180 20 55 1000
total 1002 635 384 44 135 2200
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 26
b. Contingency table
Table #: Birth weight by medical condition of mother
Birth Weight
Medical condition of mother Less than 2.5 kg Equal to or greater
than 2.5 kg
Yes 26 (44.5%) 32 (55.2%)
No 103 (10.0%) 927 (90.0%)
Total 129 (11.8%) 959 (88.2%)
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 27
Summarizing data using
graphs
Displaying Qualitative Data
Bar Chart
Birth Order of Spring 1998 Stat 250 Students
40
30
Percent
20
10
Middle Oldest Only Youngest
Birth Order
n=92 students
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 30
Bar Chart
• Summarizes categorical data.
• Horizontal axis represents categories, while vertical axis represents
either counts (“frequencies”) or percentages (“relative frequencies”).
• Used to illustrate the differences in percentages (or counts) between
categories.
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 31
Pie Chart
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 32
Displaying Quantitative Data
Histogram
• Divide measurement up into equal-sized categories.
• Determine number (or percentage) of measurements falling into each
category.
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 34
Too few categories
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 35
Too many categories
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 36
Stem-and-Leaf Plot
Stem-and-leaf of Shoes N = 139 Leaf Unit = 1.0
12 0 223334444444
63 0 555555555555566666666677777778888888888888999999999
(33) 1 000000000000011112222233333333444
43 1 555555556667777888
25 2 0000000000023
12 2 5557
8 3 0023
4 3
4 4 00
2 4
2 5 0
1 5
1 6
1 6
1 7
1 7 5
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 37
Stem-and-Leaf Plot
• Summarizes measurement data.
• Each data point is broken down into a “stem” and a “leaf.”
• First, “stems” are aligned in a column.
• Then, “leaves” are attached to the stems.
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 38
Box Plot
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 39
Box Plot
• Summarizes measurement data.
• Vertical (or horizontal) axis represents measurement scale.
• Lines in box represent the 25th percentile (“first quartile”), the 50th
percentile (“median”), and the 75th percentile (“third quartile”),
respectively.
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 40
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 41
Box Plot
• Roughly speaking:
• The “25th percentile” is the number such that 25% of the data points fall
below the number.
• The “median” or “50th percentile” is the number such that half of the data
points fall below the number.
• The “75th percentile” is the number such that 75% of the data points fall
below the number.
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 42
Using Box Plots to Compare
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 43
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 44
Quantile-Quantile Plots (QQ Plots)
• Numerical data
• Visually compare collected data with a known distribution – normal
distribution (Normal QQ plots)
• We check to see whether the sample follows a normal distribution
• Note:
• In a normal QQ plot, if your scatter plot “hugs” the line, there is good reason
to believe that your data is normally distributed.
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 45
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 46
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 47
What if your data is not normally distributed?
• Perform transformation
• For right / positively skewed data
• Log or;
• Square root
• For left/ negatively skewed data
• Exponential or;
• Square
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 48
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 49
Histograms vs Boxplots vs QQ plots
Histograms Boxplots QQ plots
Advantages With properly-sized bins, Don’t have to weak Can identify whether
histograms can summarize with “graphical” the data came from a
any shape of the data parameters (i.e. bin certain distribution
(modes, skew, quantiles, size in histograms)
outliers)
Summarize skew, Don’t have to tweak
quantiles, and with “graphical”
outliers parameters (i.e. bin
size in histograms)
Can compare several Summarize quantiles
measurements side-
by-side
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 50
Histograms vs Boxplots vs QQ plots
Histograms Boxplots QQ plots
Disadvantages Difficult to compare side- Cannot distinguish Difficult to compare
by-side (takes up too modes! side-by-side
much space in a plot)
Depending on the size of Difficult to
the bins, interpretation distinguish skews,
may be different modes, and outliers
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 51
Dotplot
• A dotplot is a type of graphic display used to compare
frequency counts within categories or groups.
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 52
Dotplot
• Each dot represents a specific number of observations from a set of
data. (Unless otherwise indicated, assume that each dot represents
one observation. If a dot represents more than one observation, that
should be explicitly noted on the plot.)
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 53
Which graph to use when?
• Stem-and-leaf plots and dotplots are good for small data sets, while
histograms and box plots are good for large data sets.
• Boxplots and dotplots are good for comparing two groups.
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 54
Which graph to use when?
• Boxplots are good for identifying outliers.
• Histograms, boxplots and dotplot are good for identifying “shape” of
data.
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 55
Scatter Plots
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 56
Scatter Plots
• Summarizes the relationship between two measurement variables.
• Horizontal axis represents one variable and vertical axis represents
second variable.
• Plot one point for each pair of measurements.
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 57
No relationship
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 58
Which graph to use?
• Depends on type of data
• Depends on what you want to illustrate
• Boxplots are good for identifying outliers.
• Histograms and boxplots are good for identifying “shape” of data.
• Depends on statistical software
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 59
Descriptive Statistics using SPSS
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 60
Introduction
• All statistical analyses will be conducted using the same dataset called
[Link]. A brief description about the dataset as follows:
• This dataset contains 5,811 observations. It is about a study done in a
university to study the pattern of anxiety and resilience among
university students during COVID 19 pandemic. The main aim of this
study is to measure the association between anxiety and resilience
among university students during COVID 19 pandemic.
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 61
Summarizing Data
Frequencies
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 62
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 63
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 64
Saving syntax
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 65
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 66
We can customize the syntax file by adding
more commands starting from Open a
dataset and many more.
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 67
You can also save the Syntax file for later use.
Please give a meaningful name and save it in an
appropriate folder.
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 68
Frequencies
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 69
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 70
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 71
Descriptives
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 72
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 73
Explore
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 74
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 75
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 76
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 77
Normality Test
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 78
Thank you
Dr 'Abqariyah Yahya, Semester 1 Session 2023/2024 79