Introduction to
Statistics 2
Before we continue
Have we considered the quality of measurement?
• You should ensure that the data collected are of good quality
• By the person collecting data
• Environment where data is collected
• Data collection tool
Your Questionnaire
How Valid? How Reliable?
Validity: The extent to which the tool measures what it intends to
measure. Calculated by assessing the sensitivity and specificity
Reliability: It tells how close repeated measurements are. It is also
called reproducibility, repeatability, or precision
How to Ensure Reliability and
Validity
• Get the right questions: Adopt or adapt a (validated) tool, construct
from existing literature
• Pretest your tool
• Expert review: is a question necessary? Is it useful but not essential?
Is it essential?
• Cronbach alpha reliability test: Perhaps the most common measure of
reliability (internal consistency). Do for separate scales and not the
entire questionnaire. A value of 0.7 and above indicates acceptable
internal consistency
Measures of Central Tendency
• They are measures by which the different values of a variable in a
population can be summarized in a single numerical value
• MCT gives that value in any given set of numerical data that is
assumed to be typical of all other values
• It is the “average” value of any set of observations indicating the
central position or location or usual value representative of all other
values of the data
• Simply put, it is the value among the set of observed data that can be
used to adequately represent the entire set of observations.
Examples of Measures of Central
Tendency
• Mean
• Arithmetic mean
• Harmonic mean
• Geometric mean
• Median
• mode
Mean
• Calculated as ratio of the sum of all observations to number of
observations
• Best when data is normally distributed as sensitive to extreme
values/outliers
• Usually reported along with the SD
• Used in conjunction with the SD in the determination of test statistics
for some parametric tests
Arithmetic Mean
• It is the numerical value each individual unit or contributor will have
when the total observations or values are shared equally among all
contributors
• It is considered the average value of all observations
• It is calculated by adding all observations and dividing the sum by the
number of observations
E.g: The ages in years of 8 final year medical students are given as 22,
23, 25, 24, 24, 22, 21, 23. Calculate the mean age.
= 22+23+25+24+24+22+21+23/8= 23
Harmonic Mean
• This is the number of observations divided by the sum of the
reciprocal of the values of the set of observations
• It is the reciprocal of the mean of the reciprocals
• For example if the systolic blood pressure (in mmHg) of four
hypertensive patients are 150, 125, 200, 150. The harmonic mean of
the SBP will be 0.007 0.008 0.005 0.007 0.027
4/ 1/150+1/125+1/200+1/150
= 148
Geometric Mean
• Multiply the numbers/values in a data set and take the nth root of the
multiplied numbers, where n is the total number of data values.
• For example, the ages (in years) of 3 newly admitted children in the
paediatric ward are 2, 4, and 8. Calculate the GM of ages
= 3/2x4x8
3/64
=4
Median
• Determined by picking the middle number when observations are
sorted in ascending/descending order
• Preferred when data is skewed as not affected by extreme values
• Usually reported along with the range or interquartile range
• Used in non parametric tests
Mode
• The observation or figure in a series of observations that occurs most
frequently
• E.g Weights (in Kg) of 6 adult obese patients are 86, 93, 85, 93, 91, 89
• The mode is 93
Measures of Dispersion
• Also called measure of variation, spread or scatter
• This measures is usually indicated alongside measures of central
tendency
• It indicates how far or distant the other values of the data are from
the central value such as the arithmetic mean
• It shows if the individual observations occur close or widely apart
from each other.
Examples of Measures of
Dispersion
• Standard deviation: This is the best measure of dispersion. It is the
square root of the variance. It is reported alongside the mean.
• Variance: This is the mean of the squared deviations of each
observation from it arithmetic mean value
• Mean absolute deviation
• Range
• Interquartile range
• Coefficient of variation
• ratio of the SD to the mean
• Expressed as a percentage
Normal Distribution
• The most famous probability distribution Describes continuous
variables
• Also called gaussian distribution
• Symmetrical about its mean
• The mean divides the curve into 2 equal halves
• The total area under the normal curve is 1
• Mean, median and mode are equal
• Major role is in statistical inference
Normal Distribution
• Handles probability problems for quantitative continuous outcomes
when the mean and standard deviation are known
• It is possible to make probability statements about observing a range
of values of the quantitative continuous outcome
• Given a mean BP and its SD what is the chance of observing a BP
greater than a certain value, lower than a value OR between two
specified values
Normal Distribution
Curve
• 68%, 95% and 99.7% lie within +/- 1,2 and 3 SD respect
STOP!
Before you select your
statistical tests
Is that data normally
distributed or not?
Maybe we should do a Normality
Test
• Shapiro Wilk
• Kolmogorov-Smirnov
• Check p-value, the charts (histogram, box and whiskers, scatter
diagram)
Parameter Versus Statistic
• Parameters are quantities or values (e.g mean, SD) derived from the
general population. It is usually impossible or difficult to get data from
the whole population. So values of parameters are difficult to get
• Parameters are usually fixed
• Statistics are quantities or values derived from a sample. A statistic is
an estimate of a parameter.
• Statistics will typically be different as different samples are drawn
from the population
Parametric Vs Non-Parametric
Test
Parametric Tests Non-Parametric Tests
Rely on the assumptions that the distribution of the Rely on no assumption of population distribution. They
population is known. It is usually normal distribution are called distribution-free tests.
Uses mean as the measure of central tendency Uses median as the measure of central tendency
Used for quantitative data (ordinal) Qualitative data (Nominal)
Probabilistic distribution is normal Probabilistic distribution is arbitrary
Population knowledge required Population knowledge not required
Pearson correlation Spearman correlation
Parametric and Non-Parametric
Equivalents
Parametric: Student t-test/Independent t-test/Two-sample t-test
Non-Parametric test: Wilcoxon rank sum test/Man Whitney U test
• Compare means between two distinct/independent groups
• For example: Is the mean systolic blood pressure (at baseline) for
patients assigned to placebo different from the mean for patients
assigned to the treatment group?
Parametric and Non-Parametric
Equivalents
Parametric: Paired t-test
Non-Parametric: Wilcoxon signed-rank test
• Compare two quantitative measurements taken from the same
individual
• For example, Was there a significant change in systolic blood pressure
between baseline and the six-month follow-up measurement in the
treatment group?
• Give your Research example
Parametric and Non-Parametric
Equivalents
Parametric: Analysis of variance (ANOVA)
Non-Parametric: Kruskal-Wallis test
• Compare means between three or more distinct/independent groups.
• For example, if our experiment had three groups (e.g., placebo, new
drug 1, new drug 2), we might want to know whether the mean
systolic blood pressure at baseline differed among the three groups
• Give your Research example
Parametric and Non-Parametric
Equivalents
Parametric: Pearson coefficient of correlation
Non-Parametric: Spearman’s rank correlation
• Assesses the degree of association between two quantitative
variables.
• For example, is systolic blood pressure associated with the patient’s
age?
• Give your Research example
Correlation
• In general, correlation quantifies the degree to which two
independent variables vary.
• If two variables are independent, then the value of one has no
relationship with the other.
• If they are correlated, then the value for one is related to the value of
the other, either high when the other is high (Direct, positive) or high
when the other is low (Inverse/Indirect/negative).
• The study of association between two variables is called correlation
analysis.
Correlation
• The correlation coefficient is measured on a scale that varies from + 1
through 0 to – 1 and is denoted by “r”. A perfect positive correlation
between two variables is expressed as +1, while a perfect negative
correlation as -1.
• Absence of linear correlation is represented by 0.
Also Note
Qualitative Data: Before and After: Use McNemar test
Prediction of outcome variable: Linear Regression, Logistic Regression
Thank You