Advanced Biostatistics
Kedir Hussein Abegaz
Biostatistics and Health informatics (Asst Professor)
Madda Walabu University
Tel: +251-913-012630
Email: kedir6300@[Link]
1. Introduction
1.1 Statistics
Defined in to two modes
1. As statistical data: it is a numerical
representation of things
2. As statistical method: it is a field of study
that deals with COPAID.
It helps to know the object under study in a better way
– Statistical methods include:
1. Designing studies
2. Collecting data
3. Presenting data
4. summarizing data
5. Drawing inferences
What is Biostatistics?
• It is the application of statistical methods to
the biological and life sciences.
Rationale of statistics
– Public health and medicine are becoming
increasingly quantitative
• Statistics is the language of assembling and handling
these quantitative material.
– The planning, conduct, and interpretation of
much of medical research are becoming
increasingly reliant on statistical technology
5
Rationale…
Among the rationale
– To enlarge our knowledge of complex phenomena
– It present facts in a definite and precise form
– Data reduction, Comparison, Estimation
– Measuring the magnitude of variations in data
– Testing and formulating of hypothesis
– Studying the relationship between variables
– Forecasting future events
6
Limitations of statistics
As a science, statistics has its own limitation
– Deals with only quantitative information
– Deals with aggregate of facts, not with individual data
items
– Data are approximately, not mathematical correct
– Statistics can be easily misused
7
1.2 Data
• Data — The Key Component of a Study
– More important than the methods
used in the analysis are the use of the
appropriate study design and the proper
definition and measurement of the study
variables.
• No good study without good data!
1.3 Design
• Design — The Road to Relevant Data
• Obtaining relevant data requires a
carefully drawn plan that identifies
– the population of interest
– the procedure used to select study units
– the process used in the measurement of
the attributes of interest.
Design…
• Standard methods of data collection:
1. Surveys: deals with ways to select a
random sample that is representative of
the population of interest and from which
a valid inference can be made.
2. Experiment: involves the creation of a
plan for determining whether or not there
are differences between groups.
3. Records: provide ready-made data for
routine and continuous information.
Design…
• Sometimes we also analyze data that
were already collected.
• In this case, we need to understand
how the data were collected
– in order to determine the appropriate
methods of analysis
1.4 Replication
• Replication — Part of the Scientific
Method
– Statistical analysis of data may demonstrate
that there is a high probability of an
association between two variables.
– However, a single study rarely provides proof
that such an association exists.
– Results must be replicated by additional studies
that eliminate other factors that could have
accounted for the relationship observed between
the study variables.
1.5 Applying Statistical Methods
• It requires more than the ability to use
statistical software, deriving formula...
– It is understanding the context for the use
of statistical procedures (study’s goal, the
data, and how data collected and
measured)
• Think instead of simply memorizing
formulas, and statistical software
1.6 Scales of measurement
• Observations and Variable:
– is a characteristic under study that assume
different value for different element like blood
pressure, age, sex, …
– In statistics, we observe or measure
characteristics, called variables, of study
subjects, called observational units.
• The main divisions are qualitative (categorical) and
quantitative (numerical variables).
14
Scales of measurement…
• Qualitative variable: a variable which can’t be
measured in quantitative form. But can only
be identified by name or categories
– E.g. place of birth, types of drug, stages of breast
cancer (I, II, III, or IV), degree of pain (minimal,
moderate, severe). …
15
Scales of measurement…
• Quantitative variable: A variable that can be
measured and expressed numerically and they
can be of two types (discrete or continuous).
– The values of a discrete variable are usually whole
numbers, e.g. the number of episodes of diarrhea in
the first five years of life.
– A continuous variable is a measurement on a
continuous scale, e.g. weight, height, blood
pressure, age, etc.
16
Types of measurement scales
• Nominal
– Data that represent categories or names
– There is no implied order to the categories of
nominal data.
– No arithmetic and relational operation can be
applied.
– E.g.
• Blood type (A, B, O and AB)
• Eye color (brown, black, blue, etc.)
• Sex (Male, Female)
17
Types of measurement scales…
• Ordinal
– Categories that can be ranked, but differences
between ranks do not exist
– Arithmetic operations are not applicable but
relational operations.
– Ordering is the sole property of ordinal scale.
– E.g.
• Degree of pain (minimal, moderate, severe)
• Rating scales (Excellent, Very good, Good, Fair, poor)
• Letter grade (A, B, C, D and F)
18
Types of measurement scales…
• Interval
– Data that can be ranked and differences are
meaningful. However, there is no meaningful
zero, so ratios are meaningless.
– All arithmetic operations except division and
relational operations are also possible.
– E.g.
– IQ
– Temperature in degree Fahrenheit (30F is not as
much as two times of 15F)
19
Types of measurement scales…
• Ratio
– Data can be ranked, differences are
meaningful, and there is a true zero.
– All arithmetic and relational operations are
applicable.
– E.g.
• Age (30 year individual is two times of 15 years)
• Weight (0kg is to mean, no weight)
• Number of drugs (0 means no drug)
20
1.7 Sources of data
Two source: primary and secondary
1. Primary Data: a data collected by the user
directly from the source.
– Methods of collection
Personal Interview ƒ
(Telephone, face-to-face…)
Focus Group discussion (FGD)
ƒMail Questionnaires
ƒDoor-to-Door Survey
ƒNew Product Registration ƒ
Sources of data…
2. Secondary Data: a data gathered or compiled
from published and unpublished sources.
– From journals, reports, government publications,
publications of professionals and research
organizations.
– E.g. - CSA: Central statistics agency
- DHS: the demographic and Health Survey
- HDS: Health and Demographic Surveillance
1.8 Division of statistics
Depending on how data can be used
• Descriptive statistics (Exploratory): is concerned with
summary calculations, graphs, charts and tables…
about a given data.
• Inferential statistics (Confirmatory): is a method used
to generalize from a sample to a population.
– sometimes called analytical statistics
1.9 Stages in statistical investigation
Five stages
• Collection of data: the process of measuring,
gathering, assembling the raw data up on which
investigation is to be based.
• Organization of data: Summarization of data in
some meaningful way, e.g. table form
24
Stages in…
• Presentation of the data: The process of re-
organization, classification, compilation… of data
to present it in a meaningful form.
• Analysis of data: The process of extracting
relevant information from the summarized data
• Inference of data: The interpretation and further
observation of the various statistical measures
through the analysis of the data
– And by implementing those methods by which
conclusions are formed and inferences made.
25
1.10 Types of questions
.
1 Open-ended questions
• Permit free responses
• Not allowed any possible answers to
choose from.
• Mostly used for investigation of
• Facts which the researcher is not familiar
• Opinions, attitudes, and suggestions of
informants
• Sensitive issues
Types of questions…Example
• Can you describe exactly what the traditional
birth attendant did when your labor started?
• What sensations did you experience during
your cataract surgery?
• How do you feel when your baby’s diarrhea
does not stop?
Types of questions…Example
2. Close-ended Questions
• Offer a list of possible options/answers
• When designing closed questions you should
try to:
• Make lists are complete and mutually exclusive
(events can’t happen at same time)
• Keep the number of options as few as possible
• It is useful if the range of possible responses
is known
Types of questions…Example
• What is your marital status?
1. Single
2. Married/living together
3. Separated/divorced/widowed
• Have you every gone to the local village
health worker for treatment?
1. Yes
2. No
Steps in designing questionnaire
1. Content
Decide what questions will be needed to
measure your variables and reach objectives
2. Formulating Questions
Specific and precise enough that respondents
do not interpret them differently
Steps…
3. Sequencing of Questions
Better to be logical for the respondent
4. Formatting the Questionnaire
Not only be consumer but also user friendly
5. Translation
If the interview will be conducted in one or
more local languages, translate
2. Data presentation
Data presentation
• Having collected and edited the data, the next
step is to organize it.
• That is to present it in a readily clear
condensed form
• The presentation of data is classified in to two
1. Tabulation
2. Diagrammatic
Tabular presentation
• Frequency distribution: is the organization of
raw data in table form using classes and
frequencies
• There are three basic types of frequency
distributions
• ƒ
Categorical frequency distribution
• ƒ
Ungrouped frequency distribution
• ƒ
Grouped frequency distribution
Categorical frequency distribution
• Used for data that can be place in specific
categories such as nominal or ordinal.
E.g. a researcher collected the following
data on marital status for 25 Patients.
(M=married, S=single, W=widowed and
D=divorced)
Qs: Present the given data in table form
M S D W D
S S M M M
W D S M M
W D D S S
S W W D D
Solution
Make a table as shown
Class Tally Frequency Percent
M ////// 6 24%
S /////// 7 28%
D /////// 7 28%
W ///// 5 20%
Ungrouped frequency distribution
• Is a table of all the potential raw score values
• Often constructed for small set or data on
discrete variable.
E.g. The following data represent the Weight
of 12 clients in nutrition consulting clinic.
80 76 90
70 60 62
63 60 63
76 70 70
Construct ungrouped frequency distribution
Solution
Make a table as shown
Mark Tally Frequency
60 // 2
62 / 1
63 // 2
70 /// 3
76 // 2
80 / 1
90 / 1
Grouped frequency Distribution
• When the range of the data is large, the data
must be grouped in to classes that are more
than one unit in width
Example: Construct a frequency distribution
for the following data on.
11 29 6 33 14 31 22 27 19 20
18 17 22 38 23 21 26 34 39 27
N.B: After many steps… next page
(Reading Assignment on steps to do)
Solution
Make table as follows
Class limit class boundary Class mark Freq. CF(<) CF(>)
6-11 5.5.-11.5 8.5 2 2 20
12-17 11.5-17.5 14.5 2 4 18
18-23 17.5-23.5 20.5 7 11 16
24-29 23.5-29.5 26.5 4 15 9
30-35 29.5-35.5 32.5 3 18 5
35-41 35.5-41.5 38.5 2 20 2
Diagrammatic and Graphic
presentation
• presenting data in visual displays
• Importance
– They have greater attraction.
– They facilitate comparison
– They are easily understandable
• The commonly used diagrammatic
presentation for discrete as well as
qualitative data are:
– Pie charts, Bar charts, Pictogram, map…
Pie chart
• A pie chart is a circle that is divided in to
sections according to the percentage of
frequencies in each category of the
distribution.
Example: Draw a pie chart to represent the
following OPD Patients of the year 2018 in the
given hospital.
Men Women Girls Boys
2500 2000 4000 1500
Solution
• First make a table like:
• Then, draw
Bar chart
• is the most widely used graphical method for
describing qualitative data.
• A set of bars representing some magnitude
over time space.
• The common types of bar chart
– Simple
– Multiple
– Component … etc
Simple bar chart
E.g. Distribution of Decayed teeth among
children of a primary school
Multiple bar chart
E.g. Distribution of marital status by sex
%
60 Male
Female
40
20
0
Single Married Divorced Widowed
Marital status
Graphical presentation of data
• The commonly used graphs for
continuous data are
– histogram,
– Frequency polygon
– Ogive (CF graph)…
Histogram
• A graph which displays the data by using
vertical bars of various heights to represent
frequencies.
• Class boundaries are placed along the horizontal
axes.
• Example: Construct a histogram to represent
the previous data
– i.e., (example on grouped [Link].)
Solution
4
3
2
5.5 11.5 17.5 23.5 29.5 35.5 41. 5
Class boundaries on x-axis and frequency on y-axis
Frequency Polygon:
• it is a line graph where,
– The frequency is placed along the vertical axis and
Class marks at horizontal axis
• Example: draw a line graph for the above
example on histogram
Solution
class marks are in the x-axis
3. MCT and MV
MCT (Measures of central tendency)
• useful in data editing as well as in aiding our
understanding of the data
• Sometimes called Average
• Objectives
• To understand the data easily
• To facilitate comparison
• To make further statistical analysis
Types of MCT
• The Mean (Arithmetic, Geometric and
Harmonic)
• The Mode
• The Median
• Quantiles (Quartiles, deciles and percentiles)
• The choice of these averages depends up on
which best fit the property under discussion.
54
The Mean ( X )
• The Arithmetic Mean:
• Is defined as the sum of the magnitude of the
items divided by the number of items
• The mean of X1+X2+X3+,…+Xn is denoted by
A.M ,m or X and is given by:
,, Or
55
Mean for Ungrouped data
• Example: Obtain the mean age of the
following ages of children under Pedi clinic
2, 7, 8, 2, 7, 3, 7
• Solution:
56
Mean for grouped data
• If data are given in the shape of a continuous
frequency distribution, then the mean is
57
Example: calculate the mean for the
following data
Class frequency Solution
6- 10 35
11- 15 23
16- 20 15
21- 25 12
26- 30 9
31- 35 6
58
The Mode ( X̂)
• Mode is a value which occurs most frequently in
a set of values
• The mode may not exist and even if it does
exist, it may not be unique.
• In case of discrete distribution the value having
the maximum frequency is the modal value.
• The mode of a set of numbers X1, X2, X3,…Xn is
usually denoted by: X̂
59
Examples:
1. Find the mode of 5, 3, 5, 8, 9
Mode =5
2. Find the mode of 8, 9, 9, 7, 8, 2, and 5.
It is a bimodal Data: 8 and 9
3. Find the mode of 4, 12, 3, 6, and 7.
No mode for this data.
60
~
The Median( X )
• In a distribution, median is the value of the
variable which divides it in to two equal
halves.
• In an ordered series of data median is an
observation lying exactly in the middle of the
series.
61
Example:
Find the median of the following numbers.
a) 6, 5, 2, 8, 9, 4
b) 2, 1, 8, 3, 5
Solution:
a) First order the data: b) Order the data :
2, 4, 5, 6, 8, 9 1, 2, 3, 5, 8
Here n=6, which is even Here n=5 , which is
n=6 odd
62
MV: Measures of variation
• The spread of items of a distribution is known
as dispersion or variation.
• In other words, the degree to which numerical
data tend to spread about an average value is
called dispersion or variation of the data.
63
Objectives of measures of variation
• To judge the reliability of MCT
• To control variability itself
• To compare two or more groups of numbers in terms
of their variability
• To make further statistical analysis
64
Types of Measures of Dispersion
• The most commonly used measures of
dispersions are:
– Range and relative range
– Standard deviation and coefficient of
variation
– Quartile deviation and coefficient of
Quartile deviation
65
The Range
• The range is the largest score minus the
smallest score.
• It is a quick and dirty measure of variability.
• It is greatly affected by extreme scores.
• R=L-S, L=Largest and S=Smallest
Example: 32 35 36 42 42 43 43 45
Range is 45-32=13
66
Mean Deviation
• Is the arithmetic mean of the values of the
absolute deviations from a given average
• Depending up on the type of averages used
we have different mean deviations
• Mean deviation for raw data and for frequency
distribution respectively as follows:
67
The variance and standard deviation
Population Variance:
• If we divide the variation by the number of
values in the population, we get the
population variance.
• This variance is the "average squared
deviation from the mean"
• And for frequency distribution
68
Sample Variance
• It simply be the population variance with the
population mean replaced by the sample
mean.
• However, one of the major uses of statistics
is to estimate the corresponding parameter.
• To counteract this, the sum of the squares of
the deviations is divided by one less than the
sample size
69
Sample variance formula
For raw data:
, Or
, shorthand formula
For frequency distribution:
Or
,
shorthand formula
70
Standard deviation
• It is the square root of variance
• Population standard deviation
• Sample standard deviation
71
Examples:
• Find the variance and standard deviation of
the following sample data
1. 5, 17, 12, 10.
2. The data is given in the form of frequency
distribution
72
Cont…
73
Cont…
74
Coefficient of Variation (C.V)
• Is defined as the ratio of standard deviation
to the mean usually expressed as percents.
• The distribution having less C.V is said to be
less variable or more consistent.
75
Example:
• An analysis of the monthly wages paid to
workers in two dep’t Pedi (A) and Ortho (B)
belonging to the same campus gives the
following results
Value Dep’t A Dep’t B
Mean wage 52.5 47.5
Variance 100 121
In which dep’t is there greater variability in
individual wages?
76
Cont…
• in dep’t B there is a greater variability in
individual wages.
77
Standard Scores (Z-scores)
• If X is a measurement from a distribution with
mean X and standard deviation S, then its
value in standard units is
78
Cont…
• Z gives the deviations from the mean in units
of standard deviation
• Z gives the number of standard deviation a
particular observation lie above or below the
mean.
• It is used to compare two observations
coming from different group
79
Examples:
1. Two sections were given Biostatistics
examinations. The following information was
given.
Value HO (Sec1) Nursing (Sec2)
Mean 78 90
Sd 6 5
• Student A from section 1 scored 90 and
student B from section 2 scored 95.
Relatively speaking who performed better?
80
Solutions:
• Calculate the standard score of both students
• Student A performed better relative to his
section because the score of student A is
2SD above the mean score of his section
while, the score of student B is only 1s.d
above the mean score of his section.
81
Measures of shape
• Measures of skewness
– Skewed to the right
– Skewed to the left
– Symmetric
• Measures of kurtosis
– Leptokurtic
– Mesokurtic
– Platykurtic
• Reading Assignment
82
4. Probability and its distribution
What is probability?
• It is the chance of an outcome in an exp’t.
• It is the measure of how likely an outcome is to occur.
• It helps us to cope up with uncertainty
• probability is “0’’ If an event can't occur, and it is “1” if an
event is certain to occur
A physician may say that a patient has a 50-50 chance of
surviving in a given operation.
Another physician may say that the patient is 95 percent
certain that has a particular disease.
84
Important definitions
• Experiment: Any process of observation or measurement
which generates a well defined outcome.
– E.g. The parasite counts of malaria pts entering
Hospital
• Probability Experiment:
– It is an experiment that can be repeated a number of
times under similar conditions
– it is possible to enumerate the total number of
outcomes with out predicting an individual outcome.
Example: If a fair die is rolled once
– it is possible to list all the possible outcomes
• i.e.{1, 2, 3, 4, 5, 6}
– but it is not possible to predict which outcome will
occur.
85
Important def…
• Outcome :The result of a single trial of a random
experiment
• Sample space (S): The set of all possible outcomes of an
experiment , for example, {H,T}.
• Event: Any subset of the sample space for example, {H} or
{T} or {H,T}
• Empty set (Φ) : Contain no elements.
• Equally Likely Events: Events which have the same chance
of occurring.
• Complement of an Event: the complement of an event A
means non-occurrence of A
86
Important def…
• Elementary Event: an event having only a single
element or sample point.
• Mutually Exclusive Events: Two events which cannot
happen at the same time.
• Independent Events: Two events are independent if the
occurrence of one does not affect the probability of
the other occurring.
• Dependent Events: Two events are dependent if the
first event affects the outcome or occurrence of the
second event in a way the probability is changed.
87
Approaches to measure probability
• There are four approaches in studying of probability
theory.
– These are:
• The classical approach
• The frequentist approach
• The axiomatic approach
• The subjective approach
88
Classical approach
• If a random experiment with N equally likely
outcomes is conducted and out of these NA outcomes
are favorable to the event A
• then the probability that event A occur denoted P(A) is
defined as:
89
Example
• A fair die is tossed once. What is the probability of
getting
a) Number 4?
b) An odd number?
c) An even number?
d) Number 8?
90
Solutions:
• First identify the sample space,
S={1,2,3,4,5,6}; N=n(S)=6
a) Let A be the event of number 4; A={4}
NA= n(A)=1;
P(A)=n(A)/n(S)=1/6
b) Let A be the event of odd numbers; A={1,3,5}
NA= n(A)=3;
P(A)=n(A)/n(S)=3/6=0.5
c) and d)… calculate by your self
91
Addition Rule
• The sum of the probabilities of all Mutually exclusive
event outcomes is equals to “1”
– For any events A and B;
p(AUB)= p(A)+p(B)-p(AnB)…Addition rule
– For Mutually exclusive events
p(AUB)= p(A)+p(B)
– For two independent variables A and B
p(AnB)=p(A).p(B)…Multiplication rule
– For two dependent variables
p(AnB)=p(A).p(B/A)…conditional probability
92
Example
• Out of 200 seniors at a certain college, 98 are women, 34
are majoring in Health Officer, and 20 Health officer
majors are women. If one student is chosen at random
from the senior class, what is the probability that the
choice will be either a Health officer major or a women.
Solution:
p( HO major or woman)=P( HO major) +p(woman) - Pr(HO Major
and woman)
=34/200 + 98/200 - 20/200 = 112/200 =0.56
93
Conditional probability
• The conditional probability of an event A given that B
has already occurred, denoted p(A/B) is;
94
Example
• Suppose in country X; the chance that an infant lives
to age 25 is 0.95, whereas the chance that he lives to
age 65 is 0.65. What is the chance that a person 25
years of age survives to age 65?
– Hint: it is clear that to survive to age 65 means to survive
both from birth to age 25 and from age 25 to 65.
95
Solution
A: Survive birth to age 25=0.95
B: Survive both birth to age 25 and age 25 to 65=0.65
B/A: Survive age 25 to 65 given survival to age 25=?
Then, Pr(B/A) =Pr(A n B)/Pr(A)
= 0.65/0.95
=0.684
That is, a person aged 25 has a 68.4 percent chance
of living to age 65
96
Probability distribution
97
Random Variable
• It is a numerical description of the outcomes of the
experiment or
• It is a numerical valued function defined on sample space
• usually denoted by capital letters
• A random variable takes a possible outcome and assigns
a number to it
98
Cont…
• Example: Toss a coin two times and let X be the number of
heads in two tosses
• S={(HH), (HT), (TH), (TT)}
– X(HH)=2,
– X(HT)=X(TH)=1,
– X(TT)=0
• X=0, 1, 2
• X assumes a specific number of values with some
probabilities.
99
RV are of two types:
1. Discrete random variable:
– are variables which can assume only a specific
number of values.
– These numbers will be only a finite or countable
infinite number of outcomes
Examples:
– Number of bacteria per two cubic centimeter of water.
– Number of drugs soled per week in a pharmacy
– Number of children in a family.
100
Cont…
2. Continuous random variable: are variables that can
assume all values between any two give values
Examples:
– Length of time required to complete a kidney transplant
surgery.
– Life time of Drugs.
– Height of students at certain college.
101
Discrete Probability Distribution
• It may be table, graph or formula consists of a value
that a RV assumes and the corresponding probability of
the values
or
• A probability function maps the possible values of x
against their respective probabilities of occurrence, p(x)
102
Properties of Discrete [Link]
1. 0 ≤ P(X = x) ≤ 1
2. ∑ P(X = x) = 1
3. P(X < b) = P(X b-1)
4. P(a X b) = P(X b) – P(X a-1)
103
Example:
• Consider the experiment of tossing a single coin two
times. Let X be the number of heads. Construct the
probability distribution of X.
• S={(HH), (HT), (TH), (TT)}
– X(HH)=2,
– X(HT)=X(TH)=1,
– X(TT)=0
• X=0, 1, 2
X 0 1 2
P(X=x) 1/4 2/4 1/4
104
Example 2:
• The following data shows x P(X=x)
the number of diagnostic 0 0.671
services with their 1 0.229
respective probability; a
2 0.053
patient receives
3 0.031
4 0.010
5 0.006
105
Solution
• What is the probability that a patient receives exactly 3
diagnostic services?
P(X=3) = 0.031
• What is the probability that a patient receives at most one
diagnostic service?
P (X≤1) = P(X = 0) + P(X = 1)
= 0.671 + 0.229
= 0.900
• What is the probability that a patient receives at least four
diagnostic services?
P (X≥4) = P(X = 4) + P(X = 5)
= 0.010 + 0.006
= 0.016
106
Probability distributions can also be displayed using a
graph
0.8
Probability, X=x
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5
No. of diagnostic services, x
107
Intro. to Expectation [ E(X) ]
• Let a discrete random variable X assume the values X1,
X2,…Xn with the probabilities P(X1), P(X2), ….,P(Xn)
respectively.
• Then the expected value of X ,denoted as E(X) and it is
defined as:
108
Example
• What is the expected value of a RV X on the number of
diagnostic services a patient receives example (Example 2)
Solution
E(X)=X1P(X1) + X2P(X2)+ …. +X6P(X6)
=(0*0.067)+(1*0.229)+(2*0.053)+(3*0.031)+(4*
0.010)+(5*0.006)
=0+0.229+0.106+0.093+0.04+0.03=0.498
• The expected number of diagnostic services on average is
0.498
109
Mean and Variance of RV
Let X be given random variable.
• Mean of x is E(X)
• Variance of x is: Var(X)=E(X2) – [E(X)] 2
Where,
110
Some general rules for expectation
Let X and Y are random variables and k be a constant.
1. E(k) = k
2. Var(k) = 0
3. E(kX) = kE(X)
4. Var(kX) = k 2 Var(X)
5. E(X+Y) = E(X)+E(Y)
111
Example:
• Calculate Mean and Variance of the above Example
(Example on the number of diagnostic services a
patient receives)
Solution
Mean= E(X)= X1P(X1) + X2P(X2)+ …. +X6P(X6)
=(0*0.067)+(1*0.229)+(2*0.053)+(3*0.031)+(4*0.010)+
(5*0.006)
=0+0.229+0.106+0.093+0.04+0.03
=0.498
112
Solution cont…
• Variance: Var(X)=E(X2) – [E(X)] 2
First calculate E(X2)
= (02 *0.067)+(12 *0.229)+(22 *0.053)+(32 *0.031)+(42
*0.010)+(52 *0.006)
= 0+0.229+0.212+0.279+0.016+0.4
=1.136
Then, Var(X) =E(X2) – [E(X)] 2
=1.136 - [0.498]2
=1.136 - 0.248
=0.887
113
Discrete probability distribution
The common Discrete probability
distributions are
• Binomial Distribution
• Poisson Distribution
114
Binomial Distribution
• A binomial experiment is a probability experiment
which satisfies the following four requirements
• Assumptions of a binomial distribution.
1. The experiment consists of “n” identical trials
2. Each trial has only one of the two possible mutually
exclusive outcomes, success or a failure
3. The probability of each outcome does not change
from trial to trial, and
4. The trials are independent, thus we must sample
with replacement
115
Examples
• Registering a newly produced drug product as
defective or non-defective.
• Asking 100 people if they favor the ruling party.
• Asking 200 people if they watch BBC/EBC news.
• Tossing a coin 20 times to see how many tails
occur.
• Rolling a die to see if a 5 appears.
116
Binomial Distribution…
• The outcomes of the binomial experiment and the
corresponding probabilities are called Binomial
Distribution.
• Then the probability of getting “x” successes in “n” trials
becomes:
Where, p=probability of success
q= probability of failure
x=Number of success desired
n= number of trials
117
Example
• What is the probability of getting three female children
from four births (not twin or more) in a family?
Solution
– Let X be the number of female in four single births in a
family?
– n=4, p=0.5
118
Solution cont…
• Therefore in one given family the probability of getting
three female child from four births is 0.25
119
Poisson Distribution
• A random variable X is said to have a Poisson
distribution if its probability distribution is given by:
• Where is the average number
• The Poisson distribution depends only on the
average number of occurrences per unit time of
space.
120
Poisson distribution…
• The Poisson distribution is used as a
distribution of rare events like;
– Number of misused drugs.
– Car Accidents…
– Natural disasters like earth quake.
121
Example
• If 1.6 accidents can be expected an intersection on any
given day, what is the probability that there will be 3
accidents on any given day?
122
Normal distribution
• A random variable X is said to have a normal
distribution if its probability density function is given
by;
Where;
• and are the parameters of normal distribution
123
Properties of Normal distribution
• It is bell shaped and is symmetrical about its mean and
it is mesokurtic
• It is unimodal; It is a family of curves
• It is a continuous distribution
• Total area under the curve sums to 1
• Mean=Median=Mode
• The probability that a random variable will have a value
between any two points is equal to the area under the
curve between those points
124
Normal distribution…
125
Standard Normal distribution
• The transformed value of normal distribution
by standardization of RV X
where,
– Mean is zero and Variance is one
126
Standard Normal distribution…
• Given a normal distributed random variable X
with mean and standard deviation
• Note that;
127
Example
• Find the area under the standard normal distribution
which lies
a. Between Z=0 and Z=0.96
b. Between Z= -1.45 and Z=0
c. To the right of Z= -0.35
d. Between Z=-1.45 and Z=0.96
128
Solution
129
Solution…
d) Exercise ??
130
131
5. Sampling
Definitions
• Population: is the complete set of possible measurements
for which inferences are to be made
• Sample: is the set of measurements that are collected in
the course of an investigation.
• Parameter: Characteristic or measure obtained from a
population.
• Statistic: Characteristic or measure obtained from a sample
133
Definitions…
• Sampling: The process or method of sample selection
from the population
• Sampling unit: the ultimate unit to be sampled or
elements of the population to be sampled
• Sample size: the number of elements or observation to
be included in the sample.
134
The main concern in sampling
• To ensure that the sample represents the population,
• The findings can be generalized
• Researchers are
– Not interested in the sample itself; But in what can
be learned from the sample and
– How this information can be applied to the entire
population.
135
Reasons for Sampling
– Reduced cost
– Greater speed
– Greater accuracy
– Greater scope
– The only option when the population is infinite or
simply large to handle
136
Sampling Techniques
There are two types of sampling techniques
• Probability sampling . Non probability sampling
– Simple random sampling • Judgment sampling
– Stratified random sampling • Convenience sampling
– Cluster sampling • Quota Sampling
– Systematic sampling
– Multi-stage sampling
137
Probability sampling
• Is a method of sampling in which;
– all elements in the population have a pre-
assigned non zero probability to be
included in to the sample.
138
Simple Random Sampling(SRS)
• Is a method of selecting items from a population
– that every possible sample of specific size has an
equal chance of being selected.
– In this case, sampling may be with or without
replacement.
• All elements in the population have the same pre-
assigned non zero probability to be included in to
the sample.
• SRS can be done either using the lottery method
or table of random numbers.
139
Stratified Random Sampling
• The population will be divided in to non-overlapping but
exhaustive groups called strata.
• SRS will be chosen from each stratum.
• Elements in the same strata should be more or less
homogeneous while different in different strata.
• It is applied if the population is heterogeneous.
• Some of the criteria for dividing a population into strata
are:
– Sex (male, female);
– Age (under 18, 18 to 28, 29 to 39);
– Occupation (blue-collar, professional, other).
140
Cluster Sampling
• The population is divided in to non-overlapping groups
called clusters.
• A SRS of cluster of elements is chosen and all the
sampling units in the selected clusters will be surveyed.
• Clusters are formed in a way that elements with in a
cluster are heterogeneous
– i.e. observations in each cluster should be more or less
dissimilar.
• Cluster sampling is useful when it is difficult or costly to
generate a SRS.
141
Systematic Sampling:
• A complete list of all elements with in the population
(sampling frame) is required.
• The procedure starts in determining the first element
to be included in the sample.
• Then the technique is to take the kth item from the
sampling frame.
142
Non probability sampling
• It is a sampling technique in which the choice of
individuals for a sample depends on the basis of;
– convenience,
– personal choice or
– Interest…etc
143
Types of Non probability sampling
• Judgment Sampling
– The person taking the sample has direct or indirect
control over which items are selected for the sample.
• Convenience Sampling
– The decision maker selects a sample from the
population in a manner that is relatively easy and
convenient.
• Quota Sampling
– The decision maker requires the sample to contain a
certain number of items with a given characteristic.
Many political polls are, in part, quota sampling.
144
Errors in sample survey:
There are two types of errors
a) Sampling error:
• Is the discrepancy between the population value and sample
value.
• May arise due to in appropriate sampling techniques applied
b) Non sampling errors:
• are errors due to procedure bias such as:
– Due to incorrect responses
– Measurement
– Errors at different stages in processing the data.
145
Sample size determination
How many subjects should a researcher study?
Decide how many people need to be studied in order to
answer the study objectives
It is much better to increase the accuracy of data
collection than to increase the sample size after a
certain point.
146
Sample size determination…
Describe how the sample size is determined
Too small sample;
– May fail to detect important effects
– May estimate effects too imprecisely
Results have no practical use
Too large sample;
– Waste of resources
– Data quality compromised
147
Sample size determination…
When deciding on sample size:
PRECISION COST
∆
Sample size = Precision = Cost
148
Sample size determination…
Example:
• A prevalence of 10% from a sample size of 20
– would have a 95% CI of 3% to 23%,
– which is not very precise or informative.
• But, a prevalence of 10% from a sample of
size 400
– would have a 95% CI of 9% to 13%,
– which may be considered sufficiently accurate.
149
Sample size determination depends on the
• Objective of the study
• Design of the study
• Descriptive/Analytic
• Accuracy of the measurements to be made (margin of
error)
• Degree of precision required for generalization
• Plan for statistical analysis
• Degree of confidence with which to conclude
• The feasible sample size is also determined by the
availability of resources
• Time, manpower. transport, available facility and money
150
Sample Size for estimating a Single Proportion
(z ) pq2
n 2
d2
Where, p = proportion
q= 1-p
d= the degree of precision
Zα/2= The confidence level at α level of significance
This formula works for a large population!
151
Example:
• Suppose that you are interested to know the proportion of
HIV infected adult patients who developed Tuberculosis in
Goba Referral Hospital. Suppose that in this hospital the
proportion (p) of Tuberculosis was found to be 20%. What
sample size is required to estimate the true proportion
within ±3% with 95% confidence level.
• Given: p=0.20, d=0.03, α=5%
(z )2 pq
1.96 2 (0.2)(0.8)
n 2
683
d2 (0.03) 2
152
Example
• If the sample is to be taken from a relatively small
population (<10,000); the above formula needs some
adjustment.
Final sample (fpc) = n/1+(n/N)
• Suppose in the above example that the total population of
patients are 5000. What sample size will be needed to
conduct the study?
n=n/1+(n/N)=683/1+(683/5000)=601
153
Example
• Suppose there is no prior information about the prop. of
HIV infected adult patients (p) who developed TB
If you don’t have any information about p, take it as 50%
and get the maximum value of p*q which is ¼ (25%).
• Assume p=q=0.5 (most conservative)
• Then, the required sample size will be
(z )2 pq
1.96 2 (0.5)(0.5)
n 2
1068
d2 (0.03) 2
154
Sample Size for Estimating a Single Mean
(z ) 2 2
n 2
d 2
Where,
n = sample size
σ = standard deviation
d = desired precision = half of the
confidence interval (width (w)=margin of
error (e)=2d)
155
Example
• Suppose that for a certain group of cancer patients, we
are interested in estimating the mean age at diagnosis.
We would like a 95% confidence interval of 5 years wide.
If the population standard deviation is 12 years, how
large should our sample be?
(z )2 2
(1.96)2 (144)
n 2
88.5 89
d2 (2.5) 2
156
Example
• Suppose d=1
• Then the sample size increases!
(z )2 2
(1.96)2 (144)
n 2
553.2 554
d2 12
157
But, the population 2 is most of the time unknown!
As a result, it has to be estimated from:
• Pilot or preliminary sample:
– Select a pilot sample and estimate 2 with
the sample variance, s2
• Previous or similar studies
158
6. Statistical Estimation and
Hypothesis Testing
Inference
• Inference is the process of making a conclusion
from sample data for the totality of the
population.
• It is only the sample data that is ready for
inference.
• In statistics there are two ways though which
inference can be made.
– ™
Statistical estimation
– ™
Statistical hypothesis testing
Inference…
• Data analysis is the process of extracting relevant
information from the summarized data.
Statistical estimation
• This is one way of making inference about the
population parameter where;
• the investigator does not have any prior notion about
values or characteristics of the population parameter.
• Two ways of estimation
– Point Estimation
– Interval Estimation
Statistical estimation…
Point Estimation
• It is a procedure that results in a single value as an
estimate for a parameter
Sample statistic Population parameter
X (sample mean) μ (population mean)
S2 ( sample variance) σ2 (population variance)
S (sample SD) σ (population SD)
p ( sample proportion) P or π (Pop. proportion)
Interval estimation
• It is the procedure that results in the interval of
values as an estimate for a parameter,
• which is interval that contains the likely values of a
parameter.
Confidence Interval
• How confident can we be that the value of the statistic
falls within a certain "distance" of the parameter?
• Or, what is the probability that the parameter's value is
within a certain range of the statistic's value?
• This range is the confidence interval.
Estimator and Estimate
• Estimator is the rule or random variable that
helps us to approximate a population parameter.
• But estimate is the different possible values
which an estimator can assume.
Example: The sample mean is an estimator for
the population mean and is an estimate, which
is one of the possible value of .
Properties of best estimator
The following are some qualities of an estimator
• It should be unbiased
• It should be consistent
• It should be relatively efficient
Estimator…
• To explain these properties let ˆ be an estimator
of
– Unbiased Estimator: An estimator whose expected
value is the value of the parameter being estimated.
i.e.
– Consistent Estimator: An estimator which gets closer
to the value of the parameter as the sample size
increases. i.e. ˆ gets closer to as the sample size
increases.
Cont….
• Relatively Efficient Estimator:
– The estimator for a parameter with the smallest
variance.
– This actually compares two or more estimators for
one parameter.
Hypothesis Testing
• This is also one way of making inference about
population parameter, where
• the investigator has prior notion about the value of the
parameter.
Definitions:
• Statistical hypothesis: is a statement about the
population whose acceptability is to be evaluated on
the basis of the sample data.
• Test statistic: is a statistics whose value serves to
determine whether to reject or accept the hypothesis to
be tested.
– It is a random variable.
Two types of hypothesis
Null hypothesis (H0):
• It is the hypothesis to be tested.
• It is the hypothesis of equality or the hypothesis of no
difference.
Alternative hypothesis(H1) or (HA):
• It is the hypothesis available when the null hypothesis
has to be rejected.
• It is the hypothesis of difference.
Types of errors:
Two types of errors in hypothesis testing
• Type I error(α): Rejecting the null hypothesis when it is
true.
– It is sometimes called level of significance.
• Type II error (β): Failing to reject the null hypothesis
when it is false.
General steps in hypothesis testing
1. The first step in hypothesis testing is to specify the
null and alternative hypothesis
2. The next step is to select a significance level, α
3. Identify the sampling distribution of the estimator.
4. Calculate a statistic analogous to the parameter
specified by the null hypothesis.
5. Identify the critical region.
6. Making decision.
7. Summarization of the result.
Example:
• It is known in a pharmacological experiment that rats
fed with a particular diet over a certain period gain an
average of 40 gms in weight. A new diet was tried on a
sample of 20 rats yielding a weight gain of 43 gms
with variance 7 gms2 .
– Test the hypothesis that the new diet is an
improvement assuming normality.
Solution
Given:
Steps
Conclusion:
• Reject Ho and conclude that the new diet has an
improvement on the rats.
Test of Association
• Suppose we have a population consisting of
observations having two attributes; say A and B.
• If the attributes are independent then the probability
of possessing both A and B is PA*PB. Where,
– PA is the probability that a number has attribute A.
– PB is the probability that a number has attribute B.
• Suppose
– A has r mutually exclusive and exhaustive classes.
– B has c mutually exclusive and exhaustive classes
Test of association…
• The entire set of data can be represented using r *c
contingency table
Test of association examples
• Whether the presence or absence of hypertension is
independent of smoking habit or not.
• Whether the size of the family is independent of the
level of education attained by the mothers.
• Whether there is association between father and son
regarding boldness
The Chi-square Test
Hypothesis
• The null and alternative hypothesis may be
stated as:
– H0 : There is no association between A and B
– H1 : not H0 (There is an association between A and B)
Decision Rule:
Reject H0 for at α level of significance if the calculated
value of χ2 exceeds the tabulated value of χ2 with
degree of freedom equal to (r − 1)(c −1)
Example
Solution:
Solution…
Solution…
Conclusion
• At 5% level of significance we have evidence to say
there is association between father and son regarding
boldness, based on this sample data.
Exercise
• Attack rates among the vaccinated and unvaccinated
against measles are given in the Table below.
• Prove the protective value of vaccination by χ2 test.
7. Correlation and Linear regression
Correlation
• It is the quantification of the degree to which
two random variables are related, provided
that the relationship is linear.
• Used to investigate the relationships that can
exist among continuous variables
• We will discuss on
– Two way scatter plot
– Pearson’s correlation coefficient
Two way scatter plot
• Example: Percentage of children immunized against DPT and
under-five mortality rate for 20 countries, 1992
Not surprisingly, the mortality rate tends
to decrease as the % of children
immunized increases.
Pearson's Correlation Coefficient
Scatter plots showing possible relationships between X and Y
Pearson's Correlation…
• The correlation between the random variables X and Y
is denoted by the Greek letter p (rho).
• The correlation quantifies the strength of the linear
relationship between the outcomes x and y.
• It can be thought of as the average of the product of
the standard normal deviates of X and Y
Pearson's Correlation…
The correlation between X and Y may be one of the
following
1. Perfect positive (r=1)
2. Positive(r between 0 and 1)
3. No correlation (r=0)
4. Negative(r between -1 and 0)
5. Perfect negative(r=-1)
Pearson's Correlation…
• The presence of correlation between two variables
may be due to three reasons:
– One variable being the cause of the other.
– Both variables being the result of a common cause
– Chance
Pearson's Correlation…
• In the previous example, on immunization and U5
mortality.
• There is a strong linear relationship between the percentage
of children immunized against DPT in a specified country and
its under-five mortality rate;
• the correlation coefficient is fairly close to its minimum
value of -1. Since r is negative, mortality rate decreases in
magnitude as percentage of immunization increases
Simple Linear Regression
• It refers to the linear relationship between two
continuous variables
– We usually denote the dependent variable by Y and the
independent variable by X.
• A simple regression line is the line fitted to the points
plotted in the scatter diagram which would describe
the average relationship between the two variables.
– Therefore, to see the type of relation ship, it is advisable
to prepare scatter plot before fitting the model.
Simple Linear Regression…
• The linear model is :
Simple Linear Regression…
• The above model is estimated by (OLS):
– Where is “a” constant which gives the value of Y when
X=0. It is called the Y intercept.
– And “b” is a constant indicating the slope of the
regression line, and it gives a measure of the change in Y
for a unit change in X. It is also regression coefficient of
Y on X
– The calculation formula for “a” and “b” are:
Example:
The data on Resting metabolic rate (RMR) in (kcal/24
hrs) and body weight in (kg) for 10 Nutrition clinic
clients
BW 57.6 64.9 59.2 60.0 72.8 77.1 82.0 86.2 91.6 99.8
RMR 1325 1365 1342 1316 1382 1439 1536 1466 1519 1639
a. Plot the scatter diagram to view the relationship
b. Calculate a simple correlation coefficient and interpret
c. Fit a regression line of RMR on BW using least square
estimates.
d. Predict the value of RMR if the BW is 85.
Solution
a. The scatter diagram looks: it seems linear
relationship
Solution…
b. We can use the “r” calculating formula
13,510.72
r 0.955
(1953.56)(102, 424.9)
Interpretation
• There is a positive high linear relationship between BW
and RMR
Solution…
c. First we have to calculate a and b using the formula
Thus the regression line is given by y= 913.3729+ 6.91596x
Solution…
d. Simply replace 85 in the place of x in the fitted model y=
913.3729+ 6.91596x.
Thus, y= 913.3729+6.91596*85
y=1501.2295
Interpretation
In the given data, the RMR will be 1501.2295 if the BW is 85 kg
Multiple linear regression
• is a model with a 2 and more regressors having a
linear relationship with a response variable Y.
• The multiple regression model is
Y o 1 X1 2 X 2 ... p X p
• Where:
– Y= response V - b = slope
1
– Xs= regressors - ε =random error component
– b = intercept
0
Example: As a research question
• Do number of cigarettes (IV1) and exercise (IV2),
predict CHD mortality (DV)?
• Cigarettes CHD Mortality
• Exercise
202
…the SPSS output
…cont
• From the 1st table we can see the correlation
between Cig and CHD
• From the 2nd table again we can see the
ANOVA table
• We are interested with the 3rd table
– We will focus on the Unstandardized predicted and
residual values.
– The model “CHD=27.08+0.45Cig–5.92Exercise”
…cont
The interpretation looks
• The model “CHD=27.08+0.45Cig–5.92Exercise”
• Smoking and Exercise are a significant factors to the CHD
The conclusion will be
• “In the given 21 countries, the 1 cigarette increase in
smoking will rise the CHD mortality by 27.53” and “when
Number of exercise per week is decreased by 1 the CHD
mortality will increase by 21.16”
8. Logistic Regression
Logistic Regression
• Is a method for examining the relationship
between a Categorical (Dichotomous)DV with
one or more IVs.
– Simple Logistic Regression
• 1 dichotomous DV and 1IV
– Multiple Logistic Regression
• 1 dichotomous DV and >1IVs
…cont
• Logistic regression is used to predict a
categorical (usually dichotomous) variable from a
set of predictor variables.
• Logit analysis is usually employed if all of the
predictors are categorical; and
• logistic regression is often chosen if the predictor
variables are a mix of continuous and categorical
variables.
• Logistic regression has been especially popular
with medical research in which the dependent
variable is whether or not a patient has a disease.
Cont…
• The model;
• where is the predicted probability of the
event which is coded with 1.
,
• Odds Ratio (OR)= eβ
Assumptions of BinaryLR
1. The DV to be binary(e.g. 0 and 1)
2. Since it assumes; p(y=1) the DV will be coded
accordingly
3. The model should be fitted correctly(Neither
over or under fitted)
– Only meaning full variables should be included
4. Error terms need to be independent; each
observation be independent(Collinearity)
5. Linearity of independent variables and log
odds
6. It requires quite large sample size
– at least 30 cases for each parameter to be
estimated
Example
• Factors associated with physician agreement
on causes of death.
• DV;
– Physician Agreement (1=Agree Vs 0=Disagree)
• IV;
– AgeCat1
– OccupDeceaCataaa , EducDeceasedCat
– RespAge, ,
– DeceaSex
Cont…
• See the output from the SPSS output window
• In the final model “OccupDecea” and
“DecesSex” were the factors that affect the
physician agreement
Cont…
• The interpretation looks
– From all deaths physicians had agreed 28%
less likely on the deceased who were females
compared to those deceased who were males
with the odds of (OR=0.72, 95% CI: 0.525-
0.988).
– And physicians had 2.18 times more likely to
agree on those deceased who were illegible for
any work compared to the deceased who were
workers with (OR 2.18, 95% CI: 1.43-3.32).
– OR = eβ
Example 2
• The DV;
– Decision about research (0=stop and 1=continue)
• The IV;
– gender (0=F and 1=M)
• The model and the output looks
Meaning???
9. Survival analysis
What is Survival Analysis?
• Survival Analysis is referred to statistical methods
for analyzing survival data
• Survival data could be derived from laboratory
studies of animals or from clinical and
epidemiologic studies
• Survival data could relate to outcomes for studying
acute or chronic diseases
What is Survival Time?
• Survival time refers to a variable which measures
the time from a particular starting time (e.g., time
initiated the treatment) to a particular endpoint of
interest (e.g., attaining certain functional abilities)
• It is important to note that for some subjects in the
study a complete survival time may not be available
due to censoring
Censored Data
Some patients may still be alive or in remission
at the end of the study period
The exact survival times of these subjects are
unknown
These are called censored observation or
censored times and can also occur when
individuals are lost to follow-up after a period of
study
Random Right Censoring
• Suppose 4 patients with acute leukemia enter a
clinical study for three years
• Remission times of the four patients are recorded as
10, 15+, 35 and 40 months
• 15+ indicate that for one patient the remission time
is greater than 15 months but the actual value is
unknown
Important Areas of Application
• Clinical Trials (e.g., Recovery Time after heart
surgery)
• Longitudinal or Cohort Studies (e.g., Time to
observing the event of interest)
• Life Insurance (e.g., Time to file a claim)
• Quality Control & Reliability in Manufacturing (e.g.,
The amount of force needed to damage a part such
that it is not useable)
Survival Function or Curve
Let T denote the survival time
S(t) = P(surviving longer than time t )
= P(T > t)
The function S(t) is also known as the cumulative
survival function. 0 S( t ) 1
Ŝ(t)=number of patients surviving longer than t
total number of patients in the study
E.g: Four patients’ survival time are 10, 20, 35
and 40 months. Estimate the survival function.
0.8
% Surviving
0.6
0.4
0.2
0
0 10 20 30 40 50
Month
Example: Four patients’ survival data are 10,
15+, 35 and 40 months. Estimate the survival
function
0.8
% Surviving
0.6
0.4
0.2
0
0 10 20 30 40 50
Month
In 1958, Product-Limit (P-L) method was
introduced by Kaplan and Meier (K-M)
• As you move from left to right in estimation of the
survival curve first assign equal weights to each
observation. Do not jump at the censored observations
• Redistribute equally the pre-assigned weight to the
censored observations to all observations to the right of
each censored observation
• Median survival is a point of time when S(t) is 0.5
• Mean is equal to the area under the survival curve
A few critical features of P-L or K-M
Estimator
• The PL method assumes that censoring is
independent of the survival times
• K-M estimates are limited to the time interval in
which the observations fall
• If the largest observation is uncensored, the PL
estimate at that time equals zero
Comparison Of Two Survival Curves
• Let S1(t) and S2(t) be the survival functions of the
two groups.
• The null hypothesis is
H0: S1(t) =S2(t), for all t > 0
• The alternative hypothesis is:
H1: S1(t) S2(t), for some t > 0
The Logrank Test
• SPSS, Stata, SAS, S-Plus and many other statistical
software packages have the capability of analyzing
survival data
• Logrank Test can be used to compare two survival
curves
• A p-value of less than alpha level (0.05) based on the
Logrank test indicate a difference between the two
survival curves
EXAMPLE
• Survival time of 30 patients with Acute
Myeloid Leukemia (AML)
• Two possible prognostic factors
Age = 1 if Age of the patient 50
Age = 0 if Age of the patient < 50
Cellularity = 1 if cellularity of marrow clot section
is 100%
Cellularity =0 otherwise
Format of the DATA
Survival Times and Data of Two Possible
Prognostic Factors of 30 AML Patients
* Censored = 1 if Lost to follow-up
Censored = 0 if Data is Complete
Comparing the survival curves by
Age Groups using Logrank Test
Comparing the survival curves by
Cellularity using Logrank Test
Hazard Function
• The hazard function h(t) of survival time T gives the
conditional failure rate
• The hazard function is also known as the
instantaneous failure rate, force of mortality, and
age-specific failure rate
• The hazard function gives the risk of failure per unit
time during the aging process
Multivariate Analysis: (CPHM)
Cox's Proportional Hazards Model
• CPHM is a technique for investigating the
relationship between survival time and
independent variables
• A PHM possesses the property that different
individuals have hazard functions that are
proportional to one another
Comparing the survival curves by Age
Groups after Adjusting Cellularity using
CPHM
Comparing the survival curves by
Cellularity Groups after Adjusting Age
using CPHM
10. Study Designs in Epidemiology
Case report
• Are the most basic types of observational study designs.
• These studies describe the experiences of a single person
(case report) or a group of people (case series) who have a
specific disease or condition.
• Case reports and case series typically describe previously
unrecognized diseases or unusual variants of a known
disease process.
• Consequently, data from these studies are particularly
useful for alerting the health community to the presence of
a new disease and for generating hypotheses regarding
possible causes.
Cross sectional study
• Cross-sectional studies are a type of observational
study in which the exposure and outcome are
measured simultaneously.
• Concurrent measurement of potential risk factors and
a disease outcome implies that there is no follow-up
time in cross sectional studies.
Cohort study
• Are observational studies that compare the
incidence of disease among different exposure
groups.
• The cohort study design separates potential risk
factors from the development of disease over
time to demonstrate temporal associations.
• Cohort studies are conducted in three
fundamental steps:
1. Identify a group of people who are initially free of the
disease outcome
2. Measure the exposure(s) of interest to create cohorts
3. Follow the cohorts over time to determine the
incidences of disease
Cohort…
• Design of cohort study
Case control study
• Are observational studies that begin by targeting a
disease or condition of interest and then work
backward to determine associations with previous
exposures.
• The case-control study design is ideally suited for
examining potential risk factors for rare diseases.
Randomized trial
• A randomized trial is a prospective study in
humans that evaluates the benefits and
harms of an intervention against control
procedures
overview of common research study designs
11. Measurement error and bias
Bias
• Epidemiological studies measure x-stics of
populations. These parameter may be
– a disease rate
– the prevalence of an exposure
– The association between an exposure and disease.
• Because studies are carried out on people and
have all the attendant practical and ethical
constraints, they are almost invariably subject
to bias.
Bias…
• Bias is a systematic tendency to under or
overestimate the parameter of interest
because of a deficiency in the design or
execution of a study.
• Two main sources of bias here:
– Selection bias and information bias
Bias…
• Selection bias occurs when the subjects
studied are not representative of the target
population about which conclusions are to be
drawn
– The possibility of selection bias should always be
considered when defining a study sample
• Information bias: arises from errors in
measuring exposure or disease.
Bias…
• Bias cannot usually be totally eliminated from
epidemiological studies.
• The aim, therefore, must be
– to keep it to a minimum
– to identify those biases that cannot be avoided
– to assess their potential impact, and
– to take this into account when interpreting results
• The motto of the epidemiologist could well be
“dirty hands but a clean mind”
Measurement error
• As indicated above, errors in measuring exposure
or disease can be an important source of bias in
epidemiological studies.
• In conducting studies, therefore, it is important to
assess the quality of measurements
• Sometimes a reliable standard is available
against which the validity of a survey method can
be assessed
– E.g. the validity of a mammographic diagnosis of
breast cancer can be tested by biopsy. More often,
however, there is no sure reference standard
Measurement error…
• Measurements of disease in life are often
incapable of full validation.
• In practice, therefore, validity may have to be
assessed indirectly
• Two techniques of measurement
1. Survey method
2. Standard reference test
Analyzing validity
• When a survey technique or test is used to
dichotomize subjects its validity may be
analyzed by classifying subjects as positive
or negative
– firstly by the survey method and secondly
according to a standard reference test
• Four important validity analyzing: sensitivity,
specificity, systematic error, and predictive
value
Analyzing validity…
Comparison of a survey test with a reference test
Analyzing validity…
• Sensitivity—A sensitive test detects a high proportion
of the true cases, and this quality is measured here by
a/a + c.
• Specificity—A specific test has few false positives, and
this quality is measured by d/b + d.
• Systematic error—For epidemiological rates it is
particularly important for the test to give the right
total count of cases.
– This is measured by the ratio of the total numbers positive to the
survey and the reference tests, or (a + b)/(a + c).
• Predictive value—This is the proportion of positive test
results that are truly positive.
Analyzing validity…
• For example,
• The sensitivity of mammography for detecting breast
cancer is 90%. This value is interpreted as “90% of
women who have biopsy-proven breast cancer will have a
positive mammogram.”
• The specificity of mammography for detecting breast
cancer is also about 90%. This value is interpreted as
“90% of women who have biopsy-proven absence of
breast cancer will have a negative mammogram.
Sensitive or specific? A matter of choice
Repeatability
• It is helpful examining repeatability when there is
no satisfactory standard against which to assess
the validity of a measurement technique
• Consistent findings do not necessarily imply that
the technique is valid:
– a laboratory test may yield persistently false positive
results, or a very repeatable psychiatric questionnaire
may be an insensitive measure of, for example,
“stress”.
Repeatability…
• Repeatability can be tested
– within observer (that is, the same
observer performing the measurement on two
separate occasions) and also
– between observers (comparing measurements
made by different observers on the same subject or
specimen).
To dissect the total variability into four
• Within observer variation—Discovering one’s own
inconsistency
• Between observer variation—This includes the first
component but adds to it an extra and systematic
component due to individual differences in techniques
and criteria
• Random subject variation—When measured repeatedly
in the same person, physiological variables like blood
pressure tend to show a roughly normal distribution
around the subject’s mean.
• Biased (systematic) subject variation—Blood pressure
is much influenced by the temperature of the
examination room, as well as by less readily
standardized emotional factors.
Analyzing repeatability
• For continuous numerical variables
• Calculated by SD or CV(standard deviation ÷
mean)
• Scatter plot will show the extent and pattern
of observer variation
– to plot the difference between each pair of
measurements against their mean.
– E.g. Blood pressure
Analyzing repeatability…
• For qualitative attributes
• κ statistic, which measures the level
of agreement over and above what would be
expected from the prevalence of the attribute.
• The proportion of the total in cells a and d, is
the level of agreement
12. Validity and Reliability
Validity
• The ability of a test, data
collection mechanism or process,
to accurately measure the variable
of interest:
– to distinguish diseased from non-diseased
subjects
– To measure the presence or absence of a
particular risk factor (exposure)
– To measure the magnitude of disease or risk
in a population
Threats to Validity
• Bias
• Confounding
• Chance
Reliability
• The degree to which results are
consistently measured by any type
of data collection instrument
– medical test
– medical record
– observation
– study questionnaire
Example:
• In a study assessing the exposure between
alcohol intake (high vs. low) and high blood
pressure the investigator calculated the
following results:
RR = 2.13 (95% C.I.: 1.05 - 12.10) p = 0.01
• CONCLUSIONS?
• PROBLEMS?
Conclusions:
RR = 2.13 (1.05 - 12.10) p = 0.01
• P-value:
We have observed an association that is significantly
different than the null hypothesis (RR=1) and the
probability that an observed effect is actually due to
chance is 1 in 100.
• Confidence Interval:
If we did this study 100 times (took 100 different
samples from the target population) approximately
95% of the time the interval would cover the true
population measure.
Conclusions/Problems
RR = 2.13 (1.05 - 12.10) p = 0.01
• How sure can we really be that the true RR is 2.13?
• Why is this Confidence Interval so wide?
• What are the sources of error which may have affected
precision (wide CI) and internal validity?
Possible errors in every study
should be carefully considered
Error
• “Epidemiology can be considered an exercise in
measurement and estimation”
– Sanders Greenland
• Epidemiologic studies attempt to approximate the “real
world” by evaluating relationship between exposure and
outcome in a sample of people.
• Therefore, error is inevitable!
– Both in sample selection: It is seldom desirable, necessary,
or possible to study everyone.
– And in measurement: Can never measure exposure and
outcome perfectly.
Truth and Approximation
Truth in “real Approximation in
world” study
• Actual people to apply Study population
findings (TARGET
POPULATION)
• Actual exposure and Measurements collected
outcome
• Actual relationship Results from a given study
between exposure and
outcome
Difference between truth and approximation = error
Sources of Error in a Study
Lack of Study Precision Lack of study validity
(RELIABILITY) (INTERNAL VALIDITY)
Random Error Systematic Error Confounding
Selection Bias Information Bias
Types of Error in Epidemiologic Research
• Random error
– reflects fluctuations around the true value of a
parameter.
– is essentially attributable to sampling variation,
the extent of which may depend on aspects of the
study design (e.g. sample size) and statistical
characteristics of the estimator (e.g. its variance).
• Systematic or non-random error
– leads to BIAS
– reflects a deviation of results or inferences from
the truth.
– the processes leading to such deviation can be
introduced at any point in an investigation.
Errors and Study Size
(BIAS)
Effect of Bias
• Bias will result in an estimate that is not the same as
the true value.
• Directions of bias:
– Away from the null:
• study RR=8, true RR=2
• study RR=0.5, true RR=0.9
– Towards the null:
• study RR=1.3, true RR=5.0
• Study RR=0.9, true RR=0.4
– “Switchover”:
• Study RR=0.5, true RR=2.0
Internal vs. External Validity
• Bias undermines internal validity, which is the
ability to measure what the study sets out to
measure.
– It requires proper selection of study subjects and
lack of error in measurement.
• External Validity concerns inferences to an
external population beyond the study’s
restricted interest.
– Such inferences require generalization based on
judgmental aspects, such as findings from other
studies and existing knowledge about the biology of
the disease.
• In this course, we limit the discussion of
validity to internal validity.
Sources of error
Random error affects Systematic error (bias)
precision (RELIABILITY) affects VALIDITY
X X X
X
X X X XX
X X XX
X
X
X X X
x X
Aday, 1996
How can the relative risk or odds ratio be wrong?
Lack of study precision Lack of study validity
(RELIABILITY) (INTERNAL VALIDITY)
Random Error Systematic Error
How can the relative risk or odds ratio be wrong ?
lack of study validity
(INTERNAL VALIDITY)
Systematic Error
Confounding
Selection Bias Information Bias
Sources of error
(random or systematic)
Error can be introduced by the…
Study observer/investigator
Study participant
Study instrument
During the process of…
Selection of study subjects
Measurement of disease and/or exposure
Analysis or interpretation of findings
How do we prevent threats to validity
(systematic error) in our research?
1) Study design: Minimize Bias
(more on this in upcoming lectures)
2) Study implementation:
Quality Assurance & Quality Control
3) Use “validated tools” (best if validated in your
population)
Assessing the Quality of a
Measurement Tool
• Accuracy (validity):
– sensitivity/specificity with a gold standard
• Either validity or reliability:
– Mean difference
– Kappa/% agreement
– Correlation
– Regression
• Reliability
– Correlation coefficient (ICC)
– Coefficient of variation (CV)
– Bland-Altman/limit of agreement (LOA)
Sensitivity/Specificity
• The ability of a test to distinguish
diseased from non-diseased subjects
Two by Two Table
Disease Disease
Total
Yes No
Test
TP FP TP + FP
Pos
Test
FN TN TN + FN
Neg
Total TP + FN TN + FP
Sensitivity
• percentage of all true cases identified
(TP / TP+FN) X 100
Specificity
• percentage of true negatives identified
(TN / TN + FP) X 100
Two by Two Table
Disease Disease
Total
Yes No
Test
TP FP TP + FP
Pos
Test
FN TN TN + FN
Neg
Total TP + FN TN + FP
Sensitivity Specificity
False positive & negative results
• False positives
– burden on HC system
– unnecessary anxiety
– labeling
• False negatives
– delay treatment
– false sense of “security” regarding risk
behaviors
Improving sensitivity and/or specificity
• Sequential testing
– initial test positives examined using other
method
– improves specificity
• Simultaneous tests
– multiple variables assessed at the same time
– improves sensitivity
Measure of Yield
Predictive value positive (PVP) is the
proportion of positive tests that are actually
diseased
PVP = TP / TP + FP
Influences on PVP
PVP influenced by:
– sensitivity and specificity of the test used,
especially specificity
– prevalence of disease in the population
being tested
Increasing Reliability :
(the precision and reproducibility of data collected)
1. Reduce intra-subject variability
-Repeated Measurements
-Standardized data collection times
2. Reduce inter-observer variability
-Standardized diagnostic criteria, tests,
and instruments
3. Increase sample size
Assessing Reliability
– Inter-rater
• % agreement, kappa statistic
– Internal consistency
• Kuder-Richardson20 , Cronbach’s coefficient
alpha
– Test-retest
• Quantified by correlation co-efficient
*See Szklo book for more examples*
Assessing agreement between observers,
instruments, etc.
• Percent (observed) agreement
– proportion of measurements that have the same
results by two (or more) methods, expressed as a
percentage
% agreement =(a+d) / (a+b+c+d)
• Kappa measure: the extent to which 2 measures agree,
taking into account their agreement expected by
chance alone (ex: agreement if two assessors rated
responses at random)
Calculating % (observed) agreement
Test 1 + Test 1 - Total
Test 2 + 140 52 192
Test 2 - 69 725 794
Total 209 777 986
% agreement = (140 + 725) / (986)
% Agreement = 0.877 or 87.7%
Kappa
Test 1 + Test 1 - Total
Test 2 + 140 52 192
Test 2 - 69 725 794
Total 209 777 986
1) Calculate % (observed) agreement
2) Calculate % chance agreement
Expected value for Test1+/Test 2 + (cell a) = (209*192) / 986 = 40.7
Expected value for Test1-/Test 2 - (cell d) = (777*794) / 986 = 625.7
3) Kappa = (%obs agreement - % chance agreement)
(1 - % chance agreement)
= [((140 + 725)/ 986)– ((40.7+ 625.7)/ 986) ] = (0.877 – 0.676) = 0.62
[ 1 – (40.7+ 625.7)/ 986 )] (1 – 0.676)
Evaluation of Kappa
Values of kappa range from –1 to 1:
– If kappa = 0, observed agreement same as chance
alone
– If kappa < 0, observed agreement worse than by
chance alone
– If kappa = 1, observed agreement = 100% (perfect!)
In medical research:
> 0.75 excellent
0.40 < < 0.75 good
0 < < 0.40 marginal/poor
Correlation coefficient
– Pearson for normally distributed data (actual values)
– Spearman for non-normally distributed data (ranks)
– Both measure the degree to which a scatter diagram
between 2 readings approaches a straight line (if one
goes up, the other goes up; if one goes down, the
other goes down)
– If replicate measurements show reliability, they will
be highly POSITIVELY correlated
• A negative correlation is NOT what you want in
reliability or validity studies
Correlation Coefficients
•
• •
• •
• •
• • •
•
• •• •
•• •
A. B. C.
• All 3 r=1.0
– A. Both observers get same exact value
– B and C. Systematic differences between
observers, but very reliable differences
Intraclass Correlation Coefficient (ICC)
• Also known as the reliability coefficient (RC)
• (more than one way to calculate)
• The fraction of the total measurement variability that
is due to variation between patients
• High ICC indicates little variability due to the
technologist, and large variability due only to the
patient…high is Good!
• Variation between patients / (variation between
patients + variation due to error)
• Therefore, substantially affected by amount of
variation between patients (particularly important
when comparing studies)
• It is the equivalent of the Kappa statistic for
continuous data, and also ranges from 0 to 1
Now that I know the reliability/validity for my
study, what do I do?
• If good – feel more confident in results
• If one measure better than others – use that one
• If not so good or systematically biased – may be able
to correct
– Simple or complex
• If not so good and can’t correct – need to mention in
discussion
Relationship between reliability and
validity
• Can, and often do have reliability without
validity
Remember!
“If all appears to be going well in an epidemiologic study,
you have forgotten something”
– Khan and Sempos
Thank you!