Introduction to Biostatistics Basics
Introduction to Biostatistics Basics
What does the word statistics bring to mind? To most people, it suggests
numerical facts or data, such as unemployment figures, farm prices, or the
number of marriages and divorces. Biostatistics deals with the design and
execution of scientific experiments on living creatures, the acquisition and
analysis of data from those experiments, and the interpretation and
presentation of the result of those analyses.
As such, Statistics is the art of learning from data. It is concerned with the
collection of data, their subsequent description, and their analysis, which often
leads to the drawing of conclusions.
Types of Statistics
Broadly speaking, applied statistics can be divided into two areas: descriptive
statistics and inferential statistics.
Descriptive Statistics
Suppose we have information on the test scores of students enrolled in a
statistics class. In statistical terminology, the whole set of numbers that
represents the scores of students is called a data set, the name of each student
is called an element, and the score of each student is called an observation.
A data set in its original form is usually very large. Consequently, such a data
set is not very helpful in drawing conclusions or making decisions. It is easier
to draw conclusions from summary tables and diagrams than from the original
version of a data set. So, we reduce data to a manageable size by constructing
tables, drawing graphs, or calculating summary measures such as averages.
The portion of statistics that helps us do this type of statistical analysis is
called descriptive statistics.
Page 1 of 19
Descriptive statistics consists of methods for organizing, displaying, and
describing data by using tables, graphs, and summary measures.
Inferential Statistics
In statistics, the collection of all elements of interest is called a population.
The selection of a few elements from this population is called a sample.
Basic Terms
In statistics, we are interested in obtaining information about a total collection
of elements, which we will refer to as the population. The population is often too
large for us to examine each of its members. For instance, we might have all
the residents of a given state, or all the television sets produced in the last year
by a particular manufacturer, or all the households in a given community. In
such cases, we try to learn about the population by choosing and then
examining subgroup of its elements. This subgroup of a population is called a
sample.
Definition: The total collection of all the elements that we are interested in is
called a population.
A subgroup of the population that will be studied in detail is called a sample.
The next gives information on the 2007 charitable givings (in millions of Kes) by
six retail companies. We can call this group of companies a sample of six
companies. Each company listed in this table is called an element or a
member of the sample. Table 1.1 contains information on six elements. Note
that elements are also called observational units.
Page 2 of 19
The 2007 charitable givings in our example is called a variable. The 2007
charitable givings is a characteristic of companies that we are investigating or
studying.
Variable
A variable is a characteristic under study that assumes different values for
different elements. In contrast to a variable, the value of a constant is fixed.
Other examples of variables are the incomes of households, the number of
houses built in a city per month during the past year, the makes of cars owned
by people, the gross profits of companies, and the number of insurance policies
sold by a salesperson per day during the past month. In general, a variable
assumes different values for different elements, as does the 2007 charitable
givings of the six companies in Table. For some elements in a data set,
however, the values of the variable may be the same.
Definition
Observation or Measurement The value of a variable for an element is called
an observation or measurement.
From the table,the 2007 charitable givings of carrefour were KeS. 337.9
million. The value KeS.337.9 million is an observation or a measurement. Table
contains six observations, one for each of the six retail companies. The
information given in the table on 2007 charitable givings of companies is called
the data or a data set.
Page 3 of 19
Definition
Data Set A data set is a collection of observations on one or more variables.
Types of Variables
The variable is a characteristic under investigation that assumes different
values for different elements. The incomes of families, heights of persons, gross
sales of companies, prices of college textbooks, makes of cars owned by
families, number of accidents, and status (freshman, sophomore, junior, or
senior) of students enrolled at a university are examples of variables.
Quantitative Variables
Some variables (such as the price of a home) can be measured numerically,
whereas others (such as hair color) cannot. The first is an example of a
quantitative variable and the second that of a qualitative variable.
Definition
Quantitative Variable A variable that can be measured numerically is called a
quantitative variable. The data collected on a quantitative variable are called
quantitative data. Incomes, heights, gross sales, prices of homes, number of
cars owned, and number of accidents are examples of quantitative variables
because each of them can be expressed numerically.
For instance, the income of a family may be KeS 81,520.75 per year, the gross
sales for a company may be KeS567 million for the past year, and so forth.
Such quantitative variables may be classified as either discrete variables or
continuous variables.
Discrete Variables
The values that a certain quantitative variable can assume may be countable or
noncountable. For example, we can count the number of cars owned by a
family, but we cannot count the height of a family member. A variable that
assumes countable values is called a discrete variable. Note that there are no
possible intermediate values between consecutive values of a discrete variable.
Definition
Discrete Variable A variable whose values are countable is called a discrete
variable. In other words, a discrete variable can assume only certain values
with no intermediate values.
For example, the number of cars sold on any day at a car dealership is a
discrete variable because the number of cars sold must be 0, 1, 2, 3,... and we
can count it. The number of cars sold cannot be between 0 and 1, or between 1
and 2. Other examples of discrete variables are the number of people visiting a
Page 4 of 19
bank on any day, the number of cars in a parking lot, the number of cattle
owned by a farmer, and the number of students in a class.
Continuous Variables
Some variables cannot be counted, and they can assume any numerical value
between two numbers. Such variables are called continuous variables.
Definition
Continuous Variable: A variable that can assume any numerical value over a
certain interval or intervals is called a continuous variable.
Definition
Qualitative or Categorical: Variable A variable that cannot assume a
numerical value but can be classified into two or more nonnumeric categories
is called a qualitative or categorical variable. The data collected on such a
variable are called qualitative data.
Sources of Data
There are 2 sources for data collection namely Primary, and Secondary data
Primary data:- freshly collected ie for the first time. They are original in
character ie they are the first hand information collected, compiled and
published for some purpose. They haven’t undergone any statistical treatment
Page 5 of 19
Secondary Data:- Second hand information mainly obtained from published
sources such as statistical abstracts books encyclopedias periodicals, media
reports eg census report CD-roms and other electronic devices, internet. They
are not original in character and have undergone some statistical treatment at
least once.
Demerits
(iii) Independent variables are not manipulated.
(iv) Co-operation of the organization is often difficult to obtain.
(v) Data is likely to contain unknown sampling biasness.
(vi) The dross rate (proportion of irrelevant data) may be high in such studies.
(vii) Measurement is not precise as in laboratory because of influence of
confounding variables.
b) Census. A census is a study that obtains data from every member of a
population (totality of individuals /items pertaining to certain characteristics).
In most studies, a census is not practical, because of the cost and/or time
required.
c) Sample survey. A sample survey is a study that obtains data from a subset
of a population, in order to estimate population attributes/ characteristics.
Surveys of human populations and institutions are common in government,
health, social science and marketing research.
d) Case study –It’s a method of intensively exploring and analyzing the life of a
single social unit be it a family, person, an institution, cultural group or even
an entire community. In this method no attempt is made to exercise
experimental or statistical control and phenomena related to the unit are
Page 6 of 19
studied in natural. The researcher has several discretion in gathering
information from a variety of sources such as diaries, letters, autobiographies,
records in office, files or personal interviews.
Merits:
(i) The method is less expensive than other methods.
(ii) ) Very intensive in nature –aims at studying a few units rather than
several
(iii) Data collection is flexible since the researcher is free to approach the
problem from any angle
(iv) Data is collected from natural settings.
Demerits
(i) It lacks internal validity which is basic to scientific evidence.
(ii) Only one unit of the defined population is studied. Hence the findings of
case study cannot be used as a base for generalization about a large
population. They lack external validity.
(iii) Case studies are more time consuming than other methods.
Page 7 of 19
Sampling Frames
For probability sampling, we must have a list of all the individuals (units) in
the population. This list or sampling frame is the basis for the selection process
of the sample. “A [sampling] frame is a clear and concise description of the
population under study, by virtue of which the population units can be
identified unambiguously and contacted, if desired, for the purpose of the
survey” - Hedayet and Sinha, 1991
Based on the sampling frame, the sampling design could also be classified as:
Individual Surveys if List of individuals is available or when the size of
population is small
Special population
Household Surveys; If it’s Based on the census of the households and if the
individual level information is unlikely to be available In practice, it’s limited to
small geographical areas and know as “area sampling frame” Example:
Demographic and Health Surveys (DHS)
Institutional Surveys If it’s Based on the census of say Hospital/clinic lists eg
i) 1990 National Hospital Discharge Survey
ii) National Ambulatory Medical Care Survey
Sampling
Sampling is a statistical process of selecting a representative sample. We have
probability sampling and non-probability sampling Probability Samples
involves a mathematical chance of selecting the respondent. Every unit in the
population has a chance, greater than zero, of being selected in the sample.
Thus producing unbiased estimates. They include;
(i) Simple random sampling
(ii) Systematic sampling
(iii) Stratified sampling
(iv) Cluster sampling
(v) multi-stage sampling
Sampling Procedure
Sampling involves two tasks
How to select the elements?
How to estimate the population characteristics – from the sampling units?
Page 8 of 19
We employ some randomization process for sample selection so that there is no
preferential treatment
in selection which may introduce selectivity bias
Stratified Sampling
In stratified sampling the population is partitioned into groups, called strata,
and sampling is performed separately within each stratum.
This sampling technique is used when;
i) Population groups may have different values for the responses of interest.
ii) we want to improve our estimation for each group separately.
iii) To ensure adequate sample size for each group.
Advantages
(i) Provides opportunity to study the stratum; variations - estimation could be
made for each stratum
(ii) Disproportionate sample may be selected from each stratum
(iii) The precision is likely to increase as variance may be smaller than simple
random case with same sample size
(iv) Field works can be organized using the strata (e.g., by geographical areas or
regions)
(v) Reduce survey costs.
Disadvantages
(i) Sampling frame is needed for each stratum
(ii) Analysis method is complex
(iii) Correct variance estimation
(iv) Data analysis should take sampling “weight” into account for
disproportionate sampling of strata
(v) Sample size estimation is difficult in practice
Page 9 of 19
Allocation of Stratified Sampling
The major task of stratified sampling design is the appropriate allocation of
samples to different strata.
Types of allocation methods:
(i) Equal allocation
(ii) Proportional to stratum size
(iii) Cost based sample allocation
c. Cluster Sampling
In many practical situations the population elements are grouped into a
number of clusters. A list of clusters can be constructed as the sampling frame
but a complete list of elements is often unavailable, or too expensive to
construct. In this case it is necessary to use cluster sampling where a random
sample of clusters is taken and some or all elements in the selected clusters
are observed. Cluster sampling is also preferable in terms of cost, because it is
much cheaper, easier and quicker to collect data from adjoining elements than
elements chosen at random. On the other hand, cluster sampling is less
informative and less efficient per elements in the sample, due to similarities of
elements within the same cluster. The loss of efficiency, however, can often be
compensated by increasing the overall sample size. Thus, in terms of unit cost,
the cluster sampling plan is efficient.
e. Multi-Stage Samples
Here the respondents are chosen through a process of defined stages. Eg
residents within Kibera
(Nairobi) may have been chosen for a survey through the following process:
Throughout the country (Kenya) the Nairobi may have been selected at random, (
stage 1), within
Nairobi, Langata (constituency) is selected again at random (stage 2), Kibera is
then selected within
Langata (stage 3), then polling stations from Kibera (stage 4) and then
individuals from the electoralvoters’ register (stage 5)! As demonstrated five
stages were gone through before the final selection of respondents were
selected from the electoral voters’ register.
Page 10 of 19
(vi) Mathematical statistics and probability can be applied to analyze and
interpret the data
Non-probability Sampling
Social research is often conducted in situations where a researcher cannot
select the kinds of probability samples used in large-scale social surveys. For
example, say you wanted to study homelessness – there is no list of homeless
individuals nor are you likely to create such a list. However, you need to get
some kind of a sample of respondents in order to conduct your research. To
gather such a sample, you would likely use some form of non-probability
sampling.
There are four primary types of non-probability sampling methods:
a)..Convinience Sampling
It’s a method of choosing subjects who are available or easy to find. This
method is also sometimes
referred to as haphazard, accidental, or availability sampling. The primary
advantage of the method is
that it is very easy to carry out, relative to other methods.
Demerit
One can never be certain what population the participants in the study
represent. The population is
unknown.
The method is haphazard, and the cases studied probably don't represent
any population you could come
up with. However, it’s very useful for pilot studies
Advantages of convenience sample
(i) It’s very easy to carry out with few rules governing how the sample should
be collected.
(ii) The relative cost and time required to carry out a convenience sample are
small in comparison to probability sampling techniques. This enables you to
achieve the sample size you want in a relatively fast and inexpensive way.
(iii) The convenience sample may help you gather useful data and information
that would not have been possible using probability sampling techniques, which
require more formal access to lists of populations [see, for example, the article
on simple random sampling].
Page 11 of 19
probability sampling technique such as simple random sampling or systematic
random sampling.
However, the manager has managed to secure permission for you to spend two
days in the organisation to collect as many questionnaire responses as
possible. You decide to spend the two days at the entrance of the organisation
where all employees have to pass through to get to their desks. Whilst a
probability sampling technique would have been preferred, the convenience
sample was the only sampling technique that you could use to collect data.
Irrespective of the disadvantages of convenience sampling, discussed below,
without the use of this sampling technique, you may not have been able to get
access to any data on employee satisfaction in the
organisation.
It may be that the organisation has multiple sites, with employee satisfaction
varying considerably between these sites. By conducting the survey at the
headquarters of the organisation, we may have missed the differences in
employee satisfaction amongst those at different sites, including non-office
workers. We also do not know why some employees agreed to take part in the
survey, whilst others did not. Was it because some employees were simply too
busy? Did they not trust the intentions of the survey? Did others take part out
of kindness or because they had a particular grievance with the
organisation? These types of biases are quite typical in convenience sampling.
Since the sampling frame is not know, and the sample is not chosen at
random, the inherent bias in convenience sampling means that the sample is
unlikely to be representative of the population being studied. This undermines
your ability to make generalisations from your sample to the population you are
studying.
If you are an undergraduate or master’s level dissertation student considering
using convenience sampling, you may also want to read more about how to put
together your sampling strategy [see the section: Sampling Strategy
b)..Quota Sampling
Quota sampling is designed to overcome the most obvious flaw of availability
sampling. Rather than taking just anyone, you set quotas to ensure that the
sample you get represents certain characteristics in proportion to their
Page 12 of 19
prevalence in the population. Note that for this method, you have to know
something about the characteristics of the population ahead of time. Say you
want to make sure you have a sample proportional to the population in terms
of gender - you have to know what percentage of the population is male and
female, then collect sample until yours matches. Marketing studies are
particularly fond ofthis form of research design.
The primary problem with this form of sampling is that even when we know
that a quota sample is representative of the particular characteristics for which
quotas have been set, we have no way of knowing if sample is representative in
terms of any other characteristics. If we set quotas for gender and age, we are
likely to attain a sample with good representativeness on age and gender, but
one that may not be very representative in terms of income and education or
other factors.
Moreover, because researchers can set quotas for only a small fraction of the
characteristics relevant to a study quota sampling is really not much better
than availability sampling. To reiterate, you must know the characteristics of
the entire population to set quotas; otherwise there's not much point to setting
up quotas. Finally, interviewers often introduce bias when allowed to self-select
respondents, which is usually the case in this form of research. In choosing
males 18-25, interviewers are more likely to choose those that are better-
dressed, seem more approachable or less threatening. That may be
understandable from a practical point of view, but it introduces bias into
research findings.
Imagine that a researcher wants to understand more about the career goals of
students at a single university. Let’s say that the university has roughly
10,000 students. suppose we were interested in comparing the differences in
career goals between male and female students at the single university. If this
was the case, we would want to ensure that the sample we selected had a
proportional number of male and female students relative to the population.
To create a quota sample, there are three steps:
Choose the relevant grouping chsr and divide the population accordingly
gender
Calculate a quota (number of units that should be included in each for group
Continue to invite units until the quota for each group is met
Page 13 of 19
does not require a sampling frame and the strict use of random sampling
techniques.
iii) The quota sample improves the representation of particular strata (groups)
within the population, as well as ensuring that these strata are not over-
represented. For example, it would ensure that we have sufficient male
students taking part in the research (60% of our sample size of 100; hence, 60
male students). It would also make sure we did not have more than 60 male
students, which would result in an over-representation of male students in our
research.
iv) It allows comparison of groups.
c)..Purposive Sampling
Purposive sampling is a sampling method in which elements are chosen based
on purpose of the study.
Purposive sampling may involve studying the entire population of some limited
group or a subset of a population. As with other non-probability sampling
methods, purposive sampling does not produce a sample that is representative
of a larger population, but it can be exactly what is needed in some cases -
study of organization, community, or some other clearly defined and relatively
limited group.
Page 14 of 19
the sample that is being studied, whether such generalisations are theoretical,
analytic and/or logical in nature. However, since each of these types of
purposive sampling differs in terms of the nature and ability to make
generalisations, you should read the articles on each of these purposive
sampling techniques to understand their relative advantages.
iii) Qualitative research designs can involve multiple phases, with each phase
building on the previous one. In such instances, different types of sampling
technique may be required at each phase.
Purposive sampling is useful in these instances because it provides a wide
range of non-probability sampling techniques for the researcher to draw on.
For example, critical case sampling may be used to investigate whether a
phenomenon is worth investigating further, before adopting an expert sampling
approach to examine specific issues further.
d)..Snowball Sampling
Snowball sampling is a method in which a researcher identifies one member of
some population of interest, speaks to him/her, and then asks that person to
identify others in the population that the researcher might speak to. This
person is then asked to refer the researcher to yet another person, and so on.
Snowball sampling is very good for cases where members of a special
population are difficult to locate.
Page 15 of 19
For example,.populations that are subject to social stigma and marginalisation,
such as suffers of
AIDS/HIV, as well as individuals engaged in illicit or illegal activities, including
prostitution and drug use. Snowball sampling is useful in such scenarios
because:
The method creates a sample with questionable representativeness. A
researcher is not sure who is in the sample. In effect snowball sampling often
leads the researcher into a realm he/she knows little about. It can be difficult
to determine how a sample compares to a larger population. Also, there's an
issue of who respondents refer you to - friends refer to friends, less likely to
refer to ones they don't like, fear, etc.
Snowball sampling is a useful choice of sampling strategy when the population
you are interested in studying is hidden or hard-to-reach.
(iii) Sampling bias is also a fear of researchers when using this sampling
technique. Initial subjects tend
to nominate people that they know well. Because of this, it is highly possible
that the subjects share
the same traits and characteristics, thus, it is possible that the sample that the
researcher will obtain
is only a small subgroup of the entire populatio
Page 16 of 19
1.4.6 Characteristics of Good sampling
A good sample should;
a) Meet the requirements of the study objectives
b) Provides reliable results
c) Clearly understandable
d) Manageable/realistic: could be implemented
e) Time consideration: reasonable and timely
f) Cost consideration: economical
g) Interpretation: accurate, representative
h) Acceptability
NB, i) these sample sizes reflect the number of obtained responses, and not
necessarily the number of surveys mailed or interviews planned (this number is
often increased to compensate for non-response).
Page 17 of 19
Ii) the sample sizes in Table 2.2 presume that the attributes being measured
are distributed normally or nearly so. If this assumption cannot be met, then
the entire population may need to be surveyed.
RAW DATA
When data are collected, the information obtained from each member of a
population or sample is recorded in the sequence in which it becomes
available. This sequence of data recording is random and unranked. Such data,
before they are grouped or ranked, are called raw data.
Raw Data: Data recorded in the sequence in which they are collected and
before they are processed or ranked are called raw data.
Table 1
21 19 24 25 29 34 26 27 37 33
18 20 19 22 19 19 25 22 25 23
25 19 31 19 23 18 23 19 23 26
22 28 21 20 22 22 21 20 19 21
25 23 18 37 27 23 21 25 21 24
Suppose we ask the same 50 students about their student status. The
responses of the students are recorded in Table 2. In this table, F, SO, J, and
SE are the abbreviations for freshman, sophomore, junior, and senior,
respectively. This is an example of qualitative (or categorical) raw data.
Table 2
J F SO SE J J SE J J J
F F J F F F SE SO SE J
J F SE SO SO F J F SE SE
SO SE J SO SO J J SO F SO
SE SE F SE J SO F J SO SO
The data presented in Tables 1 and 2 are also called ungrouped data. An
ungrouped data set contains information on each member of a sample or
population individually.
Page 18 of 19
Organizing and Graphing quantitative Data
Page 19 of 19