0% found this document useful (0 votes)
19 views9 pages

Sample Size Calculation Basics

Uploaded by

boby smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views9 pages

Sample Size Calculation Basics

Uploaded by

boby smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: [Link]

net/publication/272730017

Introduction to sample size calculation

Article in Education in Medicine Journal · June 2013


DOI: 10.5959/eimj.v5i2.130

CITATIONS READS

56 24,688

1 author:

Wan Nor Arifin


Universiti Sains Malaysia
156 PUBLICATIONS 1,931 CITATIONS

SEE PROFILE

All content following this page was uploaded by Wan Nor Arifin on 12 October 2015.

The user has requested enhancement of the downloaded file.


EDUCATIONAL RESOURCE
Volume 5 Issue 2 2013
DOI: 10.5959/eimj.v5i2.130
[Link]

Introduction to sample size calculation

Wan Nor Arifin

Unit of Biostatistics and Research Methodology, School of Medical Sciences, Universiti Sains Malaysia.

ARTICLE INFO ABSTRACT


Received : 06/11/2012
Accepted : 26/12/2012 One of the most common reasons why researchers seek help from
Published : 01/06/2012 statistician is sample size calculation. However despite the common believe
that it only involves formula and calculation, researchers often ignore other
aspects of research design that leads to proper sample size calculation. In
KEYWORD this article, the author outlines basic steps toward sample size calculation.
Sample size The author also introduces the logic behind sample size calculation for
Steps single mean and single proportion in simplified and less intimidating forms
Single mean to those not statistically inclined.
Single proportion

© Medical Education Department, School of Medical Sciences, Universiti Sains Malaysia. All rights reserved.

CORRESPONDING AUTHOR: Dr. Wan Nor Arifin, Unit of Biostatistics and Research Methodology, School
of Medical Sciences, Universiti Sains Malaysia, 16150 Kubang Kerian, Kelantan, Malaysia.
E-mail: wnarifin@[Link]

Education in Medicine Journal (ISSN 2180-1932) © [Link] | e89


Introduction For example, the objective “To determine the
mean systolic blood pressure among staffs in
In my experience of statistical consultation, one XYZ University” is clear and the outcome,
of the most common reasons why researchers “systolic blood pressure” is measurable (i.e.
seek help from statistician is sample size measured using sphygmomanometer in mmHg
calculation. However, one of the least common unit). Likewise, the objective “To determine the
reasons is to discuss on the general planning and prevalence of HIV positive among drug addicts
design of a research, including on what statistical in XYZ district” is clear and the outcome is
analysis to use. Of note, by having discussion countable (i.e. the number of drug addicts and
with statistician at initial stage of study would the number of HIV positive among them that
clarify many issues with regard to the general would constitute the numerator and denominator
conduct of a research, more so the issues related of a prevalence are countable).
to sample size calculation.
On the other hand, the objective “To look into
Sample size is very much related to other parts of the perception of medical personnel on the
a research (1) and it is not a stand-alone entity. importance of sample size calculation”, although
As such, to handle the problem of sample size looks appealing, it is not clear as to how
calculation, the other parts of a study should be “perception”, a subjective concept, is measured.
taken into account. Despite misconception that Restating the objective to “To determine the
the process only involves formula and mean score of perception of medical personnel
calculation, researchers often ignore other on the importance of sample size calculation
aspects of a research that lead to proper sample using XYZ inventory” it is clear and
size calculation. Most often, researchers come quantifiable, as “perception” is measured by
for consultation sessions with standard “XYZ inventory”.
deviations or percentages from related journal
articles and expect to calculate sample sizes for As for the objective “To determine associated
their studies with only that information. It is factors of smoking among school teenagers”, it is
important to note that sample size calculation a well stated objective as the outcome is
requires a number of preliminary steps with are countable as we categorize the school teenagers
related to general aspects of a research planning as smokers or non-smokers (outcome) and count
rather than plain formula and calculation. the number, given that the factors (predictors)
are also quantifiable.
In this article, I would suggest basic steps to
obtain sample size. I do not include study design When an objective is stated in general form, for
among them as the steps outlined are meant to be example “To determine the association between
applied in general sense. In this introductory systolic blood pressure and demographic
article also, I would go through sample size factors”, split the general objective into smaller
calculations for single mean and single specific objectives, such as “To determine the
proportion. I would show that sample size is not association between systolic blood pressure and
unreasonable and meant to complicate planning age”, “To determine the association between
of a research and writing of a research proposal, systolic blood pressure and house income” and
but rather a logical and important part in so on. Likewise, for objective “To determine the
quantitative research. associated factors of smoking among school
teenagers”, it has to be restated in form of
Basic steps toward sample size calculation specific objectives, such as “To determine the
association between socioeconomic status and
Step one: Objective smoking status of teenagers”, “To determine the
association between gender and smoking status
A clear (2) and achievable objective must be of teenagers”, and so on. This process of
specified; most importantly it has to be restating the general objective into smaller
quantitatively achievable. In quantitative specific objectives makes the objectives clear so
research context, the outcome (dependent as to facilitate sample size calculation (but it is
variable) and predictor (independent variable) not necessarily so in proposal).
that are stated in an objective can be measured or
counted, not in form of abstract or subjective
concepts.

Education in Medicine Journal (ISSN 2180-1932) © [Link] | e90


Step two: Hypothesis testing or estimation category, we can decide on appropriate formula
to use to calculate sample size.
After clarifying the objective, be clear about
whether the objective requires use of statistical For example, continuing from the previous
test or just in form of descriptive statistics. In example, a researcher wishes to know the
other words, either we are testing out hypothesis prevalence of a particular disease in a
(using statistical test) or estimating (using population. He visualizes his result in form of
confidence interval), which are two approaches ##.#% (95% confidence interval: ##.##%,
of inferential statistics (3). ##.##%). In other word he want to estimate the
prevalence of disease of interest in a population
This dichotomy is reflected in the objective. For based on the sample he collected. There are a
estimation, the objective would be stated in form number of information that we can extract from
of determination or measuring outcome of the preceding sentences: 1. Prevalence is
interest. As an example, for the objective “To essentially proportion. 2. Estimate with 95%
determine the mean systolic blood pressure confidence. 3. Single sample proportion is
among staffs in XYZ university”, the outcome of involved. Even if you are not familiar with
interest is “systolic blood pressure”, and it aims sample size calculation, by going through
to estimate the mean systolic blood pressure in commonly used sample size formulas you would
the population. Notice the absence of predictor in be able to guess that to calculate sample size for
the objective. It is also helpful to preview the this objective, the most appropriate formula is
way the result would be presented. For example, single proportion formula. Following preceding
if a researcher wishes to know the prevalence of two steps, it is easy to decide on which sample
a particular disease in a population, it would be size formula is appropriate for the objective.
presented in form of percentage followed by
respective confidence interval. Thus, it falls Next, a researcher wishes to compare the means
under the category of estimation. of systolic blood pressure between population A
and population B. He decided to use independent
For hypothesis testing, the objective would t-test to compare samples from these two
typically consist of outcome and predictor. The populations. Essentially, he wishes to test his
decision on suitable statistical test to test the hypothesis that the populations are different (or
hypothesis depends on the relationship of the not different) in term of means of systolic blood
outcome to its predictor, so it should be decided pressure. From the sentences, we can extract: 1.
accordingly. For example, the objective is “To Two means are to be compared. 2. Two sample
compare the means of systolic blood pressure means are involved. 3. Hypothesis testing is
between staffs in University A and University B. involved. Again, it is clear that two means
The outcome is “systolic blood pressure” and the formula is appropriate to calculate the sample
independent variable is staffs' category size for the objective.
(University A or B). Comparison between the
staffs of the universities is done by comparing After deciding on appropriate sample size
the means of systolic blood pressure. It is formula to use, in both cases it is thus only a
hypothesized that there are no difference matter of deciding and finding applicable values
between the populations and to test this to put in the formula.
hypothesis, it requires a suitable statistical test,
which is independent t-test. Another indicator In the next part, I would introduce reader to the
that the objective falls into the category of basis of sample size formula for single mean and
hypothesis testing is that in the presentation of single proportion. It would be a good
the result, it would include p-value of respective introduction toward understanding of how
statistical analysis. It should be stressed that for sample size is obtained and why it is important
objective involving hypothesis testing, it is part of planning of a research.
important to decide on statistical test to use as
sample size calculation depends on the test used. Sample size calculation for estimation

Step three: Sample size formula Single mean

After deciding whether the objective falls into For single mean, the objective of a study is to
estimation category or hypothesis testing estimate the mean of an outcome of interest in a
population from data obtained from sample, of

Education in Medicine Journal (ISSN 2180-1932) © [Link] | e91


which the outcome is measured on numerical As for the standard error, you can view it as
continuous scale. adjustment for standard deviation when we are
dealing with sample instead of population. To be
For example, a researcher is interested to know exact, it is the standard deviation of sampling
the mean weight of young children aged 10 to 12 distribution. Interested reader can read further
years old in Malaysia. He wishes to estimate the under sampling distribution topic in statistics
mean weight of the population by taking a textbooks, for example (3). Standard error
sample representative of the population with consists of,
95% confidence. He previews that the result
would be presented in form of xx.x (95% standarddeviation
confidence interval: [Link], [Link]). He wants to
samplesize
know the sample size that he should take to
achieve his study objective.
with the sample standard deviation and its size
Let say based on a hypothetical literature search, used in the formula.
in one Asian country, it was estimated that the
mean weight among children aged 10 to 12 years So, from our hypothetical literature search, the
old was 20.0 kg (95% CI: 19.75, 20.25), based mean weight was 20.0 kg, the standard deviation
on a sample of 250 children. The standard was 2.00 kg and the sample size was 250. So,
deviation of the weight was 2.00 kg. Looking at putting everything together with 95%
the result also it was precise to plus and minus confidence,
0.25 kg.
standarddeviation
mean ± reliability coefficient 
Let us trace back the result to its basic formula sample size
used to obtain the 95% confidence interval and
also the precision. Basically, a confidence 2.00
20.0 ± 1.96 
interval consists of lower confidence limit and 250
upper confidence limit, in our example the limits
were 19.75 and 20.25 respectively. The lower 20.0± 0.25
confidence limit was obtained by subtracting the
precision from the mean, in which 20.0 kg minus 19.75,20.25
0.25 kg equals 19.75 kg. On the other hand, the
upper confidence limit was obtained by adding
the precision to the mean, in which 20.0 kg plus or presented in form of 20.0 kg (95% CI: 19.75,
0.25 kg equals 20.25 kg. As such, confidence 20.25), which we already encountered in our
interval formula for mean is given by, hypothetical literature search before.
_____________________________________________________________________
A
mean± precision The reliability coefficient is the corresponding z-
value of standard normal distribution at a particular
as applied to our example, probability value. For example, given 95% confidence
level, it corresponds to covering 95% area of
cumulative probability distribution of standard normal
20.0± 0.25
distribution, leaving only 5% area for our lack of
confidence (usually denoted as α, typically known as
19.75,20.25 type I error or significance level). As we want to
divide our uncertainty into two parts (lower and upper
So, where are the component of 95% confidence limits), as such we allocate 2.5% of the area at the
and the standard deviation? Precision can be lower region of the distribution (left most) and also
further deconstructed into (3), 2.5% of the area at the upper region of the distribution
(right most) so that our confidence area lies in the
middle. So, please look up in any standard normal
precision = reliability coefficien t  standard error distribution table, you will find z-value of –1.96
corresponds to 0.025 cumulative probability (area),
Reliability coefficient value depends on our and z-value of 1.96 corresponds to 0.975 cumulative
preset level of confidence, for example the probability. As the z-values are only different in
reliability coefficient for 95% confidence is 1.96. direction, it is easier for us to just find the z-value for
The reliability coefficients for other confidence the upper limit, usually written as z 1α / 2  , and also
level are 1.645 for 90% confidence and 2.58 for because our confidence interval formula already
99% confidence A. accommodate that.

Education in Medicine Journal (ISSN 2180-1932) © [Link] | e92


After all, what does this calculation of thus the required sample size for his study is 69
confidence interval mean to us? In the children after accommodating for 10% drop out
calculation, we did not pre-specify the precision rate.
(0.25), but instead we just put in the value of the
respective components of precision obtained Single proportion
from literature. So, what if we pre-specify the
precision that we deem acceptable and we would For single proportion, the objective of a study is
like to know the sample size required to achieve to estimate the proportion or percentage of an
that level of precision? By algebraic outcome of interest in a population from data
manipulation we obtain, obtained from sample, of which the outcome
consists of two categories (dichotomous).
standarddeviation
precision= reliability coefficient 
sample size For example, a researcher is interested to know
the percentage (or prevalence) of obesity among
young children aged 10 to 12 years old in
reliability coefficient  standarddeviation
sample size = Malaysia. He wishes to estimate the percentage
precision
of obesity in the population by taking a sample
representative of the population with 95%
reliability coefficient 2  standarddeviation2 confidence. He previews that the result would be
sample size =
precision2 presented in form of percentage: xx.x% (95%
confidence interval: [Link]%, [Link]%). He wants
To recapitulate objective put forward by a to know the sample size that he should take to
researcher in our example, he wishes to estimate achieve his study objective.
mean weight of young children aged 10 to 12
years old from a representative sample with 95% Again, let say based on a hypothetical literature
confidence level. Additionally he wishes to search, in one nearby Asian country, it was
estimate with precision of 0.5 kg, which he found that percentage of obesity among children
deems acceptable. From literature, it was found aged 10 to 12 years old was 30.0% (95% CI:
that the standard deviation of weight in that age 25.00%, 35.00%), based on a sample of 320
group was 2.00 kg. The sample size to achieve children. Note that the result was precise to 5%
his objective is, (or 0.05 in form of proportion).

1.962  2.002 Before going into the sample size calculation, we


samplesize = = 61.47  62children need to go through the detail of confidence
0.52
interval calculation for the proportion given in by
the literature. Confidence interval formula for
We often round up the sample size when dealing
proportion is given by,
with human being, as we cannot simply sample
only part of a person just to be precise with our
calculated sample size. We may also add proportionof a factor± precision
additional subjects to the calculated sample size
to accommodate for possible dropouts, or simply,

sample size+ dropouts 100%subjects proportion± precision


=
calculatedsample size % subjects dropouts
in our example,
sample size+ dropouts 1
=
calculatedsample size 1  proportionof dropouts 0.3± 0.05

0.25, 0.35 or 25.00%, 35.00%


calculatedsample size
sample size+ dropouts=
1  proportionof dropouts
As for the precision, it is given by,
with, let say with 10% drop out rate,
precision= reliability coefficient  standarderror

62
sample size+ dropouts=  69 children which is similar to the precision for mean in term
1  0.1 of basic formula. For the reliability coefficient,

Education in Medicine Journal (ISSN 2180-1932) © [Link] | e93


we still use the same z-value that corresponds to manipulation we derive the sample size for
our confidence level. However, the standard single proportion,
error part needs minor changes to the formula. In
general, standard error is given by,
t
precision= reliability coefficien

proportion 1  proportion 
sam plesize
standarddeviation
samplesize 
t  proportion 1  proportion
reliability coefficien 
precision=
sam plesize
which looks similar to that of single mean. But
notice that for proportion, we are not presented 
reliability coefficient  proportion 1  proportion
with standard deviation in literature. It does not sample size =
precision
mean that there is no standard deviation for
proportion, but because it is not commonly
presented in article. Standard deviation of sam plesize =
2

t  proportion 1  proportion
reliability coefficien 
2
proportion can be easily obtained by, precision

proportionwith outcome proportionwithoutoutcome Recall the objective of our researcher, in which


he wishes to estimate percentage of obesity
among young children aged 10 to 12 years old in
in other words,
Malaysia from a representative sample with 95%
confidence level and precision of 1%. From
proportionwith outcome 1  proportionwith outcome literature, it was found that the prevalence was
30.0%. The sample size to achieve his objective
By putting our standard deviation for proportion is,
into our standard error formula, it becomes,
1.962  0.3  0.7
standarddeviation sample size = = 8067.36 8068children
0.012
sample size
As you can see, with very small precision, the
=

proportionwith outcom e proportionwithout outcom e  sample size is inflated to 8078 children as
sam plesize compared to the study from literature with
sample size of only 320 children. The researcher
may need to reduce the precision to, for example

proportionwith outcom e 1  proportionwith outcom e  2% or 3% if he feels that it is impossible for him
=
sam plesize to collect a sample that large, or possibly due to
budget constrain or other considerations
or simply, pertaining to conduct of research. After deciding
with an optimal sample size, he can inflate the
proportion 1  proportion sample size further to adjust for expected drop
= out rate.
sample size

Conclusion
Thus, reconstructing the confidence interval
given the literature,
In this short article, we have gone through the
t
proportion± reliability coefficien

proportion 1  proportion  basic steps of sample size calculation. We also
sam plesize have gone through the basis of sample size
0.3  0.7 formula to estimate true values (parameters) of a
0.3 ± 1.96  population, specifically single mean and single
320
proportion formulas. I intentionally show the
0.3± 0.05 derivation of sample size formulas so that you
can appreciate the reason why sample size is so
0.25, 0.35 important in planning of a research and it is not
meant to complicate the process of conducting a
research. I did not cover sample size calculation
Having understood the process of calculating for two means and two proportions in this article
confidence interval, similar to what we did for as it is more suitable to be discussed in another
precision formula of single mean, by algebraic

Education in Medicine Journal (ISSN 2180-1932) © [Link] | e94


article which would follow this introductory note
on sample size. Throughout this article, formulas
are written in full sentences or words instead of
using Greek's alphabet or symbols or letters to
foster the understanding of the formulas and to
avoid unnecessary fear of statistical notations
commonly encountered while reading statistics
textbooks. The formulas in their commonly used
forms are included in Appendix for those who
are statistically inclined.

Conflict of interest

None to be declared.

Reference

1. Lachin JM. Introduction to sample size


determination and power analysis for clinical
trials. Controlled Clinical Trials. 1981;2(2):
93-113.
2. Lwanga S, Lemeshow S. Sample size
determination in health studies: a practical
manual. England: World Health
Organization; 1991.
3. Daniel WW. Biostatistics: A foundation for
analysis in the health sciences. 6th ed. USA:
John Wiley & Sons. Inc; 1995.

Further reading

1. Machin D, Campbell MJ, Beng TS, Tan SH.


Sample size tables for clinical studies.
Singapore: Wiley-Blackwell; 2009.
2. Naing N. A practical guide on the
determination of sample size in health
sciences research. Kota Bharu, Malaysia:
Pustaka Aman Press; 2010.

Education in Medicine Journal (ISSN 2180-1932) © [Link] | e95


Appendix

Confidence interval for single mean:

x±d
x ± z 1 α / 2 σ x
σ
x ± z 1 α / 2 
n

Precision for single mean:

σ
d = z1 α / 2 
n

Single mean formula:

z21 α / 2 σ 2
n=
d2

Confidence interval for single proportion:

x±d
x ± z 1 α / 2 σ pˆ
p1  p 
x ± z 1 α / 2 
n

Precision for single proportion:

p1  p 
d = z1α / 2 
n

Single proportion formula:

z21α / 2  p1  p 
n=
d2

Symbols:

x – mean
d – precision
z1α / 2  – reliability coefficient
σ – standard deviation
p – proportion
σ x – standard error (single mean)
σ pˆ – standard error (single proportion)
n – sample size

Education in Medicine Journal (ISSN 2180-1932) © [Link] | e96

View publication stats

You might also like