Sampling errors and Non-sampling errors
Dr. M. Musa Khan
Associate Professor
Dept. of Statistics
PrSTU
Sampling error
Sampling error occurs solely as a result of using a sample from a population, rather than conducting
a census (complete enumeration) of the population. It refers to the difference between an estimate
for a population based on data from a sample and the 'true' value for that population which would
result if a census were taken. Sampling errors do not occur in a census, as the census values are
based on the entire population.
Sources of Sampling Error
Sampling error arises because we study a sample instead of the entire population. The main sources
are:
Natural variability of the population: When population units differ widely from each other,
different samples are likely to give different results.
Sample size: A small sample tends to produce larger sampling error, while a larger sample reduces
sampling error.
Method of sampling: Inefficient or inappropriate sampling methods (e.g., poor stratification,
cluster sampling with high intra-cluster correlation) increase sampling error.
Random selection of units: Even with proper random sampling, chance variation in the selected
units leads to sampling error.
Unequal probability of selection: If some units have higher chances of being selected than others,
sampling variability may increase.
Sample design and allocation: Improper allocation of sample sizes among strata or clusters
contributes to higher sampling error.
Measurement of Sampling Error
Sampling error refers to the difference between a population parameter and its corresponding
sample statistic. It is measured using statistical tools that describe the variability of sample
estimates.
1. Standard Error (S.E.)
1|Page
Standard error is the most widely used measure of sampling error.
S.E. = √Var (sample statistic)
For sample mean: S.E. (x̄) = σ/√n or s/√n
For proportion: S.E.(p) = √(pq/n)
2. Variance of the Sampling Distribution
The variance of the sampling distribution of a statistic measures the extent of sampling error.
Higher variance indicates greater sampling error.
3. Confidence Limits / Margin of Error
Sampling error can be expressed through the margin of error used in confidence intervals.
Margin of Error = z × S.E.
4. Coefficient of Variation (CV)
It measures the relative size of sampling error.
CV = (S.E. / Estimate) × 100
5. Sampling Distribution
The dispersion of the sampling distribution reflects the magnitude of sampling error. Narrower
distributions indicate smaller sampling error.
Sampling error is unavoidable but can be minimized by increasing sample size and using efficient
sampling designs.
Types of Sampling Error
1. Random Sampling Error/Variation: This error occurs due to natural chance variation among
different samples drawn from the same population. Even with proper random sampling,
different samples yield different results.
2. Population-specific Error: The researcher doesn't clearly define or understand the exact
population they need to survey (e.g., surveying all adults when you meant homeowners).
3. Selection Error/Bias or Systematic Error: Selection error occurs when the survey is self-
selected, or when only those participants who are interested in the survey respond to the
questions. Researchers can attempt to overcome selection errors by finding ways to
encourage participation.
4. Sample Frame Error: A sample frame error occurs when a sample is selected from the
wrong population data.
How to limit Sampling errors (not possible to control 100% error)
2|Page
➢ True representation of population
➢ Population specification
➢ Sampling design
➢ Sample accuracy- Homogeneity
➢ Sampling method
➢ Sample selection
➢ Increase sample size
Non-sampling error
Non-sampling error refers to all sources of error that are unrelated to sampling. Non-sampling
errors are present in all types of survey, including censuses and administrative data. They arise for
a number of reasons: the frame may be incomplete, some respondents may not accurately report
data, data may be missing for some respondents, etc.
Non-sampling errors can be classified into two groups: i) random errors and ii) systematic errors.
• Random errors are errors whose effects approximately cancel out if a large enough sample
is used, leading to increased variability.
• Systematic errors are errors that tend to go in the same direction, and thus accumulate
over the entire sample leading to a bias in the final results. Unlike random errors, this bias
is not reduced by increasing the sample size. Systematic errors are the principal cause of
concern in terms of a survey’s data quality. Unfortunately, non-sampling errors are often
extremely difficult, if not impossible, to measure.
Types of non-sampling error
Non-sampling error can occur in all aspects of the survey process, and can be classified into the
following categories:
a) coverage error
b) measurement error
c) nonresponse error and
d) processing error
a) Coverage error
Coverage error consists of omissions (under coverage), erroneous inclusions, duplications and
misclassifications (over coverage) of units in the survey frame. Since it affects every estimate
produced by the survey, they are one of the most important types of error. In the case of a census,
it may be the main source of error. Coverage error can have both spatial and temporal dimensions,
and may cause bias in the estimates. The effect can vary for different subgroups of the population.
3|Page
This error tends to be systematic and is usually due to under coverage, which is why it’s important
to reduce it as much as possible.
b) Measurement error
Measurement error, also called response error, is the difference between measured values and true
values. It consists of bias and variance, and it results when data are incorrectly requested, provided,
received or recorded. These errors may occur because of inefficiencies with the questionnaire, the
interviewer, the respondent or the survey process.
• Poor questionnaire designs
It is essential that sample survey or census questions are worded carefully in order to avoid
introducing bias. If questions are misleading or confusing, then the responses may end up
being distorted.
• Interviewer bias
An interviewer can influence how a respondent answers the survey questions. This may
occur when the interviewer is too friendly or aloof or prompts the respondent. To prevent
this, interviewers must be trained to remain neutral throughout the interview. They must
also pay close attention to the way they ask each question. If an interviewer changes the
way a question is worded, it may impact the respondent’s answer.
• Respondent error
Respondents can also provide incorrect answers. Faulty recollections, tendencies to
exaggerate or underplay events, and inclinations to give answers that appear more socially
acceptable are several reasons why a respondent may provide a false answer.
• Problems with the survey process
Errors can also occur because of a problem with the actual survey process. Using proxy
responses, meaning taking answers from someone other than the respondent, or lacking
control over the survey procedures are just a few ways of increasing the risk of response
errors.
c) Non-response error
Estimates obtained after nonresponse has been observed and imputation has been used to deal with
this nonresponse are usually not equivalent to the estimates that would have been obtained had all
the desired values been observed without error. The difference between these two types of
estimates is called the nonresponse error. There are two types of non-response errors: total and
partial.
• Total nonresponse error occurs when all or almost all data for a sampling unit are missing.
This can happen if the respondent is unavailable or temporarily absent, the respondent is
unable to participate or refuses to participate in the survey, or if the dwelling is vacant. If a
significant number of sampled units do not respond to a survey, then the results may be
4|Page
biased since the characteristics of the non-respondents may differ from those who have
participated.
• Partial nonresponse error occurs when respondents provide incomplete information. For
certain people, some questions may be difficult to understand, they may refuse or forget to
answer a question. Poorly designed questionnaire or poor interviewing techniques can also
be reasons which result partial nonresponse error. To reduce this form of error, care should
be taken in designing and testing questionnaires. Adequate interviewer training and
appropriate edit and imputation strategies will also help minimize this error.
d) Processing error
Processing error occurs during data processing. It includes all data processing activities after
collection and prior to estimation, such as errors in data capture, coding, editing and tabulation of
the data as well as in the assignment of survey weights.
• Coding errors occur when different coders code the same answer differently, which can
be caused by poor training, incomplete instructions, variance in coder performance (i.e.
tiredness, illness), data entry errors, or machine malfunction (some processing errors are
caused by errors in the computer programs).
• Data capture errors result when data are not entered into the computer exactly as they
appear on the questionnaire. This can be caused by the complexity of alphanumeric data
and by the lack of clarity in the answer provided. The physical layout of the questionnaire
itself or the coding documents can cause data capture errors. The method of data capture,
manual or automated (for example, using an optical scanner), can also result in errors.
• Editing and imputation errors can be caused by the poor quality of the original data or
by its complex structure. When the editing and imputation processes are automated, errors
can also be the result of faulty programs that were insufficiently tested. The choice of an
inappropriate imputation method can introduce bias. Errors can also result from incorrectly
changing data that were found to be in error, or by erroneously changing correct data.
Sources of non-sampling error
a) Not attentive towards work
b) Lack of knowledge
c) Bias in data analysis
When can non-sampling error occur?
➢ Lack of proper specification of the domain of study and scope of the investigation,
➢ Incomplete coverage of the population or sample,
➢ Faulty definitions and objective setting,
5|Page
➢ Inappropriate sampling frame,
➢ defective methods of data collection,
➢ Error in instrument- Measurement error,
➢ Data processing- collection, entering, coding, tabulation, analysis etc.
How to limit non-sampling error?
➢ Specification of area of research and defined objectives
➢ Appropriate research design
➢ Appropriate sampling frame
➢ Complete data collection
➢ Minimizing errors in instruments
➢ Use of suitable data collection methods
➢ Avoid short timing for data collection
➢ Cautious to data processing
➢ Pre-testing or Pilot study
Technique for adjustment of non-response
Non-response occurs when information is not obtained from some units selected in the sample. To
reduce the bias caused by non-response, several statistical techniques are used. Important
techniques are discussed below.
1. Hansen and Hurwitz Technique
Concept:
This technique is used when non-response occurs at the first attempt, but information can be
obtained from a sub-sample of non-respondents by making extra efforts (such as personal
interviews).
Procedure
1. Draw a random sample of size n from the population.
2. Divide the sample into:
o Respondents (𝑛1 )
o Non-respondents (𝑛2 )
𝑛2
3. From the non-respondents, select a sub-sample of size 𝑛2′ = (𝐾 > 1) using random
𝐾
sampling.
4. Collect data from this sub-sample with greater effort.
6|Page
5. Combine information from respondents and sub-sampled non-respondents.
Estimator of Population Mean
𝑛1 𝑛2
𝑦̅𝐻𝐻 = 𝑦̅1 + 𝑦̅′2
𝑛 𝑛
where,
𝑦̅1 = 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑟𝑒𝑠𝑝𝑜𝑛𝑑𝑒𝑛𝑡𝑠
𝑦̅′2 = mean of sun − smapled non − respondents
Merits
• Reduces bias due to non-response
• Statistically sound and widely used
• Cost-effective compared to full follow-up
Demerits
• More time and effort required
• Assumes sub-sample represents all non-respondents
Applications
• Government censuses
• Large-scale socio-economic surveys
Variance of the Hansen–Hurwitz Estimator
Setup and Notation:
Let a population be divided into respondents and non-respondents.
n = n₁ + n₂
n₂′ = n₂ / k, k > 1
ȳ₁ = mean of respondents
ȳ₂′ = mean of sub-sampled non-respondents
S₁² = variance of respondents
S₂² = variance of non-respondents
Step 1: Hansen–Hurwitz Estimator
ȳHH = (n₁/n) ȳ₁ + (n₂/n) ȳ₂′
Step 2: Variance Formula
7|Page
V(ȳHH) = (n₁/n) ² V(ȳ₁) + (n₂/n) ² V(ȳ₂′)
Step 3: Variance of Sample Means
V(ȳ₁) = S₁² / n₁
V(ȳ₂′) = S₂² / n₂′ = kS₂² / n₂
Step 4: Substitution
V(ȳHH) = (n₁/n²) S₁² + (k n₂/n²) S₂²
Step 5: Rearrangement
V(ȳHH) = (n₁S₁² + n₂S₂²)/n² + (k−1) n₂S₂²/n²
Final Result
V(ȳHH) = S²/n + (k−1) n₂S₂²/n²
Interpretation
The first term represents usual sampling variance.
The second term represents additional variance due to non-response.
If all non-respondents are surveyed (k=1), extra variance = 0.
Larger sub-sample of non-respondents ⇒ smaller variance.
Proof that V(ȳ) = S² / n (Step-by-Step)
Step 1: Population Setup
Consider a finite population of size N with values y₁, y₂, … , yN.
Population mean: Ȳ = (1/N) Σ yᵢ
Population variance: S² = (1/(N−1)) Σ (yᵢ − Ȳ)²
Step 2: Sample Mean
Draw a simple random sample with replacement of size n.
Sample mean: ȳ = (1/n) Σ yᵢ
Step 3: Variance of the Sample Mean
V(ȳ) = V[(1/n) Σ yᵢ] = (1/n²) Σ V(yᵢ)
Step 4: Variance of Each Observation
Under sampling with replacement, each observation has variance S².
So, V(yᵢ) = S²
Step 5: Substitution
V(ȳ) = (1/n²) × nS²
8|Page
Step 6: Simplification
V(ȳ) = S² / n (Sampling with replacement)
Conclusion
Hence, under simple random sampling with replacement, the variance of the sample mean equals
the population variance divided by the sample size.
For sampling without replacement (SRSWOR):
V(ȳ) = S² / n (1-n/N)
2. Politz–Simmons Technique
Concept
This technique is mainly used in survey research, especially when non-response is related to
availability of respondents rather than refusal.
Basic Idea
Respondents are classified according to the number of attempts required to contact them. The
probability of response is assumed to be related to availability.
Procedure
1. Make repeated attempts to contact selected units.
2. Record the number of attempts needed for each respondent.
3. Classify respondents into groups based on contact attempts.
4. Assign weights inversely proportional to the probability of contact.
5. Compute weighted estimates.
Assumption
Late respondents are similar to non-respondents.
Merits
• Simple to apply in practice
• No need to contact non-respondents again
• Useful in opinion and market surveys
Demerits
• Strong assumptions may not hold
• Not suitable when refusal is the main cause of non-response
Applications
9|Page
• Labour force surveys
• Household income and expenditure surveys
3. Randomized Response Technique (RRT)
Concept
Developed to handle non-response and false response in surveys involving sensitive questions
(e.g., income, drug use, tax evasion).
Basic Idea
The respondent answers one of two questions based on a random device (coin, dice, random
number table), ensuring privacy and anonymity.
Procedure (Warner’s Model)
1. Respondent uses a random device:
o With probability p, answers the sensitive question.
o With probability 1−p, answers a neutral question.
2. The interviewer does not know which question was answered.
3. Population proportion is estimated statistically.
Estimation
If
• π = proportion having the sensitive attribute
• R = observed proportion of “Yes” responses
Then,
𝑅 − (1 − 𝑝)𝜋0
𝜋=
𝑝
where 𝜋0 is known proportion for neutral question.
Merits
• Reduces refusal rate
• Encourages truthful responses
• Protects respondent privacy
Demerits
• More complex design
• Larger variance than direct questioning
Applications
10 | P a g e
• Crime statistics
• Drug abuse surveys
• Tax compliance studies
Comparative Summary
Technique Type of non-response Cost Bias Reduction
Hansen–Hurwitz Unit non-response High Very good
Politz–Simmons Non-contact Medium Moderate
Randomized Response Sensitive questions Medium High
11 | P a g e