Question 2: What are sampling and non-sampling errors?
Distinguish between probability and non-
probability sampling methods with examples.
Answer:
Sampling Error vs. Non-Sampling Error:
Sampling Error is the natural, random difference between a sample statistic (like a sample mean) and
the true population parameter. It occurs because we study only a subset of the population. This error is
unavoidable but can be reduced by increasing the sample size. For example, if you take two different
random samples of students to estimate average height, the two averages will differ slightly—this is
sampling error.
Non-Sampling Error refers to all other errors not related to sampling. These are often systematic and
can occur even in a full census. They include mistakes in data collection (biased questions), processing
errors, respondent errors (lying or misunderstanding), and non-response bias. Increasing sample
size does not reduce these errors.
Probability vs. Non-Probability Sampling Methods:
Probability Sampling: Every member of the population has a known, non-zero chance of being selected. This
allows for statistical inference.
1. Simple Random Sampling: Pure random selection (e.g., drawing names from a hat, using a random
number generator for a list of employees).
2. Systematic Sampling: Selecting every kth individual after a random start (e.g., choosing every 10th
house on a street list starting from house #3).
3. Stratified Sampling: Dividing the population into homogeneous groups (strata) and randomly sampling
from each (e.g., dividing students by grade level—freshman, sophomore, etc.—and then randomly
selecting a few from each grade).
4. Cluster Sampling: Dividing the population into natural, heterogeneous clusters and randomly selecting
entire clusters to study (e.g., randomly selecting 5 city blocks from a town and surveying all households
on those blocks).
Non-Probability Sampling: Members are selected based on judgment, convenience, or quotas; the
probability of selection is unknown. It is not suitable for statistical inference about the whole population.
1. Convenience Sampling: Choosing readily available people (e.g., interviewing shoppers at a single
mall).
2. Judgment/Purposive Sampling: Researcher selects who they think are most appropriate (e.g., a
journalist interviewing "experts" they select on a topic).
3. Quota Sampling: Setting quotas for different groups (e.g., surveying 50 men and 50 women, but
choosing them conveniently).
4. Snowball Sampling: Existing subjects recruit future subjects from their network (e.g., studying a
hidden community by having initial contacts refer others).
Question 4: Write down the following terms and concepts:
a) Type II Error
b) Null Hypothesis
c) Central Limit Theorem
d) Perfect Negative Correlation
e) Coefficient of Determination
Answer:
a) Type II Error: A Type II error occurs when a researcher incorrectly fails to reject a false null hypothesis. In
simpler terms, it is the mistake of concluding "there is no effect or difference" when, in reality, there is one. It is
also called a "false negative."
b) Null Hypothesis: The null hypothesis (𝐻 ) is a statement that assumes there is no effect, no difference, or no
relationship between variables in a study. It is the hypothesis that researchers test against, aiming to reject it in
favor of the alternative hypothesis.
c) Central Limit Theorem: This theorem states that if you take large enough random samples from any
population (regardless of its original distribution), the distribution of the sample means will be approximately
normal (bell-shaped). This allows us to use normal probability rules to make inferences about population means.
d) Perfect Negative Correlation: This is a relationship between two variables where they move in exactly
opposite directions in a perfectly linear way. When one variable increases, the other decreases at a constant rate.
It is represented by a correlation coefficient (𝑟) of −1.
e) Coefficient of Determination: Denoted as 𝑅 , this is a statistical measure that shows how well the
independent variable(s) explain the variation in the dependent variable. It ranges from 0 to 1 (or 0% to 100%).
For example, an 𝑅 of 0.75 means 75% of the variation in the dependent variable is explained by the model.
Question 5: Why do researchers take samples in conducting research?
Answer:
Researchers take samples instead of studying an entire population for practical and efficient reasons, which
include:
1. Cost Reduction: Studying an entire population is often very expensive, whereas sampling reduces costs
significantly.
2. Time Efficiency: Collecting data from a sample is much faster than from a whole population, allowing
for quicker results and analysis.
3. Practicality: Some populations are too large, inaccessible, or spread out to study entirely (e.g., all
consumers in a country, all trees in a forest).
4. Destructive Nature of Testing: In cases where the testing process destroys or alters the item (e.g.,
testing light bulbs until they burn out), sampling is necessary.
5. Accuracy and Manageability: A well-chosen, representative sample can provide sufficiently accurate
results without the complexity of handling an entire population.
Question 15: Define the following Terms with Example: Poisson Distribution, p-value, Standard Error,
Systematic Sampling, Dependent Sample.
Answer:
1. Poisson Distribution:
A Poisson distribution is a probability distribution used to model the number of times an event occurs
within a fixed interval of time or space, assuming events occur independently at a constant average rate.
Example: The number of customers arriving at a bank in an hour, when the average is 10 per hour.
2. p-value:
The p-value is the probability of obtaining test results at least as extreme as the observed results,
assuming the null hypothesis is true. It helps determine statistical significance.
Example: If a p-value is 0.03, it means there’s a 3% chance the observed data could occur if the null
hypothesis were true.
3. Standard Error (SE):
The standard error measures the variability or precision of a sample statistic (like the sample mean) from
the true population parameter. It is the standard deviation of the sampling distribution.
Example: If the sample mean height is 170 cm and the SE is 5 cm, it indicates how much the sample
mean might differ from the true population mean.
4. Systematic Sampling:
Systematic sampling is a probability sampling method where sample members are selected at regular
intervals from a list after a random starting point.
Example: From a list of 500 students, randomly pick a starting number (e.g., 7) and then select every
10th student (7th, 17th, 27th, …).
5. Dependent Sample:
Dependent samples (or paired/matched samples) occur when measurements in one sample are related to
measurements in another sample, often from the same subjects or matched pairs.
Example: Measuring the blood pressure of the same patients before and after a treatment.
Question 16(a): What are the differences between Point Estimation and Interval Estimation?
Answer:
Point Estimation and Interval Estimation are two methods used in inferential statistics to estimate population
parameters based on sample data.
Point Estimation
It provides a single, specific value as the best estimate of an unknown population parameter (such as the
mean or proportion).
This single value is called a point estimate and is usually derived from sample statistics (e.g., sample
mean, sample proportion).
Example: If we take a sample of 50 students and find their average height is 165 cm, then 165 cm is the
point estimate for the population mean height.
Limitation: It does not give any information about the accuracy, reliability, or variability of the
estimate. It simply gives one number without any measure of uncertainty.
Interval Estimation
It provides a range (or interval) of values within which the population parameter is expected to fall,
along with a certain level of confidence (e.g., 95%, 99%).
This range is called a confidence interval.
Example: Using the same sample, we might say: “We are 95% confident that the true population mean
height is between 162 cm and 168 cm.”
Advantage: It not only gives an estimate but also communicates the precision and reliability of the
estimate through the confidence level and the width of the interval.
Question 17 What factors should be considered for determining the appropriate sample size?
Factors to consider for determining appropriate sample size:
1. Population Variability (Standard Deviation):
A more variable population requires a larger sample to accurately estimate parameters.
2. Desired Margin of Error (Precision):
The acceptable difference between the sample estimate and the true population value. A smaller margin
of error requires a larger sample.
3. Confidence Level:
The probability that the confidence interval contains the true population parameter (e.g., 95%, 99%). A
higher confidence level requires a larger sample.
4. Population Size:
For small populations, a larger proportion may need to be sampled. For very large populations, the
sample size needed stabilizes.
5. Sampling Method:
Complex designs (like stratified or cluster sampling) may require different sample size calculations
compared to simple random sampling.
6. Budget and Resources:
Practical constraints like cost, time, and manpower can limit how large a sample can be.
7. Expected Response Rate:
If non-response is anticipated, the initial sample size should be increased to achieve the desired number
of responses.
8. Purpose of the Study:
Exploratory studies may use smaller samples, while conclusive or high-stakes research requires larger,
more reliable samples.
Question 18 (b): Distinguish between Stratified and Clustered Sampling with example.
Answer:
Stratified Sampling and Cluster Sampling are both probability sampling methods, but they differ in their
approach and purpose.
Stratified Sampling
Purpose: To ensure representation from all key subgroups (strata) of the population.
Process:
1. Divide the population into homogeneous subgroups called strata based on a relevant
characteristic (e.g., age, income, department).
2. Then, take a random sample from each stratum.
Example: A university wants to survey student satisfaction. The population is divided into strata:
Freshmen, Sophomores, Juniors, and Seniors. From each group, 50 students are randomly selected.
Key Feature: Strata are internally similar, but different from each other. Ensures all subgroups are
included.
Cluster Sampling
Purpose: To reduce cost and increase efficiency when the population is large and geographically
scattered.
Process:
1. Divide the population into heterogeneous subgroups called clusters, usually based on natural or
geographical boundaries (e.g., schools, city blocks).
2. Randomly select entire clusters and include all members of chosen clusters in the sample (or
take a random sample within selected clusters).
Example: A health organization wants to study vaccination rates in a city. The city is divided into 100
neighborhoods (clusters). 10 neighborhoods are randomly selected, and all households in those 10
neighborhoods are surveyed.
Key Feature: Clusters are internally diverse and resemble the whole population. Sampling is done at
the cluster level.
Question 19 (a): What do you mean by testing of a hypothesis? Discuss different steps involved in testing
of a hypothesis.
Answer:
Hypothesis testing is a statistical procedure used to make decisions or draw conclusions about a population
parameter based on sample data. It involves testing an assumption (hypothesis) about a population parameter to
determine whether there is enough evidence in the sample data to reject that assumption.
Steps Involved in Hypothesis Testing:
1. State the Hypotheses:
o Null Hypothesis (𝐻 ): A statement of no effect, no difference, or no change. It is the assumption
we test.
o Alternative Hypothesis (𝐻 or 𝐻 ): A statement that contradicts the null hypothesis. It
represents the effect, difference, or change we aim to detect.
2. Select the Significance Level (𝛼):
o Choose the probability of rejecting the null hypothesis when it is actually true (Type I error).
Common levels are 0.05, 0.01, or 0.10.
3. Choose the Appropriate Test Statistic:
o Based on the data type, sample size, and parameter being tested, select a test (e.g., z-test, t-test,
chi-square test).
4. Collect Sample Data and Compute Test Statistic:
o Gather the sample data and calculate the value of the test statistic (e.g., z-score, t-score).
5. Determine the Critical Value(s) or p-value:
o Find the critical value from the statistical tables corresponding to 𝛼 and the test distribution, or
compute the p-value (probability of observing the test statistic under 𝐻 ).
6. Make a Decision:
o Using Critical Value: If the test statistic falls in the rejection region (beyond the critical value),
reject 𝐻 .
o Using p-value: If p-value ≤ 𝛼, reject 𝐻 ; otherwise, fail to reject 𝐻 .
7. Draw a Conclusion:
o Interpret the decision in the context of the original research question. State whether there is
sufficient evidence to support the alternative hypothesis.
Question 20: Define the following Terms with Example:
1. Level of Significance and p-value
Level of Significance (𝛼): The probability of rejecting the null hypothesis when it is actually true (Type
I error). It is set by the researcher before the test (e.g., 0.05 or 5%).
Example: If 𝛼 = 0.05, there is a 5% risk of concluding that a new drug works when it actually doesn’t.
p-value: The probability of obtaining a test statistic at least as extreme as the one observed, assuming
the null hypothesis is true.
Example: A p-value of 0.03 means there’s a 3% chance the observed results occurred by random chance
if the null hypothesis is true. If 𝛼 = 0.05, we reject 𝐻 because 0.03 < 0.05.
2. Paired Sample
Two sets of observations that are related or matched in some way, often from the same subjects at two different
times or under two conditions.
Example: Measuring the blood pressure of the same patients before and after taking a medication.
3. CPI (Consumer Price Index)
A measure that tracks changes in the price level of a basket of consumer goods and services over time, used as
an indicator of inflation.
Example: If the CPI increases from 100 to 105 in a year, it means the average price level has risen by 5%.
4. Symmetric Distribution
A probability distribution where the left and right sides are mirror images of each other, meaning the mean,
median, and mode are equal and data are evenly spread.
Example: The normal distribution (bell curve) is symmetric.
5. Confidence Interval
A range of values, derived from sample data, within which the true population parameter is expected to lie with
a certain level of confidence.
Example: “We are 95% confident that the average height of students is between 160 cm and 170 cm.”
Question: Distinguish between Random and Systematic Sampling with example. What are the differences
between Small Sample and Large Sample Test of Hypothesis?
1. Random Sampling vs. Systematic Sampling
Random Sampling
Each member of the population has an equal and independent chance of being selected.
Selection is entirely by chance, often using random number generators or lottery methods.
Example: Assigning each student in a school a number and using a computer to randomly pick 50
numbers. Those students are selected.
Systematic Sampling
Selection follows a fixed interval after a random starting point.
The first member is chosen randomly, then every kth member is selected from the list.
Example: From a list of 1000 employees, randomly choose a starting point (e.g., 12th person) and then
select every 20th person thereafter (12th, 32nd, 52nd, …).
Key Difference:
Random sampling is entirely chance-based with no pattern.
Systematic sampling uses a fixed interval after one random start, making it easier and faster but
potentially less random if the list has a hidden pattern.
2. Differences between Small Sample and Large Sample Test of Hypothesis
Aspect Small Sample Test Large Sample Test
Sample Size (n) Usually n < 30 Usually n ≥ 30
Distribution t-distribution z-distribution (normal)
Used
Assumption Population is normally distributed (if n Central Limit Theorem applies;
small) normality not required
Variance Population variance often unknown; Population variance known or estimated
sample variance used accurately
Example Test One-sample t-test, Paired t-test One-sample z-test, Proportion test
Accuracy Less precise, wider confidence intervals More precise, narrower confidence
intervals
When Used When data is limited, e.g., pilot studies, Surveys, market research, large-scale
medical trials studies
Example:
Small sample: Testing a new drug on 15 patients using a t-test.
Large sample: Surveying 500 customers about satisfaction using a z-test for proportions.
Question: Explain the underlying concept of the ‘Central Limit Theorem’ with example.
Answer:
The Central Limit Theorem (CLT) is a fundamental concept in statistics that explains how the distribution of
sample means behaves, regardless of the shape of the population distribution. The underlying idea is that if you
take large enough random samples from any population (whether normal, skewed, uniform, etc.), the
distribution of the sample means will approximate a normal (bell-shaped) distribution.
Key Concepts of CLT:
1. Sample Means are Normally Distributed:
Even if the population is not normal, the sampling distribution of the sample mean becomes
approximately normal as the sample size increases (usually 𝑛 ≥ 30 is sufficient).
2. Mean of Sample Means Equals Population Mean:
The mean of all possible sample means is equal to the population mean (𝜇).
3. Standard Error:
The standard deviation of the sample means (called the standard error) is:
𝜎
Standard Error =
√𝑛
where 𝜎 is the population standard deviation and 𝑛 is the sample size.
4. Larger Sample → Better Approximation:
The larger the sample size, the closer the sampling distribution gets to a normal distribution.
Example:
Imagine a factory produces light bulbs with a lifespan that follows a skewed distribution (most last around
800 hours, but some last much longer). The population mean 𝜇 = 800 hours and standard deviation 𝜎 =
100 hours.
Step 1: Take many random samples of size 𝑛 = 40 bulbs each.
Step 2: For each sample, calculate the mean lifespan.
Step 3: Plot these sample means.
Result: Even though the original bulb lifespans are skewed, the distribution of the sample means will be
approximately normal, centered at 800 hours, with a standard error of:
100
SE = ≈ 15.81 hours
√40
This allows us to use normal probability rules to make inferences about the average lifespan of all bulbs, even
without knowing the original population’s shape.
Question 26 (a): Why do we convert data into index number?
Answer:
We convert data into index numbers to simplify, standardize, and compare complex data over time or across
categories. An index number expresses data relative to a base period (set as 100), making trends and changes
easier to interpret.
Main reasons for using index numbers:
1. Measure Relative Change:
Index numbers show percentage changes in variables (like price, quantity, or value) compared to a base
period.
Example: A price index of 115 means prices have increased by 15% compared to the base year.
2. Compare Diverse Items:
They allow comparison of different items measured in different units (e.g., comparing the cost of food,
housing, and transportation in a single index like CPI).
3. Simplify Large Numbers:
Index numbers convert large and unwieldy data into simple, understandable figures (e.g., converting
sales values from millions into an index of 100, 105, 110, etc.).
4. Identify Trends and Patterns:
They help in tracking economic or business trends over time, such as inflation, production growth, or
market performance.
5. Facilitate Decision-Making:
Governments, businesses, and researchers use indices (like Consumer Price Index, GDP deflator) to
adjust wages, pensions, prices, and policies.
6. Standardize Comparisons:
Indices enable fair comparison across different time periods, regions, or groups by using a common
base.
Example:
If the average price of a basket of goods was $200 in 2020 (base year) and $230 in 2024, the price index for
2024 is:
230
× 100 = 115
200
This indicates a 15% increase in prices since 2020.
Question 30: Why must we not attribute causality in a relationship even when there is strong correlation
between the variables or events? Why should we be cautious in using the past data to predict future
trend?
Answer:
1. Correlation Does Not Imply Causation
Even when there is a strong correlation between two variables, we cannot assume one causes the other.
Reasons include:
Third Variable (Confounding Factor):
Both correlated variables may be influenced by a third unseen factor.
Example: Ice cream sales and drowning incidents are highly correlated in summer, but both are caused
by hot weather, not by each other.
Direction of Causality Ambiguity:
It may be unclear which variable is the cause and which is the effect.
Example: Higher education and higher income are correlated, but does education cause higher income,
or do richer families afford more education?
Coincidence:
Sometimes correlation occurs purely by chance, especially with large datasets.
Reverse Causality:
The effect may actually be the cause.
Example: People with more stress may sleep less, but poor sleep could also cause more stress.
Conclusion: Only a carefully designed experimental study (with control groups, randomization) can establish
causality, not correlation alone.
2. Caution in Using Past Data to Predict Future Trends
Past data is useful but limited for forecasting because:
Changing Conditions:
Economic, social, technological, or environmental factors change over time, making past patterns
unreliable.
Example: Stock market trends pre-pandemic may not hold post-pandemic.
Structural Breaks:
Sudden events (recessions, policy changes, natural disasters) can disrupt historical trends.
Overfitting:
Models built too closely on past data may capture noise rather than true patterns, failing in future
predictions.
Limited Data Range:
Past data may not include all possible scenarios or rare events (e.g., a once-in-a-century crisis).
Human Behavior Changes:
Consumer preferences, cultural shifts, and innovations evolve, making past behavior an imperfect guide.
Conclusion: While past data provides a baseline, predictions should account for uncertainty, external
changes, and new information.
Question 36: How are the time series data different from panel data? Briefly explain different
components of time series analysis.
Answer:
Difference Between Time Series Data and Panel Data
Aspect Time Series Data Panel Data
Data collected for one entity over multiple Data collected for multiple entities over
Definition
time periods. multiple time periods.
Two-dimensional: Time × Variables (for one Three-dimensional: Entities × Time ×
Dimension
unit). Variables.
Aspect Time Series Data Panel Data
Monthly sales data of a single Annual GDP data for 10 countries from 2000–
Example
company from 2010–2023. 2022.
Also Known
Longitudinal data (single unit). Cross-sectional time series.
As
Analyze trends, seasonality, cycles over
Purpose Compare behavior across units and over time.
time for one unit.
Components of Time Series Analysis
A time series can be decomposed into four systematic components:
1. Trend (T):
The long-term movement in the data, showing a general upward, downward, or stable pattern over
time.
Example: Gradual increase in annual smartphone sales over a decade.
2. Seasonality (S):
Regular, predictable fluctuations that occur at fixed intervals within a year (monthly, quarterly, etc.).
Example: Ice cream sales peak every summer.
3. Cyclical (C):
Long-term oscillations that are not of fixed period, often influenced by economic conditions (e.g.,
business cycles lasting several years).
Example: Periods of economic boom and recession.
4. Irregular/Random (I):
Unpredictable, random variations due to unexpected events or noise. This component is what remains
after removing trend, seasonality, and cyclical effects.
Additive Model: 𝑌 = 𝑇 + 𝑆 + 𝐶 + 𝐼
Multiplicative Model: 𝑌 = 𝑇 × 𝑆 × 𝐶 × 𝐼
Time series analysis helps in forecasting, detecting patterns, and understanding underlying behavior over
time.
Question 38 (a): What are the four possible outcomes for a test of hypothesis? Show these outcomes by
writing a table. Briefly describe the Type I and Type II errors.
Answer:
Four Possible Outcomes of Hypothesis Testing
In hypothesis testing, decisions about the null hypothesis (𝐻 ) are made based on sample data. The four possible
outcomes depend on whether 𝐻 is true or false, and whether we reject or fail to reject it.
Table of Outcomes:
𝐻 is TRUE (No effect/difference) 𝐻 is FALSE (Effect/difference exists)
Reject 𝐻 Type I Error (False Positive) Correct Decision (True Positive)
Fail to Reject 𝐻 Correct Decision (True Negative) Type II Error (False Negative)
Type I Error (α – False Positive):
Meaning: Rejecting the null hypothesis when it is actually true.
Probability: Denoted by α (level of significance).
Example: Concluding a new drug is effective when it actually is not.
Consequence: Wasting resources, implementing ineffective treatments, or making false claims.
Type II Error (β – False Negative):
Meaning: Failing to reject the null hypothesis when it is actually false.
Probability: Denoted by β.
Example: Concluding a new drug has no effect when it actually works.
Consequence: Missing out on beneficial innovations or interventions.
Power of the Test (1 – β):
The probability of correctly rejecting 𝐻 when it is false. A good test minimizes both Type I and Type II errors.