0% found this document useful (0 votes)
10 views57 pages

Inference in Quantitative Research Methods

The lecture focuses on inference in quantitative research, particularly the importance of understanding confidence intervals and sampling distributions. It covers the distribution of random variables, population and sampling, point and interval estimation, and the properties of estimators. Key concepts include the Central Limit Theorem, bias, efficiency, consistency, and asymptotic normality of estimators.

Uploaded by

khaniman999
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views57 pages

Inference in Quantitative Research Methods

The lecture focuses on inference in quantitative research, particularly the importance of understanding confidence intervals and sampling distributions. It covers the distribution of random variables, population and sampling, point and interval estimation, and the properties of estimators. Key concepts include the Central Limit Theorem, bias, efficiency, consistency, and asymptotic normality of estimators.

Uploaded by

khaniman999
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

SOCI2030: Quantitative Research Methods1

Lecture 5: Inference

Lai Wei
Department of Sociology
The University of Hong Kong

Mar 4, 2025

1
I thank Satoshi Araki and Brandon Stewart for the sharing of slides.
1 / 57
Last Week

1. Transforming hypotheses to variables


2. Types of survey questions/variables
• Continuous
• Discrete (Binary; Nominal; Ordinal)
• Open-ended
3. Pitfalls in survey design (wording/ordering)
4. Getting sensitive information
• List experiment
• Randomized response
5. Survey experiment
• Priming effect
• Vignette/factorial design

2 / 57
What is inference?

Inference: given that we observe something in the data, what is our best guess of what it
will be in the entire population?

• We read in the news statements like this a lot:


• The research, which analyzed survey data from over 2,000 adults across the country,
found that regular volunteers experienced, on average, a 10 percent increase in life
satisfaction. Researchers explained that the reported effect comes with a 95 percent
confidence interval ranging from 5 percent to 15 percent.
• Goal of the class:
• Understand what confidence interval means
• Know how to calculate it

3 / 57
Why do we need inference?

Suppose you want to study the average life satisfaction of HKU students
...If you interviewed 10 students, and get an average life satisfaction of 3.5 (on a scale of
1-7)
...If you interviewed 100 students, and get an average life satisfaction of 3.5 (on a scale of
1-7)
The information contained in the two studies is different
Confidence interval helps summarize the information; in general, larger sample leads to
narrower confidence interval

4 / 57
This Week

1. Distribution of Random Variables


2. Population and Sampling
3. Point Estimation
4. Interval Estimation

5 / 57
Discrete Distribution

• Inference concerns probabilitistic statments


• There is a 95 perecnt chance that the values are between ... and ...
• The distribution bridges variables and probabilities
• The distribution of a random variable specifies the probability of all events associated
with that random variable.
• For discrete distributions, the random variable X takes on a finite number of values.
• A probability mass function (PMF) and a cumulative distribution function (CMF) are
two common ways to define the probability distribution for a discrete random
variable.
• Probability mass functions provide a compact way to represent information about
how likely various outcomes are.

6 / 57
PMF of Discrete Variable
Suppose we want to know student’s average life satisfaction. We ask 26 students on a
scale of 1-8

7 / 57
CMF of Discrete Variable

8 / 57
Distribution of Continuous Variable

• Continuous random variables take on an uncountably infinite number of values.


• Example: human height, time
• We often consider variables that take many values as a continuous variable too, by
approximation (test scores 0-100).
• A probability density function (PDF) and a cumulative distribution function (CDF)
are two common ways to define the distribution for a continuous random variable.
• They are similar to the discrete case with a few subtle differences.

9 / 57
PDF of Continuous Variable

10 / 57
CDF of Continuous Variable

Figure: Height in China

11 / 57
Summaries of Distributions

Distributions have all kinds of wonky shapes. How do we characterize what they look like?

12 / 57
Expectation as a Measure of Central Tendency
The expecterd value of a random variable X is denoted by E [X ] and is a measure of
central tendency of X .

An expected value is a weighted average if all of the values weighted by probability of


occurrence.

In a lot of cases it is synonymous with mean or average

The expected value of a discrete random variable X is defiend as:


X
E [X ] = x · pX (x)
all x
The expected value of a continuous random variable X is defiend as:
Z ∞
E [X ] = x · fX (x)dx
−∞
13 / 57
Variance: A Measure of Dispersion

Expectation told us about the central tendency of a random variable, but what about
dispersion?

The variance of a random variable X is given by:

Var (X ) = E [(X − E [X ])2 ]


It is the expectation of the squared distances from the mean.
p
It’s square root is called standard deviation, SD(X ) = Var (X )

14 / 57
A Visual Illustration

15 / 57
This Week

1. Distribution of Random Variables


2. Population and Sampling
3. Point Estimation
4. Interval Estimation

16 / 57
Population

• Typically, we want to learn about a population of interest


• The students in this class room
• All students at HKU
• All students in HK
• All college students worldwide
• Always be clear and explicit about the population of interest
• The validity of the quantitative analysis can only be evaluated if the population of
interest is explicit
• – Students in this class may be a fair representation of FOSS, but not HKU
• – Students at HKU may be reasonably representative of students in HK, but
definitely not worldwide

17 / 57
Why Sample?

• The population can be unreasonably large, that it is impossible to survey everyone


• We ask a representative group of the population and get an estimate
• We express our uncertainty around that estimate, since we cannot get to the entire
population
• Terminology side note: a sample refers to a fixed group of individuals that we
interviewed for our study; an observation refers to a specific individual in our sample

18 / 57
Basic Intuitions behind Inference

• Where is the source of uncertainty?


• It has to do with the discrepancy between our sample and the population, or the
discrepancy among different samples!
• For each sample we take, the answer we get may be slightly different
• So how do we quantify this uncertainty?
• We report how our estimate differs from sample to sample
• This is called the sampling distribution, which captures how our estimate differes
from sample to sample

19 / 57
Differentiating Three Distributions

It is important to differentiate the following three concepts


• Population Distribution: the height distribution among all HKU students
• We could not observe this directly due to budget constraint
• We want to learn about this using a sample
• Empirical Distribution: the observed distribution of heights within the sample we
collect
• The information we actually gathered
• We want to learn population distribution using this
• Sampling Distribution: the distribution of average heights from sample to sample
• Can only be truly observed if we draw repeated samples from the population
• With some statistical tricks we can learn this without actual repeated sample
• This contains information we need about the uncertainty in our estimate

20 / 57
Population Distribution
10,000 units, mean 50, sd 10

21 / 57
Empirical Distribution
One draw, 30 observations

22 / 57
Sampling Distribution

23 / 57
Estimand, Estimator, Estimate

The goal of statistical inference is to learn about the unobserved population distribution
of a variable, throughout suppose that we are only interested in the mean of a variable.
• Estimand: the parameters that we aim to estimate, denoted with Greek letters (µ,θ,
etc.). e.g. average height of all HKU students
• Estimator: functions of sample data which we use to learn about the estimands.
Denoted with a ”hat” (µ̂, θ̂, etc.). e.g. average height of a sample
• Estimate: particular values of estimators that are realized in a given sample (e.g.
sample mean)

24 / 57
Estimator is a random variable!

An estimator is a random variable


Because sampling is random
For each sample we take, the output is different
It is the range of all possible estimates across different samples
Will be made clearer later

Let us regroup:
• We are interested in the average of a population variable, say average height of HKU
students, this is called estimand
• We have an estimator for the estimand, say sample average height of 10 people; we
collect data and then have an estimate, say 170cm
• We want to express the uncertainty around our estimate, therefore the study of
sampling distribution
25 / 57
Overview

26 / 57
This Week

1. Distribution of Random Variables


2. Population and Sampling
3. Point Estimation
4. Interval Estimation

27 / 57
Notation for Sampling Distribution

Suppose we take a simple random sample of size n from the population.


We say that X1 , X2 ,...,Xn are identically and independently distributed (i.i.d.) from a
population distribution with a mean E [X1 ] = µ and Var (X1 ) = σ 2
This can be expressed as X1 , X2 , ..., Xn ∼?(µ, σ 2 )
Note that for every individual draw (like X1 ), its distribution is just the distribution of the
variable of interest (X )
Point Estimation: Estimating just one number (contrasting interval estimation which tries
to understand the interval that one number may lie in)

28 / 57
Describing the Sampling Distribution of the Mean

We will focus on inference over the mean (µ), as this is the most common task we will be
facing.
With a sample X1 , X2 , ..., Xn , one natural estimator is the sample mean: X̄n = n1 ni=1 Xi
P

We want to understand the distribution of X̄n


Remember X̄n is a random variable. To describe this sampling distribution, we can start
from the most important properties of a random variable
Expectation: E [X̄n ]
Variance: Var (X̄n )

29 / 57
Expectation and Variance of the Sampling Distribution

E [X̄n ] = µ
σ2
Var [X̄n ] = n
Intuitively: The expectation of X̄n is our parameter of interest, the population mean; the
variance of X̄n is smaller than the variance of the variable of interest
More intuition: the larger the sample, the more accurate the estimate; when sample size
is infinitely large, the estimator becomes a constant

30 / 57
The Entire Sampling Distribution

We may want to know the entire distribution of X̄n , not just its expectation and variance.
What form do sampling distributions follow?
Central Limit Theorem: sampling distributions of means ALWAYS follow a normal
distribution, irregardless of the distribution of the variable of interest.
2
If n is large enough, X̄n ∼approx N(µ, σn )
↑ This is one of the most PROFOUND discovery in the entire history of science

31 / 57
Bernoulli (Coin Flip) Distribution

32 / 57
Poisson (Event Count) Distribution

33 / 57
Uniform Distribution

34 / 57
Properties of Estimators

2
CLT, X̄n ∼approx N(µ, σn ), is a powerful result and will be proven useful for a lot of
purposes.
Now let us try to answer a question: is the sample mean, X̄n a good estimator for µ?
We will define what is a good estimator, that is the properties of estimators

35 / 57
Desirable Properties of Estimators

• Unbiasedness: We’d like an estimator that gets the right answer on average.
• Efficiency: We’d like an estimator that doesn’t change much from sample to sample.
• Consistency: We’d like an estimator that gets closer to the right answer
(probabilistically) as the sample size increases.
• Asymptotic Normality: We’d like an estimator that has a known sampling
distribution (approximately) when the sample size is large.

36 / 57
1. Bias (Not Getting the Right Answer on Average)

Definition of Bias
Bias(µ̂) = E [µ̂] − µ

Bias is the difference between the expectation of our estimator and the true value
It is NOT the difference between one particular estimate and a true value!
An estimator µ̂ is unbiased if Bias(µ̂) = 0

37 / 57
1. Bias (Not Getting the Right Answer on Average)

Which of the following are unbiased for population mean?


1. µ̂ = X̄n
2. µ̂ = X1
3. µ̂ = X1 + 1
4. µ̂ = 12 (X1 + X2 )

38 / 57
2. Efficiency (does not change much from sample to sample)

How should we choose between unbiased estimators?


We prefer the estimator with a smaller variance
We say that one estimator is more efficient than another, if it has smaller variance

Which estimator is more efficient?


1. µ̂ = X̄n
2. µ̂ = X1
3. µ̂ = X1 + 1
4. µ̂ = 12 (X1 + X2 )

39 / 57
3. Consistency (closer to truth as sample size increases)

Unbiasedness and efficiency are finite-sample properties of estimators, which hold


regardless of sample size
Consistency and Asymptotic Normality (CAN) are asymptotic properties, behavior of the
sampling distribution when sample size gets infinitely large
Definition of Consistency
An estimator θn is consistent if θn converges in probability to θ when sample size n grows
to infinity

Often seen as a minimal requirement for estimators


A consistent estimator may still perform badly in small samples
How to check it: E [θn ] → θ and Var (θn ) → 0

40 / 57
4. Asymptotic Normality (known asymptotic distribution)

We are also interested in the shape of the sampling distribution of an estimator as the
sample size increases.
The sampling distributions of many estimators converge towards a normal distribution.
For example, we’ve seen that the sampling distribution of the sample mean converges to
the normal distribution.
Way to calculate asymptotic distribution for more complex estimators: the Delta method

41 / 57
This Week

1. Distribution of Random Variables


2. Population and Sampling
3. Point Estimation
4. Interval Estimation

42 / 57
What is Interval Estimation?

• A point estimator θ̂ estimates a population parameter θ with a single number.


• However, because we are dealing with a random sample, we might also want to
report uncertainty in our estimate.
• An interval estimator for θ takes the following form:

[θ̂lower , θ̂upper ]

Where θ̂lower and θ̂upper are random quantities that vary from sample to sample.
• The interval represents the range of possible values within which we estimate the
true value of θ to fall
• An interval estimate is a realized value from an interval estimator. The estimated
interval typically forms what we call a confidence interval.

43 / 57
Normal Population with Known σ 2

Suppose we have i.i.d random sample of size n,, X1 , X2 , ..., Xn , from X ∼ N(µ, 1)
From CLT, we know that the sampling distribution of the sample mean is:

X̄n ∼ N(µ, σ 2 /n) = N(µ, 1/n)


Therefore, the standardized samlpe average is distributed as follows:

X̄n − µ
√ ∼ N(0, 1)
1/ n
This implies

X̄n − µ
P(−1.96 < √ < 1.96) = 0.95
1/ n
Why?
44 / 57
CDF of the standard normal distribution

45 / 57
Constructing a Confidence Interval with Known σ 2

So we know that:

X̄n − µ
P(−1.96 < √ < 1.96) = 0.95
1/ n
Rearranging yields:
√ √
P(X̄n − 1.96/ n < µ < X̄n + 1.96/ n) = 0.95
This implies that the following interval estimator
√ √
[X̄n − 1.96/ n, X̄n + 1.96/ n]
contains the true population mean µ with probability 0.95
We call this interval estimator a 95% confidence interval for µ

46 / 57
Normal Population with Unknown σ 2

In practice, it is rarely the case that we somehow know the true value of σ 2
Suppose now that we have an i.i.d sample of size n, X1 , X2 , ...Xn from X ∼ N(µ, σ 2 /n),
where σ 2 is unknown. Then, as before, from CLT:

X̄n ∼ N(µ, σ 2 /n)

X̄n − µ
√ ∼ N(0, 1)
σ/ n
Previously, we then constructed the 95 percent interval:
√ √
[X̄n − 1.96σ/ n, X̄n + 1.96σ/ n]
But now we can not directly use this because we do not know σ
Therefore, we need an estimator for σ, σ̂.
47 / 57
Estimators for Population Variance

Population variance estimator:


n
1 X
Sn2 = (Xi − X̄n )2
n−1
i=1

Through some algebra, we can show:


Unbiasedness: E [Sn2 ] = σ 2
Consistency: Sn2 → σ 2
This estimator, Sn2 , is commonly known as sample variance.
So now we can plug-in this estimator to our 95 percent confidence interval:
√ √
[X̄n − 1.96Sn / n, X̄n + 1.96Sn / n]

48 / 57
Let Us Regroup

With a sample of size n of a random variable of interest X with mean µ and variance σ 2 ,
from CLT, we know that X̄n ∼ N(µ, σ 2 /n)
By checking the CDF of normal distributions, we know that
−µ
P(−1.96 < X̄ n√
σ/ n
< 1.96) = 0.95
√ √
Rearranging this gives P(X̄n − 1.96σ/ n < µ < X̄n + 1.96σ/ n) = 0.95
√ √
We call this interval the 95 percent interval, [X̄n − 1.96σ/ n < µ < X̄n + 1.96σ/ n]
Since we do not know σ, we replace it with our estimated σ, Sn . This gives the interval
√ √
estimator: [X̄n − 1.96Sn / n < µ < X̄n + 1.96Sn / n]
Both X̄ and Sn are random, across samples
Once the data is observed, nothing is random

49 / 57
Interpreting the Confidence Interval

• Does a 95 confidence interval mean that there is a 95 percent probability that the
truth is in our estimate?
• NO!
• Once we get an interval from our specific sample, the truth is either in it or not, so
probability is either 1 or 0.
• However, if we were to repeatedly sample many times, and for each time calculate a
confidence interval, 95 percent of the confidence intervals would cover the truth
• 95 percent expresses confidence in our calculation process, not confidence in our
particular estimate

50 / 57
Interpreting the Confidence Interval

51 / 57
Is 95 percent all there is?

Our 95 percent CI is: X̄n ± 1.96Sn


−µ
X̄n √
Remember where 1.96 come from: P(−1.96 < σ/ n
< 1.96) = 0.95
What if we want a difference percentage, α?

X̄n − µ
P(−z < √ < z) = α%
σ/ n
How do we find z?

52 / 57
Normal PDF

We know that z comes from the probability


in the tails of the standard normal
distribution.
When α=95, we want to pick z so that
2.5% of the probability is in each tail
This gives us a value of 1.96 for z

53 / 57
Normal PDF

What if we want a 50 percent confidence


interval
We want to pick z so that 25 percent of the
probability is in each tail
This gives us a value of 0.67 for z

54 / 57
The Problem with Small Sample

CLT only kicks in only in very large sample


If sample is small, we cannot credibly say that the sampling distribution is normal
Also, when sample size is small, our variance estimate, Sn , can be quite off
Solution: the t distribution
In small samples, more credible than the normal distribution; in large samples, converges
to the normal distribution

55 / 57
The t distribution

56 / 57
This Week

1. Distributions
2. Population and Sampling
3. Point Estimation
4. Interval Estimation

57 / 57

You might also like