0% found this document useful (0 votes)
9 views23 pages

Point Estimation in Statistics

Chapter 2 of MATH 3423 focuses on point estimation, detailing methods for estimating unknown parameters using statistics, particularly through Maximum Likelihood Estimation (MLE). It introduces the concept of likelihood and explains how to find MLE, emphasizing its desirable properties such as being asymptotically unbiased and normally distributed. The chapter also discusses the importance of understanding the parameter space and provides examples of finding MLE in different statistical contexts.

Uploaded by

Zexi Tang
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views23 pages

Point Estimation in Statistics

Chapter 2 of MATH 3423 focuses on point estimation, detailing methods for estimating unknown parameters using statistics, particularly through Maximum Likelihood Estimation (MLE). It introduces the concept of likelihood and explains how to find MLE, emphasizing its desirable properties such as being asymptotically unbiased and normally distributed. The chapter also discusses the importance of understanding the parameter space and provides examples of finding MLE in different statistical contexts.

Uploaded by

Zexi Tang
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MATH 3423 Statistical Inference | Dr.

CW YU

Chapter 2: Point Estimation


There are two parts for this chapter:
Part I: Finding estimators in general cases and Evaluating Estimators, and
Part II: Best Unbiased Estimator (UMVUE).

1 PART I: INTRODUCTION
For part I, we will first study how to estimate a parameter(s) in general cases by a methodical
estimation principle and then discuss its performance.

Point Estimation:
The idea of point estimation is so simple that we just use a statistic 𝑇(𝒙) to estimate the
unknown parameter, say 𝜃, where 𝒙 = (𝑥1 , … , 𝑥𝑛 )′ is a realization of the random sample 𝑿 =
(𝑋1 , … , 𝑋𝑛 )′ or {𝑋𝑖 : 𝑖 = 1, … , 𝑛} of size 𝑛 from a population with a pdf 𝑓(⋅ |𝜃) or pmf 𝑝(⋅ |𝜃)
and 𝜃 is in the parameter space Θ.

In some cases, there is an obvious or natural point estimator of an unknown parameter. For
instance, sample mean of a random sample is a natural point estimator of the population mean.
However, when we leave such a simple case, we need a more methodical estimation technique
that will at least give us a reasonable candidate for consideration. In Section 2, we will study
one most commonly used estimation approach in statistics: Maximum Likelihood Estimation.

Remark:

[Parameter of interest] Most often, the parameter(s) of our interest to be estimated (called
estimand) is a function of the unknown distribution parameter(s) 𝜃, say 𝑔(𝜃). For instance, we
may be interested in 𝜇 2 , instead of 𝜇, or 𝜎/𝜇, instead of 𝜇 or 𝜎 only, etc.
[Estimator? Estimate?] An estimator is a funciton of the random sample 𝑿, while an estimate is
the realized value of the estimator that is obtained when a sample of data is actually taken.

[Why is called ‘point’ estimation?] Note that the statistic 𝑇 indeed is a ‘point’ in 𝑅 𝑘 , where
𝑘 ≥ 1 represents the number of unknown parameters to be estimated. We use it to estimate
𝑔(𝜃), which is also a ‘point’ in 𝑅 𝑘 . So, that’s why 𝑇(𝒙) is called a point estimate of 𝑔(𝜃).
Caution: We ONLY estimate an UNKNOWN parameter(s). For any KNOWN parameter, there is
no point for us to estimate it!!!

~1~
MATH 3423 Statistical Inference | Dr. CW YU

2 GENERAL METHOD OF FINDING ESTIMATORS


In practice, there are a lot of estimation techniques which can be used to estimate an unknown
parameter(s), but we only detail one method --- maximum likelihood estimation, mainly
because it is most popular in statistics and data science, and has some desirable properties,
such as asymptotically unbiased and asymptotically normal.

2.1 MAXIMUM LIKELIHOOD ESTIMATION (MLE)


The method of maximum likelihood was popularized in mathematical statistics by Ronald
Aylmer Fisher in 1922. Nowadays, there are still a lot of research studying its properties.

Ronald Aylmer Fisher (1890-1962)

Fisher is one of the most prominent statisticians of the


19th-20th century. Other examples of his contributions are
sufficiency, consistency, efficiency, Fisher information,
genetical statistics, etc.

More details about him can be found in the article “How


Ronald Fisher became a mathematical statistician” by
Stephen M. Stigler.

Before showing how to find MLE, let’s first understand what the ‘likelihood’ is.

WHAT IS ‘LIKELIHOOD’?
Likelihood function is used to quantify how our observed data is likely to occur.
Definition: Consider a r.s. of size 𝑛 from a population with a pdf 𝑓(⋅ |𝜃) or pmf 𝑝(⋅ |𝜃). After
collection, we have the realization 𝒙 = (𝑥1 , … , 𝑥𝑛 )′ . The likelihood function is defined by
𝐿(𝜃) = 𝐿(𝜃1 , 𝜃2 , … , 𝜃𝑘 |𝒙) = ∏𝑛𝑖=1 𝑓(𝑥𝑖 |𝜃) for continuous cases and 𝐿(𝜃) = ∏𝑛𝑖=1 𝑝(𝑥𝑖 |𝜃) for
discrete cases.
Remark that 𝐿(𝜃) is a function of 𝜃, with 𝒙 held fixed.

~2~
Statistical belief highest pron if getting data E x xzixi
Discrete cases
joint pmfl P x_x 三⼤ Xn⼆加10
jointprob

熊 B 加101 where Pxisthe commonpintofthe
t getting I ncopiesof X
吣⽐ likelihood [Link] to be ⻛伏10
Dye the

Continuous
xiifi
the
ones
[Link]
fwwml [Link]
likelihood
Cannotuse
o variate

plxi x.X [Link] n xnl0


o

So we can consider
[Link].nl0


PlxiEXisxitdxilosbyindept fi
惑P EXcxitdxilo by
Xi identicallydistrito

⼆点
find们 ⼆
感 [Link]
MATH 3423 Statistical Inference | Dr. CW YU

The basic principle of the maximum likelihood estimation:


It comes from the statistical belief that there is a highest chance of getting our current
particular set of data. For each realization 𝒙, we look for a value of 𝜃, denoted by 𝜃̂, in Θ at
which 𝐿(𝜃) attains its maximum. In other words, the value --- maximum likelihood estimate
(MLE) --- makes our observed data be most likely to occur.

More formally, we have the following definition of MLE.

Definition (MLE): The maximum likelihood estimate is 𝜃̂ = argmax 𝐿(𝜃), which means
𝜃∈Θ

𝐿(𝜃̂ ) = max 𝐿(𝜃),


𝜃∈Θ

where max means the maximum over the parameter space Θ.


𝜃∈Θ

We also use the abbreviation MLE for the maximum likelihood estimator when we study the
properties of this “maximum likelihood” estimation method.
In some cases, especially when differentiation is used, it is easier to work with a natural
logarithm of 𝐿(𝜃), i.e. 𝑙(𝜃) = log 𝐿(𝜃), called log likelihood, than it is to work with 𝐿(𝜃)
directly. This is possible because the log function is strictly increasing, which implies that the
maxima of 𝐿(𝜃) and 𝑙(𝜃) coincide.

The MLE is oftenbiased Mostoften there is no closed


Remark:
MLE may not exist or may not be unique in Θ.
form of MLE
[Invariance property] If 𝜃̂𝑖 is the MLE for 𝜃𝑖 for 𝑖 = 1, … , 𝑘, then ℎ(𝜃̂1 , 𝜃̂2 , … , 𝜃̂𝑘 ) is the
MLE for ℎ(𝜃1 , 𝜃2 , … , 𝜃𝑘 ), where ℎ is a known function.

For 𝜃 ∈ 𝑅 𝑘 , 𝜃̂𝑛 is consistent, asymptotically unbiased, asymptotically efficient and


asymptotically normally distributed. To be more precise, under regularity assumptions,
we have
𝑑
√𝑛(𝜃̂𝑛 − 𝜃) → 𝑁𝑘 (𝟎, 𝐼−1
𝑋 (𝜃)),

where 𝐼𝑋 (𝜃) is known as Fisher Information matrix (More details about this matrix will
be discussed in part II) and it is a 𝑘 × 𝑘 matrix with the (𝑖, 𝑗)𝑡ℎ entry defined as

𝜕 𝜕
𝐸 [( log 𝑓𝑋 (𝑋|𝜃)) ( log 𝑓𝑋 (𝑋|𝜃))]
𝜕𝜃𝑖 𝜕𝜃𝑗
for 𝑖 = 1, … , 𝑘 and 𝑗 = 1, … , 𝑘.

~3~
MATH 3423 Statistical Inference | Dr. CW YU

There are three standard approaches to find MLE. Our job is to find a global maximum!!!
(i) If the parameter space Θ contains finitely many points, then an MLE can always be
obtained by simply comparing finitely many value of (log) 𝐿(𝜃), for all 𝜃 ∈ Θ.

(ii) If 𝐿(𝜃) is differentiable on the interior of Θ, then one possible way of finding an MLE
is to consider the values of 𝜃 = (𝜃1 , 𝜃2 , … , 𝜃𝑘 )′ in the interior that solve the
first-order/ likelihood/ log likelihood equations
immune 𝜕 𝜕
𝐿(𝜃) = 0 𝑜𝑟 𝑙(𝜃) = 0, 𝑓𝑜𝑟 𝑖 = 1, … , 𝑘.
kit 𝜕𝜃𝑖 𝜕𝜃𝑖

ikiriu
However, this is just a necessary condition for a maximum (or minimum), not a
sufficient condition. To be more precise, the solutions to the above equations are
just the critical points, which may or may not be extrema. Furthermore, the zeros of
the first derivative only locate the critical points in the interior of the domain of the
(log) 𝐿(𝜃). If the maximum occur on the boundary, the first derivative may not be
zero. Thus, the boundary must be checked separately for MLE.
[A special case for one-parameter cases] When 𝑘 = 1, there is a case that we can
get a global maximum easily. If there is a unique critical point and it has a negative
second derivative of (log) 𝐿(𝜃), then it must be a global maximum. Note that for
this case, we do not have to check any boundary point!!

Example: Consider a random sample of size 𝑛 from 𝑁(𝜃, 1). Then,


𝑛

To KI
1 (𝑥𝑖 −𝜃)2 1 ∑(𝑥𝑖 −𝜃)2
− −
𝐿(𝜃) = ∏ 𝑒 2 = 𝑒 2 .
√2𝜋 (2𝜋)𝑛/2
𝑖=1

The first derivative of log 𝐿(𝜃) being 0 is


𝑑 𝑑 −𝑛 𝑑 ∑(𝑥𝑖 − 𝜃)2
0= 𝑙(𝜃) = [ log(2𝜋)] + [− ] = ∑(𝑥𝑖 − 𝜃),
𝑑𝜃 𝑑𝜃 2 𝑑𝜃 2

which yields the solution 𝜃̂ = 𝑥̅ . To verify that it is, in fact, a global


maximum of log 𝐿(𝜃) (or 𝐿(𝜃)), we first note that it is the unique solution x
to the first-order equation. Second, we can check that

𝑑2
0 =
dhan 𝑙"(𝜃) = 𝑙(𝜃)| = −𝑛 < 0.
𝑑𝜃 2 𝜃=𝑥̅

Therefore, 𝜃̂ = 𝑥̅ is a global maximum --- MLE.

~4~ Notethat the parameter


space iisc [Link]
and I can be any real value
in
MATH 3423 Statistical Inference | Dr. CW YU

(iii) Another way to find an MLE is to abandon differentiation and proceed with a direct
maximization. One general technique is to find a global upper bound on (log) 𝐿(𝜃)
and then establish that there is a unique point for which the upper bound is
attained.

Example (cont’): Instead of using calculus, we can also show that 𝜃̂ = 𝑥̅ is MLE
algebraically. Note that ∑(𝑥𝑖 − 𝜃)2 ≥ ∑(𝑥𝑖 − 𝑥̅ )2 for any 𝜃, where
they are equal if and only if 𝜃 = 𝑥̅ . Thus, for any 𝜃 ∈ Θ,

I
𝐿(𝜃) ≤ 𝐿(𝑥̅ )

with equality if and only if 𝜃 = 𝑥̅ . Hence, the MLE for 𝜃 is 𝑥̅ .

祕 0於产伽义成⼀

FUxi
Remark that the global maximum finding problem in the above case can be solved for large
situations when some regularity conditions are required. For instance,
n
itc tui
0N [Link]
2⼼⼼ ⽐

20

Flxij
MATH5423 Advanced Stated inference nioi
More details can be found in the book “Theory of Point Estimation” written by 20
Xiii
E. L. Lehmann and George Casella.
⼆产

Examples for finding MLE forany


1. Consider a r.s. of size 𝑛 from 𝑁(𝜇0 , 𝜎 2 ), where 𝜎 2 ∈ (0, ∞) is unknown and 𝜇0 is known.
Use the MLE to estimate 𝜎 2 .

2. Consider a r.s. with size 𝑛 of 𝑋 ∼ 𝑃𝑜𝑖𝑠𝑠𝑜𝑛(𝜆), where 𝜆 ∈ (0, ∞). Find the MLE of 𝜆.

3. Consider a r.s. with size 𝑛 of 𝑋 ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(1, 𝜃), where 𝜃 ∈ [0,1]. Find the MLE of 𝜃,
and then get the MLE when 𝑛 = 10 and ∑𝑛𝑖=1 𝑥𝑖 = 4.

4. Consider a r.s. of size 𝑛 of 𝑋 from 𝐺𝑎𝑚𝑚𝑎(𝛼, 𝛽), where 0 < 𝛼 < ∞ and 0 < 𝛽 < ∞.
Find MLE when

~5~
t
donotneed to estimate it

感 fcxiló where x xn are the data forthe rs from Ny 0

㳤 意 eilc [Link] 㮺㷗
2
当 总 Flxipo
2⽄62 e I took [Link]
write
[Link]
hey
I
afmun HE Not afountain
go
斯 _if if ii
0 chemise
Thus we can take the 1St part as L G
LG 2⽄62
垐 e总 Flxiyui

take log to
get
f 是 lnzn [Link] 與刮
and 14 derivative with respect to 62
0 l⼼ 芒六 t
t 忘了
Go
Ǒ⼆ 六年 xi
[Link] a
unique solution
Ò220 K 1
T
negative 2ndderivative
2
0 0 an xi are equal to
M 七⼼



parameterspare
co oo
惑 吣⼼ 阳⻔⼊ ⼊
等 [Link]
otherwise
n o


F1 器I灬
f Xiē叭
I can xi [Link]

lrsr
hry⼊ [Link]
Not function a

of ⼊ of
三术
Thus ⼼ ⼊ Take log to go
Xi
T l 灬
Fxi 以 M Thai
and take 14
dernvanmurt.

t i ⼆ 亚
on
i K 1


嵓 蕊器 iixi [Link]
T T 的
point

l
㞊 Parameter spare l
lonoo ⼽ 0 all xiao
The NNE exist 叺
and x P lxi [Link] ē 欧
guts
When zero observations
[Link]
hi n⼊ with No and no

器前扎⼊ does My exist Iii Ū 010 can beused
Thus Āo has limo
find l
⼊ 0
but it is NOTreasonable
But the MLE with
suphi 0 because ⼊ 70
does Mi exist in [Link]
感 吣 10
热 Ücroilcxioorl

[Link]
a function
of 0 NOT
Thus Luo ǒxicrojnxi
and to
i Exijlnotln [Link]
i

Fxim
㘩 Fxilnotln [Link] his Qo

⼼1凹 uni
愁愁些
iii 1囵 [Link] TOinto⻔ 必Eisley oaia
iii 1凹 XTMLE is in toni
to
5
㘩 n h 10 to into⻔ MEI
the MLE for 0 into ⻔ is T
Tang
MATH 3423 Statistical Inference | Dr. CW YU

i. 𝛽 is unknown but 𝛼 is known, say 𝛼0 .


ii. 𝛼 is unknown but 𝛽 is known, say 𝛽0 .
iii. Both are unknown.
iv. Both are unknown, but the mean is known to be 𝐾0 .

5. Consider a r.s. of size 𝑛 from 𝑁(𝜇, 𝜎 2 ), where 𝜎 2 ∈ (0, ∞) is unknown and 𝜇 ∈ (−∞, ∞)
is unknown. Find the MLE of 𝜃 = (𝜇, 𝜎 2 )′ .

6. Suppose 𝑋 is from 𝑈[0, 𝜃], where 0 < 𝜃 < ∞. Find the MLE of 𝜃 if a r.s. with size 𝑛 of 𝑋
is considered.

7. Suppose 𝑋 is from 𝑈[𝜃 − 1, 𝜃 + 1], where −∞ < 𝜃 < ∞. Find the MLE of 𝜃 if a r.s. with
size 𝑛 of 𝑋 is considered.

8. Consider a r.s. of size 𝑛 from 𝑁(𝜇, 1), where 0 ≤ 𝜇 < ∞ is unknown. Find the MLE of 𝜇.

~6~
⼩ p is unknown but ⼩⼭ is known where 10 o

for p
惑fcxilp ⼆点 器 xfè I 佖州

点 器 [Link] in

at meth of p
Thus
up 惑 器
xié llp惑 aolnp wnxntcao nlnxi pxi
naolnp nmnxostlxo D.mxi_ 噍xi
Then tip
with
0
unique critical point
yields
2nd derivative t
a
是 Go
negative 俴 型

So the Mi E
of pislsl
ii x is unknown but pipo is known where o o
Similarly we have
tcxsznxlnpo
However there is No
nmnastcx [Link]
closed form of the solution
of tix 0
Thus we can
only get the estimate by numerical methods
Same as ⼼⻔ and civ

o.o x o.o
For Q f 息 where
[Link] ⼼ and GElong
2
to ⼆
望 hun ⼀
是 lnl ⼼ ⼀
⾔ 点 lip
2

we want to
max l lo
QE
to get the MLE㤀 缘
Recall that xi
[Link] iforanyptl
and
[Link]

EE o.o
any
In otherwords
[Link] [Link] Elli 62 forany of co

地 上 2
是毖 器器 fm
max
along
his
Note that
tix 62 ⼆
望 [Link] lol ⼀
交点 xixi
是 man ⼀号mcó

will be maximized at E Si Sio if and only if allxi
zero prob
[Link] and Gao
we have

llqi ad x E Coo o Si E Coo


Thus
I and
主I
感 fail 0 ⼆
杰 going

where 10 no

㐫 I goEx xi 01
a futon of 0
Thug
Le
j I [Link] 0

⾐ I losxpcxn so WhyNOT 加⼆加
加⼆ 加 X⼼ 如
X [Link]
加 20 zeroprob

ME in
otxrnso
加 我
感 file ⽓ I flexion where [Link]

t I o 1 Exncxn 0⺾

f I 她 1 0 E 加⼗

1 40 ㄩ0

There are

n
.[Link] s0
MATH 3423 Statistical Inference | Dr. CW YU

2.2 FINDING MLE WITH R


Except for a few cases, typically we are only able to write down (log) 𝐿(𝜃) but cannot maximize
it analytically because there are no explicit solutions to the likelihood equation. However, there
is still some hope of maximizing it numerically by R or other statistical packages and, hence,
finding MLE. Note that when this is done, there is still always the question of whether a local or
global maximum is found.

PRINCIPLE OF THE NUMERICAL SOLUTION TO LIKELIHOOD EQUATIONS


Example: Consider a r.s. with size 𝑛 of 𝑋 ∼ 𝐶𝑎𝑢𝑐ℎ𝑦(𝜃). Find an MLE of 𝜃.

First, try to get (log) 𝐿(𝜃). Since the pdf of 𝑋 is 𝑓𝑋 (𝑥|𝜃) = 𝜋 −1 [1 + (𝑥 − 𝜃)2 ]−1 , the likelihood
is 𝐿(𝜃) = 𝜋 −𝑛 ∏𝑛𝑖=1[1 + (𝑥𝑖 − 𝜃)2 ]−1 and
𝑛

𝑙(𝜃) = −𝑛 log 𝜋 − ∑ log[1 + (𝑥𝑖 − 𝜃)2 ].


𝑖=1

Setting
𝑛
𝑑 2(𝑥𝑖 − 𝜃)
𝑙′(𝜃) = 𝑙(𝜃) = ∑ =0
𝑑𝜃 1 + (𝑥𝑖 − 𝜃)2
𝑖=1

yields the MLE (Again, we then also have to check if it is a global maximum.) Note that the
(solution) MLE cannot be solved explicitly in this case, but we can obtain/ approximate it by
numerical method like Newton-Raphson Algorithm.
According to Taylor, we have the following result:
1 ′ 1 1
0= 𝑙 (𝜃̂) ≈ 𝑙 ′ (𝜃) + (𝜃̂ − 𝜃) 𝑙"(𝜃).
𝑛 𝑛 𝑛
Thus,

𝜃̂ ≈ 𝜃 − 𝑙 ′ (𝜃)[𝑙"(𝜃)]−1 .
Newton-Raphson Algorithm:
−1
𝜃𝑗+1 ≈ 𝜃𝑗 − 𝑙 ′ (𝜃𝑗 )[𝑙"(𝜃𝑗 )] , 𝑗 = 0, 1, 2, …,

1
l 19
Q_Q Oj
~7~
MATH 3423 Statistical Inference | Dr. CW YU

R CORNER
We would use the R package maxLik to maximize (log) 𝐿(𝜃) in the following. Other R functions
like optim can also be used.
[Case 1: One unknown parameter]

t
initialguess

~8~
MATH 3423 Statistical Inference | Dr. CW YU

[Case 2: Multi unknown parameters]

Next, we give an example of a normal distribution with two unknown parameters

Note that for this case par is defined to be a vector.

~9~
MATH 3423 Statistical Inference | Dr. CW YU

3 ESTIMATOR EVALUATION
In addition to using MLE, we can also have a whole bunch of other estimators to estimate the
parameter(s) of interest. Thus, our next problem about the point estimation is how to evaluate
the goodness of the estimator, so that we can compare different estimators and then get the
best estimator in a class of estimators under consideration.

Bias
Vart
3.1 MEAN SQUARE ERROR (MSE)
For the evaluation of the goodness of an estimator, we consider the “closeness” of an estimator
𝜃̂(𝑿), or simply 𝜃̂, to the true unknown parameter 𝜃. (Note that 𝜃̂ used in this section 3
represents any estimator, it is not necessary to be MLE.)

So, it is reasonable to use a distance function to measure the closeness. Here we consider the
2
squared error norm (or 𝐿2 norm), (𝜃̂ − 𝜃) , because of its easy calculation and nice properties.

2
Note that (𝜃̂ − 𝜃) is random, so we need to find a way to remove its randomness to get a
numerical quantity for the comparison of different estimators. Conventionally, we fix this
problem by taking the expectation.

More precisely, we have


2
Definition (MSE): The mean squared error (MSE) of 𝜃̂ for 𝜃 is defined by 𝐸(𝜃̂(𝑿) − 𝜃) .

Note that MSE is a function of 𝜃 . For any two estimators, say 𝜃̂1 and 𝜃̂2 , if for all 𝜃 ∈ Θ,
2 2
𝐸(𝜃̂1 (𝑿) − 𝜃) ≤ 𝐸(𝜃̂2 (𝑿) − 𝜃) ,

and the inequality is strict for at least one 𝜃, then 𝜃̂1 is uniformly better than 𝜃̂2 . Consider a
class 𝑀 of all estimators for 𝜃, if there exists an estimator 𝜃̂ ∗∗ in 𝑀 that is uniformly better than
any other estimators in 𝑀, then 𝜃̂ ∗∗ is said to be a uniform minimum MSE estimator for 𝜃 in 𝑀.

However, such an estimator 𝜃̂ ∗∗ in general does not exist because (i) we are too Greedy to get a
uniform ‘best’ estimator over all 𝜃, and (ii) we are too Generous to consider too many (all)
estimators for 𝜃, even some of them are poor or not reasonable (like 𝜃̂ = 3423).

~ 10 ~
MATH 3423 Statistical Inference | Dr. CW YU

For (i), to remove the dependence of MSE on 𝜃, we can


1) Replace MSE by its maximum, and then compare estimators by looking at their respective
maximum MSE, naturally preferring the one with the smallest maximum MSE over 𝑀.
Such an estimator is said to be minimax.

2) Average out 𝜃, just as we average out the dependence on samples when going from
2 2
(𝜃̂ − 𝜃) to 𝐸(𝜃̂(𝑿) − 𝜃) . Then, a natural question is
“How should 𝜃 be average out?”
The answer is based on “Bayesian statistics”.

For (ii), we can restrict us to consider a particular class of estimators. Is it reasonable?


Yes, it is because sometimes a “very poor” estimator can be a locally best estimator. For
instance, 𝜃̂ = 3423 is undoubtedly a poor estimator because no information of data is used,
i.e. 3423 is always used to estimate an unknown parameter 𝜃 no matter what the observed
data are. However, it is the best if the true value of 𝜃 is really equal to 3423. Thus, at least, we
have to shrink a class of estimator to kick such a poor estimator out.

Obviously, we want to keep estimators with some nice properties. In this course, we keep
mean-unbiased estimators.

Definition (Unbiasedness): If an estimator 𝜃̂ satisfies 𝐸(𝜃̂) = 𝜃 for all 𝜃 ∈ Θ, then it is said to


be mean-unbiased or unbiased for 𝜃; otherwise, it is biased.

Interpretation:
The statement “𝜃̂ is unbiased for 𝜃" means that in repeating sampling, 𝜃̂ equals 𝜃 on average.
That is, in the long run, the amounts by which 𝜃̂ overestimates and underestimates 𝜃 will
balance.

2
Note that 𝐸(𝜃̂(𝑿) − 𝜃) = 𝑉𝑎𝑟 (𝜃̂(𝑿)) + 𝑏𝑖𝑎𝑠 2, where 𝑏𝑖𝑎𝑠 = 𝐸(𝜃̂ (𝑿)) − 𝜃.
So, if 𝜃̂ is unbiased for 𝜃, then its MSE is just its variance! In other words, we fix the bias to be
zero, and then look for an estimator with the smallest variance (or the most efficient!!). Such an
estimator is called a UMVUE --- uniform minimum variance unbiased estimator.

In part II, we would learn how to ‘catch’ UMVUE in some special cases.

i Existence maynot existbecause unbiasedestimatormay not en


ii uniqueness
~ 11 ~
MATH 3423 Statistical Inference | Dr. CW YU

Remarks:
1) According to Lemma 1 in Chapter 1, we know that the 𝑘 𝑡ℎ sample moment (about 0) is
2
unbiased for the 𝑘 𝑡ℎ population moment (about 0), and sample variance 𝑆𝑛−1 is
G
2
unbiased for 𝜃, but 𝑆𝑛 is not.

2) MLE is often biased.

3) The biased estimator is NOT always bad because a bias estimator can have a smaller
MSE than an unbiased estimator.

4) It is possible to have infinitely many diferent or NO unbiased estimators for 𝜃.


a. [Infinitely many] Consider a r.s. of size 𝑛 from a distribution with a finite mean 𝜃.
∑𝑛
𝑖=1(𝑎𝑖 𝑋𝑖 )
All estimators in form of are unbiased for 𝜃, where 𝑎1 , … , 𝑎𝑛 ∈ 𝑅 and
∑𝑛
𝑖=1 𝑎𝑖
∑𝑛𝑖=1 𝑎𝑖 ≠ 0.
𝜃
b. [No] Suppose that we have a r.s. from Binomial(1, 𝜃) with 𝑔(𝜃) = 1−𝜃 as the
parameter to be estimated. Note that there does not exist an unbiased estimator
for 𝑔(𝜃).

5) An unbiased estimator may be a poor estimate. X N Poisson


th W
For instance, consider a situation at which a telephonist has to leave the switch board
for a short time. Let 𝑋 be the number of telephone calls per 10 minutes. Suppose that
𝑋 ∼ 𝑃𝑜𝑖𝑠𝑠𝑜𝑛(𝜃) and the telephonist is absent for 20 minutes. We want to estimate the
probability that there is no calls during his absence, i.e. 𝑒 −2𝜃 .

If we can only observe the number of calls received in the preceding 10 minutes, say 𝑋1,
and want to use a function ℎ of 𝑋1 to get an unbiased estimator for 𝑒 −2𝜃 , then we
would have the result that ℎ(𝑋1 ) = (−1) 𝑋1 . Note that if 𝑥1 is any odd integer, then -1
will be obtained to be an estimator of the above probability 𝑒 −2𝜃 . It is very Poor!!!

6) Unbiasedness does not have an invariance property. That is, if 𝜃̂ is unbiased for 𝜃, then
ℎ(𝜃̂) may NO be unbiased for ℎ(𝜃). For instance, 𝑋̅ is unbiased for 𝜇, but 𝑋̅ 2 is not
unbiased for 𝜇 2 when 𝜎 > 0.

Eli ⼆
Uarlinǐ 台 ǐtǐ

~ 12 ~
[Link]
fhfy
Px

data oo
T hdl
fxonpxbfxil0
wiuu oo.rs1 ⼀⼀
o o
x Xixni T
17
statistic

下 IR EIRK
wherek is the
numberofunknown
parameterto be
estimated

Tx or Ò四 estimate a number like

[Link]

f [Link] Godness or Cod guess


5 closeness 仝
Ò区 0 2 平 Òix 0
2

depends on E
MSE
㞓 公⽐ 0 i ⾂⼼ ⽐505
sense
cnet ru rv
How about the best MSE estimator
M all estimators for0
恐 MSE
However the best MSE estimatordoesMY exist
Remedy i use a smaller mass of estimator
satisfying some criteria
Criterion Unbinds
tn
豳⼈
Leiter aclass
Echo foranypossible
value of 0 in
ofunbiased
estimator
force
MBE Var Ò t bias
⼆ Varlò when unbiasedness is used

big 0

best MSE minimum Variance


over a class
of unbiasedness
for any 0 t 01
Uniform 竺⼼mum Karima Enbiased Estimator
[Link] Part⼆

a sequence ofÒn⼆ ⼼ 2 n
Langen
properties Asymptotic unbiased ElÒn 0 as ⼼
as me for any ⽐
ii Consisting 8 Òn 上 0 as
Asymptotic Normality ii ⽔

F 后 Òn 0 为 Mlo ⼼
吓 Yantai 剩

e an
i of0
it Asymptotic Variance
of these three
Écanmjgny ⼼
器照
以 is 等

MATH 3423 Statistical Inference | Dr. CW YU

3.2 LARGE-SAMPLE PROPERTIES OF A POINT ESTIMATOR


We would consider the large-sample/large-𝑛/ asymptotic properties of a sequence of
estimators when it is difficult (or impossible) to check the evaluations in the finite-sample case.

1. Asymptotic Unbiaedness

Unbiasedness is nice, but in many cases our estimator, say MLE, may not be unbiased. Luckily,
when we assess the performance of a sequence of estimators asymptotically, we find that the
biases of most biased estimators will disappear. If an estimator whose bias tends to 0 as 𝑛 →
∞, then it is said to be asymptotically unbiased. More formally, we have

Definition (Asymptotic Unbiasedness): A sequence of estimator, {𝜃̂𝑛 : 𝑛 = 1,2, … , }, based on a


r.s. of size 𝑛 is said to be asymptotically unbiased if lim 𝐸(𝜃̂𝑛 ) = 𝜃, for all 𝜃 ∈ Θ.
𝑛→∞

Note that 𝑆𝑛2 , the MLE of 𝜎 2 , is asymptotically biased, although it is biasd in a finite-sample
case. That is, in the asymptotic sense, 𝑆𝑛2 can also enjoy a nice property.

2. Consistency ---- convergence in probability


𝑝
For consistency, 𝜃̂𝑛 has to be arbitrarily close to 𝜃 with a high probability, i.e. 𝜃̂𝑛 → 𝜃.

Definition (Consistency): A sequence of estimator, {𝜃̂𝑛 : 𝑛 = 1,2, … , }, based on a r.s. of size 𝑛 is


said to be consistency if, for any 𝜖 > 0, lim 𝑃(|𝜃̂𝑛 − 𝜃| ≤ 𝜖) = 1, for all 𝜃 ∈ Θ.
𝑛→∞

Note that although asymptotically unbiasedness and consistency look similar, but they cannot
imply each other in general. However, an estimator will be consistent if it is asymptotical
unbiased AND its variance 𝑉𝑎𝑟(𝜃̂𝑛 ) tends to 0 as 𝑛 → ∞.

3. Asymptotic Normality ---- convergence in distribution to normal

This property is important when we want to use 𝜃̂𝑛 to make more statistical inference about 𝜃,
say find a confidence intercal or do a hypothesis testing.

Definition (Asymptotic Normality): A sequence of estimator, {𝜃̂𝑛 : 𝑛 = 1,2, … , }, based on a r.s.


𝑑
of size 𝑛 is said to be asymptotically normal if √𝑛(𝜃̂𝑛 − 𝜃) → 𝑁(0, 𝜎𝜃2 ).
2
𝜎
Note that the asymptotic variance of 𝜃̂𝑛 is 𝑛𝜃 .

Recall that MLE is consistent, asymptotically unbiased, and asymptotically normal. The proofs
will be covered in part II.

~ 13 ~

You might also like