Point Estimation in Statistics
Point Estimation in Statistics
CW YU
1 PART I: INTRODUCTION
For part I, we will first study how to estimate a parameter(s) in general cases by a methodical
estimation principle and then discuss its performance.
Point Estimation:
The idea of point estimation is so simple that we just use a statistic 𝑇(𝒙) to estimate the
unknown parameter, say 𝜃, where 𝒙 = (𝑥1 , … , 𝑥𝑛 )′ is a realization of the random sample 𝑿 =
(𝑋1 , … , 𝑋𝑛 )′ or {𝑋𝑖 : 𝑖 = 1, … , 𝑛} of size 𝑛 from a population with a pdf 𝑓(⋅ |𝜃) or pmf 𝑝(⋅ |𝜃)
and 𝜃 is in the parameter space Θ.
In some cases, there is an obvious or natural point estimator of an unknown parameter. For
instance, sample mean of a random sample is a natural point estimator of the population mean.
However, when we leave such a simple case, we need a more methodical estimation technique
that will at least give us a reasonable candidate for consideration. In Section 2, we will study
one most commonly used estimation approach in statistics: Maximum Likelihood Estimation.
Remark:
[Parameter of interest] Most often, the parameter(s) of our interest to be estimated (called
estimand) is a function of the unknown distribution parameter(s) 𝜃, say 𝑔(𝜃). For instance, we
may be interested in 𝜇 2 , instead of 𝜇, or 𝜎/𝜇, instead of 𝜇 or 𝜎 only, etc.
[Estimator? Estimate?] An estimator is a funciton of the random sample 𝑿, while an estimate is
the realized value of the estimator that is obtained when a sample of data is actually taken.
[Why is called ‘point’ estimation?] Note that the statistic 𝑇 indeed is a ‘point’ in 𝑅 𝑘 , where
𝑘 ≥ 1 represents the number of unknown parameters to be estimated. We use it to estimate
𝑔(𝜃), which is also a ‘point’ in 𝑅 𝑘 . So, that’s why 𝑇(𝒙) is called a point estimate of 𝑔(𝜃).
Caution: We ONLY estimate an UNKNOWN parameter(s). For any KNOWN parameter, there is
no point for us to estimate it!!!
~1~
MATH 3423 Statistical Inference | Dr. CW YU
Before showing how to find MLE, let’s first understand what the ‘likelihood’ is.
WHAT IS ‘LIKELIHOOD’?
Likelihood function is used to quantify how our observed data is likely to occur.
Definition: Consider a r.s. of size 𝑛 from a population with a pdf 𝑓(⋅ |𝜃) or pmf 𝑝(⋅ |𝜃). After
collection, we have the realization 𝒙 = (𝑥1 , … , 𝑥𝑛 )′ . The likelihood function is defined by
𝐿(𝜃) = 𝐿(𝜃1 , 𝜃2 , … , 𝜃𝑘 |𝒙) = ∏𝑛𝑖=1 𝑓(𝑥𝑖 |𝜃) for continuous cases and 𝐿(𝜃) = ∏𝑛𝑖=1 𝑝(𝑥𝑖 |𝜃) for
discrete cases.
Remark that 𝐿(𝜃) is a function of 𝜃, with 𝒙 held fixed.
~2~
Statistical belief highest pron if getting data E x xzixi
Discrete cases
joint pmfl P x_x 三⼤ Xn⼆加10
jointprob
⼆
熊 B 加101 where Pxisthe commonpintofthe
t getting I ncopiesof X
吣⽐ likelihood [Link] to be ⻛伏10
Dye the
Continuous
xiifi
the
ones
[Link]
fwwml [Link]
likelihood
Cannotuse
o variate
So we can consider
[Link].nl0
⼆
点
PlxiEXisxitdxilosbyindept fi
惑P EXcxitdxilo by
Xi identicallydistrito
⼆点
find们 ⼆
感 [Link]
MATH 3423 Statistical Inference | Dr. CW YU
Definition (MLE): The maximum likelihood estimate is 𝜃̂ = argmax 𝐿(𝜃), which means
𝜃∈Θ
We also use the abbreviation MLE for the maximum likelihood estimator when we study the
properties of this “maximum likelihood” estimation method.
In some cases, especially when differentiation is used, it is easier to work with a natural
logarithm of 𝐿(𝜃), i.e. 𝑙(𝜃) = log 𝐿(𝜃), called log likelihood, than it is to work with 𝐿(𝜃)
directly. This is possible because the log function is strictly increasing, which implies that the
maxima of 𝐿(𝜃) and 𝑙(𝜃) coincide.
where 𝐼𝑋 (𝜃) is known as Fisher Information matrix (More details about this matrix will
be discussed in part II) and it is a 𝑘 × 𝑘 matrix with the (𝑖, 𝑗)𝑡ℎ entry defined as
𝜕 𝜕
𝐸 [( log 𝑓𝑋 (𝑋|𝜃)) ( log 𝑓𝑋 (𝑋|𝜃))]
𝜕𝜃𝑖 𝜕𝜃𝑗
for 𝑖 = 1, … , 𝑘 and 𝑗 = 1, … , 𝑘.
~3~
MATH 3423 Statistical Inference | Dr. CW YU
There are three standard approaches to find MLE. Our job is to find a global maximum!!!
(i) If the parameter space Θ contains finitely many points, then an MLE can always be
obtained by simply comparing finitely many value of (log) 𝐿(𝜃), for all 𝜃 ∈ Θ.
(ii) If 𝐿(𝜃) is differentiable on the interior of Θ, then one possible way of finding an MLE
is to consider the values of 𝜃 = (𝜃1 , 𝜃2 , … , 𝜃𝑘 )′ in the interior that solve the
first-order/ likelihood/ log likelihood equations
immune 𝜕 𝜕
𝐿(𝜃) = 0 𝑜𝑟 𝑙(𝜃) = 0, 𝑓𝑜𝑟 𝑖 = 1, … , 𝑘.
kit 𝜕𝜃𝑖 𝜕𝜃𝑖
ikiriu
However, this is just a necessary condition for a maximum (or minimum), not a
sufficient condition. To be more precise, the solutions to the above equations are
just the critical points, which may or may not be extrema. Furthermore, the zeros of
the first derivative only locate the critical points in the interior of the domain of the
(log) 𝐿(𝜃). If the maximum occur on the boundary, the first derivative may not be
zero. Thus, the boundary must be checked separately for MLE.
[A special case for one-parameter cases] When 𝑘 = 1, there is a case that we can
get a global maximum easily. If there is a unique critical point and it has a negative
second derivative of (log) 𝐿(𝜃), then it must be a global maximum. Note that for
this case, we do not have to check any boundary point!!
To KI
1 (𝑥𝑖 −𝜃)2 1 ∑(𝑥𝑖 −𝜃)2
− −
𝐿(𝜃) = ∏ 𝑒 2 = 𝑒 2 .
√2𝜋 (2𝜋)𝑛/2
𝑖=1
𝑑2
0 =
dhan 𝑙"(𝜃) = 𝑙(𝜃)| = −𝑛 < 0.
𝑑𝜃 2 𝜃=𝑥̅
(iii) Another way to find an MLE is to abandon differentiation and proceed with a direct
maximization. One general technique is to find a global upper bound on (log) 𝐿(𝜃)
and then establish that there is a unique point for which the upper bound is
attained.
Example (cont’): Instead of using calculus, we can also show that 𝜃̂ = 𝑥̅ is MLE
algebraically. Note that ∑(𝑥𝑖 − 𝜃)2 ≥ ∑(𝑥𝑖 − 𝑥̅ )2 for any 𝜃, where
they are equal if and only if 𝜃 = 𝑥̅ . Thus, for any 𝜃 ∈ Θ,
I
𝐿(𝜃) ≤ 𝐿(𝑥̅ )
祕 0於产伽义成⼀
⼆
FUxi
Remark that the global maximum finding problem in the above case can be solved for large
situations when some regularity conditions are required. For instance,
n
itc tui
0N [Link]
2⼼⼼ ⽐
⼀
20
⼆
Flxij
MATH5423 Advanced Stated inference nioi
More details can be found in the book “Theory of Point Estimation” written by 20
Xiii
E. L. Lehmann and George Casella.
⼆产
2. Consider a r.s. with size 𝑛 of 𝑋 ∼ 𝑃𝑜𝑖𝑠𝑠𝑜𝑛(𝜆), where 𝜆 ∈ (0, ∞). Find the MLE of 𝜆.
3. Consider a r.s. with size 𝑛 of 𝑋 ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(1, 𝜃), where 𝜃 ∈ [0,1]. Find the MLE of 𝜃,
and then get the MLE when 𝑛 = 10 and ∑𝑛𝑖=1 𝑥𝑖 = 4.
4. Consider a r.s. of size 𝑛 of 𝑋 from 𝐺𝑎𝑚𝑚𝑎(𝛼, 𝛽), where 0 < 𝛼 < ∞ and 0 < 𝛽 < ∞.
Find MLE when
~5~
t
donotneed to estimate it
㳤 意 eilc [Link] 㮺㷗
2
当 总 Flxipo
2⽄62 e I took [Link]
write
[Link]
hey
I
afmun HE Not afountain
go
斯 _if if ii
0 chemise
Thus we can take the 1St part as L G
LG 2⽄62
垐 e总 Flxiyui
take log to
get
f 是 lnzn [Link] 與刮
and 14 derivative with respect to 62
0 l⼼ 芒六 t
t 忘了
Go
Ǒ⼆ 六年 xi
[Link] a
unique solution
Ò220 K 1
T
negative 2ndderivative
2
0 0 an xi are equal to
M 七⼼
斲
㡭
parameterspare
co oo
惑 吣⼼ 阳⻔⼊ ⼊
等 [Link]
otherwise
n o
⼆
下
F1 器I灬
f Xiē叭
I can xi [Link]
恥
lrsr
hry⼊ [Link]
Not function a
of ⼊ of
三术
Thus ⼼ ⼊ Take log to go
Xi
T l 灬
Fxi 以 M Thai
and take 14
dernvanmurt.
三
t i ⼆ 亚
on
i K 1
点
嵓 蕊器 iixi [Link]
T T 的
point
l
㞊 Parameter spare l
lonoo ⼽ 0 all xiao
The NNE exist 叺
and x P lxi [Link] ē 欧
guts
When zero observations
[Link]
hi n⼊ with No and no
⼼
器前扎⼊ does My exist Iii Ū 010 can beused
Thus Āo has limo
find l
⼊ 0
but it is NOTreasonable
But the MLE with
suphi 0 because ⼊ 70
does Mi exist in [Link]
感 吣 10
热 Ücroilcxioorl
[Link]
a function
of 0 NOT
Thus Luo ǒxicrojnxi
and to
i Exijlnotln [Link]
i
⼼
Fxim
㘩 Fxilnotln [Link] his Qo
战
⼼1凹 uni
愁愁些
iii 1囵 [Link] TOinto⻔ 必Eisley oaia
iii 1凹 XTMLE is in toni
to
5
㘩 n h 10 to into⻔ MEI
the MLE for 0 into ⻔ is T
Tang
MATH 3423 Statistical Inference | Dr. CW YU
5. Consider a r.s. of size 𝑛 from 𝑁(𝜇, 𝜎 2 ), where 𝜎 2 ∈ (0, ∞) is unknown and 𝜇 ∈ (−∞, ∞)
is unknown. Find the MLE of 𝜃 = (𝜇, 𝜎 2 )′ .
6. Suppose 𝑋 is from 𝑈[0, 𝜃], where 0 < 𝜃 < ∞. Find the MLE of 𝜃 if a r.s. with size 𝑛 of 𝑋
is considered.
7. Suppose 𝑋 is from 𝑈[𝜃 − 1, 𝜃 + 1], where −∞ < 𝜃 < ∞. Find the MLE of 𝜃 if a r.s. with
size 𝑛 of 𝑋 is considered.
8. Consider a r.s. of size 𝑛 from 𝑁(𝜇, 1), where 0 ≤ 𝜇 < ∞ is unknown. Find the MLE of 𝜇.
~6~
⼩ p is unknown but ⼩⼭ is known where 10 o
⼩
for p
惑fcxilp ⼆点 器 xfè I 佖州
⼆
点 器 [Link] in
at meth of p
Thus
up 惑 器
xié llp惑 aolnp wnxntcao nlnxi pxi
naolnp nmnxostlxo D.mxi_ 噍xi
Then tip
with
0
unique critical point
yields
2nd derivative t
a
是 Go
negative 俴 型
器
So the Mi E
of pislsl
ii x is unknown but pipo is known where o o
Similarly we have
tcxsznxlnpo
However there is No
nmnastcx [Link]
closed form of the solution
of tix 0
Thus we can
only get the estimate by numerical methods
Same as ⼼⻔ and civ
⼀
o.o x o.o
For Q f 息 where
[Link] ⼼ and GElong
2
to ⼆
望 hun ⼀
是 lnl ⼼ ⼀
⾔ 点 lip
2
we want to
max l lo
QE
to get the MLE㤀 缘
Recall that xi
[Link] iforanyptl
and
[Link]
EE o.o
any
In otherwords
[Link] [Link] Elli 62 forany of co
地 上 2
是毖 器器 fm
max
along
his
Note that
tix 62 ⼆
望 [Link] lol ⼀
交点 xixi
是 man ⼀号mcó
哥
will be maximized at E Si Sio if and only if allxi
zero prob
[Link] and Gao
we have
㐫 I goEx xi 01
a futon of 0
Thug
Le
j I [Link] 0
⼆
⾐ I losxpcxn so WhyNOT 加⼆加
加⼆ 加 X⼼ 如
X [Link]
加 20 zeroprob
ME in
otxrnso
加 我
感 file ⽓ I flexion where [Link]
t I o 1 Exncxn 0⺾
f I 她 1 0 E 加⼗
1 40 ㄩ0
There are
n
.[Link] s0
MATH 3423 Statistical Inference | Dr. CW YU
First, try to get (log) 𝐿(𝜃). Since the pdf of 𝑋 is 𝑓𝑋 (𝑥|𝜃) = 𝜋 −1 [1 + (𝑥 − 𝜃)2 ]−1 , the likelihood
is 𝐿(𝜃) = 𝜋 −𝑛 ∏𝑛𝑖=1[1 + (𝑥𝑖 − 𝜃)2 ]−1 and
𝑛
Setting
𝑛
𝑑 2(𝑥𝑖 − 𝜃)
𝑙′(𝜃) = 𝑙(𝜃) = ∑ =0
𝑑𝜃 1 + (𝑥𝑖 − 𝜃)2
𝑖=1
yields the MLE (Again, we then also have to check if it is a global maximum.) Note that the
(solution) MLE cannot be solved explicitly in this case, but we can obtain/ approximate it by
numerical method like Newton-Raphson Algorithm.
According to Taylor, we have the following result:
1 ′ 1 1
0= 𝑙 (𝜃̂) ≈ 𝑙 ′ (𝜃) + (𝜃̂ − 𝜃) 𝑙"(𝜃).
𝑛 𝑛 𝑛
Thus,
𝜃̂ ≈ 𝜃 − 𝑙 ′ (𝜃)[𝑙"(𝜃)]−1 .
Newton-Raphson Algorithm:
−1
𝜃𝑗+1 ≈ 𝜃𝑗 − 𝑙 ′ (𝜃𝑗 )[𝑙"(𝜃𝑗 )] , 𝑗 = 0, 1, 2, …,
1
l 19
Q_Q Oj
~7~
MATH 3423 Statistical Inference | Dr. CW YU
R CORNER
We would use the R package maxLik to maximize (log) 𝐿(𝜃) in the following. Other R functions
like optim can also be used.
[Case 1: One unknown parameter]
t
initialguess
~8~
MATH 3423 Statistical Inference | Dr. CW YU
~9~
MATH 3423 Statistical Inference | Dr. CW YU
3 ESTIMATOR EVALUATION
In addition to using MLE, we can also have a whole bunch of other estimators to estimate the
parameter(s) of interest. Thus, our next problem about the point estimation is how to evaluate
the goodness of the estimator, so that we can compare different estimators and then get the
best estimator in a class of estimators under consideration.
Bias
Vart
3.1 MEAN SQUARE ERROR (MSE)
For the evaluation of the goodness of an estimator, we consider the “closeness” of an estimator
𝜃̂(𝑿), or simply 𝜃̂, to the true unknown parameter 𝜃. (Note that 𝜃̂ used in this section 3
represents any estimator, it is not necessary to be MLE.)
So, it is reasonable to use a distance function to measure the closeness. Here we consider the
2
squared error norm (or 𝐿2 norm), (𝜃̂ − 𝜃) , because of its easy calculation and nice properties.
2
Note that (𝜃̂ − 𝜃) is random, so we need to find a way to remove its randomness to get a
numerical quantity for the comparison of different estimators. Conventionally, we fix this
problem by taking the expectation.
Note that MSE is a function of 𝜃 . For any two estimators, say 𝜃̂1 and 𝜃̂2 , if for all 𝜃 ∈ Θ,
2 2
𝐸(𝜃̂1 (𝑿) − 𝜃) ≤ 𝐸(𝜃̂2 (𝑿) − 𝜃) ,
and the inequality is strict for at least one 𝜃, then 𝜃̂1 is uniformly better than 𝜃̂2 . Consider a
class 𝑀 of all estimators for 𝜃, if there exists an estimator 𝜃̂ ∗∗ in 𝑀 that is uniformly better than
any other estimators in 𝑀, then 𝜃̂ ∗∗ is said to be a uniform minimum MSE estimator for 𝜃 in 𝑀.
However, such an estimator 𝜃̂ ∗∗ in general does not exist because (i) we are too Greedy to get a
uniform ‘best’ estimator over all 𝜃, and (ii) we are too Generous to consider too many (all)
estimators for 𝜃, even some of them are poor or not reasonable (like 𝜃̂ = 3423).
~ 10 ~
MATH 3423 Statistical Inference | Dr. CW YU
2) Average out 𝜃, just as we average out the dependence on samples when going from
2 2
(𝜃̂ − 𝜃) to 𝐸(𝜃̂(𝑿) − 𝜃) . Then, a natural question is
“How should 𝜃 be average out?”
The answer is based on “Bayesian statistics”.
Obviously, we want to keep estimators with some nice properties. In this course, we keep
mean-unbiased estimators.
Interpretation:
The statement “𝜃̂ is unbiased for 𝜃" means that in repeating sampling, 𝜃̂ equals 𝜃 on average.
That is, in the long run, the amounts by which 𝜃̂ overestimates and underestimates 𝜃 will
balance.
2
Note that 𝐸(𝜃̂(𝑿) − 𝜃) = 𝑉𝑎𝑟 (𝜃̂(𝑿)) + 𝑏𝑖𝑎𝑠 2, where 𝑏𝑖𝑎𝑠 = 𝐸(𝜃̂ (𝑿)) − 𝜃.
So, if 𝜃̂ is unbiased for 𝜃, then its MSE is just its variance! In other words, we fix the bias to be
zero, and then look for an estimator with the smallest variance (or the most efficient!!). Such an
estimator is called a UMVUE --- uniform minimum variance unbiased estimator.
In part II, we would learn how to ‘catch’ UMVUE in some special cases.
Remarks:
1) According to Lemma 1 in Chapter 1, we know that the 𝑘 𝑡ℎ sample moment (about 0) is
2
unbiased for the 𝑘 𝑡ℎ population moment (about 0), and sample variance 𝑆𝑛−1 is
G
2
unbiased for 𝜃, but 𝑆𝑛 is not.
3) The biased estimator is NOT always bad because a bias estimator can have a smaller
MSE than an unbiased estimator.
If we can only observe the number of calls received in the preceding 10 minutes, say 𝑋1,
and want to use a function ℎ of 𝑋1 to get an unbiased estimator for 𝑒 −2𝜃 , then we
would have the result that ℎ(𝑋1 ) = (−1) 𝑋1 . Note that if 𝑥1 is any odd integer, then -1
will be obtained to be an estimator of the above probability 𝑒 −2𝜃 . It is very Poor!!!
6) Unbiasedness does not have an invariance property. That is, if 𝜃̂ is unbiased for 𝜃, then
ℎ(𝜃̂) may NO be unbiased for ℎ(𝜃). For instance, 𝑋̅ is unbiased for 𝜇, but 𝑋̅ 2 is not
unbiased for 𝜇 2 when 𝜎 > 0.
Eli ⼆
Uarlinǐ 台 ǐtǐ
⼆
~ 12 ~
[Link]
fhfy
Px
data oo
T hdl
fxonpxbfxil0
wiuu oo.rs1 ⼀⼀
o o
x Xixni T
17
statistic
⼀
下 IR EIRK
wherek is the
numberofunknown
parameterto be
estimated
[Link]
depends on E
MSE
㞓 公⽐ 0 i ⾂⼼ ⽐505
sense
cnet ru rv
How about the best MSE estimator
M all estimators for0
恐 MSE
However the best MSE estimatordoesMY exist
Remedy i use a smaller mass of estimator
satisfying some criteria
Criterion Unbinds
tn
豳⼈
Leiter aclass
Echo foranypossible
value of 0 in
ofunbiased
estimator
force
MBE Var Ò t bias
⼆ Varlò when unbiasedness is used
big 0
a sequence ofÒn⼆ ⼼ 2 n
Langen
properties Asymptotic unbiased ElÒn 0 as ⼼
as me for any ⽐
ii Consisting 8 Òn 上 0 as
Asymptotic Normality ii ⽔
⼀
F 后 Òn 0 为 Mlo ⼼
吓 Yantai 剩
恐
e an
i of0
it Asymptotic Variance
of these three
Écanmjgny ⼼
器照
以 is 等
⼀
MATH 3423 Statistical Inference | Dr. CW YU
1. Asymptotic Unbiaedness
Unbiasedness is nice, but in many cases our estimator, say MLE, may not be unbiased. Luckily,
when we assess the performance of a sequence of estimators asymptotically, we find that the
biases of most biased estimators will disappear. If an estimator whose bias tends to 0 as 𝑛 →
∞, then it is said to be asymptotically unbiased. More formally, we have
Note that 𝑆𝑛2 , the MLE of 𝜎 2 , is asymptotically biased, although it is biasd in a finite-sample
case. That is, in the asymptotic sense, 𝑆𝑛2 can also enjoy a nice property.
Note that although asymptotically unbiasedness and consistency look similar, but they cannot
imply each other in general. However, an estimator will be consistent if it is asymptotical
unbiased AND its variance 𝑉𝑎𝑟(𝜃̂𝑛 ) tends to 0 as 𝑛 → ∞.
This property is important when we want to use 𝜃̂𝑛 to make more statistical inference about 𝜃,
say find a confidence intercal or do a hypothesis testing.
Recall that MLE is consistent, asymptotically unbiased, and asymptotically normal. The proofs
will be covered in part II.
~ 13 ~