Estimation Theory Lecture Notes
Estimation Theory Lecture Notes
Ofer Shayevitz
Tel Aviv University
1 Introduction 3
1.1 The Estimation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Notation 13
5 Linear Models 45
5.1 Linear Models in AWGN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6 Sucient Statistics 59
6.1 Sucient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7 Exponential Families 69
7.1 Motivation and Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
1
2 CONTENTS
8 The ML Estimator 79
8.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.2 ML Asymptotic Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9 Condence Intervals 85
10 Regularization and the Bayesian Approach 91
10.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
10.2 Regularization in the Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . 92
10.3 Bayesian Interpretation of Regularization . . . . . . . . . . . . . . . . . . . . . . 96
10.4 Penelty Function or Prior That Encourages Sparsity . . . . . . . . . . . . . . . . 99
Physical Process
Observations x∈X
(Nature)
where s(t) is a known transmitted signal, τ is an unknown delay parameter that we want
to estimate, and W (t) is some additive noise process.
2. Detection: Here there is a nite number of possible hypotheses about the statistical
behavior of the world (and hence of our observations), and we would like to determine
which hypothesis is true. This can be though of as a special case of inference, where the
parameter of interest is discrete. For example: target detection/communication:
s n + Wn target exists (rst hypothesis)
Xn = (1.2)
Wn no target (second hypothesis)
noise
} |{z}
parameter
| {z
where sn is a known signal sequence and Wn is some additive noise process. There are two
hypotheses, which correspond to a binary parameter.
3
4 CHAPTER 1. INTRODUCTION
t0 and timet1 > t0 , i.e., Y = price(t1 ) − price(t0 ). We would like to nd the conditional
distribution pY |X (y|x) from a set of past samples we have, so that we can predict new
values of Y from new values of X . For example, it might be reasonable to model the
possible relations between X and Y by, say,
Y = θX + Z (1.4)
where θ is some unknown parameter, and Z is some r.v. representing noise or uncer-
tainty. Plotting our past observations in the x−y plane, we can try to t a straight line to
the noisy data (this is called linear regression). Here, the parameter itself is not necessarily
what we're interested in; we want to learn the joint probability distribution itself (which is
a function of the parameter of course) so that we can use it for prediction.
y=∆ price(t0 → t1 )
# bids(t0 )
x= # asks(t0 )
Physical
Process x∈X
pX|Θ (x | θ)
θ∈Θ
First, we need to provide a model of the physical process producing our samples. In this course,
we will assume that our observations are drawn from a parametric family
X ∼ p(x | θ) (1.6)
for some unknown parameter θ ∈ Θ. How is the parametric family chosen? It typically comes
from the understanding of the specic setup we are dealing with. The parametric family shouldn't
be too large, to avoid over-tting our data, nor should it be too small, in which case we may not
have a good description of the data within the family. Here, we will assume the model is given
to us.
1.2. THE PROBLEM SETUP 5
Remark 1.1. Throughout this course, we will assume that the data is indeed generated by
p(x | θ) for some θ∈Θ in the parametric family. We will make no guarantees otherwise.
1.2.2 Estimators
An estimator is a function
θ̂ : X → Θ (1.7)
from the observation space to the parameter space. This estimator is sometimes called a point
estimator, as it returns a single parameter as its decision. We will mostly limit our discussion
to point estimators in this course, but will also mention other estimators that return subsets of
the parameter space (which we will call condence intervals). Note that an estimator θ̂ cannot
depend on θ itself, as it is an unknown parameter. However, the performance of an estimator
will depend on the true underlying θ.
l : Θ × Θ → R+ , (1.8)
which assigns a nonnegative number `(θ̂, θ) to the event where we estimate the parameter to
be θ̂ when the true parameter is in fact θ (typically, we will have `(θ, θ) = 0, but this is not
essential). The larger this number, the less happy we are with our estimation. Of course since
our observations are random, and hence our estimator is also random, the loss will be a random
variable `(θ̂(X), θ). We will therefore be concerned with the expected loss
h i
Lθ (θ̂) = Ep(x | θ) l(θ̂(X), θ) . (1.9)
Note that the expectation is taken w.r.t. the distribution p(x | θ) induced by the true (unknown)
underlying parameter. Note also that the expected loss Lθ (θ̂) will generally depend on the
underlying θ.
Here are the two most common examples of loss functions:
• 0-1 loss:
l(θ̂, θ) = 1 θ̂ 6= θ (1.10)
This loss is also called hit-or-miss, and is adequate only for a discrete parameter space Θ.
The associated expected loss is the error probability of our estimator:
Lθ (θ̂) = Pr θ̂(X) 6= θ . (1.11)
• Quadratic loss:
h i
= E kθ̂(X) − θk22 (1.14)
Now that we have a performance criterion, how do we nd the best estimator? A natural
rst guess is to seek the estimator that minimizes the expected loss:
The problem is that this estimator will generally depend on θ, and hence will not be realizable.
For example, suppose θ ∈ R qith a quadratic loss function. Then the above estimator will simply
return θ̂ = θ itself, which yields zero expected loss, but of course cannot be realized. So, we
cannot hope to nd a single estimator that will beat all other estimators regardless of the value
of θ. For example, the estimator θ̂(X) = 5 is clearly the best possible when θ = 5, but is very
bad e.g. when θ = 100.
How do we proceed then? There are several possible ways to circumvent this problem.
Can we choose F such that θ̂F does not depend on θ, and F is rich enough? We will
discuss this in the h i Specically, we will consider the family of unbiased estimators,
sequel.
i.e., where Ep(x | θ) θ̂(X) = θ for any θ. Note that this choice eliminates the constant
• Naturally appealing.
• Law of large numbers (LLN): Given N independent unbiased estimators, we have that
N
1 X
θ̂j → θ with probability 1. (1.17)
N
j=1
• We will be able to nd a non-trivial lower bound for this family, that will be sometimes
achievable.
This estimator is realizable, as it does not depend on the actual θ, and we can still look at
its performance in terms of θ. Note that for the 0-1 loss, the optimal Bayesian estimator is
what you may know of as the Maximum A-Posteriori (MAP) decision rule, which simply
1.3. AN INTRODUCTORY EXAMPLE 7
chooses the parameter value that maximizes the posterior distribution of the observations.
For quadratic loss, this optimal Bayesian estimator is what you may know as the Minimum
Mean Squared Error (MMSE) estimator, which returns the posterior expectation of the
parameter given the observations.
One problem with Bayesian estimation is that it may be unclear what distribution p(θ) we
should choose. Furthermore, even if a reasonable choice exists, are we really interested in
the average performance with respect to p(θ), or with the performance for each and every
θ? We will later see that a prior p(θ) can arise naturally from a process called regularization
of the parameters, which is sometimes employed in order to control over-tting.
We will later see that interestingly, in some cases the minimax approach can be interpreted
as taking the worst case prior in the Bayesian setting, in which case the minimax estimator
can be more easily computed. In general, it may be dicult to nd.
In this course we will be generally interested in the fundamental limits of estimation, i.e., in
what can be said about the best we can do in the settings above. For instance, we would like to
get lower bounds of the form:
That hold for all θ̂ under consideration. We would like these bounds to be non-trivial, and
hopefully achievable (or almost so).
Other things we will briey touch upon include
• Data issues:
Overtting: Too many parameters and not enough data. How do we handle this? Pos-
sible solutions include Bayesian, regularization, Minimum Description Length (MDL).
• Algorithmic issues: Is there some turn-the crank ecient way to come of with good
estimators in a general setting. We will see that to some extent there is, at least in limited
cases or in the limit of large data sets.
i.i.d.
Let us consider the simplest setting, where X ∼ Bern(θ) where θ ∈ [0, 1] is an unknown
parameter. Namely, our parameter space is the unit interval Θ = [0, 1], and for any x ∈ {0, 1}n
n
Y
p(x | θ) = p(xk | θ) (1.21)
k=1
kxkH
=θ (1 − θ)n−kxkH , (1.22)
n
P
where kxkH = xk is the Hamming weight (norm) of x. Given a sample vector X, we would
k=1
like to estimate the underlying θ. To measure our success, we need to rst decide on a loss
function. Let us choose the quadratic loss:
What would be a good estimator in the quadratic loss sense? A natural suggestion: the sample
mean!
n
1X
θ̂(X) = Xk (1.24)
n
k=1
h i 1X n
E θ̂(X) = E [Xk ] = E [X1 ] = θ. (1.25)
n
k=1
= θ − θ2 (1.29)
h i2 h i
E θ̂ − θ = Var θ̂ (1.31)
" n #
1X
= Var Xk (1.32)
n
k=1
1
= · nVar [X1 ] (1.33)
n2
θ(1 − θ)
= . (1.34)
n
We will later show that this is the best possible MSE that can be attained by an unbiased
estimator, uniformly in θ.
Arriving at the sample average estimator θ̂ suggested above was based on intuition. Is there
an underlying principle from which this estimator can be derived, and thereby from which other
optimal unbiased estimators can be obtained in other settings? A naturally appealing idea is to
1.3. AN INTRODUCTORY EXAMPLE 9
choose θ̂ as the parameter whose associated distribution assigns the highest probability to the
observations. This is called the Maximum Likelihood (ML) estimator :
∂
p(x | θ) = dθd−1 (1 − θ)n−d − (n − d)θd (1 − θ)n−d−1 , (1.37)
∂θ
Equating to zero, dividing by θd−1 (1 − θ)d , and solving for θ, we obtain
We have implicitly assumed that the optimizing θ 6= 0, 1 in the derivation above. It is easy to
see that this will always be the case unless xn is the all-zeros (resp. all-ones) sequence, in which
case θ̂ML = 0 (resp. 1), which is also in agreement with the above. We will later see that the
ML approach does not always work so perfectly, but we will also show that it is generally rather
good.
Note that ML can be thought of as putting a uniform prior θ ∼ [0, 1], and then nding the
parameter value that maximizes the posterior distribution (the so-called MAP rule):
p(x | θ)p(θ)
p(θ | x) = (1.40)
p(x)
p(x | θ)
= , (1.41)
p(x)
where (1.40) follows from Bayes rule and (1.41) uses the uniform prior. Therefore in this case
we get:
But if we believe the uniform prior, why use the MAP rule (which is optimal for the 0-1 loss)
and not perform MMSE estimation, which is optimal for the quadratic loss?
h i
θ̂MMSE = argmin Ep(θ)p(x|θ) (θ̂(X) − θ)2 (1.45)
θ̂
= E [θ | X] . (1.46)
n
X
α= xk + 1 (1.47)
k=1
10 CHAPTER 1. INTRODUCTION
n
X
β= (1 − xk ) + 1 = n − α + 2. (1.48)
k=1
α
Moreover, the expectation of a Beta(α, β) distribution is simply
α+β , hence
θ̂MMSE = E [θ | X] (1.49)
α
= (1.50)
α+β
Pn
Xk + 1
k=1
= . (1.51)
n+2
We note that
h i n·θ+1
E θ̂MMSE | θ = (1.52)
n+1
1 − 2θ
=θ+ , (1.53)
n+2
| {z }
bias toward 12
hence the MMSE estimator gains (on average w.r.t. the uniform prior) over the (unbiased)
sample mean estimator by introducing a small bias toward 1/2. The reason should be intuitively
clear: Under the uniform prior, and before we observe any data, the MMSE estimator guesses
that θ is 1/2. Samples that are now obtained attempt to convince us that the true value of
the parameter is the sample mean, and we need to properly weigh our prior belief with the data
we observe. The more samples se get, the less the prior belief (which favors 1/2) will aect our
guess.
We will later see that even in a non-Bayesian setting, it is sometimes helpful, (uniformly in
θ) to introduce a small bias.
√
n θ̂ − θ → N (0, θ(1 − θ)) (1.54)
in distribution. In fact, a stronger version of the CLT (called the Berry-Esseen CLT) bounds the
1
deviation of the tail function in this case by √ , i.e., for any
n
ε > 0,
√ !
n|θ̂ − θ| 1
Pr p > ε ≤ 2 Q(ε) + √ , (1.55)
θ(1 − θ) n
X = 101010101010 · · · (1.58)
1
Our sample mean estimator would return θ̂ = 2 . If we try to predict based on this, it seems
that perhaps we will be wrong half the time (if this behavior indeed continues). Any reasonable
person would have guessed that the next bit is ipped, so why did this happen? The reason of
course is that we have limited our model, i.e. the process that is assumed to have generated our
observations, to be i.i.d. and hence in particular memoryless. What can we do? If we believe
there is some dependence between consecutive time instances, we may try to assume a rst order
Markov model, where say X1 ∼ Bern(1/2), and the transition from Xk to Xk+1 follows
1 − θ0 1 − θ1
θ1
0 1
θ0
In this case, it is easy to check that the ML estimator for the alternating sequence above yields
since under this choice of parameters, the observed sequences has probability 1/2, which is the
highest possible.
But why stop at rst order? If we increase the order of the model then at some point, our
model can become as complicated as the observations themselves. In such a case we will do a
great job describing what we have seen, but typically this will not generalize to predict future
observations. This situation is commonly called over-tting". Intuitively, a good model should
strike a reasonable balance between the eective number of parameters and the associated loss.
We will not explicitly discuss model selection in this course, but we will talk about ways to
counter over-tting eects via a simple process called regularization.
12 CHAPTER 1. INTRODUCTION
2. Notation
Let X be our sample space, associated with a sigma-algebra F. In general, a parametric family
considered in these notes is a collection of probability distributions µθ indexed by θ over F, that
are all absolutely continuous w.r.t. a single measure µ over F. We denote the Radon-Nikodym
dµθ
derivative of µθ (·) w.r.t. µ. by p(x|θ) , dµ (x). Most of the results hold under this general
framework. However, in this exposition we will usually assume that either
• X is discrete , F = 2X , and µ(A) = |A| is the counting measure. In this case, p(x | θ) are
all simply probability mass functions (p.m.f.s) over X.
• X = Rn , F is the Borel sigma-algebra of the open sets induced by the Euclidean metric,
and µ is the Lebesgue measure. In this case, p(x | θ) are all simply probability density
n
functions (p.d.f.s) over R .
We will usually denote the samples by x (also called observations), and sometimes refer to
p(x | θ) as the likelihood function. Generally, x can be a scalar or a vector unless otherwise
restricted, and will be assumed to take values in the sample space X, unless otherwise stated.
We use boldface x to emphasize we mean a vector, in which case we usually mean that x =
(x1 , x2 , . . . , xn )T is an n-dimensional column vector. For the most part, capital letters X (resp.
X ) are used for random variables (resp. random vectors), and the corresponding small letters x
(resp. x) will be used for specic values. One exception is the parameter θ (or parameter vector
θ ), which is always written in small letters (even in the Bayesian case). A vector of parameters
θ = (θ1 , . . . , θd )T is usually taken to be a d-dimensional column vector.
Probabilities, expectations, and variances are denoted by the operators Pr(·), E [·] and Var[ · ]
respectively, and are typically be taken w.r.t to p(x | θ), excluding the Bayesian setting or unless
otherwise stated. This specically means that the expected values, variances, MSEs, covariance
matrices, etc., will typically depend on θ. We sometimes specically emphasize this by writing
θ as a subscript, e.g., MSEθ , but in many cases this dependence is omitted. The interior of a
subset S ⊆ Rd , denoted here by interior(S), is the set of all points s ∈ S for which there exists
a Euclidean ball B of some radius ε > 0 centered at s, such that B ⊆ S .
13
14 CHAPTER 2. NOTATION
3. Minimum Variance Unbiased Estima-
tion
In this chapter, we discuss estimation problems of a real-valued scalar parameter under the
quadratic loss, i.e., where our parametric family is {p(x | θ)}θ∈Θ with Θ ⊆ R, and `(θ̂, θ) =
(θ̂ − θ)2 . Note that x can be a scalar or a vector, and can be discrete or continuous. We focus
on the performance of a restricted set of estimators unbiased estimators.
h i
MSEθ (θ̂) , E (θ̂(X) − θ)2 . (3.1)
where the expectation is taken w.r.t p(x | θ). We say that an estimator θ̂1 dominates another
estimator θ̂2 (in the MSE sense), if
for any θ ∈ Θ (a similar denition holds for any other loss function). We have already seen
that there is no single estimator that dominates all other estimators, for the simple reason that
the constant estimators cannot be dominated. Moreover, there is no nontrivial lower bound
on the MSE attained by a general estimator; the best we can say is that MSEθ (θ̂) ≥ 0, where
again the constant estimators are to blame. We would therefore like to restrict our attention to
some reasonable sub-family of estimators for which we can provide nontrivial lower bounds, and
perhaps even nd an optimal estimator, i.e, once that dominates all the other estimators in the
family.
Let us rst break up the expression for the MSE of an estimator θ̂ as follows:
2
MSEθ (θ̂) = E θ̂ − E[θ̂] + E[θ̂] − θ (3.3)
2 h i
= Var[ θ̂ ] + E[θ̂] − θ + 2 E θ̂ − E[θ̂] E[θ̂] − θ (3.4)
| {z } | {z }
,Bias[ θ̂ ] =0
2
= Var[ θ̂ ] + Bias[ θ̂ ] , (3.5)
where the bias function of an estimator has been dened above. We see that the MSE of an
estimator is composed of two parts: The variance of the estimator, and the square of its bias.
The constant estimators θ̂ = c have zero variance, but their bias function Bias[ θ̂ ] = (c − θ)2 is
unbounded. Getting them and their likes out of the way in the strongest sense possible, we will
15
16 CHAPTER 3. MINIMUM VARIANCE UNBIASED ESTIMATION
restrict our attention in this chapter to the opposite (but much more interesting) extreme, and
consider unbiased estimators. Explicitly, an estimator θ̂ : X → R is called unbiased if Bias[ θ̂ ] = 0
identically, i.e., if
h i
E θ̂(X) = θ (3.6)
for any θ ∈ Θ. Note that here we have allowed the estimator to return any real value, not
necessarily in Θ. We will later see that if we restrict the estimator to be θ̂ : X → Θ, and in case
Θ 6= R, then it may be the case that no unbiased estimator exists.
Example 3.1 .
(Constant in Additive Noise) Suppose that we observe a sequence
Xk = θ + Wk , k = 1, . . . , n (3.7)
where {Wk }nk=1 is an i.i.d sequence of real-valued [Link] with E [Wk ] = 0 and Var [Wk ] = σ 2 (and σ2
is either known, or unknown but not to be estimated). We would like to estimate the unknown
parameter θ ∈ R. Thinking of our observations as a random vector X ∈ Rn , the parametric
family under consideration is
n
Y
p(x | θ) = p(xk | θ) (3.8)
k=1
Yn
= pW (xk − θ) (3.9)
k=1
n
1X
θ̂(x) = xk (3.10)
n
k=1
, Mean[ x ]. (3.11)
Let us nd the expected value of this estimator as a function of the underlying parameter θ:
h i 1X n
E θ̂(X) = E [Xk ] (3.12)
n
k=1
n
1X
= θ (3.13)
n
k=1
= θ. (3.14)
This holds for any θ, hence the suggested estimator is unbiased. Let us compute the expected
quadratic loss incurred by this estimator, i.e., its MSE, as a function of the underlying parameter
θ.
Var [X1 ]
= (3.18)
n
σ2
= . (3.19)
n
Where (3.15) follows since the estimator is unbiased, (3.17) holds since the samples are indepen-
dent, (3.18) holds since the samples are identically distributed, and (3.19) holds by assumption.
Note that the MSE in this case does not depend on the underlying parameter θ; this is typically
not the case in general.
In what follows, we will give special attention to the case of additive white Gaussian noise
i.i.d i.i.d
(AWGN), i.e., where {Wk }nk=1 ∼ N (0, σ 2 ). {Xk }nk=1 ∼ N (θ, σ 2 )
In this case clearly and
n
Y 1 1 2
p(x | θ) = √ exp − (xk − θ) . (3.20)
2πσ 2 2
k=1
We will soon see that the MSE attained by the sample mean is the best possible for unbiased
estimators in the AWGN case, uniformly in θ.
Can we uniformly improve the MSE of the sample mean estimator by introducing a little bit
of bias, i.e., nd a biased estimator that dominates it? A simple attempt to do so is by applying
constant shrinkage, i.e.,
for some 0 < α < 1. Let us nd the best shrinkage factor α:
2
MSEθ [ θ̂ ] = Var[ θ̂ ] + Bias[ θ̂ ] (3.22)
σ2
= α2 · + (1 − α)2 θ2 . (3.23)
n
Taking the derivative w.r.t α and equating to zero, we get:
∂ σ2
MSEθ [ θ̂ ] = 2α · − 2(1 − α)θ2 = 0. (3.24)
∂α n
Solving for α we nd the the optimum is obtained for
θ2
αopt = σ2
. (3.25)
θ2 + n
and indeed we see that αopt < 1. Unfortunately, αopt depends on the unknown parameter θ and
is hence unrealizable. It is also easy to check that taking any xed value of α will never result
in an estimator that dominates the sample mean (though it will be better in a certain range of
parameters, depending on α). Later in the course, we will see that in some (higher-dimensional)
cases a (smarter) shrinkage approach can in fact generate biased estimators that dominate the
unbiased ones.
Still, there are a few things to note. First, αopt → 1 as n → ∞ for any xed value of θ.
2
This can be thought of as a high SNR regime, since in this case the `signal energy θ is much
2
larger than the eective noise variance σ /n. Following this interpretation, we note that α opt
looks like the MMSE estimation Wiener coecient. Indeed, in the Gaussian Bayesian case, the
θ2 E θ2
unknown factor is replaced with the computable under the Gaussian prior over the
parameter. Moreover, if for some reason we know or believe that it makes little sense for θ to be
very large, we may still want to use αopt by plugging in some large value of θ above which would
not tend to trust the estimation. Such a process is called regularization of the estimation, and
we will discuss it in more detail later. From the above discussion we can already suspect that
regularization is similar to the Bayesian approach; this will be made precise in the sequel.
18 CHAPTER 3. MINIMUM VARIANCE UNBIASED ESTIMATION
Clearly, in order to drive this estimator to innity we need to corrupt at least half the samples,
hence its breakdown point is 1/2. How much did we pay for this great improvement in robustness?
It can be shown that for the median estimator (assuming now again there are no outliers..)
σ2
MSEθ [θ̂] ≈ α · (3.27)
n
where
1
α= (3.28)
4σ 2 p2W (0)
π
and where pW (0) is the density of the noise at zero. In the AWGN case we get α= 2 ≈ 1.57 . . .,
which is not too bad. Note that α < 1 is also possible; for example, if the noise is Laplace
distributed with zero mean and variance σ2, then α = 1/2, hence the median estimator beats
the sample mean.
Other ways to achieve some level of robustness is to rst try to detect and remove outliers
before applying our estimator. In the aforementioned case of a constant in additive noise, we may
want to rst remove a fraction γ of the largest samples (in absolute value) and then apply our
sample mean estimator to the samples that remain (properly scaled to maintain unbiasedness).
This is called the trimmed mean estimator, and it clearly has a breakdown point of γ ; we can make
the estimator more robust at the expense of using less samples (and hence suering in MSE). In
higher dimensions, designing a good outlier rejection procedure becomes more challenging.
Example 3.2. We observe a single sample X ∼ Unif [0, 1/θ], where θ > 0 is some unknown
parameter, namely
h i h i
Var θ̂ Var θ̂
θ̂2 θ̂2
θ̂1
θ̂1
No
MVU θ̂0
MVU
θ θ
Suppose θ̂(x) is an unbiased estimator of θ, and assume for simplicity that it is absolutely
continuous in x (an assumption that can be dropped). Then
1
h i Zθ
E θ̂(X) = θ θ̂(x)dx (3.30)
0
= θ, (3.31)
where (3.31) follows from the unbiasedness assumption. The above must hold for any θ > 0, and
hence
1
Zθ
θ̂(x)dx = 1. (3.32)
Za
θ̂(x)dx = 1. (3.33)
for any a > 0. Since this is a functional identity, we can take the derivatives on both sides, to
obtain
Za
θ̂(x)dx = a · θ̂(0), (3.36)
for any a > 0, in contradiction to (3.33). Hence, there cannot be an unbiased estimator for θ in
this case.
20 CHAPTER 3. MINIMUM VARIANCE UNBIASED ESTIMATION
i Z 1
h θ dx
E θ̂ML (X) = (3.40)
0 x
=∞ (3.41)
where W ∼ N (0, σ 2 ) and σ2 is known. The sample mean estimator in this case is trivial:
θ̂(x) = x. (3.43)
√
Intuitively, why is the MSE equal to σ2? We might heuristically interpret MSE as the
resolution of our ability to guess the value of θ. Following this line of argument, we may ask
ourselves how small can δ > 0 be so that we can reasonably separate between the parameter
values of θ and θ + δ. In the case of constant in AWGN, the distribution of the sample under
these two values of the parameter clearly changes more drastically the smaller σ2 is, as depicted
in Figure 3.2. More accurately, when we perturb the parameter from θ to θ + δ, the probabilities
of being inside any given interval slightly changes, an eect that we might hope to recognize (in
the discrete case, the change is in the probability of a symbol). To heuristically quantify this, let
us nd by what (multiplicative) factor these probabilities change (why not additive?), say for a
small interval [a, a + ]:
Pr (a ≤ X ≤ a + | θ + δ) · p(a | θ + δ)
≈ (3.45)
Pr (a ≤ X ≤ a + | θ) · p(a | θ)
3.2. SCORE AND FISHER INFORMATION 21
p(x | θ) p(x | θ)
Large σ
2 Small σ2
x x
θ θ+δ θ θ+δ
(b)
(a)
Figure 3.2: (a) The pdf is less sensitive to variation in θ (b) pdf is more sensitive
.
The logarithm log(·) above (and throughout the course) is taken w.r.t the natural base e. We
can see that the quantity that determines the sensitivity of the model to perturbations in the
parameter at a given θ and x is a quantity sometimes called the score function
∂
score [x | θ] , log p(x | θ) (3.49)
∂θ
p(x | θ) is typically called the likelihood function, and hence the score function is the derivative of
the log-likelihood function w.r.t the parameter. Of course, in reality we observe a random value
of the score, i.e., we see the r.v score [X | θ], where X ∼ p(x | θ). This random value of the score
function is typically called simply the score.
Intuitively, if we knew the distribution of the score, we should expect to be able to say
something about the performance of an estimator. For example, if the score tends to be small
(in absolute value), we expect that our ability to estimate the value of the parameter will not
be so good. Conversely, if the score tends to be large, then in principle we might be able to
estimate the parameter rather well.
Let us nd the distribution of the score in the case of constant in AWGN (n = 1). The
likelihood function is given by
1 1
p(x | θ) = √ exp − 2 (x − θ)2 (3.50)
2πσ 2 2σ
and the log-likelihood function is thus
1 1
log p(x | θ) = − log 2πσ 2 − 2 (x − θ)2 (3.51)
2 2σ
Taking the derivative of the above w.r.t the parameter, we nd the score function:
∂ log p(x | θ)
score[ x | θ ] = (3.52)
∂θ
22 CHAPTER 3. MINIMUM VARIANCE UNBIASED ESTIMATION
x−θ
= (3.53)
σ2
Since X ∼ N (θ, σ 2 ), we see that the score (as a r.v) is
score[ X | θ ] ∼ N 0, 1/σ 2 .
(3.54)
E [score[ X | θ ]] = 0, (3.55)
a property which will turn out to hold in more general setups, and the variance of the score is
1
Var [score[ X | θ ]] = . (3.56)
σ2
We see that the meaningful quantity here is the variance of the score, which is inversely propor-
tional to the noise level. This makes intuitive sense, since a higher noise level implies a smaller
score, which should mean poorer resolution. In fact, we see that the MSE (or variance) attained
by the sample mean estimator (which we claimed would be shown to dominate all unbiased
estimators) is precisely the reciprocal of the variance of the score. As it turns out, this is no
coincidence. In much more generality, the MSE (or variance) attained by any unbiased estima-
tor will be lower bounded by the reciprocal of the variance of the associated score (which will
typically depend on the value of the parameter, unlike the example above).
Let us now return to the general case of an arbitrary parametric family with a scalar real-
valued parameter. The variance of the score, due to its ostensible importance, is known by
the name the Fisher information. Traditionally, the Fisher information is dened as the second
moment of the score, which is equal to its variance when the expected score is zero, and the
latter holds under some regularity conditions (which are necessary anyway). Namely, the Fisher
information associated with the parametric family is
h i
I(θ) , E (score [X | θ])2 (3.57)
" #
∂ log p(X | θ) 2
=E . (3.58)
∂θ
Please note that p(x | θ) can be a p.d.f or a p.m.f, and that x can be a scalar or a vector.
In order for the Fisher information to play its role in lower bounding the MSE in unbiased
estimation, the associated parametric family should satisfy some mild regularity conditions.
Specically, we will say that our parametric family is m-regular, if the following conditions are
satised:
(i) The parameter space Θ⊆R is a union of a nite number of (possibly innite ) intervals.
(ii) The Fisher information exists and is nite for any θ ∈ interior(Θ).
∂k ∂ k p(x | θ)
Z Z
t(x)p(x | θ)dx = t(x) dx (3.59)
∂θk ∂θk
for k ∈ {1, . . . , m} and any θ ∈ interior(Θ), wherever the right hand side (r.h.s) exists and
is nite.
3.2. SCORE AND FISHER INFORMATION 23
Most of our results will be for 2-regular families, which will be referred to simply as regular.
We note that some of our results also hold under slightly weaker conditions, e.g., for 1-regular
families. For simplicity of exposition we do not discuss these issues. The following lemma (which
we do not prove here) provides sucient conditions for the third regularity condition to hold.
Lemma 3.3. Any one of the following conditions implies the regularity condition (3.59):
(i) p(x | θ) is a p.d.f that is an m times continuously dierentiable in θ, and has a with bounded
support in x that does not depend on θ .
(ii) p(x | θ) is a p.d.f with innite support that is m times continuously dierentiable in θ, and
the integral on the r.h.s of (3.59) converges uniformly for any θ ∈ Θ.
Lemma 3.4. For a regular parametric family, the following properties hold:
(i) I(θ) ≥ 0 (this holds also without regularity, whenever the Fisher information exists).
E [score[ X | θ ]] = 0 (3.60)
Proof. The rst property follows trivially from the denition. For the second property:
Z ∞
∂
E [score[ X | θ ]] = p(x | θ) · log p(x | θ)dx (3.64)
−∞ ∂θ
Z ∞ ∂
∂θ p(x | θ)
= p(x | θ) · dx (3.65)
p(x | θ)
Z−∞
∞
∂
= p(x | θ)dx (3.66)
−∞ ∂θ
Z ∞
∂
= p(x | θ)dx (3.67)
∂θ −∞
∂
= 1 (3.68)
∂θ
= 0, (3.69)
where (3.67) holds due to the regularity property. We see that as in the AWGN example, the
expected score is indeed zero. The variance of the score is therefore equal to its second moment,
which by denition is the Fisher information. To prove the third claim, write:
" #
∂
∂ 2 log p(x | θ) ∂ ∂θ p(x | θ)
2
= (3.70)
∂θ ∂θ p(x | θ)
24 CHAPTER 3. MINIMUM VARIANCE UNBIASED ESTIMATION
∂2
∂
2
p(x | θ) · ∂θ2
p(x | θ) − ∂θ p(x | θ)
= (3.71)
p2 (x | θ)
#2
∂2
"
∂
∂θ2
p(x | θ) ∂θ p(x | θ)
= − (3.72)
p(x | θ) p(x | θ)
∂2 2
p(x | θ)
∂θ2 ∂ log p(x | θ)
= − . (3.73)
p(x | θ) ∂θ
Z ∞ 2
∂ 2 log p(x | θ)
∂
−E = I(θ) − p(x | θ)dx (3.74)
∂θ2 −∞ ∂θ 2
Z ∞
∂2
= I(θ) − 2 p(x | θ)dx (3.75)
∂θ −∞
∂2
= I(θ) − 2 1 (3.76)
∂θ
= I(θ), (3.77)
Theorem 3.5 (Cramer-Rao) . Let {p(x | θ)}θ∈Θ⊆R be a regular parametric family with
Fisher information I(θ). Let θ̂ : X → R be any measurable unbiased estimator for θ.
Then:
1
MSEθ ( θ̂ ) = Var[ θ̂(X) ] ≥ . (3.78)
I(θ)
In addition, an unbiased estimator attaining the bound above with equality for all θ exists,
if and only if the score function can be written as
for some functions g(x) and C(θ). Moreover, if (3.79) holds then C(θ) = I(θ), and
θ̂(x) = g(x) is an MVU estimator with the variance function 1/I(θ).
Remark 3.6. Clearly, if some unbiased estimator achieves the CRLB, then this estimator is
MVU. However, it is possible that an MVU estimator exists but does not achieve the CRLB (in
which case the CRLB is clearly loose). In any case, and even if no MVU exists, the CRLB is
still a valid lower bound for the MSE of unbiased estimators.
Remark 3.7. Note that in the CRLB theorem we have allowed the estimator to return any real
value, not necessarily in Θ. This is of course generally unreasonable, but it can be shown that in
many cases, when Θ 6= R, then requiring the estimator to return values in Θ only may result in
the non-existence of any unbiased estimators. One such case is discussed later in Example 3.11.
3.3. CRAMER-RAO LOWER BOUND (CRLB) 25
Proof. We assume below that p(x | θ) is a p.d.f, but the proof holds with the trivial modications
for a p.m.f. For simplicity, we write S = score[ X | θ ] for the score r.v in this proof. As we have
seen, E [S] = 0 and Var [S] = I(θ). Now, let θ̂(x) be any estimator (not necessarily unbiased).
Appealing to the CauchySchwarz inequality for the covariance inner-product, the covariance
between the score and the estimator can be upper bounded by
2 h i
Cov[ S, θ̂(X) ] ≤ Var [S] · Var θ̂(X) (3.80)
h i
= I(θ) · Var θ̂(X) , (3.81)
2
h i Cov[ S, θ̂(X) ]
Var θ̂(X) ≥ (3.82)
I(θ)
h i
Cov[ S, θ̂(X) ] = E S · θ̂(X) − E[S] · E[θ̂(X)] (3.83)
h i
= E S · θ̂(X) (3.84)
Z ∞ ∂
p(x | θ)
= p(x | θ) · ∂θ · θ̂(x)dx (3.85)
−∞ p(x | θ)
Z ∞
∂
= θ̂(x) p(x | θ)dx (3.86)
−∞ ∂θ
Z ∞
∂
= θ̂(x)p(x | θ)dx (3.87)
∂θ −∞
∂ h i
= E θ̂(X) (3.88)
∂θ
∂
= (θ + Bias[ θ̂ ]) (3.89)
∂θ
∂
=1+ Bias[ θ̂ ], (3.90)
∂θ
where (3.87) holds due to the regularity property. Combining (3.81) and (3.90) we obtain
h i 2
MSEθ θ̂ = Var[ θ̂ ] + Bias[ θ̂ ]) (3.91)
h i2
∂
1 + ∂θ Bias[ θ̂ ] 2
≥ + Bias[ θ̂ ]) . (3.92)
I(θ)
Specically, for an unbiased θ̂ we have Bias[ θ̂ ], hence proving the lower bound in the Theorem.
Note that the above bound is more general, lower bounding the MSE of all estimators with a
given bias function (this is not too useful outside the unbiased case, however).
To prove the additional claims, note that the CauchySchwarz inequality is tight if and only
if the two random variables (with means removed) are co-linear, i.e., in our case (noting that the
constant can depend on θ):
h i
S − ES = C(θ) θ̂(X) − E θ̂(X) , (3.93)
26 CHAPTER 3. MINIMUM VARIANCE UNBIASED ESTIMATION
and hence
score[ X | θ ] = C(θ) θ̂(X) − θ (3.94)
holds with probability 1 for some C(θ). Furthermore, using the second derivative characterization
of the Fisher information, we get
∂2
I(θ) = −E log p(X | θ) (3.95)
∂θ2
∂
= −E score[ X | θ ] (3.96)
∂θ
∂C(θ)
= −E − (θ̂(X) − θ) − C(θ) (3.97)
∂θ
= C(θ). (3.98)
Example 3.8 .
(constant in AWGN) Recall the model Xk = θ + Wk for k = 1, . . . , n, where
i.i.d
{Wk } ∼ N (0, σ 2 ). The likelihood function is
n
Y 1 1 2
p(x | θ) = √ exp − 2 (xk − θ) . (3.99)
2πσ 2 2σ
k=1
n
n 1 X
log p(x | θ) = − log(2πσ 2 ) − 2 (xk − θ)2 . (3.100)
2 2σ
k=1
∂ log p(x | θ)
score[ x | θ ] = (3.101)
∂θ
n
1 X
= 2 (xk − θ) (3.102)
σ
k=1
n
= 2 (Mean[ x ] − θ) , (3.103)
σ
where Mean[ x ] is the sample mean. We note that Mean[ X ] ∼ N (θ, σ 2 /n), hence the distribution
of the score is
n2 σ 2
score[ X | θ ] ∼ N 0, 4 (3.104)
σ n
n
= N 0, 2 (3.105)
σ
Hence, the Fisher information is
n
I(θ) = , (3.106)
σ2
and the CRLB implies that for any unbiased estimator θ̂:
1
≥ (3.108)
I(θ)
σ2
= . (3.109)
n
This is achieved in equality by the sample mean estimator θ̂(x) = Mean[ x ] (which in this case
is also the ML estimator). Hence, the sample mean is the MVU estimator.
In fact, there was an easier way to gure this out without even knowing about the sample
mean estimator in the rst place. We have seen that the score function is
n
score[ x | θ ] = (Mean[ x ] − θ) , (3.110)
σ2
which is exactly of the MVU form discussed in the CRLB theorem. Once that is observed, we
n
can immediately deduce the I(θ) = σ2
and that Mean[ x ] is the MVU estimator.
Example 3.9 (Biased coin). Recall that here the i.i.d observations are drawn from the Bernoulli
i.i.d
distribution with parameter θ, i.e., {X1 , . . . , Xn } ∼ Bern(θ). The likelihood function is
n
Y
p(x | θ) = p(xk | θ) (3.111)
k=1
kxk
=θ · (1 − θ)n−kxkH (3.112)
where kxkH is the Hamming weight of the vector x ∈ {0, 1}n . The log-likelihood function is
∂ log p(x | θ)
score[ x | θ ] = (3.114)
∂θ
kxkH n − kxkH
= − (3.115)
θ 1−θ
kxkH (1 − θ) − (n − kxkH )θ
= (3.116)
θ(1 − θ)
kxkH − nθ
= (3.117)
θ(1 − θ)
n Mean[ x ] −θ .
= (3.118)
θ(1 − θ) | {z }
| {z } MVU
I(θ)
1
≥ (3.120)
I(θ)
θ(1 − θ)
= (3.121)
n
and the sample mean is the MVU estimator. Note that we could also have looked at the distri-
1
bution of the score, which is clearly Binom(n, θ) that is centered and then scaled by
θ(1−θ) . This
would also immediately yield the expression for the Fisher information.
28 CHAPTER 3. MINIMUM VARIANCE UNBIASED ESTIMATION
Lemma 3.10. Let {p(x | θ)}θ∈Θ be a regular parametric family with Θ = R, and assume that
I(θ) > 0 for any θ ∈ R. If an ecient estimator exists, then the ML estimator is ecient.
where the last step holds since we are assuming that an ecient estimator exists, in which case
g(x) is ecient. Equating the above to zero and recalling that I(θ) > 0, we obtain that
∂
log p(x | θ) = I(θ)[g(x) − θ] (3.126)
∂θ |{z}
>0
0
θ = g(x)
= > 0 θ < g(x) (3.127)
< 0 θ > g(x).
Thus θ̂ML = g(x) is a unique global maximum of the likelihood function, hence g(x) is the ML
estimator.
The reason we have required the parameter space to be the entire real line is obvious from the
proof; if Θ is a strict subset of R, then it may be that g(x) 6∈ Θ, in which case the maximum of
the likelihood function is attained at some boundary point of Θ. In such a case, the ML estimator
will be biased, as will be any other estimator that is constrained to return values inside Θ. This
is demonstrated in the following simple example.
Example 3.11. Consider the constant in AWGN model with a single sample, i.e., X = θ + W ,
where W ∼ N (0, σ 2 ). However, this time let us restrict the parameter space to be (say) the unit
interval Θ = [0, 1]. Then it should be clear that
x x ∈ [0, 1]
θ̂ML (x) = 0 x < 0 (3.128)
1 x>1
h i Z1
E θ̂ML (X) = x · p(x | θ)dx + Pr(X > 1) (3.129)
0
3.5. MORE ON THE FISHER INFORMATION 29
can be easily veried to be biased. For example, it always holds that E[θ̂ML (X)] > 0, even when
θ = 0.
In fact, a far more general statement holds in this setup: any estimator θ̂ : R → [0, 1] is
necessarily biased. To see this, note that if θ̂ is unbiased, then in particular when θ = 0 it must
hold that E[θ̂(X)] = 0. But since we have restricted θ̂(x) to give values in [0, 1] only, then clearly
the only way for E[θ̂(X)] = 0 to hold is if θ̂(X) = 0 with probability 1. This means that θ̂ is a
constant estimator, hence is obviously biased. Note that the estimator θ̂(x) = x is of course still
unbiased; it achieves this by sometimes returning values that are outside the [0, 1] parameter
space.
One might wonder why the ML estimator for in biased coin example, where Θ = [0, 1] is not
the entire real line, is ecient. From the discussion above, the following immediate corollary
claries this point.
Corollary 3.12. Let {p(x | θ)}θ∈Θ be a regular parametric family, and assume that I(θ) > 0 for
any θ ∈ interior(Θ). Suppose the score function can be written as score[ x | θ ] = C(θ) · (g(x) − θ).
If Pr(g(X) ∈ Θ) = 1, then the ML estimator is ecient (and is given by g(x)).
p(X)
D(pkq) , EX∼p(x) log (3.130)
q(X)
Z
p(x)
= p(x) log dx (3.131)
q(x)
In other words, D(pkq) is the expected log-likelihood ratio, where the expectation is taken with
respect to p.
(i) D(pkq) ≥ 0, with equality if and only if p(x) = q(x) (almost everywhere).
(iii) Data Processing Inequality (DPI): Let pX × pY |X and qX × pY |X be two joint distributions
over the same sigma-algebra of the product sample space X × Y , and let pY and qY be the
associated Y -marginals, respectively. Then
pX pY
pY |X
qX qY
pY |X
Theorem 3.14. Let {p(x | θ)}θ∈Θ⊆R be a 3-regular family with Fisher information I(θ),
and assume that
∂ 3 log p(x | θ)
sup E < ∞. (3.133)
θ∈Θ ∂θ3
(i) The Fisher information is the curvature of the KL divergence w.r.t. translations.
Namely, for any θ, δ such that [θ, θ + δ] ⊆ interior(Θ),
δ2
I(θ) + O |δ 3 | .
D (p(x | θ) k p(x | θ + δ)) = (3.134)
2
(ii) DPI for the Fisher information: Let X ∼ p(x | θ), and let Y | X = x ∼ p(y | x). Then
, assuming that the associated p(y|θ) satises the same regularity conditions,
where I Y (θ) is the Fisher information pertaining to the parametric family p(y | θ).
0 ∂ p(X | θ)
g (0) = E log (3.138)
∂δ p(X | θ + δ)
δ=0
∂ p(X | θ)
=E log (3.139)
∂δ p(X | θ + δ)
δ=0
3.5. MORE ON THE FISHER INFORMATION 31
∂
= −E log p(X | θ + δ) (3.140)
∂δ
δ=0
∂
= −E log p(X | θ) (3.141)
∂θ
= −E [score[ X | θ ]] (3.142)
= 0, (3.143)
∂2
g 00 (0) = −E log p(X | θ + δ) (3.144)
∂δ 2
δ=0
2
∂
= −E log p(X | θ) (3.145)
∂θ2
= I(θ). (3.146)
∂3
(3) 0
|g (δ )| = E log p(X | θ + δ) (3.147)
∂δ 3
δ=δ 0
3
∂
= E log p(X | θ) (3.148)
∂θ3
θ=θ+δ 0
hence by the assumption there is some universal C<∞ such that |g (3) (δ 0 )| < C for any θ
0
and δ . Writing the Taylor series with the remainder term, we have that for some δ 0 ∈ [0, δ]
1 g (3) (δ 0 ) 3
g(δ) = g(0) + δg 0 (0) + δ 2 g 00 (0) + ·δ , (3.149)
2 6
C 3
and since the remainder term is bounded in absolute value by
6 |δ |, the claim follows.
(ii) Writing the KL-divergence using the Fisher information for the family p(x | θ) we have that
δ2
D(p(x | θ)kp(x | θ + δ)) = I(θ) + O(|δ|3 ), (3.150)
2
and similarly for the family p(y | θ) therefore,
δ2 Y
D(p(y | θ) k p(y | θ + δ)) = I (θ) + O(|δ|3 ). (3.151)
2
Using the DPI for the KL divergence, we have that
δ2 δ2
I(θ) + O(|δ|3 ) ≥ I Y (θ) + O(|δ|3 ). (3.152)
2 2
Dividing by δ 2 /2 yields
Remark 3.15. The DPI for the Fisher information is intuitively appealing, claiming that it
can never be increased by processing the samples. In cases where the CRLB is tight this is
immediately clear, but it turns out to hold in more generality.
Theorem 3.16 (Additivity of I(θ) for independent samples). Let {Xk }nk=1 be independent sam-
ples drawn from possibly distinct models controlled by the same parameters, i.e., X ∼ pk (x | θ),
where pk (x | θ) are all regular families. Let I
(n) (θ) be the Fisher information pertaining to all the
samples, and let Ik (θ) be the Fisher information pertaining to the sample Xk . Then p(x | θ) is
regular, and
n
X
(n)
I (θ) = Ik (θ). (3.154)
k=1
n
!
Y
log p(x | θ) = log pk (xk | θ) (3.155)
k=1
n
X
= log pk (xk | θ). (3.156)
k=1
∂
score[ x | θ ] = log p(x | θ) (3.157)
∂θ
n
X ∂
= log pk (xk | θ) (3.158)
∂θ
k=1
n
X
= score[ xk | θ ]. (3.159)
k=1
The expectation of the score is clearly zero. The Fisher information, which is the variance of the
score, is given by
Xn
= Var [score[ Xk | θ ]] (3.161)
k=1
n
X
= Ik (θ). (3.162)
k=1
where in (3.161) we have used the fact that the samples are independent.
Remark 3.18. Together with the CRLB, the additivity property implies that the variance of
an unbiased estimator for a regular family cannot decrease faster than Ω(1/n) in the number of
i.i.d samples. We will later see that in non-regular cases, a faster decay is sometimes possible.
3.6. CRLB FOR GENERAL SIGNALS IN AWGN 33
Remark 3.19. What about dependent samples? One might guess that the Fisher information
Pn
is sub-additive, namely that I (n) (θ) ≤ k=1 Ik (θ). However, this is not the case. For example,
take the following dependent two-sample model
X1 = θ + W (3.164)
X2 = θ − W (3.165)
where W ∼ N (0, 1) . The marginal Fisher information for both samples is clearly identical and
equal to I1 (θ) = I2 (θ) = 1. However, for the two samples together, the Fisher information must
be
Xk = gk (θ) + Wk , k = 1, . . . , n, (3.167)
i.i.d
where {Wk } ∼ N (0, σ 2 ) and gk (θ) is a function (time series) parameterized by θ. Before, we
solved the case of a constant in AWGN, where the function gk (θ) = θ was the identity function,
independent of the time index. The associated likelihood function is given by
n
( )
1 1 X
p(x | θ) = n exp − 2 (xk − gk (θ))2 . (3.168)
(2πσ 2 ) 2 2σ
k=1
∂ log p(x | θ)
score[ x | θ ] = (3.169)
∂θ
n
1 X ∂gk (θ)
= 2 (xk − gk (θ)) · . (3.170)
σ ∂θ
k=1
h i σ2
Var θ̂(X) ≥ n 2 . (3.175)
P ∂gk (θ)
∂θ
k=1
34 CHAPTER 3. MINIMUM VARIANCE UNBIASED ESTIMATION
The denominator above can be though of as the sum of sensitivities of the signal to perturbation
of the parameter. Note that for the simple case of gk (θ) = θ, we get the σ 2 /n we have already
found before.
in AWGN, and wish to estimate f. The amplitude a and also the phase (set to zero here) are
assumed to be known. Write
∂gk (f )
= 2πka · sin(2πf k) (3.177)
∂f
h i σ2
Var fˆ ≥ n . (3.178)
(2πa)2 k 2 sin2 (2πf k)
P
k=1
a2
Dening SNR , σ2
, we can also write
" n
#−1
h i X 2
Var fˆ ≥ SNR · (2πk sin(2πf k)) , (3.179)
k=1
Note that later samples (large k) can be much more inuential than earlier ones (small k ). This
might seem weird at rst sight, but the reason is simple: If we are trying to decide between f
and f +δ for some small deviation δ, the dierence between the resulting sinusoids becomes
more evident as time progresses.
Theorem 3.21. Let {p(x | θ)}θ∈Θ⊆R be a regular parametric family with Fisher information
I(θ). Let ĝ(x) be an unbiased estimator, i.e., one such that E [ĝ(X)] = g(θ) for any θ ∈ Θ, for
a continuously dierentiable function g(θ) over Θ. Then
Proof. Let us dene a new parameter φ = g(θ), and apply the CRLB to φ. Without loss of
generality, we will assume that g is invertible. If it is not, then we can restrict the discussion to
a small enough neighborhood of θ, in which g is either invertible or constant (and the constant
case is not interesting, since the bound then says the variance is nonnegative). It is not dicult
3.7. TRANSFORMATION OF PARAMETERS 35
to show that the parametric family when parameterized by φ is also regular. Thus any unbiased
estimator φ̂(x) of φ (and hence of g(θ)) must satisfy
1
Var[ φ̂ ] ≥ (3.182)
I(φ)
e
where we wrote e for the Fisher information pertaining to the new parametric family pe(x | φ) =
I(φ)
p(x | θ) where θ = g −1 (φ). Thus, we only need to relate this Fisher information to the Fisher
information I(θ) of the original family. To that end, the score function pertaining to φ is given
by
where we have used the invertibility assumption in (3.185). The claim now follows since the
Fisher information is the second moment of the score.
Alternatively, we can prove this by returning to the CRLB proof, and looking at the covariance
between the score and ĝ(X). On the one hand, we have that
in light of the CauchySchwarz ineuqality. On the other hand, using regularity we get that
∂
Cov[ score[ X | θ ], ĝ(X) ] = E [ĝ(X)] (3.188)
∂θ
∂g(θ)
= (3.189)
∂θ
where we have used the assumption that the estimator is unbiased in the last transition. Com-
bining the two expressions, the bound follows.
Remark 3.22. Note that the factor of the derivative squared makes sense. It means that the
more sensitive φ is to variations in θ, the more we expect to be penalized in MSE when estimating
φ. The derivative is in fact just a scaling factor, which locally captures the change of scale between
θ and φ.
But how about using a plug-in approach? Namely, if θ̂ is an MVU / ecient estimator for θ,
why not just use g(θ̂)?
Example 3.23 .
(Constant in AWGN) Recall the model Xk = θ + Wk where
i.i.d
{Wk }nk=1 ∼
N (0, σ 2 ). We have seen that the sample mean estimator, which was also the ML estimator, is
ecient (and hence also MVU). Suppose now that we want to estimate φ = θ2 (and assume we
restrict out attention to θ>0 only). We can use the plug-in
2
estimator φ̂ = (Mean[ X ]) . Its
expected value is
h i
E φ̂ = E (Mean[ X ])2
(3.190)
36 CHAPTER 3. MINIMUM VARIANCE UNBIASED ESTIMATION
σ2
= + θ2 (3.192)
n
6= θ2 . (3.193)
This is in fact the case in general the unbiasedness of an estimator (and hence its eciency)
is generally violated by the application of a function. This is of course not to say that the plug-
in estimator is bad; it is simply biased. The only exception is ane transformations, for which
unbiasedness and eciency (had it been satised) are maintained. Let θ̂ be an unbiased estimator
for θ. Consider the ane transformation φ = aθ + b, and let us evaluate the plug-in estimator
φ̂(x) = aθ̂(x) + b.
h i h i
E φ̂(X) = aE θ̂(X) + b (3.194)
= aθ + b (3.195)
= φ. (3.196)
Hence, in this case the plug-in estimator is also unbiased. What about the MSE?
h i h i
Var φ̂(X) = a2 Var θ̂(X) (3.197)
2
∂φ h i
= · Var θ̂(X) . (3.198)
∂θ
h i ∂g(θ) 2 h i
Var g(θ̂(X)) ≈ Var θ̂(X) . (3.203)
∂θ
which is approximately ecient. We will handle this topic more accurately later on, when we
discuss the ML asymptotic.
4. Vector MVU Estimation
Suppose we have a vector of parameters θ ∈ Rd , and a parametric family {p(x | θ)} where x is
either a scalar or a vector, discrete or continuous.
d
We say that an estimator θ̂ : X → R is called
unbiased if
h i
E θ̂(X) = θ. (4.1)
How would we generalize the MVU property? First, a quick reminder from linear algebra.
v T Av ≥ 0 (4.2)
Lemma 4.1. Let A be a real-valued, symmetric, square matrix. Then the following state-
ments are equivalent:
(i) A 0.
Lemma 4.2. The following are some operations that preserve the p.s.d property:
Below we write A0 to mean that the associated inequalities are strict, in which case
A is said to be positive denite (p.d.).
Since our estimators are now multidimensional, it may not be immediately clear how to
compare them. Recall the covariance matrix of a random column vector U is
Cov [U ] , E (U − E [U ]) · (U − E [U ])T .
(4.3)
We say that an unbiased estimator θ̂1 (x) dominates another unbiased estimator θ̂2 (x), if
h i h i
Cov θ̂2 (X) Cov θ̂1 (X) . (4.4)
This clearly reduces to the denition of domination for scalar unbiased estimators. But what
does multidimensional domination mean? It is in fact a very strong requirement.
37
38 CHAPTER 4. VECTOR MVU ESTIMATION
Lemma 4.3. Let θ̂1 (x) and θ̂2 (x) be two unbiased estimators. Then θ̂1 (x) dominates θ̂2 (x) if
and only if
h i h i
Var v T θ̂1 (X) ≤ Var v T θ̂2 (X) (4.5)
h i h i
ˆ
v T Cov θ1 (X) ˆ
v ≤ v T Cov θ2 (X) v (4.9)
h i h i
holds for any v, which is synonyms with the matrix Cov θˆ2 (X) − Cov θˆ1 (X) being p.s.d,
h hi i
or equivalently with Cov θˆ2 (X) Cov θˆ1 (X) . This completes the proof.
∂ log p(x | θ)
scorei [x | θ] , (4.11)
∂θi
Note that scorei [ x | θ ] generally depends on all the parameters, not only the ith one. It captures
the multiplicative sensitivity of the distribution to perturbation of the ith parameter. The score
vector is score[ X | θ ], d-dimensional random vector obtained
the by drawing the x argument
score vector function according to p(x | θ). .
The Fisher information matrix I(θ) is dened to be the second moments matrix of the score
vector, i.e.,
h i
I(θ) , E score [X | θ] · score [X | θ]T . (4.12)
∂ log p(X | θ) ∂ log p(X | θ)
=E · . (4.14)
∂θi ∂θj
A vector parametric family is called regular, if the following conditions are satised:
(i) The parameter space Θ⊆R is a union of a nite number of convex sets.
(ii) The Fisher information matrix exists, is nite, and is full rank for any θ ∈ interior(Θ).
(iv) Integration and dierentiation can be exchanged up to second order partial derivatives,
whenever the integrals exist.
Lemma 4.4. For a regular vector parametric family, the following properties hold:
(i) I(θ) 0 (this holds also without regularity, whenever the Fisher information matrix exists).
E [score[ X | θ ]] = 0 (4.15)
Proof. The rst claim follows since I(θ) is a covariance matrix, which is always p.s.d. Proving
the other two follows the same line of argument as in the scalar case; we leave the details to the
reader.
Theorem 4.5 (Vector CRLB) . {p(x | θ)}θ∈Θ⊆Rd be a regular parametric family with
Let
Fisher information matrix I(θ). Let θ̂ : X → Rd be any unbiased estimator for θ . Then:
h i
Cov θ̂(X) I −1 (θ). (4.19)
In addition, an unbiased estimator attaining the bound above with equality for all θ exists,
if and only if the score vector function can be written as
for some vector functiong(x) and p.s.d matrix C(θ). Moreover, if (4.20) holds then
C(θ)h = I(θ)
i , and θ̂(x) = g(x) is an MVU estimator with the covariance matrix
The proof of this theorem is similar to the scalar case; the interested reader can try to ll in
the details.
Remark 4.6. If θ̂ is ecient, i.e., is unbiased and achieves the CRLB with equality for all θ,
then it is MVU. But even there is no ecient estimator, an MVU might still exist.
40 CHAPTER 4. VECTOR MVU ESTIMATION
As in the scalar case, we also have the following connection between ML and eciency. The
proof is left to the reader note that one must use the fact that I(θ) is p.s.d.
Lemma 4.7. Consider a regular family with Θ = Rd and suppose I(θ) 0 for any θ ∈ Rd . If
there exists an ecient estimator, then ML is the ecient estimator.
Corollary 4.8. The CRLB lower bound is equivalent to the statement that
h i
Var v T θ̂(X) − θ ≥ v T I −1 (θ) v (4.21)
holds for any v ∈ Rd . Namely, we have a lower bound on the variance of the projection of the
error in any direction.
The vector CRLB provides in particular a lower bound on the variance of any unbiased
estimator for any one of the parameters:
Proof. One way to see this is to pick v to have vi = 1 and zero otherwise,
h i and then appeal to
Corollary 4.8. Alternatively, by the vector CRLB we have that Cov θ̂(X) − I −1 (θ) 0. The
result follows by recalling that a p.s.d matrix must have non-negative diagonal entries.
Remark 4.10. We note that by the scalar CRLB, it also holds that
Var[ θˆi ] ≥ 1/{I(θ)}ii . (4.23)
Contrasting this with Corollary 4.9, we note that the lower bound (4.23) is weaker than (4.22),
namely that {I −1 (θ)}ii ≥ 1/{I(θ)}ii (you will show this in a homework exercise), and the
inequality can be strict. The intuitive reason is that the scalar CRLB (4.23) implicitly assumes
that all the parameters except θi are known, whereas the lower bound (4.22) assumes that
all the parameters are unknown. It is generally more dicult to estiamte θi when the other
parameters are unknown, as we will see in the following example. In some special cases the
bounds coincide; for example, in the case of constant in AWGN with unknown variance, where
the ecient estimator for the constant does not depend on the noise variance (see the SNR
estimation example below).
Example 4.11 .
(Scalar linear regression) Consider the two dimensional parametric model
Xk = α + βk + Wk , (4.24)
i.i.d
where {Wk }n−1 2
k=0 ∼ N (0, σ ). Let us denote the vector of parameters by θ = [α, β]T . Our
likelihood function is
n−1
( )
1 1 X
p(x | θ) = n exp − 2 (xk − α − βk)2 . (4.25)
(2πσ 2 ) 2 2σ
k=0
n−1
∂ log p(x | θ) 1 X
= 2 (xk − α − βk) (4.26)
∂α σ
k=0
4.1. VECTOR CRLB 41
xk
0 1 n
n−1
∂ log p(x | θ) 1 X
= 2 (xk − α − βk)k (4.27)
∂β σ
k=0
n−1
∂ 2 log p(x | θ) 1 X n
2
= − 2
1 = − 2. (4.28)
∂α σ σ
k=0
n−1
∂ 2 log p(x | θ) 1 X n(n − 1)
=− 2 k=− . (4.29)
∂α∂β σ 2σ 2
k=0
n−1
∂ 2 log p(x | θ) 1 X 2 n(n − 1)(2n − 1)
2
= − 2
k =− . (4.30)
∂β σ 6σ 2
k=0
" #
n(n−1)
1 n 2
I(θ) = 2 n(n−1) n(n−1)(2n−1) (4.31)
σ 2 6
" #
2(2n−1) 6
− n(n+1)
I −1 (θ) = σ 2 n(n+1)
6 12 . (4.32)
− n(n+1) n(n2 −1)
T
Hence for any unbiased estimator θ̂ = α̂ β̂ it holds that
" #
2(2n−1) 6
h i − n(n+1)
Cov θ̂(X) σ 2 n(n+1)
6 12 . (4.33)
− n(n+1) n(n2 −1)
2 2(2n − 1) 2 1
Var [α̂] ≥ σ · =σ ·Ω (4.34)
n(n + 1) n
h i
2 12 2 1
Var β̂ ≥ σ · 2
=σ ·Ω . (4.35)
n(n − 1) n3
42 CHAPTER 4. VECTOR MVU ESTIMATION
We see that the CRLB for β decays much faster with n than the CRLB for α, hence if there is
an ecient estimator (and we will see later there is) then β can be estimated with much better
accuracy than α. This makes sense, since we expect the slope to be much more sensitive to
perturbation than the intercept; a perturbation to the intercept incurs an additive change of
to all the samples, whereas a perturbation in the slope incurs a linearly increasing additive
change of k to the k th sample.
Note further that when β is known, the problem reduces to a constant in AWGN, in which
case
σ2
Var [α̂] ≥ (4.36)
n
for any unbiased estimator, which is strictly smaller than what we got in (4.34) for any n > 1.
This makes sense, since we expect that α could be better estimated when β is known.
Theorem 4.12. Let {p(x | θ)}θ∈Θ⊆Rd be a regular parametric family with Fisher information
matrix I(θ). Let ĝ(x) be an unbiased estimator, i.e., one such that E [ĝ(X)] = g(θ) for any
θ ∈ Θ, for a continuously dierentiable function g : Θ → Rr . Then
∂g(θ) −1 ∂g(θ)T
Cov [ĝ(X)] · I (θ) · (4.37)
∂θ ∂θ
where
∂g1 (θ) ∂g1 (θ)
···
∂g(θ) ∂θ. 1 ∂θd
.
= . . (4.38)
. .
∂θ ∂gr (θ) ∂gr (θ)
∂θ1 ··· ∂θd r×d
Example 4.13 .
(CRLB for SNR Estimation) Consider the model
Xk = a + Wk (4.39)
i.i.d
where {Wk }nk=1 ∼ N (0, σ 2 ), and where both a and σ 2 are unknown. We are however not
2 a2
interested in a and σ individually, but rather want to estimate SNR , . Denoting the
σ2
T 2 T
parameters vector by θ = [θ1 , θ2 ] = [a, σ ] , we want to estimate the vector function
θ12
SNR = g(θ) = . (4.40)
θ2
Then X ∼ N ([θ0 , . . . , θ1 ], θ2 · In×n ), hence
n
n n 1 X
log p(x | θ) = − log(2π) − log θ2 − (xk − θ1 )2 (4.41)
2 2 2θ2
k=1
4.2. TRANSFORMATION OF PARAMETERS 43
n
∂ log p(x | θ) 1 X
= (xk − θ1 ) (4.42)
∂θ1 θ2
k=1
n
∂ log p(x | θ) n 1 X
=− + 2 (xk − θ1 )2 (4.43)
∂θ2 2θ2 2θ2
k=1
∂ 2 log p(x | θ) n
2 =− , (4.44)
∂θ1 θ2
hence
∂ 2 log p(X | θ)
n
−E 2 = 2 (4.45)
∂θ1 σ
Now
n
∂ 2 log p(x | θ) 1 X
=− 2 (xk − θ1 ), (4.46)
∂θ1 ∂θ2 θ2
k=1
hence,
∂ 2 log p(X | θ)
−E = 0. (4.47)
∂θ1 ∂θ2
n
∂ 2 log p(x | θ) n 1 X
= − (xk − θ1 )2 , (4.48)
∂θ22 2θ22 θ23
k=1
∂ 2 log p(X | θ)
n nθ2 n n
−E 2 =− 2 + 3 = 2 = 4 (4.49)
∂θ2 2θ2 θ2 2θ2 2σ
n
σ2
0
I(θ) = n . (4.50)
0 2σ 4
It is worth noting at this point that I(θ) is diagonal, which means that the scalar CRLB for each
2
of a and σ separately coincides with their vector CRLB. Namely, in this specic problem, the
2 2
knowledge of σ (resp. a) does not change the CRLB for a (resp. σ ). Indeed, we have already
seen that the CRLB is met with equality for constant in known AWGN by an estimator (the
sample mean) that does not depend on the variance.
Getting back to the SNR estimation, let us nd the Jacobian matrix associated with g(θ):
Hence,
−1
∂g(θ) −1 ∂g(θ) T h 2a 2
i n
σ2
0 2a
σ22
I (θ) = σ2 − σa4 n (4.54)
∂θ ∂θ 0 2σ 4 − σa4
4a2 σ 2 a4 2σ 4
= · + · (4.55)
σ4 n σ8 n
4a2 2a4
= + (4.56)
nσ 2 nσ 4
4SNR + 2SNR2
= . (4.57)
n
Thus,
2
d ≥ 4SNR + 2SNR .
h i
Var SNR (4.58)
n
for any unbiased SNR estimator.
It can be shown by direct calculation that the (i, j)-th entry of the Fisher information matrix is
given by
T
∂µ(θ) −1 ∂µ(θ) 1 −1 ∂C(θ) −1 ∂C(θ)
{I(θ)}ij = C (θ) + Tr C (θ) C (θ) , (4.60)
∂θi ∂θj 2 ∂θi ∂θj
∂C(θ) ∂{C(θ)}kl
, (4.62)
∂θi kl ∂θi
5. Linear Models
So far, we have restricted the family of estimators under consideration to be unbiased. We now
further restrict the model we consider (namely, our parametric family) to a special family called
linear models, where linearity here means that the expectation of the signal is a linear function of
the parameter vector. The additional structure of this family will allow us to derive and analyze
very explicit estimators and gain useful insights. As we shall see, this is a very diverse and useful
family of models.
More explicitly, our parametric family {p(x | θ)}θ∈Θ⊆Rd if
X = Hθ + W , (5.1)
d
X
E [X] ≈ θ̂i hi , (5.2)
i=1
where θ̂i θi , and hi is the ith column of H , i.e., H = h1 h2 · · · hd . We
is some estimator of
note that the observation vector X is sometimes referred to in the literature as the response. We
assume throughout that the predictors (columns of H ) are linearly independent; this is clearly
without loss of generality, since if one of the predictors is a linear combination of the others, it
can be dropped. In other words, we assume that n ≥ d, i.e., that the number of parameters does
not exceed the number of observations, and that H is full rank, i.e., rank(H) = d. The following
simple lemmas will be found useful in the sequel.
Lemma 5.1. Let H ∈ Rn×d be full rank with n ≥ d. Then H T H ∈ Rd×d is invertible.
where the inequality follows since the columns of H are linearly independent. Thus the symmetric
T
matrix H H is positive-denite (all its eigenvalues are real and strictly positive), hence it is
invertible.
45
46 CHAPTER 5. LINEAR MODELS
d T
(i) dθ θ v = v.
d T d T
(ii) dθ θ Aθ = (A + AT )θ . Specically, if A is symmetric then dθ θ Aθ = 2Aθ
∂ log p(x | θ)
score [x | θ] = (5.5)
∂θ
∂ 2 n 1 T
= − log(2πσ ) 2 − 2 (x − Hθ) (x − Hθ) (5.6)
∂θ 2σ
1 ∂
xT x − 2θ T H T x + θ T H T Hθ
=− 2 (5.7)
2σ ∂θ
1 T
= 2 H x − H T Hθ .
(5.8)
σ
Recalling that by our assumptions are in light of Lemma 5.2, the matrix HT H is invertible, we
can write:
1 T T −1 T
score [x | θ] = H H (H H) H x −θ . (5.9)
2
|σ {z } | {z }
I(θ) θ̂ML (x)
1
We therefore conclude that the Fisher information matrix is given by I(θ) = σ2
H T H , and that
the ML estimator
Remark 5.3. The ML estimator does not depend on σ2, hence we can further assume the
variance is unknown and still be MVU w.r.t θ.
Remark 5.4. The matrix (H T H)−1 H T is known as the Moore-Penrose pseudo inverse of H.
It can be though of as a stable way of inverting a non-square full rank matrix; note that
(H T H)−1 H T · H = I and that the pseudo-inverse equals H −1 whenever H is invertible.
= σ 2 (H T H)−1 . (5.12)
5.1. LINEAR MODELS IN AWGN 47
θ̂ML ∼ N θ, σ 2 (H T H)−1 .
(5.13)
Theorem 5.5. For the linear model with AWGN and observation matrix H, the ML
estimator is ecient and hence MVU. It is given by
Moreover,
Example 5.6 (Stock Market). Let X be the daily (or weekly, or monthly) return of a given stock
(in percentage), and let Y be the daily market (or monthly, or yearly) return of the market (as
measured by some major index, e.g., S&P500). We would like to understand the relation between
the market returns and the individual stock returns, i.e., how much of the stock movement is a
result of market uctuations, and how much is idiosyncratic to the stock. Namely, we would like
to nd the conditional distribution pX|Y (x | y), should such a thing be dened. More specically,
suppose we believe in the following simple model:
X = α + βY + W (5.16)
where W ∼ N (0, σ 2 In×n ) is independent of Y. Note that although Y is random, we are only
p(x | y), and hence we perform our analysis
interested in estimating the conditional distribution
in a probability space induced given the market returns Y . Hence, the matrix H above can be
though of as being xed (given the market returns). Note also that since the noise vector W is
assumed independent of Y , it remains i.i.d Gaussian given Y .
48 CHAPTER 5. LINEAR MODELS
We can now nd the ML estimator for the coecients α β , which in this case is ecient
and
and MVU. Without loss of generality, let us assume now that Mean[ Y ] = 0 (if it is not, we can
remove the mean by moving it into the constant α). Then
hT1
T
H H = T h1 h2 (5.18)
h2
T
h1 h1 hT1 h2
= T (5.19)
h2 h1 hT2 h2
n
P
n Y k
= k=1 (5.20)
n
P 2
Yk kY k
k=1
n 0
= . (5.21)
0 kY k2
−1
T −1 T n 0 1 ··· 1
(H H) H = (5.22)
0 kY k2 Y1 · · · Yn
" #
1/n · · · 1/n
= Y1 · · · kYYnk2 . (5.23)
kY k2
XT Y
β̂ML = (5.27)
kY k2
XT Y kXk
= · (5.28)
kXk kY k kY k
kXk
= ρ(X, Y ) · . (5.29)
kY k
where ρ(X, Y ) is the empirical correlation between the vectors X and Y (which converges to
σ2
the true correlation as n → ∞). The variance of our estimators is given by Var[ α̂ML ] = n and
σ2
Var[ β̂ML ] = kY k2
.
Example 5.7 .
(Polynomial curve tting) We are given data points from some experiment, e.g.
in the form of a sequence {X(tk )}nk=1 , where t1 < t2 < · · · < tn are deterministic sample times
(or any other set of variables). We would like to t a simple curve to the data points. We may
want to do that in order to:
5.1. LINEAR MODELS IN AWGN 49
(ii) To compress the data, i.e., represent it approximately using a simple function of much less
then n parameters.
(iii) To interpolate or extrapolate, i.e., to guess the value of the process at unobserved time
points X(t).
d−1
X
= θi tik + error, (5.31)
i=0
where dn is set to avoid over-tting. We now write this as a linear model:
T
X = X(t1 ) X(t2 ) · · · X(tn ) (5.32)
T
θ = θ0 θ1 · · · θd−1 . (5.33)
1 t1 t21 · · · t1d−1
1 t2 t2 · · · td−1
2 2
H = . . . , (5.34)
.
.. .. .
.
.
.
2
1 tn tn · · · tnd−1
where the rst column is a constant predictor, the second column is a linear predictor, the third
column is a quadratic predictor, and so on. Note that our stock market example is essentially
polynomial tting with only the constant and the linear predictors, i.e., with d = 2. If d < n and
since all the sampling time are distinct, then H is a Vandermonde matrix and hence full rank.
Hence we get the linear model
X = Hθ + W , (5.35)
where W is tting noise. Assuming the tting noise is white Gaussian W ∼ N (0, σ 2 In×n ), we
have the ecient MVU estimator
−1
θ̂ML (x) = H T H H T x, (5.36)
which is Gaussian with mean θ and covariance matrix σ 2 · (H T H)−1 , where H is given in (5.34).
Example 5.8 (Fourier analysis). Suppose that our observation vector X is a linear combination
of sinusoids with unknown amplitudes, corrupted by AWGN:
M M
X 2πmk X 2πmk
Xk = am cos + bm sin + Wk (5.37)
n n
m=1 m=1
i.i.d
where {Wk }n−1 2
k=0 ∼ N (0, σ ). We didn't use the constant (DC) here, but it can be added easily.
This can be represented as a linear model with parameter vector
T
θ = a1 a2 · · · aM b1 b2 · · · bM (5.38)
50 CHAPTER 5. LINEAR MODELS
x(tk )
t
t1 t2 t3 ··· tn
n
The matrix H is of dimensions n × 2M , hence we must assume that M < 2 (which makes
sense). Writing hi for the ith column of H , we note that due to the known orthogonality of
these sinusoids we have hTi hj = n2 · 1 (i = j). Hence
−1 T
θ̂ML (x) = H T H H x (5.43)
2 T
= H x (5.44)
n
hT1 x
2
= · ... . (5.45)
n
hT2M x
Namely,
n−1
2X 2πmk
âm,ML = xk · cos (5.46)
n n
k=0
5.2. LEAST SQUARE INTERPRETATION 51
n−1
2X 2πmk
b̂m,ML = xk · sin , (5.47)
n n
k=0
which are exactly the DFT coecient of the observation sequence x. The ML estimator is normal
with mean θ and a diagonal covariance matrix
h i −1
Cov θ̂ML (X) = σ 2 H T H (5.48)
2σ 2
= · I2M ×2M . (5.49)
n
The amplitude estimators are thus Gaussian and statistically independent. Note that the ML /
DFT approach assumes nothing about the frequency respose of the observation sequence, except
that it is a combination of a give set of sinusoids. In case the frequency response is known to
have some further properties (smoothness, sparseness, etc.) then using ML / DFT is typically
not the best choice. We will discuss how to handle such situations when we get to the topic of
regularization.
x ≈ Hθ. (5.50)
for some choice of θ. What value ofθ should be guess? One approach, going back to Gauss,
is by look at the residual error x − Hθ , and nd the value of θ that minimizes its quadratic
norm. The general approach of minimizing the quadratic norm of the residual is commonly called
(ordinary) least squares (LS), and when the approximation is a linear function of the parameters,
the problem itself is sometimes referred to as linear regression. The LS estimator for the linear
regression problem is thus dened to be the solution to the following optimization problem:
We can immediately note that the solution to this problem is exactly the ML estimator for the
linear model in AWGN, simply since the likelihood function is monotonic in the quadratic norm
of the residual. Hence, we immediately have:
Proposition 5.9. The LS estimator for the linear regression probelm with a full rank observation
matrix H coincides with the ML estimator for the corresponding linear model with AWGN, and
is given by
−1
θ̂LS (x) = H T H H T x. (5.52)
orthogonal projection P , we say that P projects onto the subspace Im {P }. This is intuitively
clear, since applying P once gets us inside Im {P }, and applying it again does nothing.
52 CHAPTER 5. LINEAR MODELS
Lemma 5.10. Let P be an orthogonal projection, and let d = dim (Im {P }). Then
(iii) P has d eigenvalues all equal to 1, and n−d eigenvalues all equal to 0.
Proof. All these properties stem from noting the following. Any vector can be written as
x = P x + (I − P )x. (5.53)
hP x, (I − P )xi = xT P T (I − P )x (5.54)
T
= x P (I − P )x (5.55)
= xT (P − P 2 )x (5.56)
T
= x (P − P )x (5.57)
=0 (5.58)
−1
Let us go back to the LS estimator θ̂LS = H T H H T x. The approximation of x that it
provides is simply
x̂ = H · θ̂LS (5.59)
−1
= H HT H HT x (5.60)
| {z }
P
The matrix P is commonly called the hat matrix, since it maps x → x̂. Let us show that P is
an orthogonal projection:
Proof.
h −1 T iT
P T = H HT H H (5.61)
h −1 T T
i
= H HT H H (5.62)
−1 T
= H HT H H (5.63)
= P. (5.64)
−1 −1
P 2 = H HT H HT · H HT H HT (5.65)
−1
= H HT H HT (5.66)
= P. (5.67)
5.2. LEAST SQUARE INTERPRETATION 53
Namely, P x is a linear combination of the columns of H . Hence, P is projecting onto the column
space of H , and the LS estimator nds the coecients θ such that Hθ is exactly equal to the
projection of x onto this subspace, which is the subspace spanned by the predictors. This is
depicted in Figure 5.2.
Following this interpretation, a common measure for the goodness of t of the LS solution to
the linear regression problem is the so-called R-square, dened by
kx − H θ̂LS k2
R2 , 1 − (5.70)
kxk2
Example 5.11 .
(Stock market) Recall our stock market Example 5.6. The ML estimators for
α, β , which are also their LS estimators, were given by α̂LS = Mean[ x ] and
kxk
β̂LS = ρ(x, y) · , (5.71)
kyk
xT y
where ρ = ρ(x, y) = kxkkyk . Assume for simplicity that Mean[ x ] = 0, and let us nd the R2
associated with this problem. The energy of the residual is given by
2
2 kxk
x − β̂LS y = x−ρ y (5.72)
kyk
kxk kxk2 kyk2
= kxk2 − 2ρ · xT y + ρ2 (5.73)
kyk kyk2
= kxk2 − 2ρ2 kxk2 + ρ2 kxk2 (5.74)
= kxk2 (1 − ρ2 ), (5.75)
and hence
In the linear regression case, g(θ) = Hθ and we found a closed-form solution. In the general case,
there is usually no closed-form solution, and we need to optimize numerically or solve iteratively.
The LS estimator for nonlinear regression again assumes no statistical model, but as before it
can be interpreted as an ML estimator for the Gaussian model
X = g(θ) + W (5.78)
54 CHAPTER 5. LINEAR MODELS
residual
error
H θ̂ L
S
Im {H}
Figure 5.2: The LS estimator as a projection onto the space spanned by the predictors
where
Unlike the linear case, the ML estimator here is not necessarily ecient nor MVU; in fact, it is
typically biased.
X = Hθ + W (5.80)
where now the noise is W ∼ N (0, C) where C is a known covariance matrix, assumed to be
strictly positive denite. Instead of attacking this problem directly, we can whiten the noise
using a linear transformation, and then use the results from the AWGN case. Since C is positive
denite, it is invertible and its inverse is also positive denite. Moreover, being positive denite,
the inverse can be written as C −1 = QT Q for another positive denite matrix Q. To see this,
we can use the eigenvalue decomposition C
−1 = V T DV where V is unitary and D is diagonal
−1
√ √ √
with positive diagonal entries, then write C = (V T D) · ( DV ), and set Q = DV .
Let
X
f , QX (5.81)
= QHθ + QW (5.82)
= Hθ
e +Wf (5.83)
where H
e and W
f have been implicitly dened. Now, W
f is clearly a zero mean Gaussian vector,
with a covariance matrix
h i
f T = QE W W T QT
E WfW (5.84)
−1 T
= Q QT Q Q (5.85)
5.4. BEST LINEAR UNBIASED ESTIMATOR (BLUE) 55
−1
= QQ−1 QT QT (5.86)
= I. (5.87)
We can now solve this equivalent linear model in AWGN using the tools we have developed.
Specically,
−1
θ̂ML = HeTH e HeTx e (5.88)
−1 T T
= H T QT QH H Q Qx (5.89)
−1
= H T C −1 H H T C −1 x.
(5.90)
Remark 5.13. Note that the ML estimator above is insensitive to scaling of the noise covariance
matrix C → αC . In addition, it can be interpreted as a weighted LS solution (see homework
exercise).
for some matrix A ∈ Rd×n . A linear unbiased estimator that dominates (in the MSE sense)
any other unbiased linear estimator is called the Best Linear Unbiased Estimator (BLUE), and
denoted here by θ̂BLUE .
X = Hθ + W , (5.94)
where E [W ] = 0 and Cov [W ] = C for some given covariance matrix C (which does not
depend on the parameters), and where W is otherwise arbitrarily distributed (can depend
on the parameters). Then the BLUE for θ is given by
−1
θ̂BLUE (x) = H T C −1 H H T C −1 x (5.95)
and
h i
Cov θ̂BLUE = (H T C −1 H)−1 . (5.96)
56 CHAPTER 5. LINEAR MODELS
Proof. This can be proved directly (see homework exercise). Let us prove this in a smarter
roundabout way. Let θ̂(x) = Ax, and write:
h i
E θ̂(X) = E [AX] (5.97)
= E [A(Hθ + AW )] (5.98)
= AHθ (5.99)
=θ (5.100)
where the last equality follows from the unbiasedness requirement. Therefore, it must hold that
AH = I. (5.101)
h i h i
Cov θ̂ = E (θ̂ − θ)(θ̂ − θ)T (5.102)
h i h i h i
= E θ̂ θ̂ T − E θ̂θ T − E θ θ̂ T + θθ T (5.103)
h i
= E θ̂ θ̂ T − θθ T , (5.104)
h i
E θ̂ θ̂ T = E AXX T AT
(5.105)
= AE (Hθ + W )(Hθ + W )T AT
(5.106)
where we have used the fact that AH = I must hold. Hence, the covariance of the estimator
h i
Cov θ̂ = ACAT (5.109)
depends only on the choice of A and on the noise covariance C, but not on any other statistics
of the noise W. Thus, assuming that W ∼ N (0, C) is Gaussian will not aect the covariance
of the estimator. But for the Gaussian case, we already know that the ML estimator θ̂ML (x) =
−1 T −1
H T C −1 H H C x is linear and dominates any other unbiased estimator, in particular linear
unbiased estimators. Hence it must coincide with the BLUE.
Remark 5.15. Examining the proof, we note that we can use the BLUE even when the covari-
ance matrix depends on the parameters, as long as the dependence is only in the scaling of the
covariance. Namely, if
for some strictly positive function g(θ) 6= 0 and known C , then the same BLUE still works. This
is since now, Cov[ θ̂ ] = g(θ)ACAT therefore
which is independent of θ.
5.4. BEST LINEAR UNBIASED ESTIMATOR (BLUE) 57
Remark 5.16. The BLUE is not MVU in general (unless the noise is Gaussian), and may not
even exists (e.g., when there is no unbiased estimator, see homework exercise).
i.i.d
{Wk }nk=1 ∼ − 2θ , 2θ
where Unif . Note that E [Wk ] = 0 and that
θ2
Cov [W ] = In×n , (5.113)
12
and so in light of Remark 5.15 we can use the BLUE. We have that
−1 4
HT H = (5.114)
n
and hence
−1 T
θ̂BLUE (x) = H T H H x (5.115)
n
2X
= xk (5.116)
n
k=1
= 2 Mean[ x ], (5.117)
We will later nd the MVU for this problem via other methods, and see that this estimator is
not MVU. To see intuitively whet this estimator is not so good, consider the case where say
X1 = X2 = ... = X99 = 0 and X100 = 50. In this case θ̂BLUE = 2Mean[ X ] = 1. However, this is
a gross underestimate since obviously θ ≥ 50.
58 CHAPTER 5. LINEAR MODELS
6. Sucient Statistics
As we have seen, the ML estimator attains the CRLB essentially whenever this is possible, and
is hence MVU in these cases. Sometimes however, an ecient estimator may not exist at all,
either since the CRLB is loose or because the CRLB does not apply.
Example 6.1 .
(Uniform) Recall the model where
i.i.d
{Xk }nk=1 ∼ Unif(0, θ). We have seen that
the BLUE in this case is
We argued intuitively that this is probably not the best estimator. For future reference, let us
compute its variance:
h i
MSE(θ̂BLUE ) = Var θ̂BLUE (6.2)
n
4 X
= Var [Xk ] (6.3)
n2
k=1
4
= Var [X1 ] (6.4)
n
4
= E[X12 ] − (E[X1 ])2 (6.5)
n
θ
Z 2
4 1 θ
= x2 dx − (6.6)
n θ 2
0
4 θ2 θ2
= − (6.7)
n 3 4
θ 2
= , (6.8)
3n
where (6.3) and (6.4) hold since the samples are i.i.d.
Let us nd the ML estimator for θ.
n
Y
p(x | θ) = p(xk | θ) (6.9)
k=1
n
n Y
1
= 1 (xk ∈ [0, θ]) . (6.10)
θ
k=1
59
60 CHAPTER 6. SUFFICIENT STATISTICS
n Y n
1
= argmax 1 (xk ∈ [0, θ]) (6.12)
θ θ
k=1
| {z }
∈{0,1}
This sounds reasonable, since it is clear that θ is at least as large as the maximal sample. But
for the exact same reason, it seems like an underestimate. It should thus be intuitively clear that
θ̂ML is biased (toward zero). We will compute the ML bias when we return to this example later
on, and see this is indeed the case. Since an existence of an ecient estimator implies ML is
ecient, there cannot be an ecient estimator in this case (note that we have proved this only
when the parameter space is the entire real line, but the argument works here as well, e.g. by
considering negative θ values as well). In fact, the CRLB regularity conditions do not hold in
this case, since the support of the distribution depends on the parameter, so the CRLB is not
even valid.
In such cases, we still want some procedure to nd an MVU estimator (if one exists) or to
at least nd a good unbiased estimator. In this course we explore two ways to approach this
problem: One via the notion of sucient statistics (now), and one via the asymptotics of ML
(later).
θ − T (X) − X (6.16)
forms a Markov chain. This intuitively means that T (X) summarizes everything that is useful
for estimating the parameter.
n
Y
p(x | θ) = θxk (1 − θ)1−xk (6.17)
k=1
kxkH
=θ (1 − θ)n−kxkH , (6.18)
n
P
where kxkH , xk is the Hamming weight of the sample. A natural candidate for a sucient
k=1
statistic is
Let us check if it is indeed sucient. Writing T = T (X) and noting that T (X) ∼ Binom(n, θ),
we have
p(x | θ) · Pr(T = t | X = x, θ)
Pr(X = x | θ, T = t) = (6.20)
Pr(T = t | θ)
θT (x) (1 − θ)n−T (x) 1 (T (x) = t)
= n t
n−t
(6.21)
t θ (1 − θ)
θt (1 − θ)n−t 1 (T (x) = t)
= n t
n−t
(6.22)
t θ (1 − θ)
1 (kxkH = t)
= n
, (6.23)
t
which is simply the uniform distribution over all length-n binary vectors of Hamming weight t
(this should have been obvious from the outset!). It does not depend on θ anymore. Hence, T (X)
is a sucient statistic. Equivalently, the sample mean Mean[ X ] is also a sucient statistic.
Theorem 6.3 (Fisher-Neyman Factorization). Let X be drawn from the parametric family
p(x | θ). Then T (X) is a sucient statistic for θ if and only if
Proof. For simplicity we prove only for discrete alphabet samples, i.e., where X ∼ p(x | θ) has
some nite cardinality. The parameter θ can be anything.
Suppose p(x | θ) = h(x) · g(T (x), θ). Let us show that p(x | θ, T (x)) = p(x | T (x)).
All we need to show is that g(t, θ) can be canceled out. Let us expand the denominator:
X
Pr(T (x) = t | θ) = p(x | θ) (6.28)
x:T (x)=t
X
= h(x)g(T (x), θ) (6.29)
x:T (x)=t
X
= h(x)g(t, θ) (6.30)
x:T (x)=t
62 CHAPTER 6. SUFFICIENT STATISTICS
X
= g(t, θ) · h(x) (6.31)
x:T (x)=t
h(x)g(t, θ) · 1 (T (x) = t)
p(x | θ, T (x) = t) = P (6.32)
g(t, θ) · h(x)
x:T (x)=t
h(x)
= P · 1 (T (x) = t) (6.33)
h(x)
x:T (x)=t
which is not a function of θ, hence the above must be p(x | T (x) = t). Thus, T (X) is a sucient
statistic for θ.
Conversely, assume that T (X) is a sucient statistic for θ, i.e., that p(x | T (x), θ) = p(x | T (x))
holds. Then
p(x | θ) · 1 (T (x) = t)
p(x | T (x) = t, θ) = (6.34)
Pr(T (x) = t | θ)
= q(x, t). (6.35)
Dene g(t, θ) , Pr(T (X) = t | θ) and choose t = T (x). The factorization now follows from
Example 6.4 .
(Coin toss) Recall the biased coin toss experiment where
i.i.d
X ∼ Bern(θ). Write
Therefore the sample mean T (x) = Mean[ x ] (or the Hamming weight) is a sucient statistic.
Example 6.5 .
(Constant in AWGN) Recall the model
Xk = θ + Wk , k = 1, . . . , n (6.41)
i.i.d
where W ∼ N (0, σ 2 ) and σ2 is known. Write
n
( )
1 1 X
p(x | θ) = n exp − 2 (xk − θ)2 (6.42)
(2πσ 2 ) 2 2σ
k=1
n n
( !)
1 1 X
2
X
2
= n exp − 2 xk − 2θ xk + nθ (6.43)
(2πσ 2 ) 2 2σ
k=1 k=1
n
( )
1 1 X
2 nθ
= n exp − 2· xk exp − 2 (θ − 2Mean[ x ]) . (6.44)
(2πσ 2 ) 2 2σ 2σ
k=1
| {z }| {z }
g(Mean[ x ],θ)
h(x)
Hence, the sample mean T (x) = Mean[ x ] is a sucient statistic in this case as well.
6.3. RAO-BLACKWELL PROCEDURE 63
In these two examples, the sucient statistic also happened to be the MVU. We will see
tha more generally, identifying a sucient statistic can be used to derive or to get close to an
MVU.
We say that the loss function is convex, if `(θ̂, θ) is a convex function of θ̂ θ ∈ Θ. The
for any
MSE is obtained as the expected loss associated with the convex loss function `(θ̂, θ) = (θ̂ − θ)2 .
Other common examples of convex loss functions include `(θ̂, θ) = |θ̂ − θ|p for any p ≥ 1.
Theorem 6.6 (Rao-Blackwell). Let θ̂(x) be any estimator for θ (not necessarily unbiased),
and suppose that T (X) is a sucient statistic for θ. Dene
h i
θ̂RB (T (X)) , E θ̂(X) | T (X), θ (6.46)
Then θ̂RB is a feasible estimator (does not depend on the parameter) with the same bias
function, i.e., satisfying E[θ̂RB ] = E[θ̂]. Furthermore, the expected loss associated with any
convex loss function satises
In particular,
Lemma 6.10 (Law of total gap). Let X, Y be jointly distributed real-valued r.v.s., and φ : R → R
a measurable function. Then the φ-gap of Y is the sum of the expected conditional φ-gap and the
gap of the conditional expectation (given X ), i.e.,
Proof. Write
Lemma 6.11 (Jensen's Inequality) . Let φ:R→R be a convex function. Then δφ (Y ) ≥ 0, and
Pr(δφ (Y | X) ≥ 0) = 1.
Proof. Since φ(t) is convex, then at any t0 there exists some a such that φ(t) ≥ a · (t − t0 ) + φ(t0 )
(a is unique when φ is dierentiable at t0 , in which case the lower bound is the associated tangent
line). Picking t0 = E[Y ] we have
The second claim follows since the above similarly holds for any given X = x.
A simple application of the above is obtained for φ(t) = t2 , in which case δφ (Y ) = Var [Y ] and
δφ (Y | X) = Var [Y | X]. This yields the so-called law of total variance, stating that the variance
of Y can be written as the sum of the variance of its MMSE approximation E [Y | X], and the
expected variance of the approximation error E [Var[ Y | X ]].
Corollary 6.12 (Law of total variance) . Let X, Y be jointly distributed real-valued r.v.s. Then
Proof of Theorem 6.6. First, let us make sure this is a feasible estimator (depends only on data).
Recall that since T (X) is a sucient statistic, the distribution of X given T (X) does not depend
on θ. Hence
h i
θ̂RB = E θ̂(X) | T (X), θ (6.60)
h i
= E θ̂(X) | T (X) , (6.61)
= E[θ̂]. (6.63)
where (6.63) is the law of total expectation. Let us now proceed to show the expected loss does
not increase. Fix any value of θ, and dene the function φ(t) = `(t, θ), which is convex in t by
assumption. Now, be denition for any estimator θ̂ it holds that
h i
Lθ (θ̂) = E `(θ̂(X), θ) (6.64)
6.4. COMPLETE SUFFICIENT STATISTICS AND THE LEHMANN-SCHEFFÉ THEOREM65
h i
= E φ(θ̂(X)) (6.65)
Since both the estimators θ̂ and θ̂RB have the same expectation (as a function of θ), it is enough
to show that the associated φ-gap (which is the variance of the estimator in the quadratic loss
case) does not increase. By the law of total gap, we have that
h i
δφ (θ̂) = E δφ (θ̂ | T (X)) +δφ (E[θ̂ | T (X)]) (6.67)
| {z } | {z }
≥0 θ̂RB
≥ δφ (θ̂RB ). (6.68)
where non-negativity follows from the convexity of φ. This completes the proof.
h i
Remark 6.13. Note that we can clearly write θ̂RB (T (X)) , E θ̂(X) | T (X) without the con-
ditioning on θ, and that θ̂RB is indeed a function of T (X) only (and not the whole X ).
In other words, T (X) is complete if g(T (x)) being zero in expectation for all θ implies g(T (X)) =
0 with probability 1 for all θ. This odd looking denition has the following simple interpretation.
Suppose we have two estimators using the statistic, θ̂1 (T (X)) and θ̂2 (T (X)), and assume they
have the exact same bias function. Then clearly
h i
E θ̂1 (T (X)) − θ̂2 (T (X)) | θ = 0, ∀θ (6.71)
and hence if T (X) is complete, then θ̂1 and θ̂2 (T (X)) are identical with probability 1. Specically,
this immediately implies:
Corollary 6.14. If T (X) is complete, there can be at most one unbiased estimator of θ that is
a function of T (X).
For the purpose of stating the next Theorem, we will say that an estimator is minimum loss
unbiased for some loss function `, if it dominates every other unbiased estimator in the expected
loss (an MVU estimator is the special case obtained for quadratic loss).
66 CHAPTER 6. SUFFICIENT STATISTICS
is a minimum loss unbiased estimator for θ for any convex loss function, and it is
unique with probability 1 (in particular, for quadratic loss it is MVU).
(ii) If there exist an unbiased estimator θ̂(·) for θ that is only a function of T (X), i.e.,
h i
E θ̂(T (X)) = θ ∀θ, (6.73)
then θ̂ is a minimum loss unbiased estimator for θ for any convex loss function, and
it is unique with probability 1 (in particular, for quadratic loss it is MVU).
Remark 6.16. Note that the theorem holds verbatim if we replace θ with a function ϕ(θ)
everywhere.
Proof. First, we note that if θ̂ is unbiased, then θ̂RB is also unbiased, and is a function of T only,
hence it satises the conditions for claim (ii). Thus, it is enough to prove claim (ii).
Let θ̂(T (x)) be unbiased, and let θ̃(x) be any other unbiased estimator. Consider the Rao-
Blaockwell version of θ̃:
h i
θ̃RB = E θ̃(X) | T (X) (6.74)
h i
which is unbiased E θ̃RB = θ, has expected loss Lθ (θ̃RB ) ≤ Lθ (θ̃), and is a function of T (X)
only. Thus,
h i
E θ̂(T (X)) − θ̃RB (T (X)) = θ − θ = 0 (6.75)
for any θ, which by completeness means that θ̂(T (X)) = θ̃RB (T (X)) with probability 1, and
hence also that Lθ (θ̂) = Lθ (θ̃RB ) ≤ Lθ (θ̃).
Example 6.17 .
(Coin toss) Recall the biased coin toss problem with θ ∈ (0, 1). We saw that
T (X) = kXkH is a sucient statistic. Let us show it is complete. For any function g we have
that
n
X n t
E [g(T (X))] = g(t) θ (1 − θ)n−t (6.76)
t
t=0
n t
n
X n θ
= (1 − θ) g(t) . (6.77)
t 1−θ
t=0
θ
Dening z, 1−θ we have that E [g(T (X))] = 0 for any θ ∈ [0, 1] if and only if
n
X n t
g(t) z = 0, ∀z ∈ [0, ∞). (6.78)
t
t=0
6.4. COMPLETE SUFFICIENT STATISTICS AND THE LEHMANN-SCHEFFÉ THEOREM67
But the left hand side above is an nth degree polynomial, which can have at most n roots unless
it is identically zero. Thus we conclude that it must be that g(t) = 0 identically, hence T (X) is
complete.
Now let us demonstrate the Rao-Blackwell procedure and the Lehmann-Scheé Theorem.
Suppose we start with the very naive unbiased estimator
ˆ = x1
θ(x) (6.79)
E[θ̃(X)] = θ. (6.80)
Clearly, this estimator will not be MVU for n > 1. Let us perform the Rao-Blackwell procedure,
which by the Lehmann-Scheé Therem, since T (X) is a complete sucient statistic, will be the
unique MVU.
h i
θ̂RB (t) = E θ̂ | T (X) = t (6.81)
Pr(X1 = 1, T (X) = t)
= (6.84)
Pr(T (X) = t)
n−1 t
θ (1 − θ)n−t
= t−1
n t
n−t
(6.85)
t θ (1 − θ)
(n − 1)! (n − t)!t!
= · (6.86)
(n − t)!(t − 1)! n!
t
= . (6.87)
n
Therefore,
T (X)
θ̂RB (T (X)) = (6.88)
n
kXkH
= (6.89)
n
= Mean[ X ]. (6.90)
which must be MVU, as we already know from CRLB considerations. We now also know further
that Mean[ X ] is the minimum loss unbiased estimator for any convex loss function.
Example 6.18 .
(Constant in AWGN) Recall the model
i.i.d
{Xk }nk=1 ∼ N (θ, σ 2 ), where σ 2 is
n
P
known. In the previous chapter we have seen that T (X) = Xk is a sucient statistic for θ.
k=1
Let us show that is also complete. Note that T (X) ∼ N (nθ, nσ 2 ). Suppose that E [g(T (X))] = 0
for some function g and all θ, i.e.,
Z∞
1 (t−nθ)2
√ g(t)e− 2nσ 2 dt = 0, ∀θ. (6.91)
2πnσ 2
−∞
Since this is a functional identity, the derivatives with respect to θ of both sides are also identical,
namely:
Z∞
2 (t−nθ)2
√ g(t)(t − nθ)e− 2nσ 2 dt = 0, ∀θ. (6.92)
8πn3 σ 6
−∞
68 CHAPTER 6. SUFFICIENT STATISTICS
where we have assume that we move the derivative inside the integral (otherwise, approximation
arguments can be used). The above immediately implies that
= 0, (6.94)
Repeating the same argument for the function tg(t) and iterating it, we nd that
h i
E T k g(T ) = 0 (6.95)
∞
" #
X
2 k
E g (T ) = E g(T ) · gk T (6.96)
k=0
∞
" #
X
=E gk T k g(T ) (6.97)
k=0
∞
X h i
= gk E T k g(T ) (6.98)
k=0
= 0, (6.99)
where we have further assumed absolute convergence in (6.98). Hence g(T ) = 0 with probability
one, and hence T (X) is complete. By the Lehmann-Scheé Theorem, if we can nd an function
θ̂(t) such that
h i
E θ̂(T (X)) = θ (6.100)
t
for any θ, then θ̂(T (X)) is the unique MVU estimator. Clearly, θ̂(t) = n satises this require-
ment. We further conclude that the sample mean in this case is a minimum loss unbiased for
any convex loss function.
7. Exponential Families
∂
score [x | θ] = log p(x | θ) = I(θ)[T (x) − θ], (7.1)
∂θ
for some function T (X). Taking the integral on both sides w.r.t θ, we obtain
Z
log p(x | θ) = I(θ)[T (x) − θ]dθ + c(x) (7.2)
Z Z
= T (x) · I(θ)dθ − I(θ) · θdθ + c(x) (7.3)
R R
where η(θ) , I(θ)dθ and a(θ) , I(θ) · θdθ. Taking the exponent of both sides and dening
h(x) , ec(x) , the likelihood function is of the form
A family of distributions of this form is called a (scalar) exponential family. The exponential
family is determined by the choice of functions h(x), T (x), η(θ) and a(θ), which can be arbitrary
as long as they result in a proper p.d.f. Note that the function a(θ), sometimes called the
log-partition function, is in fact uniquely determined by the choice of the other functions; it is
simply a normalization that makes sure that p(x | θ) integrates (or sums) to one. The function
η(θ) is sometimes called the natural parameter of the exponential family. Note that the general
exponential family form (when the functions h, T, η are general) does not imply that T (X) is
ecient for θ nor for η(θ), but there is always some parameterization for which T (X) is ecient,
as we will discuss in the sequel.
T
More generally, we say that p(x | θ) is a (vector) exponential family for θ = θ1 · · · θd ,
if
where
T
T (x) = T1 (x) · · · Td (x) (7.7)
69
70 CHAPTER 7. EXPONENTIAL FAMILIES
An exponential family is called full rank if the d functions {Ti (x)}di=1 in the sucient statistic
are linearly independent (hence specically, any nontrivial scalar case is always full rank). The
following can be shown:
Lemma 7.1. Consider a full rank exponential family where the natural parameter space η(Θ)
has a non-empty interior. Then
(iii) Any function of T (X) that is an unbiased estimator for ϕ(θ), is also the unique MVU
estimator for ϕ (with probability 1).
Proof. We have already seen that T (X) is sucient; proofs of completeness and minimality
can be found in the literature. The nal claim follows immediately from the completeness and
suciency via the Lehmann-Scheé Theorem.
Lemma 7.2. Consider a full-rank exponential family dened by (h(x), η(θ), T (x), a(θ)), where
Θ ⊆ Rd is a convex set. The following claims hold, assuming that all the derivatives exists and
the Jacobian matrices are invertible for all θ ∈ Θ.
(i) The ML estimator is either attained on the boundary of Θ, or is one of the solutions for
the equation
!−1
dη(θ) T da(θ) T
φ(θ) , · = T (x). (7.9)
dθ dθ
(ii) The sucient statistic itself T (x) is an ecient estimator for the parameter φ(θ) dened
above.
Proof. Let ϕ(θ) be some general invertible re-parametrization. The associated score vector is
d log p(x | θ)
score[ x | ϕ ] = (7.10)
dϕ
d
= (η(θ)T T (x) − a(θ)) (7.11)
dϕ
dη dθ T da dθ T
= T (x) − (7.12)
dθ dϕ dθ dϕ
7.2. BASIC PROPERTIES 71
" T −1 #
dθ T dη T da T
dη
= T (x) − (7.13)
dϕ dθ dθ dθ
T T
dθ dη
= [T (x) − φ(θ)] (7.14)
dϕ dθ
Both claims now follow by choosing ϕ = φ. Note the the Fisher information matrix for Var[ φ ]
has also been revealed above.
The log-partition function of the exponential family can be used to derive the expectation
and the variance of the sucient statistic, which is sometimes simpler than computing directly.
Try to prove this at home:
∂a
E [T (X)] = (7.15)
∂η
and
∂2a
Cov [T (X)] = (7.16)
∂η∂η T
We note that similarly, higher central moments of the sucient statistic can be computed
using the corresponding higher order derivatives of the log-partition function.
Another important property of exponential families is their behavior under i.i.d samples.
Lemma 7.4. Let X = (X1 , . . . , Xn )T be an i.i.d sample from an exponential family. Then the
distribution of X is also an exponential family, with a sucient statistic given by
n
X
T (X) = T (Xk ). (7.17)
k=1
n n
( )
Y X
p(x | θ) = h(xk ) · exp η(θ) T (xk ) − na(θ) . (7.18)
k=1 k=1
This nice property means in particular that the dimension of the sucient statistic of an
exponential family does not grow with the number of i.i.d samples; we can summarize our entire
set of observations using only d numbers where d is the number of parameters. Unfortunately,
this property does not typically extend beyond exponential families. The proof of the following
theorem can be found in the literature.
7.3 Examples
Example 7.6 (Gaussian with unknown mean, known / unknown variance). X ∼ N (µ, σ 2 ) where
σ2 is either known or unknown (this is our constant in AWGN example).
(x − µ)2
2 1
p(x | µ, σ ) = √ exp − (7.19)
2πσ 2 2σ 2
x2 µ2
1 µ
=√ exp − 2 + x 2 − 2 . (7.20)
2πσ 2 2σ σ 2σ
x2 µ2
1 µ
p(x | µ) = √ exp − 2 exp · x − . (7.21)
2πσ 2 2σ σ2 2σ 2
hence this is an exponential family with sucient statistic is T (x) = x, natural parameter is
µ µ2
η= σ2
, and the log-partition function a(µ) = 2σ 2
. The sucient statistic of an i.i.d sample is
thus the sample mean (or sample sum for multiple i.i.d. samples) as we already know. Moreover,
da/dµ µ/σ 2
T (X) = X is an ecient (and ML) estimator for the parameter
dη/dµ = 1/σ 2
= µ, as we already
know. Alternatively, we can see that X is MVU since it is unbiased an a function of a complete
sucient statistic.
What happens when the variance is also unknown? In this case we can write (7.20) as
x2 µ2
2 1 µ 1 2
p(x | µ, σ ) = √ exp − 2 + x 2 − 2 − log σ . (7.22)
2π 2σ σ 2σ 2
hence this is an exponential family with sucient statistic
T
T (x) = x, x2 ,
(7.23)
natural parameters
T
µ2 1
η(µ, σ ) = 2
,− 2 , (7.24)
σ 2σ
µ2 1
and log-partition function a(µ, σ 2 ) = 2σ 2
+ 2 log σ 2 . The complete sucient statistic of an i.i.d
sample is therefore
T
T (X) = Mean [{Xk }] , Mean {Xk2 }
. (7.25)
are unbiased, and since they are functions of a complete sucient statistic, they are the unique
MVU estimators.
Let us demonstrate the use of the log-partition function to compute the expectation of the
sucient statistic. First, for simplicity, let us write the log-partition function as a function of
the natural parameters:
η12 1
a(η) = − + log(−2/η2 ) (7.28)
4η2 2
7.3. EXAMPLES 73
∂a(µ, σ 2 )
= (7.30)
∂η1
η1
=− (7.31)
2η2
= µ. (7.32)
and
E [T2 (X)] = E X 2
(7.33)
∂a(µ, σ 2 )
= (7.34)
∂η2
η12 1
= 2 − (7.35)
4η2 2η2
= µ2 + σ 2 . (7.36)
Example 7.7 .
(Bernoulli) Let X ∼ Bern(θ) where θ ∈ [0, 1] is an unknown parameter. Then
Therefore, this is an exponential family with T (X) = X as the complete sucient statistic
θ
(not surprisingly), the log-likelihood ratio η(θ) = log
1−θ as the natural parameter (hence θ =
eη i.i.d
1+eη ), anda(θ) = − log(1−θ) = log(1+eη ) as the log-partition function. Now, if X ∼ Bern(θ),
then we clearly have that the Hamming weight (or sample mean) T (X) = kXkH is the complete
η
sucient statistic, and a(θ) = −n log(1 − θ) = n log(1 + e ) is the log-partition function (the
natural parameter remains the same). Hence
∂a
E [T (X)] = (7.39)
∂η
neη
= (7.40)
1 + eη
= nθ, (7.41)
and
∂2a
Var[ T (X) ] = (7.42)
∂η 2
n(eη (1 + eη ) − e2η )
= (7.43)
(1 + eη )2
neη
= (7.44)
(1 + eη )2
= nθ(1 − θ). (7.45)
Example 7.8 .
(Discrete Distributions) Suppose X is a discrete r.v with
unknown symbol prob-
M
abilities {pi }i=1 . The vector of parameters is (say) θ = p1 · · · pM −1 , where we have removed
74 CHAPTER 7. EXPONENTIAL FAMILIES
the last symbol probability since it is a function of the others. Assuming without loss of generality
that X is taking the value i with probability i, we have the p.m.f:
M
pi · 1 (x = i)
X
p(x | θ) = (7.46)
i=1
M
( !)
1 (x = i) · pi
X
= exp log (7.47)
i=1
(M )
1 (x = i) · log pi
X
= exp (7.48)
i=1
M −1
( )
p
1 (x = i) · log i
X
= exp log pM + (7.49)
pM
i=1
where (7.48) follows by noting that only one summand can be nonzero. Dening
1 (x = 1) , 1 (x = 2) , · · · 1 (x = M − 1)
T
T (x) = (7.50)
h θ
iT
θ
η(θ) = log 1−PM1 −1 θ log PθM2 −1 · · · log PMM−1
−1 , (7.51)
i=1 i 1− i=1 θi 1− i=1 θi
M −1
!
X
a(θ) = − log 1 − θi (7.52)
i=1
we have that
Hence this is in the exponential family with a complete sucient statistic T (X) that is the
indicator of event that happened.
For an i.i.d sample from a discrete distribution the complete sucient statistic is therefore
n
P
T (Xi ), which is eectively the empirical distribution of the samples:
i=1
n n n n
T
1X 1
1 (Xi = 1) 1
1 (Xi = 2) · · · 1
1 (Xi = M − 1) .
P P P
T (Xi ) = n n n (7.54)
n i=1 i=1 i=1
i=1
n
1 P
Note that E n T (Xi ) = p1 · · · pM −1 . Hence, the Lehmann-Scheé Theorem implies
i=1
that the empirical distribution is the unique MVU estimator of the underlying distribution.
There are many other known families of distributions that are exponential families, includ-
ing: Exponential, Poisson, Chi-squared, Laplace, Beta, Gamma, and log-normal. On the other
hand, there are many others that are not exponential families, including uniform, Cauchy, and
hypergeometric.
Let us give an example that is not exponential family, but nevertheless has a small complete
sucient statistic and an MVU estimator.
Example 7.9 .
(Uniform) Let us go back to the model where
i.i.d
{Xk }nk=1 ∼ Unif([0, θ]) where
θ2
θ>0 is unknown. Recall we have seen that θ̂BLUE = 2Mean[ X ] has variance Var[ θ̂BLUE ] = 3n
and that θ̂ML = maxk Xk is biased.
7.3. EXAMPLES 75
n
1 (xk ∈ [0, θ])
Y
p(x | θ) = θ−n (7.55)
k=1
=θ −n
· 1 max xk ∈ [0, θ] (7.56)
k
where h(x) = 1 and g(t, θ) = θ−n · 1 (t ∈ [0, θ]). By the Neyman-Fisher factorization Theorem
we conclude that T (X) = max Xk is a sucient statistic. This makes sense, since once we know
k
the maximal sample the smaller ones intuitively give us no further information regarding the
support. It is also clear that this is not an exponential family setup, since the likelihood cannot
be written as a function of some product η(θ) · max Xk .
k
Let us now show that this sucient statistic is also complete. First, we need to nd the p.d.f
of T = T (X). For any t ∈ [0, θ] we have
n
Y
Pr (T ≤ t | θ) = Pr (Xk ≤ t) (7.58)
k=1
n
t
= ·. (7.59)
θ
∂
p(t | θ) = Pr(T ≤ t) (7.60)
∂t
= θ−n ntn−1 · 1 (t ∈ [0, θ]) (7.61)
Now suppose g(t) is such that E [g(T )] = 0 identically. Then for any θ>0
Zθ
E [g(T )] = g(t)θ−n ntn−1 dt = 0, (7.62)
hence also
Zθ
g(t)tn−1 dt = 0. (7.63)
Taking the derivative of both sides of this identity w.r.t. θ we nd that for any θ>0
g(θ)θn−1 − 0 = 0, (7.64)
Zθ
E [ϕ̂(T )] = ϕ̂(t) · θ−n ntn−1 dt (7.65)
0
76 CHAPTER 7. EXPONENTIAL FAMILIES
= ϕ(θ), (7.66)
or equivalently, that
Zθ
n ϕ̂(t)tn−1 dt = θn ϕ(θ). (7.67)
t
ϕ̂(t) = ϕ(t) + · ϕ0 (t). (7.69)
n
We conclude that the estimator
maxk Xk 0
ϕ̂MVU = ϕ(max Xk ) + · g (max Xk ) (7.70)
k n k
is the unique MVU estimator for ϕ(θ). Specically, for ϕ(θ) = θ we get the MVU estimator
n+1
θ̂MVU = · max Xk . (7.71)
n k
n+1
Note that θ̂MVU = n · θ̂ML , hence the MVU estimation in this case is just a bias-corrected ML.
What is its variance?
" 2 #
h i n+1
Var θ̂MVU (T ) = E max Xk − θ2 (7.72)
n k
n + 1 2 2
= E T − θ2 (7.73)
n
Zθ
n+1 2
= θ−n ntn−1 · t2 dt − θ2 (7.74)
n
0
2 θ
tn+2
n+1 −n
= θ n − θ2 (7.75)
n n+2
0
n+1 2 n
= · θ2 − θ2 (7.76)
n n+2
(n + 1)2 2
= · θ − θ2 (7.77)
n(n + 2)
n2 + 2n + 2 − n2 − 2n 2
= θ (7.78)
n(n + 2)
θ2
= . (7.79)
n(n + 2)
h i
θ2
This is much better than Var θ̂BLUE = 3n . In fact, note that the variance in this case decays
as O(1/n2 ), which is order-wise better than the Ω(1/n) implied by the CRLB. There is no
contradiction however, since in this case the regularity conditions for the CRLB are not satised
7.3. EXAMPLES 77
(the support size depends on the parameter). Moreover, in this case although Unif([0, θ]) is not
in an exponential family, we have found a scalr sucient statistics. This does not contradict the
Pitman-Koopman-Darius Theorem for the same reason, namely since the distribution support
depends on the parameter.
Can we gain uniformly in MSE by introducing bias? Let θ̂ = α · max xk = α · T . We have
k
seen that:
n
E [T ] = ·θ (7.80)
n+1
n 2
E T2 =
θ . (7.81)
n+2
Thus
= α2 E T 2 − 2αθE [T ] + θ2
(7.83)
n 2n
= θ2 α2 − α+1 . (7.84)
n+2 n+1
n+2 n+1 n+2
This is minimized by α= n+1 (and not by n ). Hence, θ̂biased = n+1 · max Xk we have
k
θ2 θ2
MSE(θ̂biased ) = < = MSE(θ̂MVU ). (7.85)
(n + 1)2 n(n + 2)
This turns out to be true in many cases: introducing a small bias can often uniformly help (but
usually not by much).
In general, the dimension of a sucient statistic grows with the number of samples outside
the exponential families. The example above was the exception rather than the rule; let us now
see that a simple generalization of it to higher dimensions already fails to maintain a constant
dimension of a sucient statistic.
In the worst case we will need to keep all of the sample, e.g., if we are unlucky and say always
get points on the boundary of S. But this is very unlikely, so we expect that most of the time,
Xk will already be in the convex hull of {X1 , . . . , Xk−1 }. We may then wish to see how fast
the size of T (X) grows with high probability as a function of n. The rough answer is that with
high probability, the number of points in T (X) is O(log n), which is much nicer than n but still
grows unbounded. We will not prove this rigorously here, but to see why it is the case, consider
a very similar one-dimensional situation. Suppose the samples are uniform between [0, θ], and
we want to count the number of times that Xk is larger than all the samples the precede it (a
new maximum). This value is
" n # n
1 max Xj < Xk =
X X
E Pr max Xj < Xk (7.87)
j≤k−1 j≤k−1
k=1 k=1
78 CHAPTER 7. EXPONENTIAL FAMILIES
x1 x5
x4
x3
x2
n
X
= E E max Xj < Xk | Xk (7.88)
j≤k−1
k=1
n
"
#
X Xk k−1
= E (7.89)
θ
k=1
Xn Z θ
= θ−(k−1) θ−1 · xk−1 dx (7.90)
k=1 0
n
X θk
= θ−(k−1) θ · (7.91)
k
k=1
n
1X
= (7.92)
k
k=1
Z n
dx
≤1+ (7.93)
1 x
= log n + 1, (7.94)
We conclude this chapter by giving an example where no complete sucient statistic exists.
− 12
=0 (7.97)
but Pr(sin(2πX) = 0) = 0 (and not 1). In other words, X + sin(2πX) is also unbiased, and a
function of a sucient statistic. Thus X is not complete, and we can not conclude that θ̂ = x is
an MVU estimator. The same argument applies to all other sucient statistics here, which are
clearly simply all invertible function of X, showing that none of them is complete.
8. The ML Estimator
8.1 Preliminaries
We begin with a short discussion of known denitions and results concerning the convergence of
r.v sequences. Let {Xk }∞
k=1 be a sequence of real-valued [Link], and let X be a real-valued r.v. We
D
say that Xn convergences in distribution to X, denoted Xn −
→ X, if
at any point x ∈ R where the c.d.f Pr(X ≤ x) is continuous. We say that Xn converges in
P
probability to X , Xn −
→ X, if for all ε>0
Here are some useful results relating the dierent modes of convergence.
P D
(i) Xn −
→X implies that Xn −
→ X.
D P
(ii) Xn −
→c for some constant c, implies that Xn −
→ c.
D D
1. Xn −
→X implies g(Xn ) −
→ g(X).
P P
2. Xn −
→X implies g(Xn ) −
→ g(X)
D
1. Xn + Yn −
→ X + c.
D
2. Xn · Yn −
→ X · c.
X D X
3. Y n −
→ c (for c 6= 0).
n
D
Proof. It can be easily shown that (Xn , Yn ) −
→ (X, c). The result then follows the CMT by using
g(x, y) = x + y, x · y, xy .
79
80 CHAPTER 8. THE ML ESTIMATOR
E [Mean[ X ]] = µ (8.4)
n
" #
1X
Var [Mean[ X ]] = Var Xk (8.5)
n
k=1
1
= · nσ 2 (8.6)
n2
σ2
= (8.7)
n
Using Chebyshev's inequality, we have that for any ε > 0
Var [Mean[ X ]]
Pr (|Mean[ X ] − µ| ≥ ε) ≤ (8.8)
ε2
σ 2
= 2 (8.9)
nε
−−−→ 0. (8.10)
n→∞
We note that under these conditions the convergence is in fact also with probability 1 (which is
a stronger mode of convergence).
(iii) | log p(x | θ)| ≤ h(x) for some function h(x) such that
Theorem 8.7 (Asymptotic properties of ML). Let {p(x | θ)}θ∈Θ be ML-regular, and X=
[X1 , . . . , Xn ]T be drawn i.i.d from p(x | θ). Let
(n)
θ̂ML = argmax p(X | θ). (8.15)
θ
(n)
(ii) (asymptotic eciency) lim Cov[ θ̂ML ] · nI(θ) − Id×d = 0.
n→∞
√ (n)
D
→ N 0, I −1 (θ) .
(iii) (asymptotic normality) n θ̂ML − θ −
Remark 8.8. We note that the claims in the theorem above are rather weak, as they provide
no information about the rate of convergece. Namely, it is entirely unclear how large n should
be in a certain problem so that we can safely think of the ML estimator as having converged
in the sense of having a small MSE, or having a Gaussian distribution in a region close to the
parameter, etc. Establishing nite sample bounds for the convergence of ML is in general a
challenging problem, which can sometimes be handled in specic settings (we will see one such
case in the next chapter).
Proof. For simplicity, we prove for a scalar parameter. Denote the true parameter by θ0 ∈
interior(Θ). For any other θ, dene the (normalized) log-likelihood ratio (LLR)
1 p(x | θ0 )
Ln (x, θ) , log (8.16)
n p(x | θ)
n
1X p(xk | θ0 )
= log . (8.17)
n p(xk | θ)
k=1
(n)
θ̂ML (x) = argmax log p(x | θ) (8.18)
θ
82 CHAPTER 8. THE ML ESTIMATOR
, Dθ (8.24)
where we have used the fact that the samples are i.i.d, then the WLLN, and the denition of
the KL divergence. Recall that Dθ ≥ 0, and by our identiability assumption this inequality is
strict for any θ 6= θ0 . We conclude that
P > 0 θ 6= θ0
Ln (X, θ) −
→ Dθ = . (8.25)
0 θ = θ0
Since our estimator picks the value of θ that minimizes the LLR, it is intuitively clear that it
would eventually favor the true parameter θ0 over any other xed θ, with high probability as
n grows large. However, we need this to happen uniformly in θ, and not just for each single θ,
which is where the UWLLN comes into the picture.
Let us proceed more rigorously. Dene
= inf Dθ (8.27)
|θ−θ0 |>ε
Note that D > 0 due to identiability, the continuity of the likelihood function, and the fact
that Θ is a closed set. Let us also write Ln (θ) for Ln (X, θ), suppressing the dependence on X
for brevity. We would like to evaluate the probability that our estimator would prefer some θ
that is more than ε away from the true parameter θ0 :
(n)
Pr θ̂ML − θ0 > ε = Pr inf Ln (θ) < Ln (θ0 ) (8.28)
|θ−θ0 |>ε
= Pr inf Ln (θ) < 0 (8.29)
|θ−θ0 |>ε
" #
≤ Pr sup |Ln (θ) − dθ | > D (8.30)
|θ−θ0 |>ε
−−−→ 0. (8.31)
n→∞
(n) P
where (8.31) follows from the UWLLN. Hence we conclude that θ̂ML −
→ θ0 .
Let us prove asymptotic normality. To that end, recall that the Mean Value Theorem says
that if f (·) is continuous on [a, b] and dierentiable on (a, b), then there exists c ∈ (a, b) such
that
f (b) − f (a)
f 0 (c) = (8.32)
b−a
8.2. ML ASYMPTOTIC OPTIMALITY 83
or equivalently
∂
score [X | θ] = log p(X | θ) (8.34)
∂θ
n
X ∂
= log p(Xk | θ). (8.35)
∂θ
k=1
h i
(n)
score X | θ̂ML = 0. (8.36)
Since we assumed the score is dierentiable w.r.t θ, we can use the Mean Value Theorem to write
0
h i
score [X | θ] = score [X | θ0 ] + (θ − θ0 )score X | θ̃ (8.37)
(n)
for some θ̃ between θ0 and θ. Substituting θ = θ̂ML and using (8.36), we obtain
h i 0
h i
(n) (n)
score X | θ̂ML = score [X | θ0 ] + (θ̂ML − θ0 )score X | θ̃ (8.38)
=0 (8.39)
(n)
where θ̃ is now dependent on θ̂ML . Hence,
√
√ (n) n · score [X | θ0 ]
n(θ̂ML − θ0 ) = − h i (8.40)
score0 X | θ̃
√1 score [X | θ0 ]
n
= h i. (8.41)
− n1 score0 X | θ̃
n
1 1 X
√ score [X | θ0 ] = √ score [Xk | θ0 ] (8.42)
n n
k=1
D
−
→ N (E [score [X | θ0 ]] , Var [score [X | θ0 ]]) (8.43)
where X ∼ p(x | θ0 ), and we have used the CLT. Let us examine the denominator:
n
1 0
h i 1 X ∂2
− score X | θ̃ = − log p(X | θ̃). (8.45)
n n ∂θ2
k=1
∂2
−E log p(X | θ0 ) = I(θ0 ). (8.46)
∂θ2
84 CHAPTER 8. THE ML ESTIMATOR
However, we already know that the ML estimate converges to the correct θ0 , hence θ̃ is with
high probability sandwiched inside a vanishingly small interval. Making this precise, we write:
1 0
h i
Pr − score X | θ̃ − I(θ0 ) > ε (8.47)
n
1 0
h i
= Pr − score X | θ̃ − I(θ0 ) > ε | θ̃ − θ0 < δ Pr θ̃ − θ0 < δ (8.48)
n
1 0
h i
+ Pr − score X | θ̃ − I(θ0 ) > ε | θ̃ − θ0 ≥ δ Pr θ̃ − θ0 ≥ δ (8.49)
n
1 0
h i
≤ Pr − score X | θ̃ − I(θ0 ) > ε | θ̃ − θ0 < δ + Pr θ̃ − θ0 ≥ δ (8.50)
n
!
1 0
(n)
≤ Pr sup − score [X | θ] − I(θ0 ) > ε + Pr θ̂ML − θ0 ≥ δ (8.51)
θ:|θ−θ0 |<δ n
(n)
where (8.51) holds sinceθ̃ is squeezed between θ0 and θ̂ML . The second term in (8.51) tends to
zero with n for any xed δ > 0, due to the consistency of the ML estimator. Let us pick δ > 0
ε
small enough such that | − E [score[ X | θ ]] − I(θ0 )| <
2 for any θ such that |θ − θ0 | < δ , which
is possible due to the smoothness of the score's derivative. For such a choice, the rst term also
tends to zero with n, due to the UWLNN. We therefore conclude that
1 0
h i
P
− score X | θ̃ −→ I(θ0 ) (8.52)
n
1 D
as n → ∞. Recalling that √
n
score [X | θ0 ] −
→ N (0, I(θ0 )), then applying Slutsky's theorem
to (8.41) yields
√ (n) √1 score [X | θ0 ]
n
n θ̂ML − θ0 = h i (8.53)
− n1 score0 X | θ̃
D N (0, I(θ0 ))
−
→ (8.54)
I(θ0 )
= N 0, I −1 (θ0 ) ,
(8.55)
• Try to nd a pivotal quantity for θ, which is any function g(x, θ) such that the distribution
of g(X, θ) does not depend on the parameter.
• Given a pivotal quantity, set a condence level c and try to nd the set of all θ values that
satisfy
where A is some nice set choosing A as an interval is good when g is continuous and
invertible in θ.
Example 9.1 .
(Constant in AWGN) Consider the model
i.i.d
X ∼ N (θ, σ 2 ) where σ2 is known.
Recall that the MVU estimator is θ̂MVU = Mean[ X ] and that
σ2
Mean [X] ∼ N (θ, ). (9.2)
n
Then, the function
Mean [X] − θ
g(X, θ) = √ ∼ N (0, 1), (9.3)
σ/ n
85
86 CHAPTER 9. CONFIDENCE INTERVALS
also called as the z-score, is a pivotal quantity. From the symmetry of the problem, it makes
sense to choose A = [−t, t], which yields on the one hand
c = Pr (g(X, θ) ∈ A) (9.4)
= 1 − 2Q(t) (9.6)
t = Q−1 1−c
Namely,
2 . Solving for θ yields
σ σ
|g(X, θ)| ≤ t ⇔ θ ∈ Mean[ X ] − t · √ , Mean[ X ] + t · √ . (9.7)
n n
Hence,
1−c σ
l(X) = Mean[ X ] − Q−1 · √ (9.8)
2 n
−1 1 − c σ
r(X) = Mean[ X ] + Q · √ (9.9)
2 n
Example 9.2 .
(Linear model in AWGN) Consider the linear model X = Hθ + W , where
W ∼ N (0, σ 2 I) and σ2 is known. Let us nd a CI for θ1 and treat θ2 , ..., θd as nuisance
parameters. Recall that
θ̂1,MVU (X) − θ1
g(X, θ1 ) = p ∼ N (0, 1), (9.11)
σ {(H T H)−1 }ii
and we get
1−c
q
−1 T −1
Pr θ − θ̂1,ML < Q · σ {(H H) }ii = c, (9.12)
2
for any condence level c. More generally, we can look at condence regions for the entire
parameter vector (or parts of it). In this case, we can use
θ̂ML − θ ∼ N 0, σ 2 (H T H)−1
(9.13)
In the next example we show that we can also work with quantities that are asymptotically
pivotal, namely converge to oa distribution that does not depend on the parameters, assuming
we can get concrete bounds on the convergence.
87
Example 9.3 .
(Constant in general noise) Consider the model Xk = θ + Wk where Wk nk=1 is
i.i.d with zero mean and known variance Var [Wk ] = σ 2 . By by the CLT, it holds that the z-score
satises
Mean [X] − θ D
g(X, θ) = √ −
→ N (0, 1). (9.14)
σ/ n
This statement is however not practically useful, as it gives no information on the distribution for
nite n, hence not good enough for the purpose of computing a condence intervals. However, if
E |Wk |3 ≤ η
we assume that for some known η, then the nite sample deviation of the z-score
from Gaussian can be bounded using the Berry-Esseen CLT:
Mean[ X ] − θ η
c = Pr √ ≤ t ≥ 1 − 2 Q(t) + 3 √ . (9.15)
σ/ n σ n
η√
Repeating the AWGN derivations with the small correction term yields the CI
σ3 n
2η
1− c− √
· √σ
σ 3 n
l(X) = Mean[ X ] − Q−1 (9.16)
2 n
1 − c − σ32η
√
r(X) = Mean[ X ] + Q−1
n
· √σ (9.17)
2 n
Example 9.4 (Constant in AWGN with unknown variance). Consider the model X i.i.d
∼ N (µ, σ 2 )
where µ, σ 2 are both unknown. Suppose we only care about µ and treat σ2 as a nuisance
parameter (a very common situation). We have seen that p(x | µ, σ 2 ) is exponential family with
T
X 2 , and
complete sucient statistics T (X) = X
ˆ
σ 2 MVU = Mean[ {X12 , . . . Xn2 } ] − (Mean[ X ])2 (9.19)
n
1 X
= (Xk − Mean[ X ])2 . (9.20)
n−1
k=1
µ̂MVU − µ
√ ∼ N (0, 1). (9.21)
σ/ n
This is however not a pivotal quantity for µ, since it also involves σ2 which is unknown. Nev-
ertheless, there is a way around this problem. Recall we have seen the the Fisher information
matrix is diagonal in this case, which means that the estimators µ̂MVU and σˆ2 MVU are uncorre-
lated. As it turns out, it can be shown that they are also statistically independent. Moreover, it
can be shown (using e.g. Cochran's Theorem) that
σˆ2 MVU
(n − 1) · ∼ χ2n−1 , (9.22)
σ2
88 CHAPTER 9. CONFIDENCE INTERVALS
where χ2n−1 is the Chi-square distribution with n−1 degrees of freedom, i.e., the distribution of
the sum of squares of n−1 i.i.d standard Gaussian [Link]. Hence, we can consider the following
quantity, called the t-statistic (abbreviated t-stat):
√ µ̂MVU −µ
µ̂ −µ √ σ/ n
qMVU = n−1· q (9.23)
ˆ2
σˆ2 MVU /n (n − 1) · σ σMVU
2
√ N (0, 1)
∼ n−1· q , (9.24)
2
Xm−1
√
by which the distribution of a r.v obtained by
q n − 1 times the ratio of two independent [Link]
with distributions N (0, 1) and
2
Xm−1 . This distribution is known as the student's t-distribution
with n − 1 degrees of freedom, and its p.d.f is given by
− n2
Γ n2
t2
p(t) = p · 1+ . (9.25)
(n − 1)πΓ n−1 2
n−1
µ̂ −µ
The t-stat q MVU is a pivotal quantity for µ with the above distribution, which can be easily
σˆ2 MVU /n
used to obtain condence intervals. Special cases of the t-stat's distribution include:
1
p(t) = . (9.26)
π(1 + t2 )
1
p(t) = 3 . (9.27)
(2 + t2 ) 2
1 t2
p(t) = √ e− 2 . (9.28)
2π
For a small sample size the distinction between student's t-distribution and the normal distribu-
tion are is signicant. As n grows large this dierence becomes smaller. As a rule of thumb, for
a sample of size n = 100 it is already quite accurate to use the normal distribution in lieu of the
the t-distribution in order to compute condence intervals with condence level as high as 99%,
but this is insucient for a condence level of say 99.9%. The t-stat can be generalized to any
linear model.
2
E θ̂ML − θ
≤ (9.30)
t2
1
≈ (9.31)
t2 nI(θ)
1
≈ , (9.32)
t2 nI(θ̂ML )
where (9.30) holds due to Markov's inequality. If we set a condence level of Pr θ̂ML − θ > t =
1 − c, we get
s
1
t≈ (9.33)
n · (1 − c) · I(θ̂ML )
" s s #
1 1
θ ∈ θ̂ML − , θ̂ML + (9.34)
n · (1 − c) · I(θ̂ML ) n · (1 − c) · I(θ̂ML )
with condence level approximately c or higher. Note that Markov's inequality is generally
quite loose in this case, since we expect the distribution of the ML estimator to be close to
normal. Thus, this approximate CI is typically on the conservative side.
90 CHAPTER 9. CONFIDENCE INTERVALS
10. Regularization and the Bayesian Ap-
proach
10.1 Regularization
The following (related) issues often arise in various estimation problems:
• Overtting: The model is too complicated relative to number of samples, e.g., there are
too many parameters and not enough samples. In this case, while we can come up with
(typically many) estimators that seem to explain the data well, these estimators in fact
explain the noise rather than the signal. They are often very noisy and do not generalize
well beyond the sample we have seen.
• Ill-posed problems: There may be many dierent θ's that explain the data almost equally
well, hence inverting the problem might be numerically unstable. For example, in a linear
model we have
• Prior information: We may know something about what θ should look like, and want
to incorporate it into the estimation.
Recall the famous Occam's razor: Entities are not to be multiplied without necessity .
Namely, if there are many ways to explain a phenomenon which are all similarly eective, choose
the simplest one. Loosely speaking, regularization is a process in which we penalize compli-
cated explanations, so that choosing a more complex solution to our estimation problem would
require this solution to explain the data signicantly better than simpler solutions do. This is
useful in order to avoid over-tting noise and stabilize the problem.
In order to make this more precise, we need to dene what we mean by explaining the
data, and by the complexity of the solution to an estimation problem. To these ends, the
standard approach is slightly dierent than what we have seen so far. Let L : X ×Θ → R be
a loss function, where L(x, θ) measures how well θ explains the data x. Note that this is not
the loss function Lθ (θ̂) we discussed before, which measures how close the estimator is to the
real θ . The loss function L(x, θ) is in some sense more general, as it does not directly assume
any statistical model, nor it assumes there is a true parameter. Each parameter value provides
some explanation to the data, whose accuracy is measured by L(x, θ). However, we can connect
this point of view to our usual parametric family setup by using (say) the log-likelihood loss
function, i.e.,
91
92 CHAPTER 10. REGULARIZATION AND THE BAYESIAN APPROACH
Hence, the estimator that minimizes the loss in this case is simply the ML estimator for the
parametric family
We will restrict ourselves to this type of choice (which is in fact quite general).
Now, we further dene a penalty function or a regularization function R : Θ → R+ , where
R(θ) gives higher values to models (i.e., parameters θ) that are more complicated, in the suitable
sense for the specic problem. Now, we can solve the regularized problem where we try to
minimize the loss under the constraint that the model is not too complex. One way to do that
is to solve the optimization problem
where λ is the regularization parameter, the determines the exchange rate between loss and
complexity. This is the Lagrangian version of (10.4), hence if we can solve (10.5) for any λ
then we can essentially use it to solve (10.4), say by gradient descent over λ, until we meet the
constraint. We therefore focus on (10.5), which is easier to work with.
X = Hθ + W , (10.6)
where W ∼ N (0, C) and C is a known covariance matrix. Let Q be a p.s.d matrix such that
QT Q = C −1 . The associated parametric family is
1
p(x | θ) = const · exp − (Hθ − x)T C −1 (Hθ − x) (10.7)
2
1 2
= const · exp − kQ(Hθ − x)k . (10.8)
2
1
L(x, θ) = kQ(Hθ − x)k2 . (10.9)
2
As we have seen, the estimator that minimizes this quadratic loss (or the energy of the residual
in the linear regression lingo) is
Let us add some regularization. The simplest regularization function to analyze is also quadratic
like the loss:
1
θ̂REG (x) , argmin kQ(Hθ − x)k2 + λ kΓθk2 (10.12)
θ 2
= argmin kQ(Hθ − x)k2 + 2λ kΓθk2 (10.13)
θ
• Choosing Γ = Id×d to be the identity matrix (which also known as ridge regression ), we
have
d
X
R(θ) = θi2 . (10.14)
i=1
This corresponds to the case where we do not believe that the parameters can be too large,
or perhaps we even believe that the true model is sparse, namely only few of the parameters
are active at the same time (in this case, there are better regularization functions, as we
we will later see).
This choice can be though of as penalizing by the energy of the discrete derivative of θ
along the index axis, and can be used e.g. when there is a physical reason the believe that
hte parameter vector is smooth.
= θ T H T C −1 H + 2λΓT Γ θ − 2θ T H T C −1 x + kQxk2 .
(10.19)
94 CHAPTER 10. REGULARIZATION AND THE BAYESIAN APPROACH
−1 T −1
θ̂REG (x) = H T C −1 H + 2λΓT Γ H C x. (10.20)
−1 T
θ̂REG (x) = σ −2 · σ −2 H T H + 2λΓT Γ H x= (10.21)
−1
= H T H + 2λσ 2 · ΓT Γ HT x
(10.22)
Typically, one absorbs the variance and constants into the regularization parameter, i.e., using
λ instead of 2λσ 2 .
This optimization problem cab be explicitly solved (try at home), yielding the estimator
Note that this is reminiscent of the standard LS solution, where the right pseudo-inverse is used
instead of the left pseudo-inverse.
As it turns out, the minimum norm estimator can also be viewed through the lens of regu-
larization. It can be shown (try at home) that
Lemma 10.1. Let A, B ∈ Rn×n be symmetric, and λmin (A), λmin (B) their minimal eigenvalues,
respectively. Then
and
Proof. Since A is symmetric, there exists an orthonormal basis {ui }ni=1 of eigenvectors of A, with
n
respective real-valued eigenvalues λmin (A) = λ1 ≤ λ2 ≤ · · · ≤ λn . Any unit vector v ∈ R can
Pn P 2
be written as v = i=1 αi ui , where αi = 1. Then
n
X
T
min v Av = min αi2 λi (10.29)
kvk=1 αi : α2i =1
P
i=1
= λ1 , (10.30)
Suppose now that in the linear model with AWGN, the matrix HT H is close to singular, i.e.,
λmin (H T H) ≈ 0, (10.34)
or is perhaps even zero exactly, say when there are more parameters than samples. This makes
the ML/MVU solution very unstable. Using a Tikhonov regularization with any non-singular
matrix Γ (and absorbing 2σ 2 into λ for readability), yields
which can be made suciently large (hence stable) for a large enough λ > 0. Of course, increasing
λ too much is also problematic, as the solution will tend to minimize only the regularization
constraint and hardly take the loss function into account.
For Γ = I, we can further interpret the solution using the singular value decomposition
(SVD).
A = U DV T (10.36)
form some is n×n unitary matrix U, d × d unitary matrix V , and n × d diagonal matrix D with
nonnegative diagonal entries (called the singular values of A).
Note that
AT A = V DT U T U DV T = V DT DV T (10.37)
96 CHAPTER 10. REGULARIZATION AND THE BAYESIAN APPROACH
and similarly AAT = U DDT U T , hence the eigenvalues of the p.s.d matrix AT A (and also AAT )
are the squared singular values of A.
Let us write the SVD of the observation matrix H as H = U DV T . Then for Γ = I and
C= σ2 ·I (absorbing the constants again) we have
−1
θ̂REG = H T H + λI HT x (10.38)
−1
= V DT DV T + λI V DT U T x (10.39)
−1
= V DT DV T + λV V T V DT U T x (10.40)
−1
= V DT D + λI V T V DT U T x
(10.41)
T −1 T T T
= V (D D + λI) V V D U x (10.42)
−1 T T
= V DT D + λI D U x. (10.43)
| {z }
−1
,Dλ ∈Rd×n
Let us write D = diag {σ1 , ..., σd } (we are assuming here that d ≤ n, but this is not necessary).
Then
where
σ1 σd
Dλ−1 = diag 2 , ..., 2 . (10.45)
σ1 + λ σd + λ
log kΓθk2
regularization cost
small λ
best λ
large λ
e−λR(θ) dθ
R
where (10.49) holds since is assumed to be nite. This is the maximum a-posteriori
(MAP) estimate of θ, based on the prior
e−λR(θ)
p(θ) = R . (10.55)
e−λR(θ) dx
b. Find the posterior p(θ | x) given the observation, using Bayes rule.
h i
Ep(x,θ) l(θ̂(X), θ) = Pr θ̂(X) − θ > δ (10.56)
= 1 − Pr θ̂(X) − θ ≤ δ (10.57)
h i
= 1 − EX Pr θ̂(X) − θ ≤ δ | X (10.58)
h i
∼
= 1 − EX 2δ · pθ | X (θ̂(X) | X) (10.59)
98 CHAPTER 10. REGULARIZATION AND THE BAYESIAN APPROACH
l(θ̂, θ)
±δ
θ̂
θ
Figure 10.2
≥ 1 − 2δEX max p(θ | X) , (10.60)
θ
to small variance of the prior, i.e., a stronger prior (as expected). Conversely, when λ → 0
there is essentially an almost uniform prior (very wide distribution). Recall we had
−1 T −1
θ̂REG = H T C −1 H T + 2λΓT Γ H C x. (10.66)
−1
θ̂MAP = H T C −1 H T + Cθ−1 H T C −1 x. (10.67)
Suppose now we proceed with the Bayesian interpretation, but wish to use the quadratic loss
function, i.e., the MMSE estimator. Recall that θ̂MMSE = E [θ | X] and that θ̂MAP = Mode(θ | x),
which are generally dierent. But for any Gaussian r.v., Mode = Mean. Hence
Cov [θ, X] = E θ · X T
(10.70)
= E θ θT H T + W T
(10.71)
T
= Cθ · H (10.72)
h i
Cov [X, X] = E (Hθ + W ) (Hθ + W )T (10.73)
= HCθ H T + C. (10.74)
Hence
−1
θ̂MMSE = Cθ · H T HCθ H T + C x. (10.75)
This looks dierent from θ̂MAP . In a homework exercise, you will prove they are the same using
the matrix inversion lemma. The covariance of the error (in both cases) is given by (verify!)
h i −1
Cov θ̂, θ̂ = H T C −1 H + Cθ−1 (10.76)
−1
= Cθ − Cθ H T HCθ H T + C HCθ (10.77)
where (10.76) is the MAP version and (10.77) is the MMSE version. Using the MAP interpreta-
tion, we can see that
and
kCθ k→∞
θ̂MMSE = θ̂MAP −−−−−−→ θ̂ML (10.79)
as we expect.
The is however a dicult problem since we need to iterate over all possible subset selections,
which is exponential in d. Therefore, a common relaxation is to use the L1 norm, i.e.,
X
R(θ) = kθk1 = |θi | (10.81)
i
and solve
n o
min kHθ − xk2 + λ kθk1 (10.82)
The problem with this relaxation is that there no explicit solution; however, ecient algorithms
do exists (we will discuss one later on). Also, as we shall later see this method encourages a
sparse solution. Note that the associated prior in this case is Laplacian:
In Bayesian estimation, we impose a prior distributions p(θ) on the parameter θ, which induces
a joint distribution on the p(x, θ) = p(θ)p(x | θ) on parameter and the samples. Thus, given the
samples we can in principle compute the posterior distribution
p(θ)p(x | θ)
p(θ | x) = (11.1)
p(x)
p(θ)p(x | θ)
=R (11.2)
p(θ0 )p(x | θ0 )dθ0
of the parameter given the samples, and then pick an estimator θ̂ that minimizes the expected
value of some loss function of our choosing (quadratic, hit-or-miss, absolute deviation, etc.).
However, the computation of the posterior distribution can sometimes be very dicult, since
computing the marginal distribution p(x) requires computing the integral above that in many
cases does not admit a closed-form solution. In some interesting special cases however, this
computation is very simple.
Let {p(x | θ)}θ∈Θ be a parametric family, and let {M (µ)} be a family of distribution on
θ, parameterizes by some hyper-parameter µ (e.g., N (0, µ)). Suppose that we choose some
p(θ) ∈ M (µ) as a prior on the parameter θ, and then observe the sample x and compute the
posterior p(θ | x). If for any p(θ) ∈ M (µ), the corresponding posterior p(θ | x) ∈ M (µ) is also in
the same family of distributions on θ , then we say that M (µ) is a family of conjugate priors for
p(x | θ). When structured families of conjugate priors exist, there is often a simple update rule
for the hyper-parameters.
Example 11.1 .
(Random Constant in AWGN) Consider the model X = θ+W where W ∼
2 ),
N (0, σW where
2
σW is unknown. What are the natural conjugate priors here? We are looking
for a family of distributions on θ such that, when adding an independent Gaussian to θ, we
remain inside the family. Obviously, the answer is also Gaussian the conjugate priors are
N (µ, σθ2 ) with µ, σθ2 as hyper-parameters. What is the update rule? If p(θ) ∼ N (µ, σθ2 ) then
σθ2 2
σW 2
=N µ+ 2 2 · (x − µ) , σ 2 + σ 2 · σθ . (11.4)
σθ + σW θ W
Let us now impose some prior p(θ) on the bias parameter θ. We have
p(x | θ) · p(θ)
p(θ | x) = R (11.6)
p(x | θ0 )p(θ0 )dθ0
101
102 CHAPTER 11. CONJUGATE PRIORS
P P
xk
= c(x) · p(θ) · θ (1 − θ)n− xk
. (11.7)
What would be a choice of prior that would result in a similarly looking posterior? A reasonable
idea is to set p(θ) ∝ θa−1 (1 − θ)b−1 (where the −1 is for consistency with the literature), which
yields
P P
p(θ | x) ∝ θa−1+ xk
(1 − θ)b−1+n− xk
(11.8)
P P
where ã = a + xk and b̃ = b + n − xk . Therefore, the following family of distributions
θa−1 (1 − θ)b−1
p(θ) = (11.10)
R1
θa−1 (1 − θ)b−1 dθ
0
is a conjugate prior to the i.i.d Bernoulli trials (and hence also to the Binomial distribution). This
is known as the Beta(a, b) distribution where a, b are hyper-parameters. The hyper-parameters
update rule is very simple:
n n
!
X X
(a, b) → a+ xk , b + (1 − xk ) . (11.11)
k=1 k=1
n
P n
P
Note that xk is the number of successes and (1 − xk ) is the number of failures. Thus,
k=1 k=1
we can informally think of a−1 and b−1 as the number of successes and failures that have
occurred before we started observing the data, i.e., our prior belief as to how the sample might
behave. Note that the special case where a=b=1 pertains to observing zero successes and
failures before we start, which intuitively means that we are completely ignorant; and indeed,
Beta(1, 1) = Unif([0, 1]) is the uniform distribution over the parameter space.
Once we have obtained the posterior distribution, we can apply the optimal estimator per-
taining to our loss function. For example, if our loss function is quadratic, then we are interested
in the MMSE estimator, which is the posterior conditional expectation. It can be shown that
a
E [Beta(a, b)] = a+b , hence in this case
n n
" !#
X X
θ̂MMSE (x) = E Beta a+ xk , b + (1 − xk ) (11.12)
k=1 k=1
n
P
a+ xk
k=1
= n n (11.13)
P P
a+ xk + b + (1 − xk )
k=1 k=1
n
P
xk + a
k=1
= (11.14)
n+a+b
n a+b
= · Mean[ x ] + · E [Beta(a, b)] , (11.15)
n+a+b n+a+b
which is a certain weighted average of the prior mean and the sample mean, converging to the
sample mean as n → ∞. If our loss function is hit-or-miss, then we are interested in the MAP
103
a−1
estimator, which is the posterior mode. It can be shown that Mode(Beta(a, b)) = a+b−2 , hence
in this case
n n
!!
X X
θ̂MAP = Mode Beta a+ xk , b + (1 − xk ) (11.16)
k=1 k=1
P
xk + a − 1
= n (11.17)
P P
xk + a + (1 − xk ) + b − 2
k=1
n
P
xk + a − 1
k=1
= (11.18)
a+b+n−2
n a+b−2
= · Mean[ x ] + · Mode(Beta(a, b)). (11.19)
n+a+b−2 n+a+b−2
which is weighted sum of the prior mode and the sample mean. For the fully ignorant prior
a = b = 1, we get
which coincides with the MVU (and also ML) estimator for the case where no prior is assumed.
In the two examples we have seen, the parametric families have been exponential families.
This is not a coincidence it turns out that every exponential family admits a suitable family of
p(x | θ) = h(x) · exp η T (θ) · T (x) − a(θ)
conjugate priors. To see this, let be some exponential
family. Then
A reasonable choice for a conjugate prior would be similar to the exponential family, i.e.,
where the vector τ and the scalar ν are the hyper-parameters, and g is the proper normalization.
Let us check that this is indeed a conjugate prior, and nd the update rule for the hyper-
parameters:
This is clearly also in the same family. The update rule is therefore
τ 0 = τ + T (x) (11.25)
0
ν = ν + 1. (11.26)
More generally, if we are given n i.i.d. observations, the update rule is simply
n
X
τ0 = τ + T (xk ) (11.27)
k=1
ν 0 = ν + n. (11.28)
Namely, the hyper-parameter vector τ simply accumulates the sucient statistic, and the scalar
hyper-parameter ν essentially simply counts the number of samples.
104 CHAPTER 11. CONJUGATE PRIORS
n
X
0
τ =τ+ xk (11.29)
k=1
ν 0 = ν + n. (11.30)
How is this related to a and b that we had? It is a simple change of variables, where τ =a and
ν =a+b (verify at home also that we get Beta distribution indeed).
Example 11.3 .
(Poisson distribution) Suppsoe that X ∼ Poisson(θ) for some θ > 0, i.e.,
θx −θ
p(x | θ) = e x ∈ {0, 1, . . .} (11.31)
x!
Let us write it in exponential family format:
1
p(x | θ) = exp {x · log θ − θ} . (11.32)
x!
Hence the sucient statistic is T (x) = x, the natural parameter is η = log θ, and the log-partition
function is a(θ) = θ. The conjugate prior is therefore
= θτ e−νθ (11.34)
Remark 11.4. We have seen that for exponential family likelihoods, we get exponential family
conjugate priors. The set of priors for which simple update rules exist can sometimes be extended
beyond exponential families, by considering mixture distribution over the exponential family
members. Namely, we can set some additional hyper-prior that chooses the hyper-parameters
of the exponential family prior. Since the space of the distribution is closed under convex
combinations, the update formula would simply update each member separately. For example,
consider the coin toss model with a Beta(a, b) prior (which is a conjugate prior as we have seen).
Set some distribution p(a, b) over the hyper-parameters (a, b). Then (informally)
Z
p(θ) = Beta(a, b) · p(a, b)dadb (11.35)
n n
Z !
X X
p(θ | x) = Beta a+ xk , b + (1 − xk ) · p(a, b)dadb. (11.36)
k=1 k=1
p(θ) is not an exponential family, yet tracking the Beta distribution hyper-parameters is easy,
and the integral p(θ | x) may be easy to compute. For example, if p(a, b) is a nite mixture then
the integral becomes a simple summation.
12. Hypothesis Testing
In this chapter we discuss the problem of hypothesis testing, which is an estimation problem
with a discrete parameter, i.e., where the parameter space Θ has nite cardinality. Each value
of the parameter corresponds to a hypothesis about the way out samples behave, and our goal
will be to decide which hypothesis is the correct one. We will restrict ourselves here to the
binary hypothesis testing setup, where our parameter space Θ = {H0 , H1 } contains only two
hypotheses.
Below, we will suppress the parameter θ as its value has no numerical meaning.
H0 is traditionally called the null hypothesis, and it usually represents the normal state of
nature, namely where no interesting phenomenon is being observed. H1 is traditionally called
the alternative hypothesis, and represents the state of nature where an interesting phenomenon
is present.
What are we going to use as an optimality criterion? Clearly, the quadratic loss is meaningless
in this discrete setting. Let us think of our estimator θ̂ ∈ {H0 , H1 } as a decision rule, namely a
decision of whether we believe in H0 or in H1 after having observed the sample. There are four
possible events:
H0 true H1 true
type II error
accept H0 miss detection
great!
θ̂ = H0 false negative
type I error
accept H1 false alarm great!
θ̂ = H1 false positive
As the table suggests, there are several common names for the error events. Here we will use
the false-alarm vs. mis-detection notation, as it is in line with our intuitive interpretation of the
null and alternative hypotheses. The associated error probabilities are
PFA (θ̂) = Pr θ̂ = H1 | H0 (12.3)
105
106 CHAPTER 12. HYPOTHESIS TESTING
PMD (θ̂) = Pr θ̂ = H0 | H1 . (12.4)
There is an inherent tension between the false-alarm and mis-detect probabilities. Clearly, for
example, we can attain PFA = 0 by always deciding θ̂ = H0 , but that would result in the worst
possible PMD = 1 (and vice versa). More generally, it is intuitively clear that if we want to
improve one these error probabilities, we would have to pay in the other one. Our goal now
would be to characterize the best possible tradeo between the two, i.e., to x one and minimize
the other.
Some typical examples for a binary hypothesis testing problems are:
• Blood test: Suppose that X represents the numerical value of some blood test result testing
for some disease. If this test is to be useful, we expect to have (signicantly) dierent
distributions of X in healthy vs. sick people. In this case we dene
Given the test result X, we would like to devise a test with a low mis-detection (false
negative) probability. However, aiming at a very low mis-detect may result in a high
false-alarm (false positive) probability, which may cause panic and result in unnecessary
additional test and procedures. In many practical cases, the operating point on the mis-
detect vs. false-alarm curve is determined by balancing the expected medical costs (which
is not necessarily what we as patients would want to optimize).
• The radar problem (signal detection): Suppose we are trying to acquire a target via radar.
We transmit a signal that hits the target, if it exists, and gets reected back to us. Letting
X represent the radar measurement, we expect X to be pure noise if there is no target,
and to be signal plus noise in case a target exists. Thus we dene
H0 : X = W (noise) (12.7)
H1 : X = S + W (signal+noise). (12.8)
where W is the ambient noise (say Gaussian etc.) and S is say a known constant (or a
signal with a known distribution). Given X, we would like to determine if we see a target.
Again, we would like to devise a decision rule with a low mis-detect to avoid intrusion, but
also a low false-alarm to avoid unnecessary sirens going o etc.
• Testing the signicance of an estimator. Suppose that we are given X ∼ p(x | θ) where
θ is a continuous parameter, and we came up with some estimator θ̂(x), say under an
MSE criterion or similar. We would now like to decide whether this estimator is worth
anything. Once approach can be to test the estimator against some null model θ0 , i.e.,
to consider the binary hypothesis testing problem
H0 : X ∼ p(x | θ0 ) (12.9)
For example, in the linear model X = Hθ + W in AWGN, we may want to test the MVU
estimator θ̂MVU = (H T H)−1 H T X against the null model where θ0 = 0, i.e., where we are
observing only noise X = W .
What is the best tradeo between PMD and PFA ? Consider the achievable region in R2
containing all the pairs (PFA , PMD ) ∈ [0, 1]2 that can be attained by some decision rule. The
12.1. THE NEYMAN-PEARSON SETTING 107
p0 (x)
p1 (x)
x
H0 H1
Figure 12.1
PMD
1
acheivable
region
Figure 12.2
interesting part of the achievable region is its boundary, which consists of the set of pairs such
that we cannot improve (decrease) both PFA and PMD . This boundary must be a convex curve,
since we can always achieve any point along a straight line connecting any two points on the
boundary, by simply randomizing between any the two associated decision rules (to achieve all
the points on the curve we sometimes need to randomize, e.g. in the discrete sample case). Note
in particular that the straight line PMD = 1 − PFA can always be attained. e.g. by a random
decision independent of the samples.
In light of the above discussion, our problem is
for some xed desired false-alarm level p. A decision rule achieving this minimum is called in the
statistics lingo the most powerful test at level p. We will nd it more convenient to equivalently
think of a decision rule in terms of its decision regions, i.e., the regions of samples where the rule
decides in favor of a certain hypothesis. In the binary setup, we usually think of the acceptance
region
A = {x ∈ X : θ̂(x) = H0 } (12.12)
where the rule accepts the null hypothesis, and the rejection region Then in this notation
Ā = {x ∈ X : θ̂(x) = H1 } (12.13)
108 CHAPTER 12. HYPOTHESIS TESTING
where the rule rejects the null hypothesis (and accepts the alternative). Identifying the rule with
its acceptance region, we write with some abuse of notation
PFA (A) = Pr X ∈ Ā | H0 = P0 (Ā) (12.14)
where Pi (B) is the probability that X ∈B under the distribution pi (x). We can now rewrite
our optimization problem (12.16) in terms of the acceptance region:
Theorem 12.1 (Neyman-Pearson Lemma) . Let τ ≥ 0 be some threshold, and consider the
decision rule with the acceptance region
p0 (x)
A, x∈X : >τ . (12.17)
p1 (x)
then
Proof. The cases where τ =0 and τ =∞ are trivial, hence we assume that 0 < τ < ∞. Let
1A (x) and 1B (x) be the indicator functions of the regions A, B respectively. Then for all x
x∈A x∈B
= P0 (A) − τ P1 (A) − P0 (B) + τ P1 (B) (12.23)
≥ 0, (12.28)
where the last inequality follows from our assumption that PFA (B) ≤ PFA (A). Recalling that
0 < τ < ∞, the result follows.
The following corollary, which follows from the Neyman-Pearson Lemma by noting that
the mis-detect probability decreases as the threshold increases, provides the solution to our
optimization problem (12.16), of nding the optimal acceptance region at a certain desired level
of false-alarm probability. The same result (with roles switched) holds when we x a mis-
detect error probability and minimize the false-alarm. For brevity, we will sometimes denote the
p0 (x)
likelihood-ratio by Λ(x) , p1 (x) .
Corollary 12.2. The solution to (12.16), i.e., the acceptance region minimizing PMD under the
constraint that PFA ≤ p, is given by
where
( Z )
τ = sup 0 ≤ t ≤ ∞ : p0 (x)dx ≤ p . (12.30)
x:Λ(x)≤t
whenever the supremum is attained as a maximum. Otherwise, use the rule Λ(x) ≥ τ for the
acceptance region.
The curve delineated by the above corollary may be discontinuous (e.g., in discrete cases).
A smooth convex characterization of the optimal curve can be more elegantly given if we allow
randomized decision rules (as we have already discussed).
Corollary 12.3. Consider the family of randomized decision rules, parameterized by τ ∈R and
λ ∈ [0, 1], that decides in favor of Hθ̂(x) where
0 Λ(x) > τ
θ̂(x) = Bern(λ) Λ(x) = τ (12.31)
1 Λ(x) < τ.
Then the boundary of the (convex hull of ) the false-alarm vs. mis-detect achievable region is
exhausted by this family. Explicitly, for any 0 ≤ p ≤ 1, the rule with parameters τ, λ satisfying
attains PFA = p while minimizing PMD among all randomized decision rules that attain this level
of PFA .
Remark 12.4. Note that the solution to (12.32) is unique in τ , and also unique in λ whenever
P0 (Λ(x) = τ ) 6= 0 (otherwise, λ is irrelevant).
Example 12.5 .
(Constant in AWGN) Consider the signal detection example
H0 : Xk = Wk (12.33)
H1 : Xk = s + Wk (12.34)
110 CHAPTER 12. HYPOTHESIS TESTING
i.i.d
{Wk }nk=1 ∼ N 0, σ 2 s, σ 2
where , and are known. Assume without loss of generality that s > 0.
Then
n
( )
1 1 X 2
p0 (x) = n exp − 2 xk (12.35)
(2πσ 2 ) 2 2σ
k=1
n
( )
1 1 X
2
p1 (x) = n exp − 2 (xk − s) . (12.36)
(2πσ 2 ) 2 2σ
k=1
p0 (x)
Λ(x) = >τ (12.37)
p1 (x)
log p0 (x) − log p1 (x) > log τ (12.38)
" n n
#
1 X 2 X 2
− 2 xk − (xk − s) > log τ (12.39)
2σ
k=1 k=1
n
" #
1 X
2
− 2 2s xk − s n > log τ (12.40)
2σ
k=1
Xn
2s xk − s2 n < −2σ 2 log τ, (12.41)
k=1
H1 s σ2
Mean[ x ] ≷ − log τ (12.42)
H0 |2 ns
{z }
,th
given by
τ =1
p0 (x) p1 (x)
0 s
p0 (x)
p1 (x)τ
0 θ̂
s
Figure 12.3
> 0. (12.51)
Z
p0 (x)
H1 : E [log Λ(X) | H1 ] = p1 (x) log dx (12.52)
p1 (x)
= −D(p1 kp0 ) (12.53)
< 0. (12.54)
So, we expect the LLR to be centered around a positive (resp. negative) number, when H0
(resp. H1 ) is the true hypothesis. This provides some insight as to why testing the LLR against
a threshold makes sense. When we have multiple i.i.d samples, this intuition becomes more
accurate since, by the WLLN, the normalized LLR of the sample converges in probability to the
expected value, i.e.,
(
1 P D(p0 kp1 ) H0 true
log Λ(X) −
→ (12.55)
n −D(p1 kp0 ) H1 true
112 CHAPTER 12. HYPOTHESIS TESTING
log(th)
log Λ(x)
−D(p1 kp0 ) 0 D(p0 kp1 )
Figure 12.4
asn → ∞. In this case, both PFA and PMD → 0 simultaneously for any suitable threshold in the
interval( −D(p1 k p0 ), D( p0 k p1 )) applied to the LLR. In this case it makes more sense to talk
about the tradeos in the exponential decay of PFA and PMD , i.e.,
1
α , lim − log PFA (12.56)
n→∞ n
1
β , lim − log PMD . (12.57)
n→∞ n
A typical tradeo between these exponents is depicted in Figure 12.5. This tradeo can be
precisely characterized using large deviation techniques (out of scope).
Before we give an example, we need the following simple claim.
Lemma 12.6. Suppose T (X) is a sucient for the two hypotheses, i.e., that the distribution of
X given T (X) does not depend on the hypothesis. Then the likelihood ratio is given by
p(T (x) | H0 )
Λ(x) = . (12.58)
p(T (x) | H1 )
Proof.
p(x | H0 )
Λ(x) = (12.59)
p(x | H1 )
p(x | H0 )p(t | x, H0 )
= (12.60)
p(x | H1 )p(t | x, H1 )
p(x, t | H0 )
= (12.61)
p(x, t | H1 )
p(t | H0 )p(x | t, H0 )
= (12.62)
p(t | H1 )p(x | t, H1 )
p(t | H0 )p(x | t)
= (12.63)
p(t | H1 )p(x | t)
p(t | H0 )
= . (12.64)
p(t | H1 )
Example 12.7 (Coin toss). Consider the hypothesis testing problem in which we view a sequence
of n i.i.d coin tosses, where the Bernoulli parameter is either p or q 6= p. Namely:
i.i.d
H0 : X ∼ Bern(p) (12.65)
12.2. BAYESIAN HYPOTHESIS TESTING 113
α β
H1 H0
log(th)
−D(p1 kp0 ) D(p0 kp1 )
Figure 12.5
i.i.d
H1 : X ∼ Bern(q). (12.66)
As the previous lemma suggests, it suces to look at the LLR of a sucient statistic, which
n
P
in this case is the Hamming weight kxkH = xk . Let us compute the LLR for x where
k=1
kxkH = m:
Pr(X = x | H0 )
log Λ(x) = log (12.67)
Pr(X = x | H1 )
Pr (kXkH = m | H0 )
= log (12.68)
Pr (kXkH = m | H1 )
!
n m n−m
m p (1 − p)
= log n m n−m
(12.69)
m q (1 − q)
= m log p + (n − m) log(1 − p) − [m log q + (n − m) log(1 − q)] (12.70)
m m hm m i
=n· log p + 1 − log(1 − p) − log q + 1 − log(1 − q) (12.71)
n n n n
m p m (1 − p) m q m (1 − q)
=n· log + 1− log − log + 1− log
n m/n n 1 − m/n n m/n n 1 − m/n
(12.72)
H1
D (Bern(m/n) k Bern(p)) ≷ th + D (Bern(m/n) k Bern(q)) . (12.74)
H0
th.
for some threshold parameter This is very intuitive, if we interpret D (Bern(m/n)kBern(p))
(resp. D (Bern(m/n)kBern(q))) as a measure of closeness of the empirical distribution to
Bern(p) (resp. Bern(q)).
What is the optimal decision rule? Clearly, the loss function here is hit-or-miss (a.k.a. 0/1 loss)
hence we expect that MAP would be optimal. Let us show this again. Suppose θ̂ is an arbitrary
decision rule. Then its success probability is
h i
Pr(θ̂(X) = θ) = E Pr(θ̂(X) = θ | X) (12.76)
This is the MAP decision rule. Let us write it in terms of the prior and the model:
H0
p(x | 0)p(0) ≷ p(x | 1)p(1) (12.82)
H1
or
H0 p(1)
Λ(x) ≷ =τ (12.83)
H1 p(0)
Hence, the optimal Bayesian decision rule is also in the Neyman-Pearson form, where the thresh-
old is set by the prior likelihood ratio. In fact, scanning over all the prior distributions will take
us along the entire Neyman-Pearson trade-o curve.
H1 : Xk = s + Wk , (12.85)
i.i.d
{Wk }nk=1 ∼ N 0, σ 2 σ2, s > 0
where , and are known. Suppose θ∼ Bern(p) for some known
p. We already saw that
n
" #
1 X
log Λ(x) = 2 ns2 − 2s xk . (12.86)
2σ
k=1
n
" #
1 X H0 p
2
ns − 2s xk ≷ log (12.87)
2σ 2 H1 1−p
k=1
12.3. RELATION TO TOTAL VARIATION DISTANCE 115
2σ 2
H1 p
2
2sMean[ x ] − s ≷ − log (12.88)
H0 n 1−p
2
H1 s σ p
Mean[ x ] ≷ − log . (12.89)
H0 2 ns 1−p
1
For p= 2 we get
H0 s
Mean[ x ] ≷ . (12.90)
H1 2
When p↓0 then th → ∞, which means we always decide in favor of H0 , as expected. Check
other limit cases for intuition.
to minimize E 1 θ̂ 6= θ . We can consider more general loss functions `(θ̂, θ) that give
dierent weights to dierent error events: and look for a decision rule that minimizes the
expected loss:
h i
θ̂ = argmin E ` θ̂(X), θ . (12.91)
θ̂(x)
You will nd this optimal decision rule explicitly (for the binary case) in a homework
exercise.
2. Multiple hypotheses: In this setup again θ ∼ p(θ), and we have M hypotheses we wish
to distinguish between:
{Hi : X ∼ p(x | i) }M
i=1 . (12.92)
It is easy to show (homework again) that for multiple hypotheses, the MAP rule is again
optimal under the 0/1 loss, i.e., it minimizes the average error probability.
Z
1
TV(p, q) = |p(x) − q(x)|dx. (12.95)
2
It is easy to check that 0 ≤ TV(p, q) ≤ 1, which is one reason for the half-factor. One important
interpretation for the total variation distance is given by relating it to the two types of error
probabilities in simple binary hypothesis testing.
116 CHAPTER 12. HYPOTHESIS TESTING
H0 H1
s
2 s
Figure 12.6
Proof. The upper bound is trivial. For the lower bound, let us pick a uniform prior on θ, which
translates to an optimal threshold of τ = 1. In this case, the average error probability of any
estimator, which is also the arithmetic mean of its false-alarm and mis-detect probabilities, is
lower bounded by that of the optimal estimator, i.e.,.
Z Z !
1 1
(PFA + PMD ) ≥ p1 (x)dx + p0 (x)dx (12.97)
2 2 x:p0 (x)>p1 (x) x:p0 (x)≤p1 (x)
Z
1
= min (p0 (x), p1 (x)) dx (12.98)
2
Z
1
= [min (p0 (x), p1 (x)) + max (p0 (x), p1 (x)) − max (p0 (x), p1 (x))] dx (12.99)
2
Z Z
1 1
= (p0 (x) + p1 (x)) dx − max (p0 (x), p1 (x)) dx (12.100)
2 2
Z
1
=1− max (p0 (x), p1 (x)) dx (12.101)
2
Summing up the (identical) terms in (12.98) and (12.101), we obtain
Z
1
PFA + PMD ≥1− [max (p0 (x), p1 (x)) − min (p0 (x), p1 (x))] dx (12.102)
2
Z
1
=1− |p0 (x) − p1 (x)|dx (12.103)
2
= 1 − TV(p0 , p1 ). (12.104)
Clearly, this performance is achieved by the optimal Bayesian rule designed for a uniform prior
(regardless of whether the setup is Bayesian on not).
Remark 12.10. In fact, Lemma 12.9 implies that the total variation distance is the largest
possible gap between the probabilities assigned to an event by both p and q, i.e.,
where the supremum is taken over all (measurable) subsets of the sample space X (try to show
this!). In the literature (12.105) is sometimes taken as the denition of the total variation
distance, and then (12.95) is proved as a consequence.
Namely, we are interested only in a binary function of the paramter θ, i.e., just to know if it
belongs to Θ0 or Θ1 .
The Neyman-Pearson result is not applicable here since we do not know which distribution
was chosen from the class. What can we do? Before we discuss some general approaches, we
note that in some very special cases we can nevertheless go back to Neyman and Pearson.
Example 12.11 .
(Constant in AWGN) Consider the following hypotheses
H0 : Xk = Wk (12.108)
H1 : Xk = s + Wk (12.109)
i.i.d
{Wk }nk=1 ∼ N 0, σ 2 i.i.d with a known σ 2 , but now the signal s > 0 is unknown. Recall
where
that, had s ≥ 0 been known, the optimal Neyman-Pearson likelihood ratio test would have been
H1
Mean[ x ] ≷ τ (12.110)
H0
for some τ. For any choice of τ, this decision rule can be implemented without knowledge of
s. Moreover, since H0 is simple, we can set τ = √σn Q−1 (PFA ) to attain any desired value of
PFA , while minimizing PMD . Thus, we get a uniformly most powerful test at any desired level of
PFA . The resulting PMD will be a function of the underlying unknown s, but knowing s would
not improve our detector. This phenomenon occurs in more a more general setting where the
likelihood-ratio is monotone in some sucient statistic.
H0 : X ∼ p(x | θ0 ) (12.111)
Suppose that
p(x | θ0 )
Λ(x) = (12.113)
p(x | θ)
is a monotone non-increasing function of some scalar sucient statistic T (x), for any θ > θ0 .
H1
Then the threshold decision rule T (x) ≷ τ is uniformly most powerful at any level of PFA (τ ).
H0
118 CHAPTER 12. HYPOTHESIS TESTING
You will prove (a generalization of ) this simple result in a homework exercise. The intuition
behind it is that the larger T (x), the more likely H1 is. Note that in AWGN example, the log-
likelihood ratio is monotonic in Mean[ x ] which is a sucient statistic. This is not a coincidence,
and happens more generally for some other exponential families.
and recall that T (x) is a sucient statistic. Consider testing θ = θ0 against θ > θ0 . The
log-likelihood ratio is
We see that if µ(θ) > µ(θ0 ) for any θ > θ0 , then the threshold test
H1
T (x) ≷ τ (12.116)
H0
n
P H1
is uniformly most powerful. For i.i.d samples, we can use the test T (xk ) ≷ τ .
k=1 H0
In general however, there is no uniformly most power test even when the null hypothesis is
simple.
Example 12.14 (Constant in AWGN, unknown polarity). consider the setup of Example 12.11,
but where now s 6= 0 is unknown (can be negative too). Here there can be no uniformly most
pwoer test, since the optimal Neyman-Pearson rule depends on the sign of s. What could be a
good decision rule nevertheless? One reasonable idea is to use
H1
(Mean[ x ])2 ≷ τ (12.117)
H0
This seems a very reasonable solution in this case, and would yield very good performance. But
what could be an approach that would lead us to such nice solutions in the more general case?
Here are two common approaches, with no general claim for any optimality (there are many
other methods):
• Bayesian reduction: Set two distributions p0 (θ), p1 (θ) over each hypotheses class. Then
marginalize the dependence on the specic choice of the parameter inside the class:
Z
p0 (x) = p(x | θ)p0 (θ)dθ (12.118)
θ⊆Θ0
Z
p1 (x) = p(x | θ)p1 (θ)dθ, (12.119)
θ⊆Θ1
and we are now back at the simple hypothesis testing setup, where we can use Neyman-
Pearson. Of course, it is not necessarily clear which priors to use.
• Generalized likelihood ratio test (GLRT): The likelihood ratio cannot be computed
due to the lack of knowledge regarding the underlying value of the parameter under each
hypothesis. But under each hypothesis, one could reasonably set the parameter to be the
12.4. COMPOSITE HYPOTHESIS TESTING 119
H0 H1 H1 H0
H0 H1 H1 H0
Vs.
th A A th
Figure 12.7
ML estimator of θ, and then use that estimator as the true value of the parameter in a
Neyman-Pearson decision rule. Namely, compute
sup p(x | θ)
θ⊆Θ0 p(x | θ̂0 )
ΛG (x) , = , (12.120)
sup p(x | θ) p(x | θ̂1 )
θ⊆Θ1
where θ̂i is the ML estimator for θ under Hi . Then use a test of the form
H0
ΛG (x) ≷ τ (12.121)
H1
for a suitable τ (that will determine the trade-o between PFA and PMD ).
Example 12.15 (Constant in AWGN, unknown polarity) Let us return to Example 12.14..
According to GLRT approach, under H1 (which is the only composite hypothesis here) we assume
that the parameter value is ŝML = Mean[ x ]. Thus,
n
1
x2k
P
−
2σ 2
e k=1
log ΛG (x) = log n
(12.122)
1 P 2
− (xk −Mean[ x ])
2σ 2
e k=1
n
1 n(Mean[ x ])2 − 2Mean[ x ]
X
= 2 xk (12.123)
2σ
k=1
| {z }
nMean[ x ]
n
= − 2 (Mean[ x ])2 (12.124)
2σ
Hence, the GLRT takes the form of
H1
(Mean[ x ])2 ≷ τ (12.125)
H0
which is exactly what we have intuitively guessed as a good solution to the polarity problem.
In a homework exercise, you will verify the Bayes reduction with a Gaussian prior, as well as
another approach called the Wald test, yield the same decision rule.
120 CHAPTER 12. HYPOTHESIS TESTING
13. Minimax Estimation
Let X ∼ p(x | θ) be drawn from a parametric family, and let `(θ̂, θ) be some nonnegative loss
(a.k.a. risk) function, (e.g., quadratic (θ̂ −θ)2 ). Recall the denition of the expected loss attained
by an estimator θ̂(x) for a given value of θ :
h i
Lθ (θ̂) , EX∼p(x | θ) ` θ̂(X, θ) . (13.1)
We have already seen that one cannot generally nd an estimator θ̂ that minimizes the expected
loss for any θ. We have seen two approaches to handle this problem: One was to restrict our
attention to unbiased estimators (and quadratic loss), then derive lower bounds on the expected
loss and look for an MVU estimator; however, an MVU does not always exist, and biased
estimators sometimes perform uniformly better than unbiased ones. The other was the Bayesian
approach, where we impose a prior on the parameter and then nd the estimator that minimizes
the expected loss; it is however not necessarily clear how to chose the prior, and whether we are
indeed interested in the average performance over the choice of parameter.
We now discuss a third approach, which does not require any assumptions on the estimators
or the parameter: The idea is to rank the estimators according to their worst-case loss over the
parameter space, and then seek the estimator whose worst-case loss is minimal. This is referred
to as the minimax (MM) estimator. Explicitly:
Note that min and max should be replaced with inf and sup as necessary, throughput this chapter.
Note that we are implicitly assuming that the minimum is achieved by some estimator; if this is
not the case we can look at a sequence of estimators approaching the inmum.
It is generally dicult to nd the minimax estimator analytically, but is sometimes less
dicult to nd upper and lower bounds on the minimax loss. To get upper bounds, we can
evaluate the worst case expected loss attained by any estimator. Namely, we can pick some
estimator θ̃ that we like (which is, say, reasonable / easy to handle) and evaluate its worst case
expected loss, which will serve as an upper bound on the loss of the minimax estimator, i.e.,
Obtaining lower bounds on the minimax loss is trickier. Before we proceed, we need the following
well-known inequality.
121
122 CHAPTER 13. MINIMAX ESTIMATION
The claim follows by maximizing the right-hand-side lower bound over the choice of y0 ∈ Y .
Now using the minimax inequality, we can obtain the following lower bound:
This follows since we can always choose θ̂ = θ on the max-min side, and yields a trivial lower
bound. Nevertheless, there is a way to get a nontrivial lower bound via the minimax inequality,
using a Bayesian trick. To that end, we need some denitions. Consider a prior Π(θ) over the
parameter θ and let
h i
LΠ (θ̂) , Eθ∼Π Lθ (θ̂) (13.7)
be the expected loss attained by an estimator θ̂. Let θ̂Π (x) be the associated Bayes estimator:
We will refer to the expected loss LΠ (θ̂Π ) attained by the Bayes estimator as the Bayes loss
(sometimes referred to in the literature as the Bayes risk).
Lemma 13.2. The minimax loss is lower bounded by the Bayes loss under any prior. Namely,
for any prior Π on the parameters:
Proof. The idea is to replace the maximization over individual θ by a maximization over distri-
butions on θ. Then, after switching the min and the max, it is not possible to trivially minimize
the loss anymore (this approach is called mixed strategies in game theory).
(13.11) follows since max upper-bounds average, and we can always choose Π to be a deterministic
distribution attaining the max. (13.12) follows from the minimax inequality, (13.13) is by
denition, and (13.14) follows by plugging in any specic prior Π.
A prior Π∗ is called least favorable if it maximizes the Bayes loss among all priors, i.e.,
Corollary 13.3 (best Bayesian lower bound) . The minimax loss is lower-bounded by the Bayes
loss of the least favorable prior.
In some cases the least favorable Bayes lower bound is tight; this happens when the minimax
inequality holds with equality, which generally happens under suitable convexity conditions de-
termined by the well-known Minimax Theorem (out of scope). However, there are simpler cases
where this happens, which we discuss next.
Lemma 13.4. Let Π be a prior on θ, with θ̂Π and Lθ (θ̂Π ) its corresponding Bayes estimator and
Bayes loss. If
for any θ ∈ Θ, then Π is least favorable and θ̂Π is the minimax estimator.
Proof. Let us rst show that θ̂Π is minimax. To that end, let θ̂ be any other estimator. We now
show that the worst case loss attained by θ̂Π is at least as large as the one attained by θ̂:
where (13.17) holds since expected value is upper bounded by the supremum, (13.18) holds since
θ̂Π is by denition the Bayes estimator and (13.19) holds under the assumption in the theorem
(since the inequality holds for any θ). Note that in fact, the last inequality must hold with
equality.
We now show that Π is also least favorable. Let Π̃ be any other prior. Then
where (13.20) holds by denition, (13.21) holds since expected value is upper bounded by the
supremum, and (13.22) again holds by assumption.
Corollary 13.5. Suppose θ̂Π has constant expected loss, i.e., Lθ (θ̂Π ) does not depend on θ. Then
Π is least favorable and θ̂Π is the minimax estimator.
We note Lemma 13.4 is slightly stronger than Corollary 13.5, as it may allow us to establish
the minimax estimator when the least favorable prior does not have full support (e.g., is discrete).
124 CHAPTER 13. MINIMAX ESTIMATION
Example 13.6 .
(Coin toss) Consider the observations
i.i.d
X ∼ Bern(θ). What is the minimax
estimator under the quadratic loss `(θ̂, θ) = (θ̂ − θ)2 ? We have seen that the Beta distribution
is a conjugate prior in this case, hence it makes sense to try and see if it is less favorable for
some choice of hyper-parametres. Hence, let Π ∼ Beta(a, b) and recall that the Bayes estimator
in this case is
n
P
xk + a
k=1
θ̂Π = . (13.24)
n+a+b
Let us try to nd a, b such that the expected loss is constant.
n
P 2
Xk + a
k=1
Lθ (θ̂Π ) = E − θ (13.25)
n + a + b
!2
n
= (n + a + b)−2 ·E
X
xk + a − nθ − aθ − bθ (13.26)
| {z }
k=1
,c
!2
n
X
= c · E (Xk − θ) + (a − θ(a + b)) (13.27)
k=1
n √
xk + 2n
P
Lemma 13.7. Let θ̂ be some estimator, and suppose that exists a sequence of priors {Πm }∞ m=1
whose Bayes losses are asymptotically upper bounded by the worst-case expected loss of θ̂ , i.e.,
The proof is similar to that of Lemma 13.4, and is left as an exercise to the reader.
125
Example 13.8 . W ∼ N 0, σ 2 ,
(Constant in AWGN) Consider the model X = θ+W where
where σ2 is known (a similar claim works for n i.i.d. samples). Let Πm ∼ N (0, m). The
associated Bayes loss is
" 2 #
m
LΠm θ̂Πm = E X −θ (13.34)
m + σ2
" 2 #
m
=E (θ + W ) − θ (13.35)
m + σ2
" 2 #
m σ2
=E W − θ (13.36)
m + σ2 m + σ2
m2 σ 2 mσ 4
= + (13.37)
(m + σ 2 )2 (m + σ 2 )2
mσ 2 (m + σ 2 )
= (13.38)
(m + σ 2 )2
mσ 2
= (13.39)
m + σ2
σ2
= 2 . (13.40)
1 + σm
h i
Lθ (θ̂) = E (X − θ)2 (13.42)
= E W2
(13.43)
2
=σ . (13.44)
Hence, the estimator θ̂(x) = x is minimax. Similarly, one can show that the sample mean
θ̂(x) = Mean[ x ] is the minimax estimator for the i.i.d. version of this problem.