0% found this document useful (0 votes)
13 views126 pages

Estimation Theory Lecture Notes

The document discusses estimation theory, focusing on the estimation problem, problem setup, and various types of estimators and performance criteria. It outlines the importance of understanding the physical process generating observations and introduces concepts such as unbiased estimators, loss functions, and optimality in estimation. The content is structured into multiple sections covering topics like minimum variance unbiased estimation, linear models, sufficient statistics, and hypothesis testing.

Uploaded by

Orel Mazon
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views126 pages

Estimation Theory Lecture Notes

The document discusses estimation theory, focusing on the estimation problem, problem setup, and various types of estimators and performance criteria. It outlines the importance of understanding the physical process generating observations and introduces concepts such as unbiased estimators, loss functions, and optimality in estimation. The content is structured into multiple sections covering topics like minimum variance unbiased estimation, linear models, sufficient statistics, and hypothesis testing.

Uploaded by

Orel Mazon
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Estimation Theory

Ofer Shayevitz
Tel Aviv University

Last revision: June 4, 2019

Scribed by Mor Oren Loberman


Contents

1 Introduction 3
1.1 The Estimation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 The Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 An Introductory Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Notation 13

3 Minimum Variance Unbiased Estimation 15


3.1 Unbiased Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Score and Fisher Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Cramer-Rao Lower Bound (CRLB) . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 CRLB and Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.5 More on the Fisher Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.6 CRLB for General Signals in AWGN . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.7 Transformation of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Vector MVU Estimation 37


4.1 Vector CRLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 Transformation of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3 CRLB for general Gaussian Signals . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Linear Models 45
5.1 Linear Models in AWGN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2 Least Square Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.3 Linear Model for Correlated Additive Gaussian Noise . . . . . . . . . . . . . . . . 54

5.4 Best Linear Unbiased Estimator (BLUE) . . . . . . . . . . . . . . . . . . . . . . . 55

6 Sucient Statistics 59
6.1 Sucient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.2 Fisher-Neyman Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.3 Rao-Blackwell Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.4 Complete Sucient Statistics and the Lehmann-Scheé Theorem . . . . . . . . . 65

7 Exponential Families 69
7.1 Motivation and Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7.2 Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

1
2 CONTENTS

8 The ML Estimator 79
8.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.2 ML Asymptotic Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

9 Condence Intervals 85
10 Regularization and the Bayesian Approach 91
10.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
10.2 Regularization in the Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . 92
10.3 Bayesian Interpretation of Regularization . . . . . . . . . . . . . . . . . . . . . . 96
10.4 Penelty Function or Prior That Encourages Sparsity . . . . . . . . . . . . . . . . 99

11 Conjugate Priors 101


12 Hypothesis Testing 105
12.1 The Neyman-Pearson Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
12.2 Bayesian Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
12.3 Relation to Total Variation Distance . . . . . . . . . . . . . . . . . . . . . . . . . 115
12.4 Composite Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

13 Minimax Estimation 121


1. Introduction

1.1 The Estimation Problem


In a general estimation problem, we observe samples of the world, and our high-level goal is to
learn something meaningful about the mechanism generating the observations.

Physical Process
Observations x∈X
(Nature)

There are several distinct tasks we may be interested in:

1. Inference: Here we are interested in recovering some (typically continuous) parameter of


the physical process, with good accuracy. For example, the parameter could be the time
of arrival of a certain transmitted signal (e.g., in GPS):

X(t) = s(t − τ ) + W (τ ), (1.1)

where s(t) is a known transmitted signal, τ is an unknown delay parameter that we want
to estimate, and W (t) is some additive noise process.

2. Detection: Here there is a nite number of possible hypotheses about the statistical
behavior of the world (and hence of our observations), and we would like to determine
which hypothesis is true. This can be though of as a special case of inference, where the
parameter of interest is discrete. For example: target detection/communication:


s n + Wn target exists (rst hypothesis)
Xn = (1.2)
Wn no target (second hypothesis)

= sn · 1(target exists) + Wn . (1.3)

noise
} |{z}
parameter
| {z

where sn is a known signal sequence and Wn is some additive noise process. There are two
hypotheses, which correspond to a binary parameter.

3. Prediction: (X, Y ) and we would like to understand how Y is related


The observations are
to X. For example, X t0 , say the ratio between the
can be some data on a stock at time
number bids and the number of asks, and Y can be the stock price dierence between time

3
4 CHAPTER 1. INTRODUCTION

t0 and timet1 > t0 , i.e., Y = price(t1 ) − price(t0 ). We would like to nd the conditional
distribution pY |X (y|x) from a set of past samples we have, so that we can predict new
values of Y from new values of X . For example, it might be reasonable to model the
possible relations between X and Y by, say,

Y = θX + Z (1.4)

where θ is some unknown parameter, and Z is some r.v. representing noise or uncer-
tainty. Plotting our past observations in the x−y plane, we can try to t a straight line to
the noisy data (this is called linear regression). Here, the parameter itself is not necessarily
what we're interested in; we want to learn the joint probability distribution itself (which is
a function of the parameter of course) so that we can use it for prediction.

y=∆ price(t0 → t1 )

# bids(t0 )
x= # asks(t0 )

1.2 The Problem Setup


1.2.1 Model  Parametric Family

Physical

Process x∈X
pX|Θ (x | θ)
θ∈Θ

First, we need to provide a model of the physical process producing our samples. In this course,
we will assume that our observations are drawn from a parametric family

{p(x | θ)}θ∈Θ (1.5)

of conditional probability distributions (can be probability mass functions ([Link]) or probability


density functions ([Link])), where Θ is the parameter space (discrete, R, Rd ..). Namely, we assume
that we observe

X ∼ p(x | θ) (1.6)

for some unknown parameter θ ∈ Θ. How is the parametric family chosen? It typically comes
from the understanding of the specic setup we are dealing with. The parametric family shouldn't
be too large, to avoid over-tting our data, nor should it be too small, in which case we may not
have a good description of the data within the family. Here, we will assume the model is given
to us.
1.2. THE PROBLEM SETUP 5

Remark 1.1. Throughout this course, we will assume that the data is indeed generated by
p(x | θ) for some θ∈Θ in the parametric family. We will make no guarantees otherwise.

1.2.2 Estimators
An estimator is a function

θ̂ : X → Θ (1.7)

from the observation space to the parameter space. This estimator is sometimes called a point
estimator, as it returns a single parameter as its decision. We will mostly limit our discussion
to point estimators in this course, but will also mention other estimators that return subsets of
the parameter space (which we will call condence intervals). Note that an estimator θ̂ cannot
depend on θ itself, as it is an unknown parameter. However, the performance of an estimator
will depend on the true underlying θ.

1.2.3 Performance Criteria and Optimality


In order to compare dierent estimators and to possibly nd an optimal estimator, we need
to choose some performance criterion. To that end, we dene a loss function (also called cost
function, or risk)

l : Θ × Θ → R+ , (1.8)

which assigns a nonnegative number `(θ̂, θ) to the event where we estimate the parameter to
be θ̂ when the true parameter is in fact θ (typically, we will have `(θ, θ) = 0, but this is not
essential). The larger this number, the less happy we are with our estimation. Of course since
our observations are random, and hence our estimator is also random, the loss will be a random
variable `(θ̂(X), θ). We will therefore be concerned with the expected loss
h i
Lθ (θ̂) = Ep(x | θ) l(θ̂(X), θ) . (1.9)

Note that the expectation is taken w.r.t. the distribution p(x | θ) induced by the true (unknown)
underlying parameter. Note also that the expected loss Lθ (θ̂) will generally depend on the
underlying θ.
Here are the two most common examples of loss functions:

• 0-1 loss:
 
l(θ̂, θ) = 1 θ̂ 6= θ (1.10)

This loss is also called hit-or-miss, and is adequate only for a discrete parameter space Θ.
The associated expected loss is the error probability of our estimator:
 
Lθ (θ̂) = Pr θ̂(X) 6= θ . (1.11)

• Quadratic loss:

l(θ̂, θ) = kθ̂ − θk22 (1.12)

Here we assume that θ ∈ Rd is a continuous scalar/vector in Euclidean space. The associ-


ated expected loss is thus the mean squared error of the estimator:

Lθ (θ̂) = MSEθ (θ̂) (1.13)


6 CHAPTER 1. INTRODUCTION

h i
= E kθ̂(X) − θk22 (1.14)

The quadratic loss will play a central role in this course.

Now that we have a performance criterion, how do we nd the best estimator? A natural
rst guess is to seek the estimator that minimizes the expected loss:

θ̂ = argmin Lθ (θ̂). (1.15)


θ̂

The problem is that this estimator will generally depend on θ, and hence will not be realizable.
For example, suppose θ ∈ R qith a quadratic loss function. Then the above estimator will simply
return θ̂ = θ itself, which yields zero expected loss, but of course cannot be realized. So, we
cannot hope to nd a single estimator that will beat all other estimators regardless of the value
of θ. For example, the estimator θ̂(X) = 5 is clearly the best possible when θ = 5, but is very
bad e.g. when θ = 100.
How do we proceed then? There are several possible ways to circumvent this problem.

1. Restrict the family of estimators:


θ̂F = argmin Lθ (θ̂). (1.16)
θ̂∈F

Can we choose F such that θ̂F does not depend on θ, and F is rich enough? We will
discuss this in the h i Specically, we will consider the family of unbiased estimators,
sequel.

i.e., where Ep(x | θ) θ̂(X) = θ for any θ. Note that this choice eliminates the constant

estimators, since e.g. θ̂ = 5, which is optimal in case θ = 5, is clearly biased when θ 6= 5.


Why do we like unbiased estimators?

• Naturally appealing.

• Law of large numbers (LLN): Given N independent unbiased estimators, we have that

N
1 X
θ̂j → θ with probability 1. (1.17)
N
j=1

• We will be able to nd a non-trivial lower bound for this family, that will be sometimes
achievable.

However, we will see that unbiased estimators do not always exist.

2. The Bayesian approach:


Assume that some probability distribution (typically called a prior) over the parameter
space, i.e., assume that θ ∼ p(θ) for some p(θ). Now, we can eliminate the dependence of
the expected loss on the underlying parameter, by taking another expectation w.r.t. the
prior p(θ). Now we can seek the estimator that minimizes this expected loss, i.e.,
h i
θ̂ = argmin Ep(θ) Lθ (θ̂) . (1.18)
θ̂

This estimator is realizable, as it does not depend on the actual θ, and we can still look at
its performance in terms of θ. Note that for the 0-1 loss, the optimal Bayesian estimator is
what you may know of as the Maximum A-Posteriori (MAP) decision rule, which simply
1.3. AN INTRODUCTORY EXAMPLE 7

chooses the parameter value that maximizes the posterior distribution of the observations.
For quadratic loss, this optimal Bayesian estimator is what you may know as the Minimum
Mean Squared Error (MMSE) estimator, which returns the posterior expectation of the
parameter given the observations.

One problem with Bayesian estimation is that it may be unclear what distribution p(θ) we
should choose. Furthermore, even if a reasonable choice exists, are we really interested in
the average performance with respect to p(θ), or with the performance for each and every
θ? We will later see that a prior p(θ) can arise naturally from a process called regularization
of the parameters, which is sometimes employed in order to control over-tting.

3. The Minimax approach:


Here, in order to eliminate the dependence of the expected loss on the parameter, we seek
the estimator that minimizes the worst-case expected loss:

θ̂ = argmin max Lθ (θ̂). (1.19)


θ∈Θ
θ̂

We will later see that interestingly, in some cases the minimax approach can be interpreted
as taking the worst case prior in the Bayesian setting, in which case the minimax estimator
can be more easily computed. In general, it may be dicult to nd.

In this course we will be generally interested in the fundamental limits of estimation, i.e., in
what can be said about the best we can do in the settings above. For instance, we would like to
get lower bounds of the form:

Lθ (θ̂) ≥ L.B.(θ). (1.20)

That hold for all θ̂ under consideration. We would like these bounds to be non-trivial, and
hopefully achievable (or almost so).
Other things we will briey touch upon include

• Data issues:

 Overtting: Too many parameters and not enough data. How do we handle this? Pos-
sible solutions include Bayesian, regularization, Minimum Description Length (MDL).

 Robustness: Is our estimator robust to inaccuracies in the model, or outliers? How


can we robustify it? And what is the cost?

• Algorithmic issues: Is there some turn-the crank ecient way to come of with good
estimators in a general setting. We will see that to some extent there is, at least in limited
cases or in the limit of large data sets.

1.3 An Introductory Example


Consider the case where our observations are a (column) vector of n binary random variables
(r.v.s) X = (X1 , . . . , Xn )T ∈ {0, 1}n , perhaps corresponding to some binary events, e.g., stock
market (up/down), signal level (high/low) etc. We want to explain how X was generated,
e.g. so that perhaps we can predict it in the future. What is a reasonable model? This strongly
depends on the story behind these observations.
8 CHAPTER 1. INTRODUCTION

i.i.d.
Let us consider the simplest setting, where X ∼ Bern(θ) where θ ∈ [0, 1] is an unknown
parameter. Namely, our parameter space is the unit interval Θ = [0, 1], and for any x ∈ {0, 1}n
n
Y
p(x | θ) = p(xk | θ) (1.21)
k=1
kxkH
=θ (1 − θ)n−kxkH , (1.22)

n
P
where kxkH = xk is the Hamming weight (norm) of x. Given a sample vector X, we would
k=1
like to estimate the underlying θ. To measure our success, we need to rst decide on a loss
function. Let us choose the quadratic loss:

l(θ, θ̂) = (θ̂ − θ)2 . (1.23)

What would be a good estimator in the quadratic loss sense? A natural suggestion: the sample
mean!

n
1X
θ̂(X) = Xk (1.24)
n
k=1

Note that this estimator is unbiased:

h i 1X n
E θ̂(X) = E [Xk ] = E [X1 ] = θ. (1.25)
n
k=1

What is the expected loss (MSE)? Let us start with n = 1:


h i2 h h ii2
E θ̂ − θ = E θ̂ − E θ̂ (1.26)
h i
= Var θ̂ (1.27)

= Var [X1 ] (1.28)

= θ − θ2 (1.29)

= θ(1 − θ). (1.30)

For a general n we get that

h i2 h i
E θ̂ − θ = Var θ̂ (1.31)
" n #
1X
= Var Xk (1.32)
n
k=1
1
= · nVar [X1 ] (1.33)
n2
θ(1 − θ)
= . (1.34)
n
We will later show that this is the best possible MSE that can be attained by an unbiased
estimator, uniformly in θ.
Arriving at the sample average estimator θ̂ suggested above was based on intuition. Is there
an underlying principle from which this estimator can be derived, and thereby from which other
optimal unbiased estimators can be obtained in other settings? A naturally appealing idea is to
1.3. AN INTRODUCTORY EXAMPLE 9

choose θ̂ as the parameter whose associated distribution assigns the highest probability to the
observations. This is called the Maximum Likelihood (ML) estimator :

θ̂ML (x) = argmax p(x | θ) (1.35)


θ

Specically, writing d = kxkH , we have that

p(x | θ) = θd (1 − θ)n−d . (1.36)

Taking the derivative w.r.t. θ, we get


p(x | θ) = dθd−1 (1 − θ)n−d − (n − d)θd (1 − θ)n−d−1 , (1.37)
∂θ
Equating to zero, dividing by θd−1 (1 − θ)d , and solving for θ, we obtain

d(1 − θ) = (n − d)θ (1.38)


n
d kxkH 1 X
θ̂ML = = = xk . (1.39)
n n n
k=1

We have implicitly assumed that the optimizing θ 6= 0, 1 in the derivation above. It is easy to
see that this will always be the case unless xn is the all-zeros (resp. all-ones) sequence, in which
case θ̂ML = 0 (resp. 1), which is also in agreement with the above. We will later see that the
ML approach does not always work so perfectly, but we will also show that it is generally rather
good.
Note that ML can be thought of as putting a uniform prior θ ∼ [0, 1], and then nding the
parameter value that maximizes the posterior distribution (the so-called MAP rule):

p(x | θ)p(θ)
p(θ | x) = (1.40)
p(x)
p(x | θ)
= , (1.41)
p(x)

where (1.40) follows from Bayes rule and (1.41) uses the uniform prior. Therefore in this case
we get:

θ̂MAP = argmax p(θ | x) (1.42)


θ
= argmax p(x | θ) (1.43)
θ
= θ̂ML . (1.44)

But if we believe the uniform prior, why use the MAP rule (which is optimal for the 0-1 loss)
and not perform MMSE estimation, which is optimal for the quadratic loss?
h i
θ̂MMSE = argmin Ep(θ)p(x|θ) (θ̂(X) − θ)2 (1.45)
θ̂
= E [θ | X] . (1.46)

For a uniform prior it can be shown that θ | X = x ∼ Beta(α, β) where,

n
X
α= xk + 1 (1.47)
k=1
10 CHAPTER 1. INTRODUCTION

n
X
β= (1 − xk ) + 1 = n − α + 2. (1.48)
k=1

α
Moreover, the expectation of a Beta(α, β) distribution is simply
α+β , hence

θ̂MMSE = E [θ | X] (1.49)
α
= (1.50)
α+β
Pn
Xk + 1
k=1
= . (1.51)
n+2
We note that

h i n·θ+1
E θ̂MMSE | θ = (1.52)
n+1
1 − 2θ
=θ+ , (1.53)
n+2
| {z }
bias toward 12

hence the MMSE estimator gains (on average w.r.t. the uniform prior) over the (unbiased)
sample mean estimator by introducing a small bias toward 1/2. The reason should be intuitively
clear: Under the uniform prior, and before we observe any data, the MMSE estimator guesses
that θ is 1/2. Samples that are now obtained attempt to convince us that the true value of
the parameter is the sample mean, and we need to properly weigh our prior belief with the data
we observe. The more samples se get, the less the prior belief (which favors 1/2) will aect our
guess.
We will later see that even in a non-Bayesian setting, it is sometimes helpful, (uniformly in
θ) to introduce a small bias.

1.3.1 Condence Intervals


In some cases we prefer not to commit to a point estimator θ̂, but rather give a range of values in
which we believe θ lies, with some given probability (independent of θ). For example, consider the
n
1 P
case of Bern(θ) i.i.d samples discussed above., and recall our sample mean estimator θ̂ = Xk .
n
k=1
According to central limit theorem (CLT) we

√  
n θ̂ − θ → N (0, θ(1 − θ)) (1.54)

in distribution. In fact, a stronger version of the CLT (called the Berry-Esseen CLT) bounds the
1
deviation of the tail function in this case by √ , i.e., for any
n
ε > 0,

√ !  
n|θ̂ − θ| 1
Pr p > ε ≤ 2 Q(ε) + √ , (1.55)
θ(1 − θ) n

where Q(·) is the Gaussian Q-function. Sinceθ(1 − θ) ≤ 14 we have that


   
ε 1
Pr |θ̂ − θ| > √ ≤ 2 Q(ε) + √ , (1.56)
2 n n
1.3. AN INTRODUCTORY EXAMPLE 11

Hence, we conclude that regardless of the value of θ, we know that


 
ε ε
θ ∈ θ̂ − √ , θ̂ + √ (1.57)
2 n 2 n
 
1 1 −a2 /2
with probability at least 1 − 2 Q(ε) + √ . Note that since Q(a) ≤ e , then the above
n 2

q 
log n
approximation is useful for  ≈ log n, yielding a condence interval of size O with
n
 
1
probability 1 − O √ .
n

1.3.2 Doubting the Model


Suppose we get

X = 101010101010 · · · (1.58)

1
Our sample mean estimator would return θ̂ = 2 . If we try to predict based on this, it seems
that perhaps we will be wrong half the time (if this behavior indeed continues). Any reasonable
person would have guessed that the next bit is ipped, so why did this happen? The reason of
course is that we have limited our model, i.e. the process that is assumed to have generated our
observations, to be i.i.d. and hence in particular memoryless. What can we do? If we believe
there is some dependence between consecutive time instances, we may try to assume a rst order
Markov model, where say X1 ∼ Bern(1/2), and the transition from Xk to Xk+1 follows

1 − θ0 1 − θ1

θ1

0 1

θ0

Now, we have a vector of parameters:

θ = (θ0 , θ1 )T ∈ [0, 1]2 . (1.59)

In this case, it is easy to check that the ML estimator for the alternating sequence above yields

θ̂ML = (1, 1)T , (1.60)

since under this choice of parameters, the observed sequences has probability 1/2, which is the
highest possible.
But why stop at rst order? If we increase the order of the model then at some point, our
model can become as complicated as the observations themselves. In such a case we will do a
great job describing what we have seen, but typically this will not generalize to predict future
observations. This situation is commonly called over-tting". Intuitively, a good model should
strike a reasonable balance between the eective number of parameters and the associated loss.
We will not explicitly discuss model selection in this course, but we will talk about ways to
counter over-tting eects via a simple process called regularization.
12 CHAPTER 1. INTRODUCTION
2. Notation
Let X be our sample space, associated with a sigma-algebra F. In general, a parametric family
considered in these notes is a collection of probability distributions µθ indexed by θ over F, that
are all absolutely continuous w.r.t. a single measure µ over F. We denote the Radon-Nikodym
dµθ
derivative of µθ (·) w.r.t. µ. by p(x|θ) , dµ (x). Most of the results hold under this general
framework. However, in this exposition we will usually assume that either

• X is discrete , F = 2X , and µ(A) = |A| is the counting measure. In this case, p(x | θ) are
all simply probability mass functions (p.m.f.s) over X.

• X = Rn , F is the Borel sigma-algebra of the open sets induced by the Euclidean metric,
and µ is the Lebesgue measure. In this case, p(x | θ) are all simply probability density
n
functions (p.d.f.s) over R .

We will usually denote the samples by x (also called observations), and sometimes refer to
p(x | θ) as the likelihood function. Generally, x can be a scalar or a vector unless otherwise
restricted, and will be assumed to take values in the sample space X, unless otherwise stated.
We use boldface x to emphasize we mean a vector, in which case we usually mean that x =
(x1 , x2 , . . . , xn )T is an n-dimensional column vector. For the most part, capital letters X (resp.
X ) are used for random variables (resp. random vectors), and the corresponding small letters x
(resp. x) will be used for specic values. One exception is the parameter θ (or parameter vector
θ ), which is always written in small letters (even in the Bayesian case). A vector of parameters
θ = (θ1 , . . . , θd )T is usually taken to be a d-dimensional column vector.
Probabilities, expectations, and variances are denoted by the operators Pr(·), E [·] and Var[ · ]
respectively, and are typically be taken w.r.t to p(x | θ), excluding the Bayesian setting or unless
otherwise stated. This specically means that the expected values, variances, MSEs, covariance
matrices, etc., will typically depend on θ. We sometimes specically emphasize this by writing
θ as a subscript, e.g., MSEθ , but in many cases this dependence is omitted. The interior of a
subset S ⊆ Rd , denoted here by interior(S), is the set of all points s ∈ S for which there exists
a Euclidean ball B of some radius ε > 0 centered at s, such that B ⊆ S .

13
14 CHAPTER 2. NOTATION
3. Minimum Variance Unbiased Estima-
tion
In this chapter, we discuss estimation problems of a real-valued scalar parameter under the
quadratic loss, i.e., where our parametric family is {p(x | θ)}θ∈Θ with Θ ⊆ R, and `(θ̂, θ) =
(θ̂ − θ)2 . Note that x can be a scalar or a vector, and can be discrete or continuous. We focus
on the performance of a restricted set of estimators  unbiased estimators.

3.1 Unbiased Estimators


Recall that the expected quadratic loss of an estimator θ̂, or its MSE, is given by

h i
MSEθ (θ̂) , E (θ̂(X) − θ)2 . (3.1)

where the expectation is taken w.r.t p(x | θ). We say that an estimator θ̂1 dominates another
estimator θ̂2 (in the MSE sense), if

MSEθ (θ̂1 ) ≤ MSEθ (θ̂2 ) (3.2)

for any θ ∈ Θ (a similar denition holds for any other loss function). We have already seen
that there is no single estimator that dominates all other estimators, for the simple reason that
the constant estimators cannot be dominated. Moreover, there is no nontrivial lower bound
on the MSE attained by a general estimator; the best we can say is that MSEθ (θ̂) ≥ 0, where
again the constant estimators are to blame. We would therefore like to restrict our attention to
some reasonable sub-family of estimators for which we can provide nontrivial lower bounds, and
perhaps even nd an optimal estimator, i.e, once that dominates all the other estimators in the
family.
Let us rst break up the expression for the MSE of an estimator θ̂ as follows:

   2
MSEθ (θ̂) = E θ̂ − E[θ̂] + E[θ̂] − θ (3.3)

2 h i 
= Var[ θ̂ ] + E[θ̂] − θ + 2 E θ̂ − E[θ̂] E[θ̂] − θ (3.4)
| {z } | {z }
,Bias[ θ̂ ] =0
 2
= Var[ θ̂ ] + Bias[ θ̂ ] , (3.5)

where the bias function of an estimator has been dened above. We see that the MSE of an
estimator is composed of two parts: The variance of the estimator, and the square of its bias.
The constant estimators θ̂ = c have zero variance, but their bias function Bias[ θ̂ ] = (c − θ)2 is
unbounded. Getting them and their likes out of the way in the strongest sense possible, we will

15
16 CHAPTER 3. MINIMUM VARIANCE UNBIASED ESTIMATION

restrict our attention in this chapter to the opposite (but much more interesting) extreme, and
consider unbiased estimators. Explicitly, an estimator θ̂ : X → R is called unbiased if Bias[ θ̂ ] = 0
identically, i.e., if
h i
E θ̂(X) = θ (3.6)

for any θ ∈ Θ. Note that here we have allowed the estimator to return any real value, not
necessarily in Θ. We will later see that if we restrict the estimator to be θ̂ : X → Θ, and in case
Θ 6= R, then it may be the case that no unbiased estimator exists.

Example 3.1 .
(Constant in Additive Noise) Suppose that we observe a sequence

Xk = θ + Wk , k = 1, . . . , n (3.7)

where {Wk }nk=1 is an i.i.d sequence of real-valued [Link] with E [Wk ] = 0 and Var [Wk ] = σ 2 (and σ2
is either known, or unknown but not to be estimated). We would like to estimate the unknown
parameter θ ∈ R. Thinking of our observations as a random vector X ∈ Rn , the parametric
family under consideration is

n
Y
p(x | θ) = p(xk | θ) (3.8)
k=1
Yn
= pW (xk − θ) (3.9)
k=1

where θ ∈ Θ = R. A natural choice for an estimator is the sample mean:

n
1X
θ̂(x) = xk (3.10)
n
k=1
, Mean[ x ]. (3.11)

Let us nd the expected value of this estimator as a function of the underlying parameter θ:
h i 1X n
E θ̂(X) = E [Xk ] (3.12)
n
k=1
n
1X
= θ (3.13)
n
k=1
= θ. (3.14)

This holds for any θ, hence the suggested estimator is unbiased. Let us compute the expected
quadratic loss incurred by this estimator, i.e., its MSE, as a function of the underlying parameter
θ.

MSEθ [ θ̂ ] = Var[ θ̂ ] (3.15)


" n #
1X
= Var Xk (3.16)
n
k=1
n
1 X
= 2 Var [Xk ] (3.17)
n
k=1
3.1. UNBIASED ESTIMATORS 17

Var [X1 ]
= (3.18)
n
σ2
= . (3.19)
n
Where (3.15) follows since the estimator is unbiased, (3.17) holds since the samples are indepen-
dent, (3.18) holds since the samples are identically distributed, and (3.19) holds by assumption.
Note that the MSE in this case does not depend on the underlying parameter θ; this is typically
not the case in general.
In what follows, we will give special attention to the case of additive white Gaussian noise
i.i.d i.i.d
(AWGN), i.e., where {Wk }nk=1 ∼ N (0, σ 2 ). {Xk }nk=1 ∼ N (θ, σ 2 )
In this case clearly and

n  
Y 1 1 2
p(x | θ) = √ exp − (xk − θ) . (3.20)
2πσ 2 2
k=1
We will soon see that the MSE attained by the sample mean is the best possible for unbiased
estimators in the AWGN case, uniformly in θ.
Can we uniformly improve the MSE of the sample mean estimator by introducing a little bit
of bias, i.e., nd a biased estimator that dominates it? A simple attempt to do so is by applying
constant shrinkage, i.e.,

θ̂(x) = α · Mean[ x ]. (3.21)

for some 0 < α < 1. Let us nd the best shrinkage factor α:
 2
MSEθ [ θ̂ ] = Var[ θ̂ ] + Bias[ θ̂ ] (3.22)

σ2
= α2 · + (1 − α)2 θ2 . (3.23)
n
Taking the derivative w.r.t α and equating to zero, we get:

∂ σ2
MSEθ [ θ̂ ] = 2α · − 2(1 − α)θ2 = 0. (3.24)
∂α n
Solving for α we nd the the optimum is obtained for

θ2
αopt = σ2
. (3.25)
θ2 + n
and indeed we see that αopt < 1. Unfortunately, αopt depends on the unknown parameter θ and
is hence unrealizable. It is also easy to check that taking any xed value of α will never result
in an estimator that dominates the sample mean (though it will be better in a certain range of
parameters, depending on α). Later in the course, we will see that in some (higher-dimensional)
cases a (smarter) shrinkage approach can in fact generate biased estimators that dominate the
unbiased ones.
Still, there are a few things to note. First, αopt → 1 as n → ∞ for any xed value of θ.
2
This can be thought of as a high SNR regime, since in this case the `signal energy θ is much
2
larger than the eective noise variance σ /n. Following this interpretation, we note that α opt
looks like the MMSE estimation Wiener coecient. Indeed, in the Gaussian Bayesian case, the
θ2 E θ2
 
unknown factor is replaced with the computable under the Gaussian prior over the
parameter. Moreover, if for some reason we know or believe that it makes little sense for θ to be
very large, we may still want to use αopt by plugging in some large value of θ above which would
not tend to trust the estimation. Such a process is called regularization of the estimation, and
we will discuss it in more detail later. From the above discussion we can already suspect that
regularization is similar to the Bayesian approach; this will be made precise in the sequel.
18 CHAPTER 3. MINIMUM VARIANCE UNBIASED ESTIMATION

3.1.1 A Word About Robustness


How sensitive is an estimator to outliers, i.e., to corrupted samples or samples that for some
reason do not follow our model assumptions? One way to quantify this is to ask the following:
what is the smallest fraction of samples that, if corrupted, can drive our estimator to arbitrarily
large values? This fraction is called the breakdown point of the estimator. For our sample mean
estimator, it is easy to see that corrupting even a single sample can drive the estimator to innity,
hence its breakdown point is zero, the worst possible.
What could be a more robust (and not too bad) unbiased estimator? One intuitive suggestion
is to use

θ̂(x) = Median(x). (3.26)

Clearly, in order to drive this estimator to innity we need to corrupt at least half the samples,
hence its breakdown point is 1/2. How much did we pay for this great improvement in robustness?
It can be shown that for the median estimator (assuming now again there are no outliers..)

σ2
MSEθ [θ̂] ≈ α · (3.27)
n
where

1
α= (3.28)
4σ 2 p2W (0)
π
and where pW (0) is the density of the noise at zero. In the AWGN case we get α= 2 ≈ 1.57 . . .,
which is not too bad. Note that α < 1 is also possible; for example, if the noise is Laplace
distributed with zero mean and variance σ2, then α = 1/2, hence the median estimator beats
the sample mean.
Other ways to achieve some level of robustness is to rst try to detect and remove outliers
before applying our estimator. In the aforementioned case of a constant in additive noise, we may
want to rst remove a fraction γ of the largest samples (in absolute value) and then apply our
sample mean estimator to the samples that remain (properly scaled to maintain unbiasedness).
This is called the trimmed mean estimator, and it clearly has a breakdown point of γ ; we can make
the estimator more robust at the expense of using less samples (and hence suering in MSE). In
higher dimensions, designing a good outlier rejection procedure becomes more challenging.

3.1.2 Minimum Variance Unbiased (MVU) Estimators


So, we have restricted the discussion to unbiased estimators, and now we can reasonably hope
to nd an optimal one. Specically, an estimator will be called Minimum Variance Unbiased
(MVU) if it is unbiased, and dominates all other unbiased estimators (in terms of MSE). It is
not clear that indeed an MVU exists in a certain setup; Figure 3.1 loosely illustrates the situation
where an MVU exists and does not exist. In fact, in many cases no MVU estimator exists, see
e.g. Example 2.3 in Kay's book. More surprisingly perhaps, it turns out that sometimes, not
even a single unbiased estimator exists at all:

Example 3.2. We observe a single sample X ∼ Unif [0, 1/θ], where θ > 0 is some unknown
parameter, namely

p(x | θ) = θ · 1 (x ∈ (0, 1/θ)) (3.29)


3.1. UNBIASED ESTIMATORS 19

h i h i
Var θ̂ Var θ̂

θ̂2 θ̂2

θ̂1
θ̂1
No
MVU θ̂0
MVU
θ θ

Figure 3.1: Existence and non-existence of an MVU estimator.

Suppose θ̂(x) is an unbiased estimator of θ, and assume for simplicity that it is absolutely
continuous in x (an assumption that can be dropped). Then

1
h i Zθ
E θ̂(X) = θ θ̂(x)dx (3.30)

0
= θ, (3.31)

where (3.31) follows from the unbiasedness assumption. The above must hold for any θ > 0, and
hence

1

θ̂(x)dx = 1. (3.32)

for any θ > 0, or equivalently

Za
θ̂(x)dx = 1. (3.33)

for any a > 0. Since this is a functional identity, we can take the derivatives on both sides, to
obtain

θ̂(a) − θ̂(0) = 0 (3.34)

for any a > 0, and therefore

θ̂(x) = θ̂(0). (3.35)

This means that

Za
θ̂(x)dx = a · θ̂(0), (3.36)

for any a > 0, in contradiction to (3.33). Hence, there cannot be an unbiased estimator for θ in
this case.
20 CHAPTER 3. MINIMUM VARIANCE UNBIASED ESTIMATION

It is instructive to examine the ML estimator in this case:

θ̂ML = argmax p(x | θ) (3.37)


θ∈Θ
= argmax θ · 1 (x ∈ (0, 1/θ)) (3.38)
θ
1
= . (3.39)
x
This simply follows since the largest value of θ for which x is in the support of the distribution
is exactly 1/x. While this estimator makes sense, we note that is not unbiased; in fact, its
expectation is innite:

i Z 1
h θ dx
E θ̂ML (X) = (3.40)
0 x
=∞ (3.41)

for any θ > 0.

3.2 Score and Fisher Information


In the next section, we will derive a lower bound for the MSE attainable (as a function of θ)
by any unbiased estimator, and specically, an MVU estimator should one exist. Even if an
MVU estimator does not exist, this bound would still be meaningful and useful. To develop this
bound, we will try to nd a natural quantity pertaining to the parametric family at hand, that
will capture how sensitive the observations are to variations in the parameter value. Intuitively,
the higher the sensitively the better we should be able to estimate the true parameter.
Before proceeding to the formal denitions in the general case, we nd it instructive to
develop our intuition via the example of a constant contaminated by AWGN, in conjunction
with a heuristic discussion. Recall the model (say with a scalar observation, n = 1):
X = θ + W, (3.42)

where W ∼ N (0, σ 2 ) and σ2 is known. The sample mean estimator in this case is trivial:

θ̂(x) = x. (3.43)

Recall that this estimator is unbiased, and its MSE is

MSEθ [ θ̂ ] = Var[ X ] = σ 2 , (3.44)


Intuitively, why is the MSE equal to σ2? We might heuristically interpret MSE as the
resolution of our ability to guess the value of θ. Following this line of argument, we may ask
ourselves how small can δ > 0 be so that we can reasonably separate between the parameter
values of θ and θ + δ. In the case of constant in AWGN, the distribution of the sample under
these two values of the parameter clearly changes more drastically the smaller σ2 is, as depicted
in Figure 3.2. More accurately, when we perturb the parameter from θ to θ + δ, the probabilities
of being inside any given interval slightly changes, an eect that we might hope to recognize (in
the discrete case, the change is in the probability of a symbol). To heuristically quantify this, let
us nd by what (multiplicative) factor these probabilities change (why not additive?), say for a
small interval [a, a + ]:
Pr (a ≤ X ≤ a +  | θ + δ)  · p(a | θ + δ)
≈ (3.45)
Pr (a ≤ X ≤ a +  | θ)  · p(a | θ)
3.2. SCORE AND FISHER INFORMATION 21

p(x | θ) p(x | θ)
Large σ
2 Small σ2

x x
θ θ+δ θ θ+δ
(b)
(a)

Figure 3.2: (a) The pdf is less sensitive to variation in θ (b) pdf is more sensitive
.

= exp {log p(a | θ + δ) − log p(a | θ)} (3.46)


 
log p(a | θ + δ) − log p(a | θ)
= exp δ · (3.47)
δ
 

≈ exp δ · log p(a | θ) (3.48)
∂θ

The logarithm log(·) above (and throughout the course) is taken w.r.t the natural base e. We
can see that the quantity that determines the sensitivity of the model to perturbations in the
parameter at a given θ and x is a quantity sometimes called the score function


score [x | θ] , log p(x | θ) (3.49)
∂θ
p(x | θ) is typically called the likelihood function, and hence the score function is the derivative of
the log-likelihood function w.r.t the parameter. Of course, in reality we observe a random value
of the score, i.e., we see the r.v score [X | θ], where X ∼ p(x | θ). This random value of the score
function is typically called simply the score.
Intuitively, if we knew the distribution of the score, we should expect to be able to say
something about the performance of an estimator. For example, if the score tends to be small
(in absolute value), we expect that our ability to estimate the value of the parameter will not
be so good. Conversely, if the score tends to be large, then in principle we might be able to
estimate the parameter rather well.
Let us nd the distribution of the score in the case of constant in AWGN (n = 1). The
likelihood function is given by
 
1 1
p(x | θ) = √ exp − 2 (x − θ)2 (3.50)
2πσ 2 2σ
and the log-likelihood function is thus

1 1
log p(x | θ) = − log 2πσ 2 − 2 (x − θ)2 (3.51)
2 2σ
Taking the derivative of the above w.r.t the parameter, we nd the score function:

∂ log p(x | θ)
score[ x | θ ] = (3.52)
∂θ
22 CHAPTER 3. MINIMUM VARIANCE UNBIASED ESTIMATION

x−θ
= (3.53)
σ2
Since X ∼ N (θ, σ 2 ), we see that the score (as a r.v) is

score[ X | θ ] ∼ N 0, 1/σ 2 .

(3.54)

Namely, the expectation of the score is zero (independent of θ)

E [score[ X | θ ]] = 0, (3.55)

a property which will turn out to hold in more general setups, and the variance of the score is

1
Var [score[ X | θ ]] = . (3.56)
σ2
We see that the meaningful quantity here is the variance of the score, which is inversely propor-
tional to the noise level. This makes intuitive sense, since a higher noise level implies a smaller
score, which should mean poorer resolution. In fact, we see that the MSE (or variance) attained
by the sample mean estimator (which we claimed would be shown to dominate all unbiased
estimators) is precisely the reciprocal of the variance of the score. As it turns out, this is no
coincidence. In much more generality, the MSE (or variance) attained by any unbiased estima-
tor will be lower bounded by the reciprocal of the variance of the associated score (which will
typically depend on the value of the parameter, unlike the example above).
Let us now return to the general case of an arbitrary parametric family with a scalar real-
valued parameter. The variance of the score, due to its ostensible importance, is known by
the name the Fisher information. Traditionally, the Fisher information is dened as the second
moment of the score, which is equal to its variance when the expected score is zero, and the
latter holds under some regularity conditions (which are necessary anyway). Namely, the Fisher
information associated with the parametric family is
h i
I(θ) , E (score [X | θ])2 (3.57)
"  #
∂ log p(X | θ) 2
=E . (3.58)
∂θ

Please note that p(x | θ) can be a p.d.f or a p.m.f, and that x can be a scalar or a vector.
In order for the Fisher information to play its role in lower bounding the MSE in unbiased
estimation, the associated parametric family should satisfy some mild regularity conditions.
Specically, we will say that our parametric family is m-regular, if the following conditions are
satised:

(i) The parameter space Θ⊆R is a union of a nite number of (possibly innite ) intervals.

(ii) The Fisher information exists and is nite for any θ ∈ interior(Θ).

(iii) log p(x | θ) is m-times dierentiable in θ for any θ ∈ interior(Θ) and x ∈ X.

(iv) Integration and dierentiation can be exchanged

∂k ∂ k p(x | θ)
Z  Z
t(x)p(x | θ)dx = t(x) dx (3.59)
∂θk ∂θk

for k ∈ {1, . . . , m} and any θ ∈ interior(Θ), wherever the right hand side (r.h.s) exists and
is nite.
3.2. SCORE AND FISHER INFORMATION 23

Most of our results will be for 2-regular families, which will be referred to simply as regular.
We note that some of our results also hold under slightly weaker conditions, e.g., for 1-regular
families. For simplicity of exposition we do not discuss these issues. The following lemma (which
we do not prove here) provides sucient conditions for the third regularity condition to hold.

Lemma 3.3. Any one of the following conditions implies the regularity condition (3.59):

(i) p(x | θ) is a p.d.f that is an m times continuously dierentiable in θ, and has a with bounded
support in x that does not depend on θ .

(ii) p(x | θ) is a p.d.f with innite support that is m times continuously dierentiable in θ, and
the integral on the r.h.s of (3.59) converges uniformly for any θ ∈ Θ.

(iii) p(x | θ) is a p.m.f that is m times continuously dierentiable in θ.


Next, we prove a few useful properties of the score and the Fisher information.

Lemma 3.4. For a regular parametric family, the following properties hold:

(i) I(θ) ≥ 0 (this holds also without regularity, whenever the Fisher information exists).

(ii) The score satises

E [score[ X | θ ]] = 0 (3.60)

Var [score[ X | θ ]] = I(θ). (3.61)

(iii) The Fisher information can alternatively be written as


 
∂ score[ x | θ ]
I(θ) = −E (3.62)
∂θ
 2 
∂ log p(x | θ)
= −E . (3.63)
∂θ2

Proof. The rst property follows trivially from the denition. For the second property:
Z ∞

E [score[ X | θ ]] = p(x | θ) · log p(x | θ)dx (3.64)
−∞ ∂θ
Z ∞ ∂
∂θ p(x | θ)
= p(x | θ) · dx (3.65)
p(x | θ)
Z−∞


= p(x | θ)dx (3.66)
−∞ ∂θ
Z ∞

= p(x | θ)dx (3.67)
∂θ −∞

= 1 (3.68)
∂θ
= 0, (3.69)

where (3.67) holds due to the regularity property. We see that as in the AWGN example, the
expected score is indeed zero. The variance of the score is therefore equal to its second moment,
which by denition is the Fisher information. To prove the third claim, write:
" #

∂ 2 log p(x | θ) ∂ ∂θ p(x | θ)
2
= (3.70)
∂θ ∂θ p(x | θ)
24 CHAPTER 3. MINIMUM VARIANCE UNBIASED ESTIMATION

∂2
 ∂
2
p(x | θ) · ∂θ2
p(x | θ) − ∂θ p(x | θ)
= (3.71)
p2 (x | θ)
#2
∂2
"

∂θ2
p(x | θ) ∂θ p(x | θ)
= − (3.72)
p(x | θ) p(x | θ)
∂2 2
p(x | θ)

∂θ2 ∂ log p(x | θ)
= − . (3.73)
p(x | θ) ∂θ

Taking the expectation:

Z ∞ 2
∂ 2 log p(x | θ)
 

−E = I(θ) − p(x | θ)dx (3.74)
∂θ2 −∞ ∂θ 2
Z ∞
∂2
= I(θ) − 2 p(x | θ)dx (3.75)
∂θ −∞
∂2
= I(θ) − 2 1 (3.76)
∂θ
= I(θ), (3.77)

where (3.75) holds due to the regularity property.

3.3 Cramer-Rao Lower Bound (CRLB)


We are now ready to state the main result of this chapter.

Theorem 3.5 (Cramer-Rao) . Let {p(x | θ)}θ∈Θ⊆R be a regular parametric family with
Fisher information I(θ). Let θ̂ : X → R be any measurable unbiased estimator for θ.
Then:

1
MSEθ ( θ̂ ) = Var[ θ̂(X) ] ≥ . (3.78)
I(θ)

In addition, an unbiased estimator attaining the bound above with equality for all θ exists,
if and only if the score function can be written as

score[ x | θ ] = C(θ) · (g(x) − θ) (3.79)

for some functions g(x) and C(θ). Moreover, if (3.79) holds then C(θ) = I(θ), and
θ̂(x) = g(x) is an MVU estimator with the variance function 1/I(θ).

Before we prove the theorem, a few remarks are in order:

Remark 3.6. Clearly, if some unbiased estimator achieves the CRLB, then this estimator is
MVU. However, it is possible that an MVU estimator exists but does not achieve the CRLB (in
which case the CRLB is clearly loose). In any case, and even if no MVU exists, the CRLB is
still a valid lower bound for the MSE of unbiased estimators.

Remark 3.7. Note that in the CRLB theorem we have allowed the estimator to return any real
value, not necessarily in Θ. This is of course generally unreasonable, but it can be shown that in
many cases, when Θ 6= R, then requiring the estimator to return values in Θ only may result in
the non-existence of any unbiased estimators. One such case is discussed later in Example 3.11.
3.3. CRAMER-RAO LOWER BOUND (CRLB) 25

Proof. We assume below that p(x | θ) is a p.d.f, but the proof holds with the trivial modications
for a p.m.f. For simplicity, we write S = score[ X | θ ] for the score r.v in this proof. As we have
seen, E [S] = 0 and Var [S] = I(θ). Now, let θ̂(x) be any estimator (not necessarily unbiased).
Appealing to the CauchySchwarz inequality for the covariance inner-product, the covariance
between the score and the estimator can be upper bounded by

2 h i
Cov[ S, θ̂(X) ] ≤ Var [S] · Var θ̂(X) (3.80)
h i
= I(θ) · Var θ̂(X) , (3.81)

and therefore the variance of the estimator can be lower bounded by

2
h i Cov[ S, θ̂(X) ]
Var θ̂(X) ≥ (3.82)
I(θ)

To complete the proof, let us expand the covariance directly:

h i
Cov[ S, θ̂(X) ] = E S · θ̂(X) − E[S] · E[θ̂(X)] (3.83)
h i
= E S · θ̂(X) (3.84)
Z ∞ ∂
p(x | θ)
= p(x | θ) · ∂θ · θ̂(x)dx (3.85)
−∞ p(x | θ)
Z ∞

= θ̂(x) p(x | θ)dx (3.86)
−∞ ∂θ
Z ∞

= θ̂(x)p(x | θ)dx (3.87)
∂θ −∞
∂ h i
= E θ̂(X) (3.88)
∂θ

= (θ + Bias[ θ̂ ]) (3.89)
∂θ

=1+ Bias[ θ̂ ], (3.90)
∂θ
where (3.87) holds due to the regularity property. Combining (3.81) and (3.90) we obtain

h i  2
MSEθ θ̂ = Var[ θ̂ ] + Bias[ θ̂ ]) (3.91)
h i2

1 + ∂θ Bias[ θ̂ ]  2
≥ + Bias[ θ̂ ]) . (3.92)
I(θ)

Specically, for an unbiased θ̂ we have Bias[ θ̂ ], hence proving the lower bound in the Theorem.
Note that the above bound is more general, lower bounding the MSE of all estimators with a
given bias function (this is not too useful outside the unbiased case, however).
To prove the additional claims, note that the CauchySchwarz inequality is tight if and only
if the two random variables (with means removed) are co-linear, i.e., in our case (noting that the
constant can depend on θ):
 h i
S − ES = C(θ) θ̂(X) − E θ̂(X) , (3.93)
26 CHAPTER 3. MINIMUM VARIANCE UNBIASED ESTIMATION

and hence
 
score[ X | θ ] = C(θ) θ̂(X) − θ (3.94)

holds with probability 1 for some C(θ). Furthermore, using the second derivative characterization
of the Fisher information, we get

∂2
 
I(θ) = −E log p(X | θ) (3.95)
∂θ2
 

= −E score[ X | θ ] (3.96)
∂θ
 
∂C(θ)
= −E − (θ̂(X) − θ) − C(θ) (3.97)
∂θ
= C(θ). (3.98)

where we have used the fact that θ̂ is unbiased.

Example 3.8 .
(constant in AWGN) Recall the model Xk = θ + Wk for k = 1, . . . , n, where
i.i.d
{Wk } ∼ N (0, σ 2 ). The likelihood function is
n  
Y 1 1 2
p(x | θ) = √ exp − 2 (xk − θ) . (3.99)
2πσ 2 2σ
k=1

The log likelihood is

n
n 1 X
log p(x | θ) = − log(2πσ 2 ) − 2 (xk − θ)2 . (3.100)
2 2σ
k=1

The score function is

∂ log p(x | θ)
score[ x | θ ] = (3.101)
∂θ
n
1 X
= 2 (xk − θ) (3.102)
σ
k=1
n
= 2 (Mean[ x ] − θ) , (3.103)
σ
where Mean[ x ] is the sample mean. We note that Mean[ X ] ∼ N (θ, σ 2 /n), hence the distribution
of the score is

n2 σ 2
 
score[ X | θ ] ∼ N 0, 4 (3.104)
σ n
 n
= N 0, 2 (3.105)
σ
Hence, the Fisher information is

n
I(θ) = , (3.106)
σ2
and the CRLB implies that for any unbiased estimator θ̂:

MSEθ [ θ̂ ] = Var[ θ̂ ] (3.107)


3.3. CRAMER-RAO LOWER BOUND (CRLB) 27

1
≥ (3.108)
I(θ)
σ2
= . (3.109)
n
This is achieved in equality by the sample mean estimator θ̂(x) = Mean[ x ] (which in this case
is also the ML estimator). Hence, the sample mean is the MVU estimator.
In fact, there was an easier way to gure this out without even knowing about the sample
mean estimator in the rst place. We have seen that the score function is

n
score[ x | θ ] = (Mean[ x ] − θ) , (3.110)
σ2
which is exactly of the MVU form discussed in the CRLB theorem. Once that is observed, we
n
can immediately deduce the I(θ) = σ2
and that Mean[ x ] is the MVU estimator.

Example 3.9 (Biased coin). Recall that here the i.i.d observations are drawn from the Bernoulli
i.i.d
distribution with parameter θ, i.e., {X1 , . . . , Xn } ∼ Bern(θ). The likelihood function is

n
Y
p(x | θ) = p(xk | θ) (3.111)
k=1
kxk
=θ · (1 − θ)n−kxkH (3.112)

where kxkH is the Hamming weight of the vector x ∈ {0, 1}n . The log-likelihood function is

log p(x | θ) = kxkH log(θ) + (n − kxkH ) log(1 − θ). (3.113)

The score function is

∂ log p(x | θ)
score[ x | θ ] = (3.114)
∂θ
kxkH n − kxkH
= − (3.115)
θ 1−θ
kxkH (1 − θ) − (n − kxkH )θ
= (3.116)
θ(1 − θ)
kxkH − nθ
= (3.117)
θ(1 − θ)
 
n Mean[ x ] −θ .
= (3.118)
θ(1 − θ) | {z }
| {z } MVU
I(θ)

Namely, for any unbiased estimator θ̂

MSEθ [ θ̂ ] = Var[ θ̂ ] (3.119)

1
≥ (3.120)
I(θ)
θ(1 − θ)
= (3.121)
n
and the sample mean is the MVU estimator. Note that we could also have looked at the distri-
1
bution of the score, which is clearly Binom(n, θ) that is centered and then scaled by
θ(1−θ) . This
would also immediately yield the expression for the Fisher information.
28 CHAPTER 3. MINIMUM VARIANCE UNBIASED ESTIMATION

3.4 CRLB and Maximum Likelihood


In the two examples above we saw that the CRLB can be achieved with equality. Such unbiased
estimators that achieve the CRLB with equality are called ecient. We have already seen that
at least in the biased coin example, the ecient estimator is also the ML estimator (this also
holds in the AWGN example, check!). In fact:

Lemma 3.10. Let {p(x | θ)}θ∈Θ be a regular parametric family with Θ = R, and assume that
I(θ) > 0 for any θ ∈ R. If an ecient estimator exists, then the ML estimator is ecient.

Proof. The ML estimator is the solution to the following optimization problem:

θML (x) = argmax p(x | θ) (3.122)


θ∈Θ
= argmax log p(x | θ) (3.123)
θΘ

To solve this, we can take the derivative w.r.t θ:



log p(x | θ) = score[ x | θ ] (3.124)
∂θ
= I(θ)[g(x) − θ] (3.125)

where the last step holds since we are assuming that an ecient estimator exists, in which case
g(x) is ecient. Equating the above to zero and recalling that I(θ) > 0, we obtain that


log p(x | θ) = I(θ)[g(x) − θ] (3.126)
∂θ |{z}
>0

0
 θ = g(x)
= > 0 θ < g(x) (3.127)

< 0 θ > g(x).

Thus θ̂ML = g(x) is a unique global maximum of the likelihood function, hence g(x) is the ML
estimator.

The reason we have required the parameter space to be the entire real line is obvious from the
proof; if Θ is a strict subset of R, then it may be that g(x) 6∈ Θ, in which case the maximum of
the likelihood function is attained at some boundary point of Θ. In such a case, the ML estimator
will be biased, as will be any other estimator that is constrained to return values inside Θ. This
is demonstrated in the following simple example.

Example 3.11. Consider the constant in AWGN model with a single sample, i.e., X = θ + W ,
where W ∼ N (0, σ 2 ). However, this time let us restrict the parameter space to be (say) the unit
interval Θ = [0, 1]. Then it should be clear that


x x ∈ [0, 1]

θ̂ML (x) = 0 x < 0 (3.128)

1 x>1

h i Z1
E θ̂ML (X) = x · p(x | θ)dx + Pr(X > 1) (3.129)

0
3.5. MORE ON THE FISHER INFORMATION 29

can be easily veried to be biased. For example, it always holds that E[θ̂ML (X)] > 0, even when
θ = 0.
In fact, a far more general statement holds in this setup: any estimator θ̂ : R → [0, 1] is
necessarily biased. To see this, note that if θ̂ is unbiased, then in particular when θ = 0 it must
hold that E[θ̂(X)] = 0. But since we have restricted θ̂(x) to give values in [0, 1] only, then clearly
the only way for E[θ̂(X)] = 0 to hold is if θ̂(X) = 0 with probability 1. This means that θ̂ is a
constant estimator, hence is obviously biased. Note that the estimator θ̂(x) = x is of course still
unbiased; it achieves this by sometimes returning values that are outside the [0, 1] parameter
space.

One might wonder why the ML estimator for in biased coin example, where Θ = [0, 1] is not
the entire real line, is ecient. From the discussion above, the following immediate corollary
claries this point.

Corollary 3.12. Let {p(x | θ)}θ∈Θ be a regular parametric family, and assume that I(θ) > 0 for
any θ ∈ interior(Θ). Suppose the score function can be written as score[ x | θ ] = C(θ) · (g(x) − θ).
If Pr(g(X) ∈ Θ) = 1, then the ML estimator is ecient (and is given by g(x)).

3.5 More on the Fisher Information


Let p(x), q(x) be two distributions over the same alphabet. Below we will write things assuming
that these are scalar [Link], but the same denitions and results hold in case these sare [Link], and
in the multidimensional case, by replacing the integrals with sums or with multiple integrals. The
Kullback-Leibler (KL) divergence (a.k.a relative entropy) is a useful measure of disparity between
distributions. It is dened to be

  
p(X)
D(pkq) , EX∼p(x) log (3.130)
q(X)
Z  
p(x)
= p(x) log dx (3.131)
q(x)

In other words, D(pkq) is the expected log-likelihood ratio, where the expectation is taken with
respect to p.

Lemma 3.13. The following are properties of the KL divergence:

(i) D(pkq) ≥ 0, with equality if and only if p(x) = q(x) (almost everywhere).

(ii) D(pkq) 6= D(qkp) In general.

(iii) Data Processing Inequality (DPI): Let pX × pY |X and qX × pY |X be two joint distributions
over the same sigma-algebra of the product sample space X × Y , and let pY and qY be the
associated Y -marginals, respectively. Then

D(pY kqY ) ≤ D(pX kqX ) (3.132)

This is depicted in Figure 3.3

You will prove this lemma as a homework exercise.


30 CHAPTER 3. MINIMUM VARIANCE UNBIASED ESTIMATION

pX pY
pY |X

qX qY
pY |X

D(pY kqY ) ≤ D(pX kqX )

Figure 3.3: Data Processing Inequality for the KL divergence

Theorem 3.14. Let {p(x | θ)}θ∈Θ⊆R be a 3-regular family with Fisher information I(θ),
and assume that

∂ 3 log p(x | θ)
 
sup E < ∞. (3.133)
θ∈Θ ∂θ3

Then the following properties hold:

(i) The Fisher information is the curvature of the KL divergence w.r.t. translations.
Namely, for any θ, δ such that [θ, θ + δ] ⊆ interior(Θ),

δ2
I(θ) + O |δ 3 | .

D (p(x | θ) k p(x | θ + δ)) = (3.134)
2

(ii) DPI for the Fisher information: Let X ∼ p(x | θ), and let Y | X = x ∼ p(y | x). Then
, assuming that the associated p(y|θ) satises the same regularity conditions,

I Y (θ) ≤ I(θ), (3.135)

where I Y (θ) is the Fisher information pertaining to the parametric family p(y | θ).

Proof. (i) Let

g(δ) , D (p(x | θ) k p(x | θ + δ)) . (3.136)

We will nd the Taylor expansion of g around δ = 0. First,

g(0) = D(p(x | θ) k p(x | θ)) = E [log 1] = 0. (3.137)

The rst derivative is

  
0 ∂ p(X | θ)
g (0) = E log (3.138)
∂δ p(X | θ + δ)
δ=0
  
∂ p(X | θ)
=E log (3.139)
∂δ p(X | θ + δ)
δ=0
3.5. MORE ON THE FISHER INFORMATION 31

 

= −E log p(X | θ + δ) (3.140)
∂δ
δ=0
 

= −E log p(X | θ) (3.141)
∂θ
= −E [score[ X | θ ]] (3.142)

= 0, (3.143)

where (3.139) holds due to regularity. Similarly, the second derivative is

∂2
 
g 00 (0) = −E log p(X | θ + δ) (3.144)
∂δ 2
δ=0
 2 

= −E log p(X | θ) (3.145)
∂θ2
= I(θ). (3.146)

Now, using the 3-regularity now, the third derivative at any 0 ≤ δ0 ≤ δ is

∂3
 
(3) 0
|g (δ )| = E log p(X | θ + δ) (3.147)
∂δ 3
δ=δ 0
 3 

= E log p(X | θ) (3.148)
∂θ3
θ=θ+δ 0

hence by the assumption there is some universal C<∞ such that |g (3) (δ 0 )| < C for any θ
0
and δ . Writing the Taylor series with the remainder term, we have that for some δ 0 ∈ [0, δ]

1 g (3) (δ 0 ) 3
g(δ) = g(0) + δg 0 (0) + δ 2 g 00 (0) + ·δ , (3.149)
2 6
C 3
and since the remainder term is bounded in absolute value by
6 |δ |, the claim follows.

(ii) Writing the KL-divergence using the Fisher information for the family p(x | θ) we have that

δ2
D(p(x | θ)kp(x | θ + δ)) = I(θ) + O(|δ|3 ), (3.150)
2
and similarly for the family p(y | θ) therefore,

δ2 Y
D(p(y | θ) k p(y | θ + δ)) = I (θ) + O(|δ|3 ). (3.151)
2
Using the DPI for the KL divergence, we have that

δ2 δ2
I(θ) + O(|δ|3 ) ≥ I Y (θ) + O(|δ|3 ). (3.152)
2 2
Dividing by δ 2 /2 yields

I(θ) + O(|δ|) ≥ I Y (θ) + O(|δ|). (3.153)

The result now follows by taking the limit δ → 0.


32 CHAPTER 3. MINIMUM VARIANCE UNBIASED ESTIMATION

Remark 3.15. The DPI for the Fisher information is intuitively appealing, claiming that it
can never be increased by processing the samples. In cases where the CRLB is tight this is
immediately clear, but it turns out to hold in more generality.

Theorem 3.16 (Additivity of I(θ) for independent samples). Let {Xk }nk=1 be independent sam-
ples drawn from possibly distinct models controlled by the same parameters, i.e., X ∼ pk (x | θ),
where pk (x | θ) are all regular families. Let I
(n) (θ) be the Fisher information pertaining to all the

samples, and let Ik (θ) be the Fisher information pertaining to the sample Xk . Then p(x | θ) is
regular, and

n
X
(n)
I (θ) = Ik (θ). (3.154)
k=1

Proof. The log-likelihood function is additive:

n
!
Y
log p(x | θ) = log pk (xk | θ) (3.155)
k=1
n
X
= log pk (xk | θ). (3.156)
k=1

Thus, the score function is also additive:


score[ x | θ ] = log p(x | θ) (3.157)
∂θ
n
X ∂
= log pk (xk | θ) (3.158)
∂θ
k=1
n
X
= score[ xk | θ ]. (3.159)
k=1

The expectation of the score is clearly zero. The Fisher information, which is the variance of the
score, is given by

I (n) (θ) = Var [score[ X | θ ]] (3.160)

Xn
= Var [score[ Xk | θ ]] (3.161)
k=1
n
X
= Ik (θ). (3.162)
k=1

where in (3.161) we have used the fact that the samples are independent.

Corollary 3.17. Let


i.i.d
{Xk }nk=1 ∼ p(x | θ), where p(x | θ) is a regular family with Fisher infor-
mation I(θ). Then

I (n) (θ) = nI(θ). (3.163)

Remark 3.18. Together with the CRLB, the additivity property implies that the variance of
an unbiased estimator for a regular family cannot decrease faster than Ω(1/n) in the number of
i.i.d samples. We will later see that in non-regular cases, a faster decay is sometimes possible.
3.6. CRLB FOR GENERAL SIGNALS IN AWGN 33

Remark 3.19. What about dependent samples? One might guess that the Fisher information
Pn
is sub-additive, namely that I (n) (θ) ≤ k=1 Ik (θ). However, this is not the case. For example,
take the following dependent two-sample model

X1 = θ + W (3.164)

X2 = θ − W (3.165)

where W ∼ N (0, 1) . The marginal Fisher information for both samples is clearly identical and
equal to I1 (θ) = I2 (θ) = 1. However, for the two samples together, the Fisher information must
be

I (2) (θ) = ∞, (3.166)

since θ̂(x) = (x1 + x2 )/2 recovers θ exactly, with zero MSE.

3.6 CRLB for General Signals in AWGN


Consider the case of observing a deterministic function of the parameter, contaminated by
AWGN. Namely, the observations are given by

Xk = gk (θ) + Wk , k = 1, . . . , n, (3.167)

i.i.d
where {Wk } ∼ N (0, σ 2 ) and gk (θ) is a function (time series) parameterized by θ. Before, we
solved the case of a constant in AWGN, where the function gk (θ) = θ was the identity function,
independent of the time index. The associated likelihood function is given by

n
( )
1 1 X
p(x | θ) = n exp − 2 (xk − gk (θ))2 . (3.168)
(2πσ 2 ) 2 2σ
k=1

The score function is

∂ log p(x | θ)
score[ x | θ ] = (3.169)
∂θ
n
1 X ∂gk (θ)
= 2 (xk − gk (θ)) · . (3.170)
σ ∂θ
k=1

And the Fisher information is

I(θ) = E (score[ X | θ ])2


 
(3.171)
n i  ∂g (θ) 2
1 X h 2 k
= 4 E (Xk − gk (θ)) (3.172)
σ ∂θ
k=1
n
1 X  2  ∂gk (θ) 2
 
= 4 E Wk (3.173)
σ ∂θ
k=1
| {z }
σ2
n  2
1 X ∂gk (θ)
= . (3.174)
σ2 ∂θ
k=1

Thus, for any unbiased estimator θ̂ we have that

h i σ2
Var θ̂(X) ≥ n  2 . (3.175)
P ∂gk (θ)
∂θ
k=1
34 CHAPTER 3. MINIMUM VARIANCE UNBIASED ESTIMATION

The denominator above can be though of as the sum of sensitivities of the signal to perturbation
of the parameter. Note that for the simple case of gk (θ) = θ, we get the σ 2 /n we have already
found before.

Example 3.20 (Frequency .


Estimation) Suppose we are observing a sinusoid of unknown fre-
1
quency f ∈ 0, 2

gk (f ) = a cos(2πf k), k = 1, ..., n (3.176)

in AWGN, and wish to estimate f. The amplitude a and also the phase (set to zero here) are
assumed to be known. Write

∂gk (f )
= 2πka · sin(2πf k) (3.177)
∂f

and hence for any unbiased estimator fˆ we have that

h i σ2
Var fˆ ≥ n . (3.178)
(2πa)2 k 2 sin2 (2πf k)
P
k=1

a2
Dening SNR , σ2
, we can also write

" n
#−1
h i X 2
Var fˆ ≥ SNR · (2πk sin(2πf k)) , (3.179)
k=1

Note that later samples (large k) can be much more inuential than earlier ones (small k ). This
might seem weird at rst sight, but the reason is simple: If we are trying to decide between f
and f +δ for some small deviation δ, the dierence between the resulting sinusoids becomes
more evident as time progresses.

3.7 Transformation of Parameters


Suppose our parametric family is given to us parameterized by θ, but we want to nd an un-
biased estimator for g(θ) instead of θ, where g :R→R is some given function. What is the
corresponding CRLB? The following is a simple extension of the standard CRLB.

Theorem 3.21. Let {p(x | θ)}θ∈Θ⊆R be a regular parametric family with Fisher information
I(θ). Let ĝ(x) be an unbiased estimator, i.e., one such that E [ĝ(X)] = g(θ) for any θ ∈ Θ, for
a continuously dierentiable function g(θ) over Θ. Then

MSEθ (ĝ(X)) = Var [ĝ(X)] (3.180)


 2
∂g(θ)
∂θ
≥ . (3.181)
I(θ)

Proof. Let us dene a new parameter φ = g(θ), and apply the CRLB to φ. Without loss of
generality, we will assume that g is invertible. If it is not, then we can restrict the discussion to
a small enough neighborhood of θ, in which g is either invertible or constant (and the constant
case is not interesting, since the bound then says the variance is nonnegative). It is not dicult
3.7. TRANSFORMATION OF PARAMETERS 35

to show that the parametric family when parameterized by φ is also regular. Thus any unbiased
estimator φ̂(x) of φ (and hence of g(θ)) must satisfy

1
Var[ φ̂ ] ≥ (3.182)
I(φ)
e

where we wrote e for the Fisher information pertaining to the new parametric family pe(x | φ) =
I(φ)
p(x | θ) where θ = g −1 (φ). Thus, we only need to relate this Fisher information to the Fisher
information I(θ) of the original family. To that end, the score function pertaining to φ is given
by

∂ log pe(x | φ) ∂ log p(x | θ)


= (3.183)
∂φ ∂φ
∂ log p(x | θ) ∂θ
= · (3.184)
∂θ ∂φ
 −1
∂φ
= score[ x | θ ] · (3.185)
∂θ
∂g(θ) −1
 
= score[ x | θ ] · . (3.186)
∂θ

where we have used the invertibility assumption in (3.185). The claim now follows since the
Fisher information is the second moment of the score.
Alternatively, we can prove this by returning to the CRLB proof, and looking at the covariance
between the score and ĝ(X). On the one hand, we have that

|Cov[ score[ X | θ ], ĝ(X) ]|2 ≤ Var [score[ X | θ ]] · Var[ ĝ(X) ] (3.187)

in light of the CauchySchwarz ineuqality. On the other hand, using regularity we get that


Cov[ score[ X | θ ], ĝ(X) ] = E [ĝ(X)] (3.188)
∂θ
∂g(θ)
= (3.189)
∂θ
where we have used the assumption that the estimator is unbiased in the last transition. Com-
bining the two expressions, the bound follows.

Remark 3.22. Note that the factor of the derivative squared makes sense. It means that the
more sensitive φ is to variations in θ, the more we expect to be penalized in MSE when estimating
φ. The derivative is in fact just a scaling factor, which locally captures the change of scale between
θ and φ.

But how about using a plug-in approach? Namely, if θ̂ is an MVU / ecient estimator for θ,
why not just use g(θ̂)?

Example 3.23 .
(Constant in AWGN) Recall the model Xk = θ + Wk where
i.i.d
{Wk }nk=1 ∼
N (0, σ 2 ). We have seen that the sample mean estimator, which was also the ML estimator, is
ecient (and hence also MVU). Suppose now that we want to estimate φ = θ2 (and assume we
restrict out attention to θ>0 only). We can use the plug-in
2
estimator φ̂ = (Mean[ X ]) . Its
expected value is
h i
E φ̂ = E (Mean[ X ])2
 
(3.190)
36 CHAPTER 3. MINIMUM VARIANCE UNBIASED ESTIMATION

= Var [Mean[ X ]] + E (Mean[ X ])2


 
(3.191)

σ2
= + θ2 (3.192)
n
6= θ2 . (3.193)

This is in fact the case in general  the unbiasedness of an estimator (and hence its eciency)
is generally violated by the application of a function. This is of course not to say that the plug-
in estimator is bad; it is simply biased. The only exception is ane transformations, for which
unbiasedness and eciency (had it been satised) are maintained. Let θ̂ be an unbiased estimator
for θ. Consider the ane transformation φ = aθ + b, and let us evaluate the plug-in estimator
φ̂(x) = aθ̂(x) + b.
h i h i
E φ̂(X) = aE θ̂(X) + b (3.194)

= aθ + b (3.195)

= φ. (3.196)

Hence, in this case the plug-in estimator is also unbiased. What about the MSE?
h i h i
Var φ̂(X) = a2 Var θ̂(X) (3.197)
 2
∂φ h i
= · Var θ̂(X) . (3.198)
∂θ

Now, in case θ̂ is ecient, we obtain


 2
∂φ
∂θ
h i
Var φ̂(X) = , (3.199)
I(θ)
hence φ̂ is ecient as well.
We can use the above claim approximate preservation of eciency for large data sets.
Suppose that θ̂(x) is ecient for θ, where x is a vector of i.i.d samples. Consider φ = g(θ) and
assume that the second derivative of g is uniformly boudned. Evaluating the Taylor expansion
of g around θ at θ̂(x), we have
 
∂g(θ)   2
g(θ̂(x)) = g(θ) + θ̂(x) − θ + O θ̂(x) − θ . (3.200)
∂θ
Taking the expectation of both sides, we obtain
h  i ∂g(θ)  h i   h i
E ĝ θ̂(X) = g(θ) + E θ̂(X) − θ +O Var θ̂(X) (3.201)
∂θ | {z }
=0
 
1
= g(θ) + O , (3.202)
nI(θ)
where we have used the i.i.d and eciency assumptions. Hence, the plug in estimator is asymp-
totically unbiased. Computing the variance, and assuming heuristically the that contribution of
the third term is negligible, we get

h i  ∂g(θ) 2 h i
Var g(θ̂(X)) ≈ Var θ̂(X) . (3.203)
∂θ
which is approximately ecient. We will handle this topic more accurately later on, when we
discuss the ML asymptotic.
4. Vector MVU Estimation

Suppose we have a vector of parameters θ ∈ Rd , and a parametric family {p(x | θ)} where x is
either a scalar or a vector, discrete or continuous.
d
We say that an estimator θ̂ : X → R is called
unbiased if
h i
E θ̂(X) = θ. (4.1)

How would we generalize the MVU property? First, a quick reminder from linear algebra.

A real-valued symmetric square matrix A is called positive semi-denite (p.s.d) if

v T Av ≥ 0 (4.2)

for any 6 0. In this case


v= we write A  0. We also write AB to mean that A − B  0,
i.e., that A − B is p.s.d.

Lemma 4.1. Let A be a real-valued, symmetric, square matrix. Then the following state-
ments are equivalent:

(i) A  0.

(ii) All the eigenvalues of A are nonnegative.

(iii) A = QT Q for some real-valued matrix Q. There is a unique such p.s.d. Q.

Lemma 4.2. The following are some operations that preserve the p.s.d property:

(i) If A0 and B  0, then αA + βB  0 for any α ≥ 0, β ≥ 0.

(ii) If A is invertible and A  0, then A−1  0.

Below we write A0 to mean that the associated inequalities are strict, in which case
A is said to be positive denite (p.d.).

Since our estimators are now multidimensional, it may not be immediately clear how to
compare them. Recall the covariance matrix of a random column vector U is

Cov [U ] , E (U − E [U ]) · (U − E [U ])T .
 
(4.3)

We say that an unbiased estimator θ̂1 (x) dominates another unbiased estimator θ̂2 (x), if
h i h i
Cov θ̂2 (X)  Cov θ̂1 (X) . (4.4)

This clearly reduces to the denition of domination for scalar unbiased estimators. But what
does multidimensional domination mean? It is in fact a very strong requirement.

37
38 CHAPTER 4. VECTOR MVU ESTIMATION

Lemma 4.3. Let θ̂1 (x) and θ̂2 (x) be two unbiased estimators. Then θ̂1 (x) dominates θ̂2 (x) if
and only if
h i h i
Var v T θ̂1 (X) ≤ Var v T θ̂2 (X) (4.5)

for any column vector v ∈ Rd .


In other words, domination means that no matter in which direction we project our estima-
tors, the one that dominates will have lower variance in that direction (i.e., will dominate in the
scalar sense). Thus, with some minor abuse of terms, we say that a vector estimator θ̂(x) is
MVU if it is unbiased, and dominates any other unbiased estimator in the sense dened above.

Proof. The following conditions are equivalent:


i h h i
ˆ T
T
Var v θ1 (X) ≤ Var v θ̂2 (X) (4.6)
h  i h  i
Var v T θˆ1 (X) − θ ≤ Var v T θ̂2 (X) − θ (4.7)
   T     T 
T ˆ ˆ T ˆ
E v θ1 (X) − θ θ1 (X) − θ v ≤ E v θ2 (X) − θ θ2 (X) − θ v ˆ (4.8)

h i h i
ˆ
v T Cov θ1 (X) ˆ
v ≤ v T Cov θ2 (X) v (4.9)

Hence, condition (4.6) holds for any v if and only if


 h i h i
v T Cov θˆ2 (X) − Cov θˆ1 (X) v ≥ 0. (4.10)

 h i h i
holds for any v, which is synonyms with the matrix Cov θˆ2 (X) − Cov θˆ1 (X) being p.s.d,
h hi i
or equivalently with Cov θˆ2 (X)  Cov θˆ1 (X) . This completes the proof.

4.1 Vector CRLB


Consider a d-dimensional parametric family {p(x | θ)}θ∈Θ where Θ ⊆ Rd .
x is either a
As usual,
scalar or a vector, discrete or continuous. We can dene the vector score function score[ x | θ ] =
∂ log p(x | θ)
to contain the score functions of each parameter, i.e., a vector of functions whose ith
∂θ
entry is

∂ log p(x | θ)
scorei [x | θ] , (4.11)
∂θi
Note that scorei [ x | θ ] generally depends on all the parameters, not only the ith one. It captures
the multiplicative sensitivity of the distribution to perturbation of the ith parameter. The score
vector is score[ X | θ ], d-dimensional random vector obtained
the by drawing the x argument
score vector function according to p(x | θ). .
The Fisher information matrix I(θ) is dened to be the second moments matrix of the score
vector, i.e.,
h i
I(θ) , E score [X | θ] · score [X | θ]T . (4.12)

More explicitly, the (i, j)-th entry of I(θ) is

{I(θ)}ij , E [scorei [X | θ] · scorej [X | θ]] (4.13)


4.1. VECTOR CRLB 39

 
∂ log p(X | θ) ∂ log p(X | θ)
=E · . (4.14)
∂θi ∂θj
A vector parametric family is called regular, if the following conditions are satised:

(i) The parameter space Θ⊆R is a union of a nite number of convex sets.

(ii) The Fisher information matrix exists, is nite, and is full rank for any θ ∈ interior(Θ).

(iii) log p(x | θ) is twice continuously dierentiable in θ for any θ ∈ interior(Θ).

(iv) Integration and dierentiation can be exchanged up to second order partial derivatives,
whenever the integrals exist.

Lemma 4.4. For a regular vector parametric family, the following properties hold:

(i) I(θ)  0 (this holds also without regularity, whenever the Fisher information matrix exists).

(ii) The score vector satises

E [score[ X | θ ]] = 0 (4.15)

Cov [score[ X | θ ]] = I(θ). (4.16)

(iii) The Fisher information matrix can alternatively be written as


 
∂ scorei [ x | θ ]
{I(θ)}ij = −E (4.17)
∂θj
 2 
∂ log p(x | θ)
= −E . (4.18)
∂θi θj

Proof. The rst claim follows since I(θ) is a covariance matrix, which is always p.s.d. Proving
the other two follows the same line of argument as in the scalar case; we leave the details to the
reader.

Theorem 4.5 (Vector CRLB) . {p(x | θ)}θ∈Θ⊆Rd be a regular parametric family with
Let
Fisher information matrix I(θ). Let θ̂ : X → Rd be any unbiased estimator for θ . Then:
h i
Cov θ̂(X)  I −1 (θ). (4.19)

In addition, an unbiased estimator attaining the bound above with equality for all θ exists,
if and only if the score vector function can be written as

score[ x | θ ] = C(θ) · (g(x) − θ) (4.20)

for some vector functiong(x) and p.s.d matrix C(θ). Moreover, if (4.20) holds then
C(θ)h = I(θ)
i , and θ̂(x) = g(x) is an MVU estimator with the covariance matrix

Cov θ̂(X) = I −1 (θ).

The proof of this theorem is similar to the scalar case; the interested reader can try to ll in
the details.

Remark 4.6. If θ̂ is ecient, i.e., is unbiased and achieves the CRLB with equality for all θ,
then it is MVU. But even there is no ecient estimator, an MVU might still exist.
40 CHAPTER 4. VECTOR MVU ESTIMATION

As in the scalar case, we also have the following connection between ML and eciency. The
proof is left to the reader  note that one must use the fact that I(θ) is p.s.d.

Lemma 4.7. Consider a regular family with Θ = Rd and suppose I(θ)  0 for any θ ∈ Rd . If
there exists an ecient estimator, then ML is the ecient estimator.

An alternative formulation of the vector CRLB is the following:

Corollary 4.8. The CRLB lower bound is equivalent to the statement that
h  i
Var v T θ̂(X) − θ ≥ v T I −1 (θ) v (4.21)

holds for any v ∈ Rd . Namely, we have a lower bound on the variance of the projection of the
error in any direction.

The vector CRLB provides in particular a lower bound on the variance of any unbiased
estimator for any one of the parameters:

Corollary 4.9. For a regular parametric family, it holds that

Var[ θˆi ] ≥ {I −1 (θ)}ii . (4.22)

Proof. One way to see this is to pick v to have vi = 1 and zero otherwise,
h i and then appeal to
Corollary 4.8. Alternatively, by the vector CRLB we have that Cov θ̂(X) − I −1 (θ)  0. The

result follows by recalling that a p.s.d matrix must have non-negative diagonal entries.

Remark 4.10. We note that by the scalar CRLB, it also holds that
Var[ θˆi ] ≥ 1/{I(θ)}ii . (4.23)

Contrasting this with Corollary 4.9, we note that the lower bound (4.23) is weaker than (4.22),
namely that {I −1 (θ)}ii ≥ 1/{I(θ)}ii (you will show this in a homework exercise), and the
inequality can be strict. The intuitive reason is that the scalar CRLB (4.23) implicitly assumes
that all the parameters except θi are known, whereas the lower bound (4.22) assumes that
all the parameters are unknown. It is generally more dicult to estiamte θi when the other
parameters are unknown, as we will see in the following example. In some special cases the
bounds coincide; for example, in the case of constant in AWGN with unknown variance, where
the ecient estimator for the constant does not depend on the noise variance (see the SNR
estimation example below).

Example 4.11 .
(Scalar linear regression) Consider the two dimensional parametric model

Xk = α + βk + Wk , (4.24)

i.i.d
where {Wk }n−1 2
k=0 ∼ N (0, σ ). Let us denote the vector of parameters by θ = [α, β]T . Our
likelihood function is

n−1
( )
1 1 X
p(x | θ) = n exp − 2 (xk − α − βk)2 . (4.25)
(2πσ 2 ) 2 2σ
k=0

Let us nd the second derivatives of the log-likelihood function.

n−1
∂ log p(x | θ) 1 X
= 2 (xk − α − βk) (4.26)
∂α σ
k=0
4.1. VECTOR CRLB 41

xk
0 1 n

Figure 4.1: Line tting problem.

n−1
∂ log p(x | θ) 1 X
= 2 (xk − α − βk)k (4.27)
∂β σ
k=0

n−1
∂ 2 log p(x | θ) 1 X n
2
= − 2
1 = − 2. (4.28)
∂α σ σ
k=0

n−1
∂ 2 log p(x | θ) 1 X n(n − 1)
=− 2 k=− . (4.29)
∂α∂β σ 2σ 2
k=0

n−1
∂ 2 log p(x | θ) 1 X 2 n(n − 1)(2n − 1)
2
= − 2
k =− . (4.30)
∂β σ 6σ 2
k=0

" #
n(n−1)
1 n 2
I(θ) = 2 n(n−1) n(n−1)(2n−1) (4.31)
σ 2 6

" #
2(2n−1) 6
− n(n+1)
I −1 (θ) = σ 2 n(n+1)
6 12 . (4.32)
− n(n+1) n(n2 −1)

 T
Hence for any unbiased estimator θ̂ = α̂ β̂ it holds that

" #
2(2n−1) 6
h i − n(n+1)
Cov θ̂(X)  σ 2 n(n+1)
6 12 . (4.33)
− n(n+1) n(n2 −1)

In particular, for any such estimator we have that

 
2 2(2n − 1) 2 1
Var [α̂] ≥ σ · =σ ·Ω (4.34)
n(n + 1) n
 
h i
2 12 2 1
Var β̂ ≥ σ · 2
=σ ·Ω . (4.35)
n(n − 1) n3
42 CHAPTER 4. VECTOR MVU ESTIMATION

We see that the CRLB for β decays much faster with n than the CRLB for α, hence if there is
an ecient estimator (and we will see later there is) then β can be estimated with much better
accuracy than α. This makes sense, since we expect the slope to be much more sensitive to
perturbation than the intercept; a perturbation  to the intercept incurs an additive change of
 to all the samples, whereas a perturbation  in the slope incurs a linearly increasing additive
change of k to the k th sample.
Note further that when β is known, the problem reduces to a constant in AWGN, in which
case

σ2
Var [α̂] ≥ (4.36)
n
for any unbiased estimator, which is strictly smaller than what we got in (4.34) for any n > 1.
This makes sense, since we expect that α could be better estimated when β is known.

4.2 Transformation of Parameters


Suppose we wish to estimate some vector function g(θ) of the parameters. We have the following
equivalent of the scalar case.

Theorem 4.12. Let {p(x | θ)}θ∈Θ⊆Rd be a regular parametric family with Fisher information
matrix I(θ). Let ĝ(x) be an unbiased estimator, i.e., one such that E [ĝ(X)] = g(θ) for any
θ ∈ Θ, for a continuously dierentiable function g : Θ → Rr . Then
∂g(θ) −1 ∂g(θ)T
Cov [ĝ(X)]  · I (θ) · (4.37)
∂θ ∂θ
where
 
∂g1 (θ) ∂g1 (θ)
···
∂g(θ)  ∂θ. 1 ∂θd
.

=  . . (4.38)
 . .

∂θ ∂gr (θ) ∂gr (θ)

∂θ1 ··· ∂θd r×d

is the Jacobian matrix of g.


Again, using the plug-in estimator g(θ̂) where θ̂ is unbiased, generally results in a biased
estimator unless g(θ̂) = Aθ̂ + b is an ane transformation. For large data records we have
approximate eciency of g(θ̂) (if θ̂ is ecient).

Example 4.13 .
(CRLB for SNR Estimation) Consider the model

Xk = a + Wk (4.39)

i.i.d
where {Wk }nk=1 ∼ N (0, σ 2 ), and where both a and σ 2 are unknown. We are however not
2 a2
interested in a and σ individually, but rather want to estimate SNR , . Denoting the
σ2
T 2 T
parameters vector by θ = [θ1 , θ2 ] = [a, σ ] , we want to estimate the vector function

θ12
SNR = g(θ) = . (4.40)
θ2
Then X ∼ N ([θ0 , . . . , θ1 ], θ2 · In×n ), hence

n
n n 1 X
log p(x | θ) = − log(2π) − log θ2 − (xk − θ1 )2 (4.41)
2 2 2θ2
k=1
4.2. TRANSFORMATION OF PARAMETERS 43

Let us now compute the second derivatives of the log-likelihood:

n
∂ log p(x | θ) 1 X
= (xk − θ1 ) (4.42)
∂θ1 θ2
k=1
n
∂ log p(x | θ) n 1 X
=− + 2 (xk − θ1 )2 (4.43)
∂θ2 2θ2 2θ2
k=1

∂ 2 log p(x | θ) n
2 =− , (4.44)
∂θ1 θ2

hence

∂ 2 log p(X | θ)
 
n
−E 2 = 2 (4.45)
∂θ1 σ

Now

n
∂ 2 log p(x | θ) 1 X
=− 2 (xk − θ1 ), (4.46)
∂θ1 ∂θ2 θ2
k=1

hence,

∂ 2 log p(X | θ)
 
−E = 0. (4.47)
∂θ1 ∂θ2

n
∂ 2 log p(x | θ) n 1 X
= − (xk − θ1 )2 , (4.48)
∂θ22 2θ22 θ23
k=1

∂ 2 log p(X | θ)
 
n nθ2 n n
−E 2 =− 2 + 3 = 2 = 4 (4.49)
∂θ2 2θ2 θ2 2θ2 2σ

n 
σ2
0
I(θ) = n . (4.50)
0 2σ 4

It is worth noting at this point that I(θ) is diagonal, which means that the scalar CRLB for each
2
of a and σ separately coincides with their vector CRLB. Namely, in this specic problem, the
2 2
knowledge of σ (resp. a) does not change the CRLB for a (resp. σ ). Indeed, we have already
seen that the CRLB is met with equality for constant in known AWGN by an estimator (the
sample mean) that does not depend on the variance.
Getting back to the SNR estimation, let us nd the Jacobian matrix associated with g(θ):

∂g(θ) h ∂g(θ) ∂g(θ) i


= ∂θ1 ∂θ2
(4.51)
∂θ
θ12
h i
= 2θ
θ2
1
− θ22 (4.52)
h i
2
= σ2a2 − σa4 . (4.53)
44 CHAPTER 4. VECTOR MVU ESTIMATION

Hence,

−1 
∂g(θ) −1 ∂g(θ) T h 2a 2
i n
σ2
0 2a 
σ22
I (θ) = σ2 − σa4 n (4.54)
∂θ ∂θ 0 2σ 4 − σa4
4a2 σ 2 a4 2σ 4
= · + · (4.55)
σ4 n σ8 n
4a2 2a4
= + (4.56)
nσ 2 nσ 4
4SNR + 2SNR2
= . (4.57)
n
Thus,

2
d ≥ 4SNR + 2SNR .
h i
Var SNR (4.58)
n
for any unbiased SNR estimator.

4.3 CRLB for general Gaussian Signals


The general Gaussian model is where our samples have a general mean µ(θ) and a general
covariance matrix C(θ) that depend on the parameter vector in some arbitrary given way, i.e.,

X ∼ N (µ(θ), C(θ)) . (4.59)

It can be shown by direct calculation that the (i, j)-th entry of the Fisher information matrix is
given by

 T    
∂µ(θ) −1 ∂µ(θ) 1 −1 ∂C(θ) −1 ∂C(θ)
{I(θ)}ij = C (θ) + Tr C (θ) C (θ) , (4.60)
∂θi ∂θj 2 ∂θi ∂θj

where Tr (·) is the matrix trace operator, and

∂µ(θ) h ∂µ1 (θ) ∂µn (θ)


iT
, ∂θi ··· ∂θi
(4.61)
∂θi

   
∂C(θ) ∂{C(θ)}kl
, (4.62)
∂θi kl ∂θi
5. Linear Models
So far, we have restricted the family of estimators under consideration to be unbiased. We now
further restrict the model we consider (namely, our parametric family) to a special family called
linear models, where linearity here means that the expectation of the signal is a linear function of
the parameter vector. The additional structure of this family will allow us to derive and analyze
very explicit estimators and gain useful insights. As we shall see, this is a very diverse and useful
family of models.
More explicitly, our parametric family {p(x | θ)}θ∈Θ⊆Rd if

X = Hθ + W , (5.1)

where H ∈ Rn×d is a known n×d real-valued matrix, and W ∈ Rn is an n-dimensional random


vector whose distribution is known and does not depend on the parameters, where we assume
throughout without loss of generality that E [W ] = 0. Stated dierently, in a linear model only
the mean of the observation vector depends on the parameters, and it is linear combination of
the parameters E [X] = Hθ .
The matrix H is sometimes referred to in the literature as the design matrix or the observation
matrix. The columns of H are sometimes called predictors, since their linear combination is used
to predict the expected value of X :

d
X
E [X] ≈ θ̂i hi , (5.2)
i=1
 
where θ̂i θi , and hi is the ith column of H , i.e., H = h1 h2 · · · hd . We
is some estimator of
note that the observation vector X is sometimes referred to in the literature as the response. We
assume throughout that the predictors (columns of H ) are linearly independent; this is clearly
without loss of generality, since if one of the predictors is a linear combination of the others, it
can be dropped. In other words, we assume that n ≥ d, i.e., that the number of parameters does
not exceed the number of observations, and that H is full rank, i.e., rank(H) = d. The following
simple lemmas will be found useful in the sequel.

Lemma 5.1. Let H ∈ Rn×d be full rank with n ≥ d. Then H T H ∈ Rd×d is invertible.

Proof. Note that for any nonzero y ∈ Rd ,

y T H T Hy = kHyk2 > 0, (5.3)

where the inequality follows since the columns of H are linearly independent. Thus the symmetric
T
matrix H H is positive-denite (all its eigenvalues are real and strictly positive), hence it is
invertible.

Lemma 5.2. Let v ∈ Rd and A ∈ Rd×d . Then

45
46 CHAPTER 5. LINEAR MODELS

d T
(i) dθ θ v = v.
d T d T
(ii) dθ θ Aθ = (A + AT )θ . Specically, if A is symmetric then dθ θ Aθ = 2Aθ

Try to prove this at home.

5.1 Linear Models in AWGN


In this section we consider the important special case where the additive noise is white Gaussian,
i.e., W ∼ N (0, σ 2 In×n ) for some known σ2, where In×n is the n × n identity matrix. In other
words, we are assuming that our observations are X ∼ N (Hθ, σ 2 In×n ). Let us develop the
CRLB for this case. The likelihood function is
 
1 1 2
p(x | θ) = n exp − 2 kx − Hθk . (5.4)
(2πσ 2 ) 2 2σ

and the score vector function is

∂ log p(x | θ)
score [x | θ] = (5.5)
 ∂θ 
∂ 2 n 1 T
= − log(2πσ ) 2 − 2 (x − Hθ) (x − Hθ) (5.6)
∂θ 2σ
1 ∂
xT x − 2θ T H T x + θ T H T Hθ
 
=− 2 (5.7)
2σ ∂θ
1  T
= 2 H x − H T Hθ .

(5.8)
σ
Recalling that by our assumptions are in light of Lemma 5.2, the matrix HT H is invertible, we
can write:
 
1 T  T −1 T
score [x | θ] = H H (H H) H x −θ  . (5.9)

2
|σ {z } | {z }
I(θ) θ̂ML (x)

1
We therefore conclude that the Fisher information matrix is given by I(θ) = σ2
H T H , and that
the ML estimator

θ̂ML (x) = (H T H)−1 H T x. (5.10)

is ecient (achieves the CRLB with equality) hence also MVU.

Remark 5.3. The ML estimator does not depend on σ2, hence we can further assume the
variance is unknown and still be MVU w.r.t θ.

Remark 5.4. The matrix (H T H)−1 H T is known as the Moore-Penrose pseudo inverse of H.
It can be though of as a stable way of inverting a non-square full rank matrix; note that
(H T H)−1 H T · H = I and that the pseudo-inverse equals H −1 whenever H is invertible.

Since the ML estimator is ecient, its covariance matrix is


h i
Cov θ̂ML = (I(θ))−1 (5.11)

= σ 2 (H T H)−1 . (5.12)
5.1. LINEAR MODELS IN AWGN 47

Moreover, since θ̂ML is a linear combination of Gaussian [Link], then

θ̂ML ∼ N θ, σ 2 (H T H)−1 .

(5.13)

We summarize the above it a Theorem.

Theorem 5.5. For the linear model with AWGN and observation matrix H, the ML
estimator is ecient and hence MVU. It is given by

θ̂ML (x) = (H T H)−1 H T x. (5.14)

Moreover,

θ̂ML (x) ∼ N θ, σ 2 (H T H)−1 .



(5.15)

Next, we go through some examples.

Example 5.6 (Stock Market). Let X be the daily (or weekly, or monthly) return of a given stock
(in percentage), and let Y be the daily market (or monthly, or yearly) return of the market (as
measured by some major index, e.g., S&P500). We would like to understand the relation between
the market returns and the individual stock returns, i.e., how much of the stock movement is a
result of market uctuations, and how much is idiosyncratic to the stock. Namely, we would like
to nd the conditional distribution pX|Y (x | y), should such a thing be dened. More specically,
suppose we believe in the following simple model:

X = α + βY + W (5.16)

where W ∼ N (0, σ 2 ) that is statistically independent of Y. In practice, the coecient α is


eectively zero (why?). The coecient β measures how much market movements impact the
stock, and the noise W represents the part of the stock return that cannot be explained by the
market, hence is specic to the individual stock. Intuitively, |β| is a measure of risk; the larger it
is the more sensitive is our investment in the stock to market movements. The risk represented
by β is in some sense inherent and cannot be diversied by multiple dierent investments. The
risk coming from the unexplained idiosyncratic part W can in principle be diversied by mul-
tiple independent investments. This simplied model is called the Capital Asset Pricing Model
(CAPM). In CAPM the risk free interest rate also comes into play; here we have assumed it is
zero.
Suppose we are given a sample {Xk , Yk }nk=1 of independent samples from p(x | y), taken over
dierent days (or weeks, or months). We can write the relation between the vectors X and Y
as follows:
     
X1 1 Y1 W1
 X2  1 Y2   W2 
   
 α
 ..  =  .. + . . (5.17)
  
. 
 .  . .  β  .. 
.
|{z}
Xn 1 Yn θ Wn
| {z } | {z } | {z }
X H W

where W ∼ N (0, σ 2 In×n ) is independent of Y. Note that although Y is random, we are only
p(x | y), and hence we perform our analysis
interested in estimating the conditional distribution
in a probability space induced given the market returns Y . Hence, the matrix H above can be
though of as being xed (given the market returns). Note also that since the noise vector W is
assumed independent of Y , it remains i.i.d Gaussian given Y .
48 CHAPTER 5. LINEAR MODELS

We can now nd the ML estimator for the coecients α β , which in this case is ecient
and
and MVU. Without loss of generality, let us assume now that Mean[ Y ] = 0 (if it is not, we can
remove the mean by moving it into the constant α). Then

hT1 
 
T

H H = T h1 h2 (5.18)
h2
 T
h1 h1 hT1 h2

= T (5.19)
h2 h1 hT2 h2
n
 
P
 n Y k
= k=1 (5.20)
n

P 2
Yk kY k
k=1
 
n 0
= . (5.21)
0 kY k2

Hence the pseudo-inverse of H is

 −1  
T −1 T n 0 1 ··· 1
(H H) H = (5.22)
0 kY k2 Y1 · · · Yn
" #
1/n · · · 1/n
= Y1 · · · kYYnk2 . (5.23)
kY k2

The ML estimator is now given by

θ̂ML = (H T H)−1 H T X (5.24)


 
" # X1
1/n · · · 1/n  . 
= Y1 . (5.25)
kY k2
· · · kYYnk2  . 
Xn
" #
Mean[ X ]
= XT Y . (5.26)
kY k2

Thus, α̂ML = Mean[ X ] as one would expect, and

XT Y
β̂ML = (5.27)
kY k2
XT Y kXk
= · (5.28)
kXk kY k kY k
kXk
= ρ(X, Y ) · . (5.29)
kY k

where ρ(X, Y ) is the empirical correlation between the vectors X and Y (which converges to
σ2
the true correlation as n → ∞). The variance of our estimators is given by Var[ α̂ML ] = n and
σ2
Var[ β̂ML ] = kY k2
.

Example 5.7 .
(Polynomial curve tting) We are given data points from some experiment, e.g.
in the form of a sequence {X(tk )}nk=1 , where t1 < t2 < · · · < tn are deterministic sample times
(or any other set of variables). We would like to t a simple curve to the data points. We may
want to do that in order to:
5.1. LINEAR MODELS IN AWGN 49

(i) To nd a simple model that explains our data.

(ii) To compress the data, i.e., represent it approximately using a simple function of much less
then n parameters.

(iii) To interpolate or extrapolate, i.e., to guess the value of the process at unobserved time
points X(t).

One idea is to t a low degree polynomial to the data, i.e.,

X(tk ) = θ0 + θ1 tk + θ2 t2k + ... + θd−1 td−1


k + error (5.30)

d−1
X
= θi tik + error, (5.31)
i=0

where dn is set to avoid over-tting. We now write this as a linear model:

 T
X = X(t1 ) X(t2 ) · · · X(tn ) (5.32)
 T
θ = θ0 θ1 · · · θd−1 . (5.33)

1 t1 t21 · · · t1d−1
 
1 t2 t2 · · · td−1 
2 2 
H = . . . , (5.34)

.
 .. .. .
.
. 
.
2
1 tn tn · · · tnd−1

where the rst column is a constant predictor, the second column is a linear predictor, the third
column is a quadratic predictor, and so on. Note that our stock market example is essentially
polynomial tting with only the constant and the linear predictors, i.e., with d = 2. If d < n and
since all the sampling time are distinct, then H is a Vandermonde matrix and hence full rank.
Hence we get the linear model

X = Hθ + W , (5.35)

where W is tting noise. Assuming the tting noise is white Gaussian W ∼ N (0, σ 2 In×n ), we
have the ecient MVU estimator

−1
θ̂ML (x) = H T H H T x, (5.36)

which is Gaussian with mean θ and covariance matrix σ 2 · (H T H)−1 , where H is given in (5.34).

Example 5.8 (Fourier analysis). Suppose that our observation vector X is a linear combination
of sinusoids with unknown amplitudes, corrupted by AWGN:

M   M  
X 2πmk X 2πmk
Xk = am cos + bm sin + Wk (5.37)
n n
m=1 m=1

i.i.d
where {Wk }n−1 2
k=0 ∼ N (0, σ ). We didn't use the constant (DC) here, but it can be added easily.
This can be represented as a linear model with parameter vector

 T
θ = a1 a2 · · · aM b1 b2 · · · bM (5.38)
50 CHAPTER 5. LINEAR MODELS

x(tk )

t
t1 t2 t3 ··· tn

Figure 5.1: Curve tting problem

and observation matrix


 
1 ··· 1 0  ··· 0
2π 2πM 2π 2πM
  
 cos
 n ···
cos n sin n sin n ··· 

H= . . . . . (5.39)
. . . .

. . . . 
        
2π(n−1) 2πM (n−1) 2π(n−1) 2πM (n−1)
cos n · · · cos n sin n · · · sin n

n
The matrix H is of dimensions n × 2M , hence we must assume that M < 2 (which makes
sense). Writing hi for the ith column of H , we note that due to the known orthogonality of
these sinusoids we have hTi hj = n2 · 1 (i = j). Hence

hT1 h1 hT1 h2 hT1 h2M


 
···
H T H =  ... .
.
.
. (5.40)
 
. . 
T T T
h2M h1 h2M h2 · · · h2M h2M
n 
2 0 ··· 0
0 n · · · 0
2
= . . . (5.41)
 
.
 .. .. .. .
.
0 0 · · · n2
n
= I2M ×2M . (5.42)
2
We conclude that the ML estimator, which is ecient and MVU, is given by

−1 T
θ̂ML (x) = H T H H x (5.43)

2 T
= H x (5.44)
n 
hT1 x

2 
= ·  ...  . (5.45)

n
hT2M x

Namely,

n−1  
2X 2πmk
âm,ML = xk · cos (5.46)
n n
k=0
5.2. LEAST SQUARE INTERPRETATION 51

n−1  
2X 2πmk
b̂m,ML = xk · sin , (5.47)
n n
k=0

which are exactly the DFT coecient of the observation sequence x. The ML estimator is normal
with mean θ and a diagonal covariance matrix
h i −1
Cov θ̂ML (X) = σ 2 H T H (5.48)

2σ 2
= · I2M ×2M . (5.49)
n
The amplitude estimators are thus Gaussian and statistically independent. Note that the ML /
DFT approach assumes nothing about the frequency respose of the observation sequence, except
that it is a combination of a give set of sinusoids. In case the frequency response is known to
have some further properties (smoothness, sparseness, etc.) then using ML / DFT is typically
not the best choice. We will discuss how to handle such situations when we get to the topic of
regularization.

5.2 Least Square Interpretation


Suppose that we do not want to assume any statistics on our observation vector, but we do
believe that it can be approximated rather well using a linear combination of a given set of
predictors. Namely, we believe that

x ≈ Hθ. (5.50)

for some choice of θ. What value ofθ should be guess? One approach, going back to Gauss,
is by look at the residual error x − Hθ , and nd the value of θ that minimizes its quadratic
norm. The general approach of minimizing the quadratic norm of the residual is commonly called
(ordinary) least squares (LS), and when the approximation is a linear function of the parameters,
the problem itself is sometimes referred to as linear regression. The LS estimator for the linear
regression problem is thus dened to be the solution to the following optimization problem:

θ̂LS (x) = argmin kx − Hθk2 . (5.51)


θ

We can immediately note that the solution to this problem is exactly the ML estimator for the
linear model in AWGN, simply since the likelihood function is monotonic in the quadratic norm
of the residual. Hence, we immediately have:

Proposition 5.9. The LS estimator for the linear regression probelm with a full rank observation
matrix H coincides with the ML estimator for the corresponding linear model with AWGN, and
is given by
−1
θ̂LS (x) = H T H H T x. (5.52)

5.2.1 Geometrical Interpretation


LS estimation admits a simple geometrical interpretation. Before we discuss it, we recall a few
more facts from linear algebra. The image of a matrix A, denoted Im {A}, is the linear subspace
containing all the vectors of the form Av , and the kernel of A, denoted ker {A}, is the linear
subspace of all the vectors v satisfying Av = 0. A real valued square matrix P is called an
orthogonal projection, if it is symmetric, i.e., P
T = P , and idempotent, i.e., P 2 = P . For an

orthogonal projection P , we say that P projects onto the subspace Im {P }. This is intuitively
clear, since applying P once gets us inside Im {P }, and applying it again does nothing.
52 CHAPTER 5. LINEAR MODELS

Lemma 5.10. Let P be an orthogonal projection, and let d = dim (Im {P }). Then

(i) The subspaces Im {P } and ker {P } are orthogonal, and Im {P } ⊕ ker {P } = Rn .


d
vk vkT {vk }dk=1
P
(ii) P = where is any orthonormal basis for Im {P }.
k=1

(iii) P has d eigenvalues all equal to 1, and n−d eigenvalues all equal to 0.

Proof. All these properties stem from noting the following. Any vector can be written as

x = P x + (I − P )x. (5.53)

where P x ∈ Im {P }, and (I − P )x ∈ ker {P } since P · (I − P )x = (P − P 2 )x = 0. Moreover,

hP x, (I − P )xi = xT P T (I − P )x (5.54)
T
= x P (I − P )x (5.55)

= xT (P − P 2 )x (5.56)
T
= x (P − P )x (5.57)

=0 (5.58)

The remaining details of the proof are left to the reader.

−1
Let us go back to the LS estimator θ̂LS = H T H H T x. The approximation of x that it
provides is simply

x̂ = H · θ̂LS (5.59)
−1
= H HT H HT x (5.60)
| {z }
P

The matrix P is commonly called the hat matrix, since it maps x → x̂. Let us show that P is
an orthogonal projection:

Proof.

h −1 T iT
P T = H HT H H (5.61)
h −1 T T
i
= H HT H H (5.62)
−1 T
= H HT H H (5.63)

= P. (5.64)

−1 −1
P 2 = H HT H HT · H HT H HT (5.65)
−1
= H HT H HT (5.66)

= P. (5.67)
5.2. LEAST SQUARE INTERPRETATION 53

Which subspace is the hat matrix P projecting onto? For any x,


h −1 T i
Px = H HT H H x (5.68)
| {z }
A
= H · (Ax). (5.69)

Namely, P x is a linear combination of the columns of H . Hence, P is projecting onto the column
space of H , and the LS estimator nds the coecients θ such that Hθ is exactly equal to the
projection of x onto this subspace, which is the subspace spanned by the predictors. This is
depicted in Figure 5.2.
Following this interpretation, a common measure for the goodness of t of the LS solution to
the linear regression problem is the so-called R-square, dened by

kx − H θ̂LS k2
R2 , 1 − (5.70)
kxk2

R2 is simply the fraction of signal energy that is explained by the predictors.

Example 5.11 .
(Stock market) Recall our stock market Example 5.6. The ML estimators for
α, β , which are also their LS estimators, were given by α̂LS = Mean[ x ] and

kxk
β̂LS = ρ(x, y) · , (5.71)
kyk
xT y
where ρ = ρ(x, y) = kxkkyk . Assume for simplicity that Mean[ x ] = 0, and let us nd the R2
associated with this problem. The energy of the residual is given by

2
2 kxk
x − β̂LS y = x−ρ y (5.72)
kyk
kxk kxk2 kyk2
= kxk2 − 2ρ · xT y + ρ2 (5.73)
kyk kyk2
= kxk2 − 2ρ2 kxk2 + ρ2 kxk2 (5.74)

= kxk2 (1 − ρ2 ), (5.75)

and hence

R2 = ρ2 (x, y). (5.76)

5.2.2 Least squares for nonlinear regression


The LS approach can be easily generalized beyond linear regression. Suppose we believe that
x ≈ g(θ) for some function g. Then we can dene

θ̂LS = argmin kx − g(θ)k2 . (5.77)


θ̂

In the linear regression case, g(θ) = Hθ and we found a closed-form solution. In the general case,
there is usually no closed-form solution, and we need to optimize numerically or solve iteratively.
The LS estimator for nonlinear regression again assumes no statistical model, but as before it
can be interpreted as an ML estimator for the Gaussian model

X = g(θ) + W (5.78)
54 CHAPTER 5. LINEAR MODELS

residual
error

H θ̂ L
S

Im {H}

Figure 5.2: The LS estimator as a projection onto the space spanned by the predictors

where

W ∼ N (0, σ 2 In×n ). (5.79)

Unlike the linear case, the ML estimator here is not necessarily ecient nor MVU; in fact, it is
typically biased.

5.3 Linear Model for Correlated Additive Gaussian Noise


Consider the linear model

X = Hθ + W (5.80)

where now the noise is W ∼ N (0, C) where C is a known covariance matrix, assumed to be
strictly positive denite. Instead of attacking this problem directly, we can whiten the noise
using a linear transformation, and then use the results from the AWGN case. Since C is positive
denite, it is invertible and its inverse is also positive denite. Moreover, being positive denite,
the inverse can be written as C −1 = QT Q for another positive denite matrix Q. To see this,
we can use the eigenvalue decomposition C
−1 = V T DV where V is unitary and D is diagonal
−1
√ √ √
with positive diagonal entries, then write C = (V T D) · ( DV ), and set Q = DV .
Let

X
f , QX (5.81)

= QHθ + QW (5.82)

= Hθ
e +Wf (5.83)

where H
e and W
f have been implicitly dened. Now, W
f is clearly a zero mean Gaussian vector,
with a covariance matrix
h i
f T = QE W W T QT
 
E WfW (5.84)
−1 T
= Q QT Q Q (5.85)
5.4. BEST LINEAR UNBIASED ESTIMATOR (BLUE) 55

−1
= QQ−1 QT QT (5.86)

= I. (5.87)

We can now solve this equivalent linear model in AWGN using the tools we have developed.
Specically,
 −1
θ̂ML = HeTH e HeTx e (5.88)
−1 T T
= H T QT QH H Q Qx (5.89)
−1
= H T C −1 H H T C −1 x.

(5.90)

This estimator is ecient and MVU for θ. To summarize:

Theorem 5.12. X = Hθ + W where X ∈ Rn , H ∈ Rn×d with n ≥ d is full rank and


Suppose
known, and W ∼ N (µ, C) where µ ∈ Rn and the positive denite C ∈ Rn×n are known. Then
the ML estimator
−1
θ̂ML = H T C −1 H H T C −1 (x − µ) (5.91)

is ecient and hence MVU, and has a covariance matrix


h i
Cov θ̂ML = (H T C −1 H)−1 . (5.92)

Remark 5.13. Note that the ML estimator above is insensitive to scaling of the noise covariance
matrix C → αC . In addition, it can be interpreted as a weighted LS solution (see homework
exercise).

5.4 Best Linear Unbiased Estimator (BLUE)


In some cases the distribution of the noise in the linear model may not be Gaussian, in which
case nding an ecient / MVU estimator may be challenging. Sometimes, the distribution of
the noise is now even exactly known. What can we do? One approach is to limit the discussion
to linear unbiased estimators, i.e., an unbiased estimator of the form

θ̂(x) = Ax, (5.93)

for some matrix A ∈ Rd×n . A linear unbiased estimator that dominates (in the MSE sense)
any other unbiased linear estimator is called the Best Linear Unbiased Estimator (BLUE), and
denoted here by θ̂BLUE .

Theorem 5.14 (Gauss-Markov) . Consider the linear model

X = Hθ + W , (5.94)

where E [W ] = 0 and Cov [W ] = C for some given covariance matrix C (which does not
depend on the parameters), and where W is otherwise arbitrarily distributed (can depend
on the parameters). Then the BLUE for θ is given by
−1
θ̂BLUE (x) = H T C −1 H H T C −1 x (5.95)

and
h i
Cov θ̂BLUE = (H T C −1 H)−1 . (5.96)
56 CHAPTER 5. LINEAR MODELS

Proof. This can be proved directly (see homework exercise). Let us prove this in a smarter
roundabout way. Let θ̂(x) = Ax, and write:

h i
E θ̂(X) = E [AX] (5.97)

= E [A(Hθ + AW )] (5.98)

= AHθ (5.99)

=θ (5.100)

where the last equality follows from the unbiasedness requirement. Therefore, it must hold that

AH = I. (5.101)

The covariance matrix of the estimator is given by

h i h i
Cov θ̂ = E (θ̂ − θ)(θ̂ − θ)T (5.102)
h i h i h i
= E θ̂ θ̂ T − E θ̂θ T − E θ θ̂ T + θθ T (5.103)
h i
= E θ̂ θ̂ T − θθ T , (5.104)

where we have used the unbiasedness again. Now,

h i
E θ̂ θ̂ T = E AXX T AT
 
(5.105)

= AE (Hθ + W )(Hθ + W )T AT
 
(5.106)

= AH · θθ T · (AH)T + ACAT (5.107)


T T
= θθ + ACA , (5.108)

where we have used the fact that AH = I must hold. Hence, the covariance of the estimator

h i
Cov θ̂ = ACAT (5.109)

depends only on the choice of A and on the noise covariance C, but not on any other statistics
of the noise W. Thus, assuming that W ∼ N (0, C) is Gaussian will not aect the covariance
of the estimator. But for the Gaussian case, we already know that the ML estimator θ̂ML (x) =
−1 T −1
H T C −1 H H C x is linear and dominates any other unbiased estimator, in particular linear
unbiased estimators. Hence it must coincide with the BLUE.

Remark 5.15. Examining the proof, we note that we can use the BLUE even when the covari-
ance matrix depends on the parameters, as long as the dependence is only in the scaling of the
covariance. Namely, if

Cov [W ] = g(θ)C (5.110)

for some strictly positive function g(θ) 6= 0 and known C , then the same BLUE still works. This
is since now, Cov[ θ̂ ] = g(θ)ACAT therefore

argmin g(θ)ACAT = argmin ACAT . (5.111)


A:AH=I A:AH=I

which is independent of θ.
5.4. BEST LINEAR UNBIASED ESTIMATOR (BLUE) 57

Remark 5.16. The BLUE is not MVU in general (unless the noise is Gaussian), and may not
even exists (e.g., when there is no unbiased estimator, see homework exercise).

Example 5.17. Let {Xk }nk=1 i.i.d


∼ Unif(0, θ) and suppose we want to estimate the unknown θ.
We can write this as a linear model:
1
2
X =  ...  θ + W (5.112)
 
1
2

i.i.d
{Wk }nk=1 ∼ − 2θ , 2θ

where Unif . Note that E [Wk ] = 0 and that

θ2
Cov [W ] = In×n , (5.113)
12
and so in light of Remark 5.15 we can use the BLUE. We have that

−1 4
HT H = (5.114)
n
and hence

−1 T
θ̂BLUE (x) = H T H H x (5.115)
n
2X
= xk (5.116)
n
k=1
= 2 Mean[ x ], (5.117)

We will later nd the MVU for this problem via other methods, and see that this estimator is
not MVU. To see intuitively whet this estimator is not so good, consider the case where say
X1 = X2 = ... = X99 = 0 and X100 = 50. In this case θ̂BLUE = 2Mean[ X ] = 1. However, this is
a gross underestimate since obviously θ ≥ 50.
58 CHAPTER 5. LINEAR MODELS
6. Sucient Statistics

As we have seen, the ML estimator attains the CRLB essentially whenever this is possible, and
is hence MVU in these cases. Sometimes however, an ecient estimator may not exist at all,
either since the CRLB is loose or because the CRLB does not apply.

Example 6.1 .
(Uniform) Recall the model where
i.i.d
{Xk }nk=1 ∼ Unif(0, θ). We have seen that
the BLUE in this case is

θ̂BLUE (x) = 2 · Mean[ x ]. (6.1)

We argued intuitively that this is probably not the best estimator. For future reference, let us
compute its variance:

h i
MSE(θ̂BLUE ) = Var θ̂BLUE (6.2)

n
4 X
= Var [Xk ] (6.3)
n2
k=1
4
= Var [X1 ] (6.4)
n
4 
= E[X12 ] − (E[X1 ])2 (6.5)
n
 θ 
Z  2
4 1 θ 
= x2 dx − (6.6)
n θ 2
0
4 θ2 θ2
 
= − (6.7)
n 3 4
θ 2
= , (6.8)
3n

where (6.3) and (6.4) hold since the samples are i.i.d.
Let us nd the ML estimator for θ.
n
Y
p(x | θ) = p(xk | θ) (6.9)
k=1
n
 n Y
1
= 1 (xk ∈ [0, θ]) . (6.10)
θ
k=1

θ̂ML = argmax p(x | θ) (6.11)


θ

59
60 CHAPTER 6. SUFFICIENT STATISTICS

 n Y n
1
= argmax 1 (xk ∈ [0, θ]) (6.12)
θ θ
k=1
| {z }
∈{0,1}

= argmin s.t. xk ∈ [0.θ], ∀k (6.13)


θ
= max xk . (6.14)
k

This sounds reasonable, since it is clear that θ is at least as large as the maximal sample. But
for the exact same reason, it seems like an underestimate. It should thus be intuitively clear that
θ̂ML is biased (toward zero). We will compute the ML bias when we return to this example later
on, and see this is indeed the case. Since an existence of an ecient estimator implies ML is
ecient, there cannot be an ecient estimator in this case (note that we have proved this only
when the parameter space is the entire real line, but the argument works here as well, e.g. by
considering negative θ values as well). In fact, the CRLB regularity conditions do not hold in
this case, since the support of the distribution depends on the parameter, so the CRLB is not
even valid.

In such cases, we still want some procedure to nd an MVU estimator (if one exists) or to
at least nd a good unbiased estimator. In this course we explore two ways to approach this
problem: One via the notion of sucient statistics (now), and one via the asymptotics of ML
(later).

6.1 Sucient Statistics


A statistic of a sample is simply any function of the sample, e.g., the sample mean, max, variance,
median, etc. The statistic can be a scalar or a vector function. It can be the entire sample, but
typically is a much smaller entity that tries to summarize the data in an informative way.
Suppose X ∼ p(x | θ) is either a scalar or a vector and θ is some unknown parameter. A statistic
T (x) is called sucient statistic if given its value, the sample X is no longer dependent on the
parameter, i.e., if

p(x | θ, T (x)) = p(x | T (x)). (6.15)

In other words, if we impose a distribution p(θ) on θ, then

θ − T (X) − X (6.16)

forms a Markov chain. This intuitively means that T (X) summarizes everything that is useful
for estimating the parameter.

Example 6.2 (Coin toss) Let . {Xk }nk=1 ∼


i.i.d
Bern(θ). Then

n
Y
p(x | θ) = θxk (1 − θ)1−xk (6.17)
k=1
kxkH
=θ (1 − θ)n−kxkH , (6.18)

n
P
where kxkH , xk is the Hamming weight of the sample. A natural candidate for a sucient
k=1
statistic is

T (x) = kxkH . (6.19)


6.2. FISHER-NEYMAN FACTORIZATION 61

Let us check if it is indeed sucient. Writing T = T (X) and noting that T (X) ∼ Binom(n, θ),
we have

p(x | θ) · Pr(T = t | X = x, θ)
Pr(X = x | θ, T = t) = (6.20)
Pr(T = t | θ)
θT (x) (1 − θ)n−T (x) 1 (T (x) = t)
= n t

n−t
(6.21)
t θ (1 − θ)
θt (1 − θ)n−t 1 (T (x) = t)
= n t

n−t
(6.22)
t θ (1 − θ)
1 (kxkH = t)
= n
 , (6.23)
t

which is simply the uniform distribution over all length-n binary vectors of Hamming weight t
(this should have been obvious from the outset!). It does not depend on θ anymore. Hence, T (X)
is a sucient statistic. Equivalently, the sample mean Mean[ X ] is also a sucient statistic.

6.2 Fisher-Neyman Factorization


We have seen an example where an interesting (namely, not the entire sample) sucient statistic
can be found directly. In this section we derive a general necessary and sucient conditions for
T (X) to be a sucient statistic.

Theorem 6.3 (Fisher-Neyman Factorization). Let X be drawn from the parametric family
p(x | θ). Then T (X) is a sucient statistic for θ if and only if

p(x | θ) = h(x)g(T (x), θ) (6.24)

for some functions h(·) and g(·, ·).

Proof. For simplicity we prove only for discrete alphabet samples, i.e., where X ∼ p(x | θ) has
some nite cardinality. The parameter θ can be anything.
Suppose p(x | θ) = h(x) · g(T (x), θ). Let us show that p(x | θ, T (x)) = p(x | T (x)).

p(x | θ) · Pr(T (x) = t | X = x, θ)


p(x | θ, T (x) = t) = (6.25)
Pr(T (x) = t | θ)
h(x) · g(T (x), θ) · 1 (T (x) = t)
= (6.26)
Pr(T (x) = t | θ)
h(x) · g(t, θ) · 1 (T (x) = t)
= . (6.27)
Pr(T (x) = t | θ)

All we need to show is that g(t, θ) can be canceled out. Let us expand the denominator:

X
Pr(T (x) = t | θ) = p(x | θ) (6.28)
x:T (x)=t
X
= h(x)g(T (x), θ) (6.29)
x:T (x)=t
X
= h(x)g(t, θ) (6.30)
x:T (x)=t
62 CHAPTER 6. SUFFICIENT STATISTICS

X
= g(t, θ) · h(x) (6.31)
x:T (x)=t

Substituting back into (6.27), we obtain

h(x)g(t, θ) · 1 (T (x) = t)
p(x | θ, T (x) = t) = P (6.32)
g(t, θ) · h(x)
x:T (x)=t
h(x)
= P · 1 (T (x) = t) (6.33)
h(x)
x:T (x)=t

which is not a function of θ, hence the above must be p(x | T (x) = t). Thus, T (X) is a sucient
statistic for θ.
Conversely, assume that T (X) is a sucient statistic for θ, i.e., that p(x | T (x), θ) = p(x | T (x))
holds. Then

p(x | θ) · 1 (T (x) = t)
p(x | T (x) = t, θ) = (6.34)
Pr(T (x) = t | θ)
= q(x, t). (6.35)

for some function q(x, t). Therefore,

p(x | θ) · 1 (T (x) = t | θ) = q(x, t) · Pr(T (x) = t | θ). (6.36)

Dene g(t, θ) , Pr(T (X) = t | θ) and choose t = T (x). The factorization now follows from

p(x | θ) = q(x, T (x)) · g(T (x), θ) (6.37)

= h(x) · g(T (x), θ), (6.38)

where we have further dened h(x) , q(x, T (x)).

Example 6.4 .
(Coin toss) Recall the biased coin toss experiment where
i.i.d
X ∼ Bern(θ). Write

p(x | θ) = θnMean[ x ] (1 − θ)n(1−Mean[ x ]) (6.39)

= g(Mean[ x ], θ). (6.40)

Therefore the sample mean T (x) = Mean[ x ] (or the Hamming weight) is a sucient statistic.

Example 6.5 .
(Constant in AWGN) Recall the model

Xk = θ + Wk , k = 1, . . . , n (6.41)

i.i.d
where W ∼ N (0, σ 2 ) and σ2 is known. Write

n
( )
1 1 X
p(x | θ) = n exp − 2 (xk − θ)2 (6.42)
(2πσ 2 ) 2 2σ
k=1
n n
( !)
1 1 X
2
X
2
= n exp − 2 xk − 2θ xk + nθ (6.43)
(2πσ 2 ) 2 2σ
k=1 k=1
n
( )  
1 1 X
2 nθ
= n exp − 2· xk exp − 2 (θ − 2Mean[ x ]) . (6.44)
(2πσ 2 ) 2 2σ 2σ
k=1
| {z }| {z }
g(Mean[ x ],θ)
h(x)

Hence, the sample mean T (x) = Mean[ x ] is a sucient statistic in this case as well.
6.3. RAO-BLACKWELL PROCEDURE 63

In these two examples, the sucient statistic also happened to be the MVU. We will see
tha more generally, identifying a sucient statistic can be used to derive or to get close to an
MVU.

6.3 Rao-Blackwell Procedure


We now show how a sucient statistic can be used to improve an estimator. We state this result
for more general loss functions, not only the quadratic loss. Recall that a loss function `(θ̂, θ)
assigns a nonnegative number to the event where we estimate the parameter to be θ̂ when the
true parameter is θ. Recall further that we are interested in the expected loss
h i
Lθ (θ̂) = Ep(x | θ) l(θ̂(X), θ) . (6.45)

We say that the loss function is convex, if `(θ̂, θ) is a convex function of θ̂ θ ∈ Θ. The
for any
MSE is obtained as the expected loss associated with the convex loss function `(θ̂, θ) = (θ̂ − θ)2 .
Other common examples of convex loss functions include `(θ̂, θ) = |θ̂ − θ|p for any p ≥ 1.

Theorem 6.6 (Rao-Blackwell). Let θ̂(x) be any estimator for θ (not necessarily unbiased),
and suppose that T (X) is a sucient statistic for θ. Dene
h i
θ̂RB (T (X)) , E θ̂(X) | T (X), θ (6.46)

Then θ̂RB is a feasible estimator (does not depend on the parameter) with the same bias
function, i.e., satisfying E[θ̂RB ] = E[θ̂]. Furthermore, the expected loss associated with any
convex loss function satises

Lθ (θ̂RB ) ≤ Lθ (θ̂). (6.47)

In particular,

MSEθ (θ̂RB ) ≤ MSEθ (θ̂). (6.48)

Corollary 6.7. If θ̂ is unbiased, then so is θ̂RB , and Var[ θ̂RB ] ≤ Var[ θ̂ ].


Remark 6.8. Recall that X itself is always trivially a sucient statistic. In this case, the
Rao-Blackwell theorem is trivial, since θ̂RB = θ̂.
Remark 6.9. Note that the theorem works similarly if we replace θ with and function ϕ(θ)
everywhere, and compute the expected loss with respect to ϕ(θ).
Before we proceed to the proof, we need the following technical result. Let φ : R → R be
a measurable function. For any jointly distributed real-valued [Link]. X, Y , dene theφ-gap and
conditional φ-gap to be

δφ (Y ) , E [φ(Y )] − φ(E [Y ]) (6.49)

δφ (Y | X) , E [φ(Y ) | X] − φ(E [Y | X]). (6.50)

Lemma 6.10 (Law of total gap). Let X, Y be jointly distributed real-valued r.v.s., and φ : R → R
a measurable function. Then the φ-gap of Y is the sum of the expected conditional φ-gap and the
gap of the conditional expectation (given X ), i.e.,

δφ (Y ) = E [δφ (Y | X)] + δφ (E [Y | X]). (6.51)


64 CHAPTER 6. SUFFICIENT STATISTICS

whenever the r.h.s. exists.

Proof. Write

δφ (Y ) = E [φ(Y )] − φ(E [Y ]) (6.52)

= E [E [φ(Y ) | X]] − φ(E [E [Y | X]]) (6.53)

= E [δφ (Y | X) + φ(E [Y | X])] − φ(E [E [Y | X]]) (6.54)

= E [δφ (Y | X)] + E [φ(E [Y | X])] − φ(E [E [Y | X]]) (6.55)

= E [δφ (Y | X)] + δφ (E [Y | X]). (6.56)

The above is especially interesting when φ is a convex function.

Lemma 6.11 (Jensen's Inequality) . Let φ:R→R be a convex function. Then δφ (Y ) ≥ 0, and
Pr(δφ (Y | X) ≥ 0) = 1.
Proof. Since φ(t) is convex, then at any t0 there exists some a such that φ(t) ≥ a · (t − t0 ) + φ(t0 )
(a is unique when φ is dierentiable at t0 , in which case the lower bound is the associated tangent
line). Picking t0 = E[Y ] we have

E [φ(Y )] ≥ E [a · (Y − E[Y ]) + φ(E[Y ])] (6.57)

= φ(E[Y ]). (6.58)

The second claim follows since the above similarly holds for any given X = x.

A simple application of the above is obtained for φ(t) = t2 , in which case δφ (Y ) = Var [Y ] and
δφ (Y | X) = Var [Y | X]. This yields the so-called law of total variance, stating that the variance
of Y can be written as the sum of the variance of its MMSE approximation E [Y | X], and the
expected variance of the approximation error E [Var[ Y | X ]].

Corollary 6.12 (Law of total variance) . Let X, Y be jointly distributed real-valued r.v.s. Then

Var[ Y ] = Var[ E[Y | X] ] + E[Var[ Y | X ]]. (6.59)

Proof of Theorem 6.6. First, let us make sure this is a feasible estimator (depends only on data).
Recall that since T (X) is a sucient statistic, the distribution of X given T (X) does not depend
on θ. Hence
h i
θ̂RB = E θ̂(X) | T (X), θ (6.60)
h i
= E θ̂(X) | T (X) , (6.61)

which is only a function of T (X). Let us compute the expectation:


h h ii
E[θ̂RB (X)] = E E θ̂ | T (X) (6.62)

= E[θ̂]. (6.63)

where (6.63) is the law of total expectation. Let us now proceed to show the expected loss does
not increase. Fix any value of θ, and dene the function φ(t) = `(t, θ), which is convex in t by
assumption. Now, be denition for any estimator θ̂ it holds that
h i
Lθ (θ̂) = E `(θ̂(X), θ) (6.64)
6.4. COMPLETE SUFFICIENT STATISTICS AND THE LEHMANN-SCHEFFÉ THEOREM65

h i
= E φ(θ̂(X)) (6.65)

= δφ (θ̂(X)) + φ(E[θ̂(X)]) (6.66)

Since both the estimators θ̂ and θ̂RB have the same expectation (as a function of θ), it is enough
to show that the associated φ-gap (which is the variance of the estimator in the quadratic loss
case) does not increase. By the law of total gap, we have that

h i
δφ (θ̂) = E δφ (θ̂ | T (X)) +δφ (E[θ̂ | T (X)]) (6.67)
| {z } | {z }
≥0 θ̂RB

≥ δφ (θ̂RB ). (6.68)

where non-negativity follows from the convexity of φ. This completes the proof.

h i
Remark 6.13. Note that we can clearly write θ̂RB (T (X)) , E θ̂(X) | T (X) without the con-

ditioning on θ, and that θ̂RB is indeed a function of T (X) only (and not the whole X ).

6.4 Complete Sucient Statistics and the Lehmann-Scheé The-


orem
We have seen that the Rao-Blackwell procedure can improve (or at least not hurt) the MSE
while maintaining the same bias function. Suppose we start with an unbiased estimator, identify
a sucient statistic, and apply the Rao-Blackwell procedure. When do we get an MVU? To
answer this question, we rst need a denition. A statistic T (x) is called complete if for any
measurable function g that satises

E [g(T (X)) | θ] = 0, ∀θ (6.69)

it holds that, in fact,

Pr(g(T (X)) = 0 | θ) = 1, ∀θ. (6.70)

In other words, T (X) is complete if g(T (x)) being zero in expectation for all θ implies g(T (X)) =
0 with probability 1 for all θ. This odd looking denition has the following simple interpretation.
Suppose we have two estimators using the statistic, θ̂1 (T (X)) and θ̂2 (T (X)), and assume they
have the exact same bias function. Then clearly

h i
E θ̂1 (T (X)) − θ̂2 (T (X)) | θ = 0, ∀θ (6.71)

and hence if T (X) is complete, then θ̂1 and θ̂2 (T (X)) are identical with probability 1. Specically,
this immediately implies:

Corollary 6.14. If T (X) is complete, there can be at most one unbiased estimator of θ that is
a function of T (X).

For the purpose of stating the next Theorem, we will say that an estimator is minimum loss
unbiased for some loss function `, if it dominates every other unbiased estimator in the expected
loss (an MVU estimator is the special case obtained for quadratic loss).
66 CHAPTER 6. SUFFICIENT STATISTICS

Theorem 6.15 (Lehmann-Scheé) . Let X ∼ p(x | θ) and suppose T (X) is a complete


sucient statistic. The following two claims hold:

(i) If θ̂ is any unbiased estimator for θ, then


h i
θ̂RB = E θ̂ | T (X) (6.72)

is a minimum loss unbiased estimator for θ for any convex loss function, and it is
unique with probability 1 (in particular, for quadratic loss it is MVU).

(ii) If there exist an unbiased estimator θ̂(·) for θ that is only a function of T (X), i.e.,
h i
E θ̂(T (X)) = θ ∀θ, (6.73)

then θ̂ is a minimum loss unbiased estimator for θ for any convex loss function, and
it is unique with probability 1 (in particular, for quadratic loss it is MVU).

Remark 6.16. Note that the theorem holds verbatim if we replace θ with a function ϕ(θ)
everywhere.

Proof. First, we note that if θ̂ is unbiased, then θ̂RB is also unbiased, and is a function of T only,
hence it satises the conditions for claim (ii). Thus, it is enough to prove claim (ii).
Let θ̂(T (x)) be unbiased, and let θ̃(x) be any other unbiased estimator. Consider the Rao-
Blaockwell version of θ̃:
h i
θ̃RB = E θ̃(X) | T (X) (6.74)

h i
which is unbiased E θ̃RB = θ, has expected loss Lθ (θ̃RB ) ≤ Lθ (θ̃), and is a function of T (X)
only. Thus,
h i
E θ̂(T (X)) − θ̃RB (T (X)) = θ − θ = 0 (6.75)

for any θ, which by completeness means that θ̂(T (X)) = θ̃RB (T (X)) with probability 1, and
hence also that Lθ (θ̂) = Lθ (θ̃RB ) ≤ Lθ (θ̃).

Example 6.17 .
(Coin toss) Recall the biased coin toss problem with θ ∈ (0, 1). We saw that
T (X) = kXkH is a sucient statistic. Let us show it is complete. For any function g we have
that

n  
X n t
E [g(T (X))] = g(t) θ (1 − θ)n−t (6.76)
t
t=0
n   t
n
X n θ
= (1 − θ) g(t) . (6.77)
t 1−θ
t=0
 
θ
Dening z, 1−θ we have that E [g(T (X))] = 0 for any θ ∈ [0, 1] if and only if

n  
X n t
g(t) z = 0, ∀z ∈ [0, ∞). (6.78)
t
t=0
6.4. COMPLETE SUFFICIENT STATISTICS AND THE LEHMANN-SCHEFFÉ THEOREM67

But the left hand side above is an nth degree polynomial, which can have at most n roots unless
it is identically zero. Thus we conclude that it must be that g(t) = 0 identically, hence T (X) is
complete.
Now let us demonstrate the Rao-Blackwell procedure and the Lehmann-Scheé Theorem.
Suppose we start with the very naive unbiased estimator

ˆ = x1
θ(x) (6.79)

E[θ̃(X)] = θ. (6.80)

Clearly, this estimator will not be MVU for n > 1. Let us perform the Rao-Blackwell procedure,
which by the Lehmann-Scheé Therem, since T (X) is a complete sucient statistic, will be the
unique MVU.
h i
θ̂RB (t) = E θ̂ | T (X) = t (6.81)

= E [X1 | T (X) = t] (6.82)

= Pr(X1 = 1 | T (X) = t) (6.83)

Pr(X1 = 1, T (X) = t)
= (6.84)
Pr(T (X) = t)
n−1 t
θ (1 − θ)n−t

= t−1
n t

n−t
(6.85)
t θ (1 − θ)
(n − 1)! (n − t)!t!
= · (6.86)
(n − t)!(t − 1)! n!
t
= . (6.87)
n
Therefore,

T (X)
θ̂RB (T (X)) = (6.88)
n
kXkH
= (6.89)
n
= Mean[ X ]. (6.90)

which must be MVU, as we already know from CRLB considerations. We now also know further
that Mean[ X ] is the minimum loss unbiased estimator for any convex loss function.

Example 6.18 .
(Constant in AWGN) Recall the model
i.i.d
{Xk }nk=1 ∼ N (θ, σ 2 ), where σ 2 is
n
P
known. In the previous chapter we have seen that T (X) = Xk is a sucient statistic for θ.
k=1
Let us show that is also complete. Note that T (X) ∼ N (nθ, nσ 2 ). Suppose that E [g(T (X))] = 0
for some function g and all θ, i.e.,

Z∞
1 (t−nθ)2
√ g(t)e− 2nσ 2 dt = 0, ∀θ. (6.91)
2πnσ 2
−∞

Since this is a functional identity, the derivatives with respect to θ of both sides are also identical,
namely:

Z∞
2 (t−nθ)2
√ g(t)(t − nθ)e− 2nσ 2 dt = 0, ∀θ. (6.92)
8πn3 σ 6
−∞
68 CHAPTER 6. SUFFICIENT STATISTICS

where we have assume that we move the derivative inside the integral (otherwise, approximation
arguments can be used). The above immediately implies that

E [T g(T )] = nθE [g(T )] (6.93)

= 0, (6.94)

Repeating the same argument for the function tg(t) and iterating it, we nd that

h i
E T k g(T ) = 0 (6.95)

for any nonnegative integer k. Let us assume that g(t) is


Panalytick over the entire real line (if not,
approximation arguments can be used), hence g(t) = ∞ k=0 gk t pointwise in t. Then we have
that


" #
X
2 k
 
E g (T ) = E g(T ) · gk T (6.96)
k=0

" #
X
=E gk T k g(T ) (6.97)
k=0

X h i
= gk E T k g(T ) (6.98)
k=0
= 0, (6.99)

where we have further assumed absolute convergence in (6.98). Hence g(T ) = 0 with probability
one, and hence T (X) is complete. By the Lehmann-Scheé Theorem, if we can nd an function
θ̂(t) such that

h i
E θ̂(T (X)) = θ (6.100)

t
for any θ, then θ̂(T (X)) is the unique MVU estimator. Clearly, θ̂(t) = n satises this require-
ment. We further conclude that the sample mean in this case is a minimum loss unbiased for
any convex loss function.
7. Exponential Families

7.1 Motivation and Denition


Which parametric families attain the CRLB with equality? Let us look at the scalar parameter
case p(x | θ) for simplicity. A necessary and sucient condition to attain the CRLB for θ is


score [x | θ] = log p(x | θ) = I(θ)[T (x) − θ], (7.1)
∂θ
for some function T (X). Taking the integral on both sides w.r.t θ, we obtain
Z
log p(x | θ) = I(θ)[T (x) − θ]dθ + c(x) (7.2)
Z Z
= T (x) · I(θ)dθ − I(θ) · θdθ + c(x) (7.3)

= T (x) · η(θ) − a(θ) + c(x), (7.4)

R R
where η(θ) , I(θ)dθ and a(θ) , I(θ) · θdθ. Taking the exponent of both sides and dening
h(x) , ec(x) , the likelihood function is of the form

p(x | θ) = h(x) · exp {η(θ)T (x) − a(θ)} . (7.5)

A family of distributions of this form is called a (scalar) exponential family. The exponential
family is determined by the choice of functions h(x), T (x), η(θ) and a(θ), which can be arbitrary
as long as they result in a proper p.d.f. Note that the function a(θ), sometimes called the
log-partition function, is in fact uniquely determined by the choice of the other functions; it is
simply a normalization that makes sure that p(x | θ) integrates (or sums) to one. The function
η(θ) is sometimes called the natural parameter of the exponential family. Note that the general
exponential family form (when the functions h, T, η are general) does not imply that T (X) is
ecient for θ nor for η(θ), but there is always some parameterization for which T (X) is ecient,
as we will discuss in the sequel.
 T
More generally, we say that p(x | θ) is a (vector) exponential family for θ = θ1 · · · θd ,
if

p(x | θ) = h(x) exp η(θ)T · T (x) − a(θ)



(7.6)

where
 T
T (x) = T1 (x) · · · Td (x) (7.7)

and h(x), the natural parameters η : Rd → Rd and the log-partition-function a : Rd → R are


arbitrary (as long as they result in a proper distribution). Note that x can be a scalar of a vector
both for the scalar and vector exponential family.

69
70 CHAPTER 7. EXPONENTIAL FAMILIES

7.2 Basic Properties


In this section we overview some of the properties that make exponential families appealing.
We will see additional properties later, when we discuss conjugate priors in the Bayesian setup.
There are also many other results and applications of exponential families which we do not
discuss here, including maximum entropy problems and generalized linear models.
We state some of the results in the vector exponential family form, as it is more general; the
special simpler scalar case easily follows. First, we note that T (X) is a sucient statistic. This
follows immediately from the factorization theorem, since

p(x | θ) = h(x) exp η(θ)T · T (x) − a(θ) .



(7.8)
| {z }
g(T (X),θ)

An exponential family is called full rank if the d functions {Ti (x)}di=1 in the sucient statistic
are linearly independent (hence specically, any nontrivial scalar case is always full rank). The
following can be shown:

Lemma 7.1. Consider a full rank exponential family where the natural parameter space η(Θ)
has a non-empty interior. Then

(i) T (X) is a complete sucient statistic.

(ii) T (X) is minimal, i.e., it is a function of any other sucient statistics.

(iii) Any function of T (X) that is an unbiased estimator for ϕ(θ), is also the unique MVU
estimator for ϕ (with probability 1).

Proof. We have already seen that T (X) is sucient; proofs of completeness and minimality
can be found in the literature. The nal claim follows immediately from the completeness and
suciency via the Lehmann-Scheé Theorem.

Lemma 7.2. Consider a full-rank exponential family dened by (h(x), η(θ), T (x), a(θ)), where
Θ ⊆ Rd is a convex set. The following claims hold, assuming that all the derivatives exists and
the Jacobian matrices are invertible for all θ ∈ Θ.

(i) The ML estimator is either attained on the boundary of Θ, or is one of the solutions for
the equation

!−1
dη(θ) T da(θ) T
φ(θ) , · = T (x). (7.9)
dθ dθ

(ii) The sucient statistic itself T (x) is an ecient estimator for the parameter φ(θ) dened
above.

Proof. Let ϕ(θ) be some general invertible re-parametrization. The associated score vector is

d log p(x | θ)
score[ x | ϕ ] = (7.10)

d
= (η(θ)T T (x) − a(θ)) (7.11)

dη dθ T da dθ T
   
= T (x) − (7.12)
dθ dϕ dθ dϕ
7.2. BASIC PROPERTIES 71

"  T −1 #
dθ T dη T da T
 

= T (x) − (7.13)
dϕ dθ dθ dθ
 T T
dθ dη
= [T (x) − φ(θ)] (7.14)
dϕ dθ

Both claims now follow by choosing ϕ = φ. Note the the Fisher information matrix for Var[ φ ]
has also been revealed above.

The log-partition function of the exponential family can be used to derive the expectation
and the variance of the sucient statistic, which is sometimes simpler than computing directly.
Try to prove this at home:

Lemma 7.3. For a full rank exponential family, it holds that

∂a
E [T (X)] = (7.15)
∂η

and

∂2a
Cov [T (X)] = (7.16)
∂η∂η T

We note that similarly, higher central moments of the sucient statistic can be computed
using the corresponding higher order derivatives of the log-partition function.

Another important property of exponential families is their behavior under i.i.d samples.

Lemma 7.4. Let X = (X1 , . . . , Xn )T be an i.i.d sample from an exponential family. Then the
distribution of X is also an exponential family, with a sucient statistic given by

n
X
T (X) = T (Xk ). (7.17)
k=1

Proof. Follows directly from

n n
( )
Y X
p(x | θ) = h(xk ) · exp η(θ) T (xk ) − na(θ) . (7.18)
k=1 k=1

This nice property means in particular that the dimension of the sucient statistic of an
exponential family does not grow with the number of i.i.d samples; we can summarize our entire
set of observations using only d numbers where d is the number of parameters. Unfortunately,
this property does not typically extend beyond exponential families. The proof of the following
theorem can be found in the literature.

Theorem 7.5 (Pitman-Koopman-Darius) . Of all parametric families of probability distribution


whose support does not depend on the parameters, only exponential families have sucient statis-
tics whose dimension does not grow with the (i.i.d) sample size.
72 CHAPTER 7. EXPONENTIAL FAMILIES

7.3 Examples
Example 7.6 (Gaussian with unknown mean, known / unknown variance). X ∼ N (µ, σ 2 ) where
σ2 is either known or unknown (this is our constant in AWGN example).

(x − µ)2
 
2 1
p(x | µ, σ ) = √ exp − (7.19)
2πσ 2 2σ 2
x2 µ2
 
1 µ
=√ exp − 2 + x 2 − 2 . (7.20)
2πσ 2 2σ σ 2σ

If σ2 is known, then this can be written as

x2 µ2
   
1 µ
p(x | µ) = √ exp − 2 exp · x − . (7.21)
2πσ 2 2σ σ2 2σ 2
hence this is an exponential family with sucient statistic is T (x) = x, natural parameter is
µ µ2
η= σ2
, and the log-partition function a(µ) = 2σ 2
. The sucient statistic of an i.i.d sample is
thus the sample mean (or sample sum for multiple i.i.d. samples) as we already know. Moreover,
da/dµ µ/σ 2
T (X) = X is an ecient (and ML) estimator for the parameter
dη/dµ = 1/σ 2
= µ, as we already

know. Alternatively, we can see that X is MVU since it is unbiased an a function of a complete
sucient statistic.
What happens when the variance is also unknown? In this case we can write (7.20) as

x2 µ2
 
2 1 µ 1 2
p(x | µ, σ ) = √ exp − 2 + x 2 − 2 − log σ . (7.22)
2π 2σ σ 2σ 2
hence this is an exponential family with sucient statistic

T
T (x) = x, x2 ,

(7.23)

natural parameters

 T
µ2 1
η(µ, σ ) = 2
,− 2 , (7.24)
σ 2σ
µ2 1
and log-partition function a(µ, σ 2 ) = 2σ 2
+ 2 log σ 2 . The complete sucient statistic of an i.i.d
sample is therefore

T
T (X) = Mean [{Xk }] , Mean {Xk2 }
 
. (7.25)

It is easy to check that

µ̂ = Mean [{Xk }] (7.26)

σˆ2 = Mean {X 2 } − (Mean [{Xk }])2


 
k (7.27)

are unbiased, and since they are functions of a complete sucient statistic, they are the unique
MVU estimators.
Let us demonstrate the use of the log-partition function to compute the expectation of the
sucient statistic. First, for simplicity, let us write the log-partition function as a function of
the natural parameters:

η12 1
a(η) = − + log(−2/η2 ) (7.28)
4η2 2
7.3. EXAMPLES 73

E [T1 (X)] = E [X] (7.29)

∂a(µ, σ 2 )
= (7.30)
∂η1
η1
=− (7.31)
2η2
= µ. (7.32)

and

E [T2 (X)] = E X 2
 
(7.33)

∂a(µ, σ 2 )
= (7.34)
∂η2
η12 1
= 2 − (7.35)
4η2 2η2
= µ2 + σ 2 . (7.36)

Example 7.7 .
(Bernoulli) Let X ∼ Bern(θ) where θ ∈ [0, 1] is an unknown parameter. Then

p(x | θ) = θx (1 − θ)1−x (7.37)


   
θ
= exp x log + log(1 − θ) . (7.38)
1−θ

Therefore, this is an exponential family with T (X) = X as the complete sucient statistic
θ
(not surprisingly), the log-likelihood ratio η(θ) = log
1−θ as the natural parameter (hence θ =
eη i.i.d
1+eη ), anda(θ) = − log(1−θ) = log(1+eη ) as the log-partition function. Now, if X ∼ Bern(θ),
then we clearly have that the Hamming weight (or sample mean) T (X) = kXkH is the complete
η
sucient statistic, and a(θ) = −n log(1 − θ) = n log(1 + e ) is the log-partition function (the
natural parameter remains the same). Hence

∂a
E [T (X)] = (7.39)
∂η
neη
= (7.40)
1 + eη
= nθ, (7.41)

and

∂2a
Var[ T (X) ] = (7.42)
∂η 2
n(eη (1 + eη ) − e2η )
= (7.43)
(1 + eη )2
neη
= (7.44)
(1 + eη )2
= nθ(1 − θ). (7.45)

This is as expected, since T (X) ∼ Binom(n, θ).

Example 7.8 .
(Discrete Distributions) Suppose X is a discrete r.v with
 unknown symbol prob-
M
abilities {pi }i=1 . The vector of parameters is (say) θ = p1 · · · pM −1 , where we have removed
74 CHAPTER 7. EXPONENTIAL FAMILIES

the last symbol probability since it is a function of the others. Assuming without loss of generality
that X is taking the value i with probability i, we have the p.m.f:

M
pi · 1 (x = i)
X
p(x | θ) = (7.46)
i=1
M
( !)
1 (x = i) · pi
X
= exp log (7.47)
i=1
(M )
1 (x = i) · log pi
X
= exp (7.48)
i=1
M −1
( )
p
1 (x = i) · log i
X
= exp log pM + (7.49)
pM
i=1

where (7.48) follows by noting that only one summand can be nonzero. Dening

1 (x = 1) , 1 (x = 2) , · · · 1 (x = M − 1)
 T
T (x) = (7.50)
h θ
iT
θ
η(θ) = log 1−PM1 −1 θ log PθM2 −1 · · · log PMM−1
−1 , (7.51)
i=1 i 1− i=1 θi 1− i=1 θi
M −1
!
X
a(θ) = − log 1 − θi (7.52)
i=1

we have that

p(x | θ) = exp η(θ)T · T (x) − a(θ) .



(7.53)

Hence this is in the exponential family with a complete sucient statistic T (X) that is the
indicator of event that happened.
For an i.i.d sample from a discrete distribution the complete sucient statistic is therefore
n
P
T (Xi ), which is eectively the empirical distribution of the samples:
i=1

n  n n n
T
1X 1
1 (Xi = 1) 1
1 (Xi = 2) · · · 1
1 (Xi = M − 1) .
P P P
T (Xi ) = n n n (7.54)
n i=1 i=1 i=1
i=1

n
 
1 P  
Note that E n T (Xi ) = p1 · · · pM −1 . Hence, the Lehmann-Scheé Theorem implies
i=1
that the empirical distribution is the unique MVU estimator of the underlying distribution.

There are many other known families of distributions that are exponential families, includ-
ing: Exponential, Poisson, Chi-squared, Laplace, Beta, Gamma, and log-normal. On the other
hand, there are many others that are not exponential families, including uniform, Cauchy, and
hypergeometric.
Let us give an example that is not exponential family, but nevertheless has a small complete
sucient statistic and an MVU estimator.

Example 7.9 .
(Uniform) Let us go back to the model where
i.i.d
{Xk }nk=1 ∼ Unif([0, θ]) where
θ2
θ>0 is unknown. Recall we have seen that θ̂BLUE = 2Mean[ X ] has variance Var[ θ̂BLUE ] = 3n
and that θ̂ML = maxk Xk is biased.
7.3. EXAMPLES 75

Let us try to nd a sucient statistic. Write:

n
1 (xk ∈ [0, θ])
Y
p(x | θ) = θ−n (7.55)
k=1
 
=θ −n
· 1 max xk ∈ [0, θ] (7.56)
k

= h(x) · gθ (max xk , θ), (7.57)


k

where h(x) = 1 and g(t, θ) = θ−n · 1 (t ∈ [0, θ]). By the Neyman-Fisher factorization Theorem
we conclude that T (X) = max Xk is a sucient statistic. This makes sense, since once we know
k
the maximal sample the smaller ones intuitively give us no further information regarding the
support. It is also clear that this is not an exponential family setup, since the likelihood cannot
be written as a function of some product η(θ) · max Xk .
k
Let us now show that this sucient statistic is also complete. First, we need to nd the p.d.f
of T = T (X). For any t ∈ [0, θ] we have

n
Y
Pr (T ≤ t | θ) = Pr (Xk ≤ t) (7.58)
k=1
 n
t
= ·. (7.59)
θ

Hence, taking a derivative we nd the p.d.f


p(t | θ) = Pr(T ≤ t) (7.60)
∂t
= θ−n ntn−1 · 1 (t ∈ [0, θ]) (7.61)

Now suppose g(t) is such that E [g(T )] = 0 identically. Then for any θ>0


E [g(T )] = g(t)θ−n ntn−1 dt = 0, (7.62)

hence also


g(t)tn−1 dt = 0. (7.63)

Taking the derivative of both sides of this identity w.r.t. θ we nd that for any θ>0

g(θ)θn−1 − 0 = 0, (7.64)

Hence g(θ) = 0 and T is complete.


Now, let us try to nd the MVU estimator, if it exists. In fact, let us even try to nd the
MVU estimator for any function ϕ(θ), if possible. To that end, it is necessary and sucient to
nd an estimator ϕ̂(T (X)) that is unbiased, i.e., such that for any θ>0


E [ϕ̂(T )] = ϕ̂(t) · θ−n ntn−1 dt (7.65)

0
76 CHAPTER 7. EXPONENTIAL FAMILIES

= ϕ(θ), (7.66)

or equivalently, that


n ϕ̂(t)tn−1 dt = θn ϕ(θ). (7.67)

Taking the derivative with respect to θ yields,

nϕ̂(θ)θn−1 − 0 = nθn−1 ϕ(θ) + θn ϕ0 (θ). (7.68)

Solving for ϕ̂ and substituting θ → t, we have

t
ϕ̂(t) = ϕ(t) + · ϕ0 (t). (7.69)
n
We conclude that the estimator

maxk Xk 0
ϕ̂MVU = ϕ(max Xk ) + · g (max Xk ) (7.70)
k n k

is the unique MVU estimator for ϕ(θ). Specically, for ϕ(θ) = θ we get the MVU estimator

n+1
θ̂MVU = · max Xk . (7.71)
n k

n+1
Note that θ̂MVU = n · θ̂ML , hence the MVU estimation in this case is just a bias-corrected ML.
What is its variance?
" 2 #
h i n+1
Var θ̂MVU (T ) = E max Xk − θ2 (7.72)
n k

n + 1 2  2
 
= E T − θ2 (7.73)
n
 Zθ
n+1 2

= θ−n ntn−1 · t2 dt − θ2 (7.74)
n
0
2 θ
tn+2

n+1 −n
= θ n − θ2 (7.75)
n n+2
0
n+1 2 n
 
= · θ2 − θ2 (7.76)
n n+2
(n + 1)2 2
= · θ − θ2 (7.77)
n(n + 2)
n2 + 2n + 2 − n2 − 2n 2
= θ (7.78)
n(n + 2)
θ2
= . (7.79)
n(n + 2)
h i
θ2
This is much better than Var θ̂BLUE = 3n . In fact, note that the variance in this case decays
as O(1/n2 ), which is order-wise better than the Ω(1/n) implied by the CRLB. There is no
contradiction however, since in this case the regularity conditions for the CRLB are not satised
7.3. EXAMPLES 77

(the support size depends on the parameter). Moreover, in this case although Unif([0, θ]) is not
in an exponential family, we have found a scalr sucient statistics. This does not contradict the
Pitman-Koopman-Darius Theorem for the same reason, namely since the distribution support
depends on the parameter.
Can we gain uniformly in MSE by introducing bias? Let θ̂ = α · max xk = α · T . We have
k
seen that:

n
E [T ] = ·θ (7.80)
n+1
n 2
E T2 =
 
θ . (7.81)
n+2
Thus

MSE(θ̂) = E (αT − θ)2


 
(7.82)

= α2 E T 2 − 2αθE [T ] + θ2
 
(7.83)
 
n 2n
= θ2 α2 − α+1 . (7.84)
n+2 n+1
n+2 n+1 n+2
This is minimized by α= n+1 (and not by n ). Hence, θ̂biased = n+1 · max Xk we have
k

θ2 θ2
MSE(θ̂biased ) = < = MSE(θ̂MVU ). (7.85)
(n + 1)2 n(n + 2)

This turns out to be true in many cases: introducing a small bias can often uniformly help (but
usually not by much).

In general, the dimension of a sucient statistic grows with the number of samples outside
the exponential families. The example above was the exception rather than the rule; let us now
see that a simple generalization of it to higher dimensions already fails to maintain a constant
dimension of a sucient statistic.

Example 7.10 (Convex bodies) Let. {Xk }nk=1 ∼


i.i.d
Unif(S) where S ⊆ Rk is some unknown
convex set. What is the smallest possible sucient statistic here? It is easy to be convinced that
it is both necessary and sucient to keep only the points that generate the convex hull of the
samples, i.e., T (X) is the minimal set of points such that

Convex hull (X) = Convex hull (T (X)) . (7.86)

In the worst case we will need to keep all of the sample, e.g., if we are unlucky and say always
get points on the boundary of S. But this is very unlikely, so we expect that most of the time,
Xk will already be in the convex hull of {X1 , . . . , Xk−1 }. We may then wish to see how fast
the size of T (X) grows with high probability as a function of n. The rough answer is that with
high probability, the number of points in T (X) is O(log n), which is much nicer than n but still
grows unbounded. We will not prove this rigorously here, but to see why it is the case, consider
a very similar one-dimensional situation. Suppose the samples are uniform between [0, θ], and
we want to count the number of times that Xk is larger than all the samples the precede it (a
new maximum). This value is

" n   # n  
1 max Xj < Xk =
X X
E Pr max Xj < Xk (7.87)
j≤k−1 j≤k−1
k=1 k=1
78 CHAPTER 7. EXPONENTIAL FAMILIES

x1 x5
x4
x3
x2

Figure 7.1: Example of a convex body.

n
X   
= E E max Xj < Xk | Xk (7.88)
j≤k−1
k=1
n
"
 #
X Xk k−1
= E (7.89)
θ
k=1
Xn Z θ
= θ−(k−1) θ−1 · xk−1 dx (7.90)
k=1 0
n
X θk
= θ−(k−1) θ · (7.91)
k
k=1
n
1X
= (7.92)
k
k=1
Z n
dx
≤1+ (7.93)
1 x
= log n + 1, (7.94)

where log n is the natural logarithm.

We conclude this chapter by giving an example where no complete sucient statistic exists.

Example 7.11 . W ∼ Unif − 12 , 21 . X


 
(No complete sucient statistic) Let X = θ + W, where
itself is trivially a sucient statistic. Also, it is an unbiased estimator E [X] = θ. So, if it is
complete, it will be MVU. However, note that for example,

E [sin(2πX)] = E [sin(2πθ + 2πW )] (7.95)


1
Z 2

= sin(2πθ + 2πw)dw (7.96)

− 12

=0 (7.97)

but Pr(sin(2πX) = 0) = 0 (and not 1). In other words, X + sin(2πX) is also unbiased, and a
function of a sucient statistic. Thus X is not complete, and we can not conclude that θ̂ = x is
an MVU estimator. The same argument applies to all other sucient statistics here, which are
clearly simply all invertible function of X, showing that none of them is complete.
8. The ML Estimator

8.1 Preliminaries
We begin with a short discussion of known denitions and results concerning the convergence of
r.v sequences. Let {Xk }∞
k=1 be a sequence of real-valued [Link], and let X be a real-valued r.v. We
D
say that Xn convergences in distribution to X, denoted Xn −
→ X, if

lim Pr(Xn ≤ x) = Pr(X ≤ x) (8.1)


n→∞

at any point x ∈ R where the c.d.f Pr(X ≤ x) is continuous. We say that Xn converges in
P
probability to X , Xn −
→ X, if for all ε>0

lim Pr(|Xn − X| ≥ ε) = 0. (8.2)


n→∞

Here are some useful results relating the dierent modes of convergence.

Lemma 8.1. The following claims hold:

P D
(i) Xn −
→X implies that Xn −
→ X.
D P
(ii) Xn −
→c for some constant c, implies that Xn −
→ c.

Theorem 8.2 (Continuous mapping theorem (CMT)) . Let g : Rm → Rn be a continuous


function. Then

D D
1. Xn −
→X implies g(Xn ) −
→ g(X).
P P
2. Xn −
→X implies g(Xn ) −
→ g(X)

Theorem 8.3 (Slutsky's theorem) . Suppose Xn −


→X
D
and Yn −
→c
P
where c is some constant.
Then

D
1. Xn + Yn −
→ X + c.
D
2. Xn · Yn −
→ X · c.
X D X
3. Y n −
→ c (for c 6= 0).
n

D
Proof. It can be easily shown that (Xn , Yn ) −
→ (X, c). The result then follows the CMT by using
g(x, y) = x + y, x · y, xy .

We note turn to asymptotic convergence laws for sums of i.i.d [Link].

79
80 CHAPTER 8. THE ML ESTIMATOR

Lemma 8.4 (Weak Law of Large Numbers (WLLN)) . Let {Xk }∞


k=1 be a sequence of i.i.d [Link]
with a nite mean µ and nite variance. Then
n
1X P
Xk −
→ µ. (8.3)
n
k=1

Proof. Let Var [Xk ] = σ 2 and write X = [X1 , . . . , Xn ]T . Then we have

E [Mean[ X ]] = µ (8.4)
n
" #
1X
Var [Mean[ X ]] = Var Xk (8.5)
n
k=1
1
= · nσ 2 (8.6)
n2
σ2
= (8.7)
n
Using Chebyshev's inequality, we have that for any ε > 0

Var [Mean[ X ]]
Pr (|Mean[ X ] − µ| ≥ ε) ≤ (8.8)
ε2
σ 2
= 2 (8.9)

−−−→ 0. (8.10)
n→∞

We note that under these conditions the convergence is in fact also with probability 1 (which is
a stronger mode of convergence).

It is easy to see that if {Xk } is i.i.d, then


n
1X P
g(Xk ) −
→ E[g(X1 )]. (8.11)
n
k=1
whenever the expectation above is nite. It is sometimes desirable to have the WLLN apply to
multiple functions at the same time. Clearly, if the number of functions g we are about is nite,
we can use the union bound to that end. However, if we want to control an innite number of
functions at the same time, we need to add more regularity conditions.

Lemma 8.5 (Uniform WLLN (UWLLN)) . Let {Xk }∞


k=1 Θ ⊆ Rd a closed and
be i.i.d [Link],
bounded set, and g(x, θ) a real-valued function that is continuous in θ ∈ Θ for any x, and
measurebale in x. Suppose that |g(x, θ)| ≤ h(x) for some function h(x) satisfying E[h(X1 )] < ∞.
Then
n
1X P
sup g(Xk , θ) − E [g(X1 , θ)] −
→ 0. (8.12)
θ∈Θ n
k=1
The WLLN tells us that the empirical mean of an i.i.d sequence converges in probability to
its expectation. The Central Limit Theorem (CLT) zooms in close on the random uctuations
of the empirical mean around the expectation.

Theorem 8.6 (Central Limit Theorem (CLT)) . Let {Xk }∞


k=1 be a sequence of i.i.d [Link] with a
nite mean µ 2
and nite variance σ . Then
n
1 X D
√ → N (0, σ 2 ).
(Xk − µ) − (8.13)
n
k=1
Proof. Use the characteristic function and a Taylor approximation.
8.2. ML ASYMPTOTIC OPTIMALITY 81

8.2 ML Asymptotic Optimality


We have already seen that in manhy cases, when an ecient estimator exists, then ML is ecient.
We now prove a more general result, showing that ML is asymptotically optimal. To that end, we
need some regularity conditions: We say that the parametric family {p(x | θ)}θ∈Θ is ML-regular,
if

(i) {p(x | θ)}θ∈Θ is 2-regular.

(ii) Θ ⊆ Rd is a closed and bounded set.

(iii) | log p(x | θ)| ≤ h(x) for some function h(x) such that

sup EX∼p(x | θ) [h(X)] < ∞. (8.14)


θ∈Θ

(iv) D (p(x | θ0 )kp(x | θ1 )) 6= 0 for any θ0 6= θ1 (identiability) .

Theorem 8.7 (Asymptotic properties of ML). Let {p(x | θ)}θ∈Θ be ML-regular, and X=
[X1 , . . . , Xn ]T be drawn i.i.d from p(x | θ). Let

(n)
θ̂ML = argmax p(X | θ). (8.15)
θ

Then for any θ ∈ interior(Θ):


h i
(n) P (n)
(i) (consistency) θ̂ML −
→ θ. Specically, lim E θ̂ML = θ .
n→∞

(n)
(ii) (asymptotic eciency) lim Cov[ θ̂ML ] · nI(θ) − Id×d = 0.
n→∞
√  (n) 
D
→ N 0, I −1 (θ) .

(iii) (asymptotic normality) n θ̂ML − θ −

Remark 8.8. We note that the claims in the theorem above are rather weak, as they provide
no information about the rate of convergece. Namely, it is entirely unclear how large n should
be in a certain problem so that we can safely think of the ML estimator as having converged
in the sense of having a small MSE, or having a Gaussian distribution in a region close to the
parameter, etc. Establishing nite sample bounds for the convergence of ML is in general a
challenging problem, which can sometimes be handled in specic settings (we will see one such
case in the next chapter).

Proof. For simplicity, we prove for a scalar parameter. Denote the true parameter by θ0 ∈
interior(Θ). For any other θ, dene the (normalized) log-likelihood ratio (LLR)
 
1 p(x | θ0 )
Ln (x, θ) , log (8.16)
n p(x | θ)
n  
1X p(xk | θ0 )
= log . (8.17)
n p(xk | θ)
k=1

The ML estimator can be written as

(n)
θ̂ML (x) = argmax log p(x | θ) (8.18)
θ
82 CHAPTER 8. THE ML ESTIMATOR

= argmin Ln (x, θ) (8.19)


θ

Let us therefore examine the behavior of the LLR as an r.v:


 
1 p(X | θ0 )
Ln (X, θ) = log (8.20)
n p(X | θ)
n  
1X p(Xk | θ0 )
= log (8.21)
n p(Xk | θ)
k=1
  
P p(X | θ0 )

→ EX∼p(x | θ0 ) log (8.22)
p(X | θ)
= D (p(x | θ0 ) || p(x | θ)) (8.23)

, Dθ (8.24)

where we have used the fact that the samples are i.i.d, then the WLLN, and the denition of
the KL divergence. Recall that Dθ ≥ 0, and by our identiability assumption this inequality is
strict for any θ 6= θ0 . We conclude that

P > 0 θ 6= θ0
Ln (X, θ) −
→ Dθ = . (8.25)
0 θ = θ0

Since our estimator picks the value of θ that minimizes the LLR, it is intuitively clear that it
would eventually favor the true parameter θ0 over any other xed θ, with high probability as
n grows large. However, we need this to happen uniformly in θ, and not just for each single θ,
which is where the UWLLN comes into the picture.
Let us proceed more rigorously. Dene

D, inf D (p(x | θ0 ) || p(x | θ)) (8.26)


|θ−θ0 |>ε

= inf Dθ (8.27)
|θ−θ0 |>ε

Note that D > 0 due to identiability, the continuity of the likelihood function, and the fact
that Θ is a closed set. Let us also write Ln (θ) for Ln (X, θ), suppressing the dependence on X
for brevity. We would like to evaluate the probability that our estimator would prefer some θ
that is more than ε away from the true parameter θ0 :
   
(n)
Pr θ̂ML − θ0 > ε = Pr inf Ln (θ) < Ln (θ0 ) (8.28)
|θ−θ0 |>ε
 
= Pr inf Ln (θ) < 0 (8.29)
|θ−θ0 |>ε
" #
≤ Pr sup |Ln (θ) − dθ | > D (8.30)
|θ−θ0 |>ε

−−−→ 0. (8.31)
n→∞

(n) P
where (8.31) follows from the UWLLN. Hence we conclude that θ̂ML −
→ θ0 .
Let us prove asymptotic normality. To that end, recall that the Mean Value Theorem says
that if f (·) is continuous on [a, b] and dierentiable on (a, b), then there exists c ∈ (a, b) such
that

f (b) − f (a)
f 0 (c) = (8.32)
b−a
8.2. ML ASYMPTOTIC OPTIMALITY 83

or equivalently

f (b) = f (a) + (b − a)f 0 (c). (8.33)

Consider the score of n samples


score [X | θ] = log p(X | θ) (8.34)
∂θ
n
X ∂
= log p(Xk | θ). (8.35)
∂θ
k=1

By denition of the ML, we have that

h i
(n)
score X | θ̂ML = 0. (8.36)

Since we assumed the score is dierentiable w.r.t θ, we can use the Mean Value Theorem to write
0
h i
score [X | θ] = score [X | θ0 ] + (θ − θ0 )score X | θ̃ (8.37)

(n)
for some θ̃ between θ0 and θ. Substituting θ = θ̂ML and using (8.36), we obtain

h i 0
h i
(n) (n)
score X | θ̂ML = score [X | θ0 ] + (θ̂ML − θ0 )score X | θ̃ (8.38)

=0 (8.39)

(n)
where θ̃ is now dependent on θ̂ML . Hence,


√ (n) n · score [X | θ0 ]
n(θ̂ML − θ0 ) = − h i (8.40)
score0 X | θ̃
√1 score [X | θ0 ]
n
= h i. (8.41)
− n1 score0 X | θ̃

Let us examine the numerator:

n
1 1 X
√ score [X | θ0 ] = √ score [Xk | θ0 ] (8.42)
n n
k=1
D

→ N (E [score [X | θ0 ]] , Var [score [X | θ0 ]]) (8.43)

= N (0, I(θ0 )) . (8.44)

where X ∼ p(x | θ0 ), and we have used the CLT. Let us examine the denominator:

n
1 0
h i 1 X ∂2
− score X | θ̃ = − log p(X | θ̃). (8.45)
n n ∂θ2
k=1

If θ̃ = θ0 then by the WLNN we would get convergence in probability

∂2
 
−E log p(X | θ0 ) = I(θ0 ). (8.46)
∂θ2
84 CHAPTER 8. THE ML ESTIMATOR

However, we already know that the ML estimate converges to the correct θ0 , hence θ̃ is with
high probability sandwiched inside a vanishingly small interval. Making this precise, we write:

 
1 0
h i
Pr − score X | θ̃ − I(θ0 ) > ε (8.47)
n
  
1 0
h i 
= Pr − score X | θ̃ − I(θ0 ) > ε | θ̃ − θ0 < δ Pr θ̃ − θ0 < δ (8.48)
n
  
1 0
h i 
+ Pr − score X | θ̃ − I(θ0 ) > ε | θ̃ − θ0 ≥ δ Pr θ̃ − θ0 ≥ δ (8.49)
n
 
1 0
h i  
≤ Pr − score X | θ̃ − I(θ0 ) > ε | θ̃ − θ0 < δ + Pr θ̃ − θ0 ≥ δ (8.50)
n
!
1 0

(n)

≤ Pr sup − score [X | θ] − I(θ0 ) > ε + Pr θ̂ML − θ0 ≥ δ (8.51)
θ:|θ−θ0 |<δ n

(n)
where (8.51) holds sinceθ̃ is squeezed between θ0 and θ̂ML . The second term in (8.51) tends to
zero with n for any xed δ > 0, due to the consistency of the ML estimator. Let us pick δ > 0
ε
small enough such that | − E [score[ X | θ ]] − I(θ0 )| <
2 for any θ such that |θ − θ0 | < δ , which
is possible due to the smoothness of the score's derivative. For such a choice, the rst term also
tends to zero with n, due to the UWLNN. We therefore conclude that

1 0
h i
P
− score X | θ̃ −→ I(θ0 ) (8.52)
n
1 D
as n → ∞. Recalling that √
n
score [X | θ0 ] −
→ N (0, I(θ0 )), then applying Slutsky's theorem

to (8.41) yields

√  (n)  √1 score [X | θ0 ]
n
n θ̂ML − θ0 = h i (8.53)
− n1 score0 X | θ̃
D N (0, I(θ0 ))

→ (8.54)
I(θ0 )
= N 0, I −1 (θ0 ) ,

(8.55)

proving asymptotic normality.


Asymptotic eciency is easily implied by asymptotic normality. Namely, from (8.53) we can
conclude that
h√  i
(n)
lim Var n θ̂ML − θ0 = I −1 (θ0 ) (8.56)
n→∞
h i
(n)
lim n · Var θ̂ML = I −1 (θ0 ) (8.57)
n→∞
 
h
(n)
i 1 1
Var θ̂ML = +o . (8.58)
nI(θ0 ) n
9. Condence Intervals
So far we have discussed point estimators, namely estimators that return a guess for the exact
value of the parameter, and their performance has been measured in terms of their MSE. Another
approach is that of condence intervals, where instead of a point estimator accompanied by
its MSE, we provide a condence set S = S(X) ⊂ Θ of possible values for the parameter,
accompanied with a guarantee on the probability Pr(θ ∈ S), called the condence level, that this
set will contain the true parameter. Note that the parameter here is xed and unknown with no
probability distribution associated with it; the probability here is that the set S obtained from
a random sample will contain the parameter. Clearly, once the sample is given this probability
is either zero or one. So perhaps a clearer interpretation is via the LLN, namely that if the
experiment is to be repeated many times, the fraction of instances where the set S contains the
parameter converges to the condence level. We would of course want the condence set to be as
small as possible. In this chapter we discuss the case where the parameter of interest is scalar,
and limit the subset S to be an interval. The generalization to higher dimensions and other sets
is straightforward in principle, and is partially addressed in a homework exercise.
Let X ∼ p(x | θ) be a sample from a parametric family with θ ∈ Θ ⊆ R. A condence
interval (CI) for θ is a pair of [Link] [`(X), r(X)] such that Pr (θ ∈ [`(X), r(X)]), the condence
level associated with the CI, does not depend on the parameter. It may not be immediately clear
that non-trivial condence intervals can even be computed in some setup. The following general
approach is often useful:

• Try to nd a pivotal quantity for θ, which is any function g(x, θ) such that the distribution
of g(X, θ) does not depend on the parameter.

• Given a pivotal quantity, set a condence level c and try to nd the set of all θ values that
satisfy

c = Pr(g(X, θ) ∈ A), (9.1)

where A is some nice set  choosing A as an interval is good when g is continuous and
invertible in θ.

Example 9.1 .
(Constant in AWGN) Consider the model
i.i.d
X ∼ N (θ, σ 2 ) where σ2 is known.
Recall that the MVU estimator is θ̂MVU = Mean[ X ] and that

σ2
Mean [X] ∼ N (θ, ). (9.2)
n
Then, the function

Mean [X] − θ
g(X, θ) = √ ∼ N (0, 1), (9.3)
σ/ n

85
86 CHAPTER 9. CONFIDENCE INTERVALS

also called as the z-score, is a pivotal quantity. From the symmetry of the problem, it makes
sense to choose A = [−t, t], which yields on the one hand

c = Pr (g(X, θ) ∈ A) (9.4)

= Pr (|g(X, θ)| ≤ t) (9.5)

= 1 − 2Q(t) (9.6)

t = Q−1 1−c

Namely,
2 . Solving for θ yields

 
σ σ
|g(X, θ)| ≤ t ⇔ θ ∈ Mean[ X ] − t · √ , Mean[ X ] + t · √ . (9.7)
n n

Hence,

 
1−c σ
l(X) = Mean[ X ] − Q−1 · √ (9.8)
2 n
 
−1 1 − c σ
r(X) = Mean[ X ] + Q · √ (9.9)
2 n

is a CI with condence level c.

Sometimes we wish to obtain a CI for a single parameter (say θ1 ) in a multi-parameter


problem. In such a case the other parameters (say θ2 , . . . , θ d ) are sometimes referred to as
nuisance parameters. A pivotal quantity is dened in the same way, and its distribution should
not depend on any of the parameters.

Example 9.2 .
(Linear model in AWGN) Consider the linear model X = Hθ + W , where
W ∼ N (0, σ 2 I) and σ2 is known. Let us nd a CI for θ1 and treat θ2 , ..., θd as nuisance
parameters. Recall that

θ̂MVU = (H T H)−1 H T X ∼ N θ, σ 2 (H T H)−1 .



(9.10)

Again we use z-score pivotal quantity, namely,

θ̂1,MVU (X) − θ1
g(X, θ1 ) = p ∼ N (0, 1), (9.11)
σ {(H T H)−1 }ii

and we get

   
1−c
q
−1 T −1
Pr θ − θ̂1,ML < Q · σ {(H H) }ii = c, (9.12)
2

for any condence level c. More generally, we can look at condence regions for the entire
parameter vector (or parts of it). In this case, we can use

θ̂ML − θ ∼ N 0, σ 2 (H T H)−1

(9.13)

as a pivotal quantity (see homework exercise).

In the next example we show that we can also work with quantities that are asymptotically
pivotal, namely converge to oa distribution that does not depend on the parameters, assuming
we can get concrete bounds on the convergence.
87

Example 9.3 .
(Constant in general noise) Consider the model Xk = θ + Wk where Wk nk=1 is
i.i.d with zero mean and known variance Var [Wk ] = σ 2 . By by the CLT, it holds that the z-score
satises

Mean [X] − θ D
g(X, θ) = √ −
→ N (0, 1). (9.14)
σ/ n
This statement is however not practically useful, as it gives no information on the distribution for
nite n, hence not good enough for the purpose of computing a condence intervals. However, if
E |Wk |3 ≤ η
 
we assume that for some known η, then the nite sample deviation of the z-score
from Gaussian can be bounded using the Berry-Esseen CLT:
   
Mean[ X ] − θ η
c = Pr √ ≤ t ≥ 1 − 2 Q(t) + 3 √ . (9.15)
σ/ n σ n
η√
Repeating the AWGN derivations with the small correction term yields the CI
σ3 n
  

1− c− √
 · √σ
σ 3 n
l(X) = Mean[ X ] − Q−1  (9.16)
2 n
   
1 − c − σ32η

r(X) = Mean[ X ] + Q−1 
n
 · √σ (9.17)
2 n

with condence level at least c.


The following is a basic and important example of nding a CI for the mean of a Gaussian
distribution, when the variance is unknown.

Example 9.4 (Constant in AWGN with unknown variance). Consider the model X i.i.d
∼ N (µ, σ 2 )
where µ, σ 2 are both unknown. Suppose we only care about µ and treat σ2 as a nuisance
parameter (a very common situation). We have seen that p(x | µ, σ 2 ) is exponential family with
T
X 2 , and

complete sucient statistics T (X) = X

µ̂MVU = Mean[ X ] (9.18)

ˆ
σ 2 MVU = Mean[ {X12 , . . . Xn2 } ] − (Mean[ X ])2 (9.19)
n
1 X
= (Xk − Mean[ X ])2 . (9.20)
n−1
k=1

Note that µ̂MVU ∼ N (µ, σ 2 ), and hence the z-score

µ̂MVU − µ
√ ∼ N (0, 1). (9.21)
σ/ n

This is however not a pivotal quantity for µ, since it also involves σ2 which is unknown. Nev-
ertheless, there is a way around this problem. Recall we have seen the the Fisher information
matrix is diagonal in this case, which means that the estimators µ̂MVU and σˆ2 MVU are uncorre-
lated. As it turns out, it can be shown that they are also statistically independent. Moreover, it
can be shown (using e.g. Cochran's Theorem) that

σˆ2 MVU
(n − 1) · ∼ χ2n−1 , (9.22)
σ2
88 CHAPTER 9. CONFIDENCE INTERVALS

where χ2n−1 is the Chi-square distribution with n−1 degrees of freedom, i.e., the distribution of
the sum of squares of n−1 i.i.d standard Gaussian [Link]. Hence, we can consider the following
quantity, called the t-statistic (abbreviated t-stat):

√ µ̂MVU −µ
µ̂ −µ √ σ/ n
qMVU = n−1· q (9.23)
ˆ2
σˆ2 MVU /n (n − 1) · σ σMVU
2

√ N (0, 1)
∼ n−1· q , (9.24)
2
Xm−1

by which the distribution of a r.v obtained by
q n − 1 times the ratio of two independent [Link]
with distributions N (0, 1) and
2
Xm−1 . This distribution is known as the student's t-distribution
with n − 1 degrees of freedom, and its p.d.f is given by

− n2
Γ n2

t2

p(t) = p  · 1+ . (9.25)
(n − 1)πΓ n−1 2
n−1

µ̂ −µ
The t-stat q MVU is a pivotal quantity for µ with the above distribution, which can be easily
σˆ2 MVU /n
used to obtain condence intervals. Special cases of the t-stat's distribution include:

• For n=2 we get Cauchy distribution

1
p(t) = . (9.26)
π(1 + t2 )

• For n=3 we have

1
p(t) = 3 . (9.27)
(2 + t2 ) 2

• For n=∞ we get (not surprisingly) the normal distribution

1 t2
p(t) = √ e− 2 . (9.28)

For a small sample size the distinction between student's t-distribution and the normal distribu-
tion are is signicant. As n grows large this dierence becomes smaller. As a rule of thumb, for
a sample of size n = 100 it is already quite accurate to use the normal distribution in lieu of the
the t-distribution in order to compute condence intervals with condence level as high as 99%,
but this is insucient for a condence level of say 99.9%. The t-stat can be generalized to any
linear model.

We conclude by discussing approximated asymptotic condence intervals via the ML estima-


tor.

Example 9.5. Let X i.i.d


∼ p(x | θ), and let θ̂ML be the ML estimator. We know that θ̂ML converges
1
in probability to θ and that its MSE is roughly
nI(θ) . Let us consider the quantity θ̂ML − θ and
try to approximate its tail probability:
 
  2
Pr θ̂ML − θ > t = Pr θ̂ML − θ > t2 (9.29)
89

 
2
E θ̂ML − θ
≤ (9.30)
t2
1
≈ (9.31)
t2 nI(θ)
1
≈ , (9.32)
t2 nI(θ̂ML )
 
where (9.30) holds due to Markov's inequality. If we set a condence level of Pr θ̂ML − θ > t =
1 − c, we get

s
1
t≈ (9.33)
n · (1 − c) · I(θ̂ML )

and hence we obtain the approximate CI

" s s #
1 1
θ ∈ θ̂ML − , θ̂ML + (9.34)
n · (1 − c) · I(θ̂ML ) n · (1 − c) · I(θ̂ML )

with condence level approximately c or higher. Note that Markov's inequality is generally
quite loose in this case, since we expect the distribution of the ML estimator to be close to
normal. Thus, this approximate CI is typically on the conservative side.
90 CHAPTER 9. CONFIDENCE INTERVALS
10. Regularization and the Bayesian Ap-
proach

10.1 Regularization
The following (related) issues often arise in various estimation problems:

• Overtting: The model is too complicated relative to number of samples, e.g., there are
too many parameters and not enough samples. In this case, while we can come up with
(typically many) estimators that seem to explain the data well, these estimators in fact
explain the noise rather than the signal. They are often very noisy and do not generalize
well beyond the sample we have seen.

• Ill-posed problems: There may be many dierent θ's that explain the data almost equally
well, hence inverting the problem might be numerically unstable. For example, in a linear
model we have

θ̂MVU = (H T H)−1 H T x, (10.1)

but if HT H has some very small eigenvalues, then it is an issue to invert.

• Prior information: We may know something about what θ should look like, and want
to incorporate it into the estimation.

Recall the famous Occam's razor:  Entities are not to be multiplied without necessity .
Namely, if there are many ways to explain a phenomenon which are all similarly eective, choose
the simplest one. Loosely speaking, regularization is a process in which we penalize compli-
cated explanations, so that choosing a more complex solution to our estimation problem would
require this solution to explain the data signicantly better than simpler solutions do. This is
useful in order to avoid over-tting noise and stabilize the problem.
In order to make this more precise, we need to dene what we mean by explaining the
data, and by the complexity of the solution to an estimation problem. To these ends, the
standard approach is slightly dierent than what we have seen so far. Let L : X ×Θ → R be
a loss function, where L(x, θ) measures how well θ explains the data x. Note that this is not
the loss function Lθ (θ̂) we discussed before, which measures how close the estimator is to the
real θ . The loss function L(x, θ) is in some sense more general, as it does not directly assume
any statistical model, nor it assumes there is a true parameter. Each parameter value provides
some explanation to the data, whose accuracy is measured by L(x, θ). However, we can connect
this point of view to our usual parametric family setup by using (say) the log-likelihood loss
function, i.e.,

L(x, θ) = − log p(x | θ). (10.2)

91
92 CHAPTER 10. REGULARIZATION AND THE BAYESIAN APPROACH

Hence, the estimator that minimizes the loss in this case is simply the ML estimator for the
parametric family

θ̂ML = argmin L(x, θ). (10.3)


θ

We will restrict ourselves to this type of choice (which is in fact quite general).
Now, we further dene a penalty function or a regularization function R : Θ → R+ , where
R(θ) gives higher values to models (i.e., parameters θ) that are more complicated, in the suitable
sense for the specic problem. Now, we can solve the regularized problem where we try to
minimize the loss under the constraint that the model is not too complex. One way to do that
is to solve the optimization problem

θ̂ = argmin L(x, θ), such that R(θ) ≤ c (10.4)


θ

where c is the maximal complexity level. Alternatively, we can solve

θ̂ = argmin L(x, θ) + λR(θ) (10.5)


θ

where λ is the regularization parameter, the determines the exchange rate between loss and
complexity. This is the Lagrangian version of (10.4), hence if we can solve (10.5) for any λ
then we can essentially use it to solve (10.4), say by gradient descent over λ, until we meet the
constraint. We therefore focus on (10.5), which is easier to work with.

10.2 Regularization in the Linear Model


Recall the linear model in AWGN

X = Hθ + W , (10.6)

where W ∼ N (0, C) and C is a known covariance matrix. Let Q be a p.s.d matrix such that
QT Q = C −1 . The associated parametric family is

 
1
p(x | θ) = const · exp − (Hθ − x)T C −1 (Hθ − x) (10.7)
2
 
1 2
= const · exp − kQ(Hθ − x)k . (10.8)
2

Hence the log-likelihood loss function is

1
L(x, θ) = kQ(Hθ − x)k2 . (10.9)
2
As we have seen, the estimator that minimizes this quadratic loss (or the energy of the residual
in the linear regression lingo) is

θ̂ML = θ̂MVU = (H T C −1 H)−1 H T C −1 x. (10.10)

Let us add some regularization. The simplest regularization function to analyze is also quadratic
like the loss:

R(θ) = kΓθk2 , (10.11)


10.2. REGULARIZATION IN THE LINEAR MODEL 93

where Γ is some r×d regularization matrix. Hence, we now want to solve

1
θ̂REG (x) , argmin kQ(Hθ − x)k2 + λ kΓθk2 (10.12)
θ 2
= argmin kQ(Hθ − x)k2 + 2λ kΓθk2 (10.13)
θ

where the term


1
2 kQ(Hθ − x)k2 is the loss function, λ is the regularization parameter, and kΓθk2
measures the complexity of the model. This choice of regularization is called the Tikhonov
regularization, and loosely speaking, the choice of Γ controls what type of complexity we are
trying to penalize. Here are some typical choices:

• Choosing Γ = Id×d to be the identity matrix (which also known as ridge regression ), we
have

d
X
R(θ) = θi2 . (10.14)
i=1

This corresponds to the case where we do not believe that the parameters can be too large,
or perhaps we even believe that the true model is sparse, namely only few of the parameters
are active at the same time (in this case, there are better regularization functions, as we
we will later see).

• Consider the choice


 
1 −1 0 ··· 0
0 1 −1 0 · · · 0 
 
 0 1 −1 · · · 0 
Γ = . . . (10.15)
 
.. ..
 .. . . . 
.
 
 1 −1
0 ··· 0 1

In this case we have


X
kΓθk2 = (θi+1 − θi )2 (10.16)

This choice can be though of as penalizing by the energy of the discrete derivative of θ
along the index axis, and can be used e.g. when there is a physical reason the believe that
hte parameter vector is smooth.

• Consider the r×d regularization matrix Γ with entries


 
j2πfk `
Γk,` = exp − . (10.17)
d

where here j = −1, and fk ∈ [−1, 1] are some xed frequencies. This regularization
function penalizes solutions that have too much power in certain frequency bands (again,
thinking of the parameter vector as a function of the index axis).

Returning to our regularization problem (10.13), we have

kQ(Hθ − x)k2 + 2λ kΓθk2 = θ T H T QT QHθ − 2θ T H T QT Qx + kQxk2 + 2λθ T ΓT Γθ (10.18)

= θ T H T C −1 H + 2λΓT Γ θ − 2θ T H T C −1 x + kQxk2 .

(10.19)
94 CHAPTER 10. REGULARIZATION AND THE BAYESIAN APPROACH

This yields the solution

−1 T −1
θ̂REG (x) = H T C −1 H + 2λΓT Γ H C x. (10.20)

Specically, for AWGN C = σ 2 In×n we have

−1 T
θ̂REG (x) = σ −2 · σ −2 H T H + 2λΓT Γ H x= (10.21)
−1
= H T H + 2λσ 2 · ΓT Γ HT x

(10.22)

Typically, one absorbs the variance and constants into the regularization parameter, i.e., using
λ instead of 2λσ 2 .

10.2.1 Over-parameterization: The Minimum Norm Estimator


As we mentioned, one of the reasons to use regularization is to handle over-parameterization,
when the number of parameters is large w.r.t. the number of samples, in which case there may
be many solutions and we are prone to over-tting eects. A common approach in such an over-
parameterized regime is to choose the solution with the minimal possible (say quadratic) norm.
Specically, consider the linear model (10.6) where now the number of parameters p is (possibly
much) larger than the number of samples n. In this case there is a subspace of dimension p − n of
exact solutions in the parameter space the maximize the likelihood (regardless of the covariance
matrix Q). The the minimum norm estimator is given by

θ̂min−norm = argmin kθk2 s.t. Hθ = x (10.23)


θ

This optimization problem cab be explicitly solved (try at home), yielding the estimator

θ̂min−norm = H T (HH T )−1 x. (10.24)

Note that this is reminiscent of the standard LS solution, where the right pseudo-inverse is used
instead of the left pseudo-inverse.
As it turns out, the minimum norm estimator can also be viewed through the lens of regu-
larization. It can be shown (try at home) that

θ̂min−norm = lim argmin kHθ − xk2 + λ kθk2 (10.25)


λ↓0 θ
= lim(H H + λI)−1 H T x.
T
(10.26)
λ↓0

Namely, the minimum-norm estimator is asymptotically equal to the ridge-regularized LS es-


timator in the limit of vanishing regularization parameter. This makes intuitive sense; in the
over-parameterized regime there are many solutions that give zero-error, hence penalizing the
solution norm gives preference to lower-norm solutions, and making the penalty arbitrarily small
(but nonzero) makes sure the solution remains almost zero-error. As an exercise, try to see what
kind of minimum-norm estimator is obtained from the more general Tikhonov regularization in
the limit of vanishing λ.

10.2.2 Stabilization Interpretation


Recall the following result.
10.2. REGULARIZATION IN THE LINEAR MODEL 95

Lemma 10.1. Let A, B ∈ Rn×n be symmetric, and λmin (A), λmin (B) their minimal eigenvalues,
respectively. Then

λmin (A) = min v T Av. (10.27)


v∈Rn :kvk=1

and

λmin (A + B) ≥ λmin (A) + λmin (B). (10.28)

Proof. Since A is symmetric, there exists an orthonormal basis {ui }ni=1 of eigenvectors of A, with
n
respective real-valued eigenvalues λmin (A) = λ1 ≤ λ2 ≤ · · · ≤ λn . Any unit vector v ∈ R can
Pn P 2
be written as v = i=1 αi ui , where αi = 1. Then
n
X
T
min v Av = min αi2 λi (10.29)
kvk=1 αi : α2i =1
P
i=1
= λ1 , (10.30)

proving the rst claim. Now, write

λmin (A + B) = min v T (A + B)v (10.31)


kvk=1

≥ min v T Av + min v T Bv (10.32)


kvk=1 kvk=1

= λmin (A) + λmin (B). (10.33)

Suppose now that in the linear model with AWGN, the matrix HT H is close to singular, i.e.,

λmin (H T H) ≈ 0, (10.34)

or is perhaps even zero exactly, say when there are more parameters than samples. This makes
the ML/MVU solution very unstable. Using a Tikhonov regularization with any non-singular
matrix Γ (and absorbing 2σ 2 into λ for readability), yields

λmin (H T H + λΓT Γ) ≥ λmin (H T H) + λ · λmin (ΓT Γ) (10.35)

which can be made suciently large (hence stable) for a large enough λ > 0. Of course, increasing
λ too much is also problematic, as the solution will tend to minimize only the regularization
constraint and hardly take the loss function into account.
For Γ = I, we can further interpret the solution using the singular value decomposition
(SVD).

Lemma 10.2 (SVD) . Let A be any n×d real matrix. Then

A = U DV T (10.36)

form some is n×n unitary matrix U, d × d unitary matrix V , and n × d diagonal matrix D with
nonnegative diagonal entries (called the singular values of A).

Note that

AT A = V DT U T U DV T = V DT DV T (10.37)
96 CHAPTER 10. REGULARIZATION AND THE BAYESIAN APPROACH

and similarly AAT = U DDT U T , hence the eigenvalues of the p.s.d matrix AT A (and also AAT )
are the squared singular values of A.
Let us write the SVD of the observation matrix H as H = U DV T . Then for Γ = I and
C= σ2 ·I (absorbing the constants again) we have

−1
θ̂REG = H T H + λI HT x (10.38)
−1
= V DT DV T + λI V DT U T x (10.39)
−1
= V DT DV T + λV V T V DT U T x (10.40)
−1
= V DT D + λI V T V DT U T x
 
(10.41)
T −1 T T T
= V (D D + λI) V V D U x (10.42)
−1 T T
= V DT D + λI D U x. (10.43)
| {z }
−1
,Dλ ∈Rd×n

Let us write D = diag {σ1 , ..., σd } (we are assuming here that d ≤ n, but this is not necessary).
Then

θ̂REG = V · Dλ−1 U T x. (10.44)

where
 
σ1 σd
Dλ−1 = diag 2 , ..., 2 . (10.45)
σ1 + λ σd + λ

When λ = 0 we get Dλ−1 = D−1 . In this case if some σi ≈ 0 then


1
σi is very large, and the
inversion is unstable. We can see how the regularization stabilizes the singular values, making
the diagonal of Dλ−1 bounded. The diagonal expression looks a bit like a Wiener coecient. We
will see that indeed there is a connection to MMSE estimation, though it is a bit hidden.

10.2.3 Tuning the Regularization Parameter


How do we nd the right λ for out problem? One very heuristic way is to plot the loss vs.
the regularization cost (usually in log-scale) which typically results in an L-shaped curve (see
Figure 10.1). A good value of λ is arguably where the loss starts growing fast while the
regularization cost does not decrease much anymore. Other ways include cross-validation and
various information criteria.

10.3 Bayesian Interpretation of Regularization


In general, we wanted to solve

θ̂REG (x) = argmin {− log p(x | θ) + λR(θ)} (10.46)


θ
n  o
= argmin − log p(x | θ) − log e−λR(θ) (10.47)
θ
n  o
= argmin − log p(x | θ) · e−λR(θ) (10.48)
θ
( !)
e−λR(θ)
= argmin − log p(x | θ) · R −λR(θ) (10.49)
θ e dθ
| {z }
, p(θ)
10.3. BAYESIAN INTERPRETATION OF REGULARIZATION 97

log kΓθk2
regularization cost

small λ

best λ
large λ

log kHθ − xk2


loss

Figure 10.1: L-curve for regularization parameter tuning

= argmin {− log (p(x | θ) · p(θ))} (10.50)


θ
= argmin {− log (p(θ | x) · p(x))} (10.51)
θ
= argmin {− log p(θ | x)} (10.52)
θ
= argmax {p(θ | x)} (10.53)
θ
= θ̂MAP (x). (10.54)

e−λR(θ) dθ
R
where (10.49) holds since is assumed to be nite. This is the maximum a-posteriori
(MAP) estimate of θ, based on the prior

e−λR(θ)
p(θ) = R . (10.55)
e−λR(θ) dx

Namely, we could have equivalently:

a. Induce some prior belief p(θ) on θ.

b. Find the posterior p(θ | x) given the observation, using Bayes rule.

c. Find θ that maximizes the posterior.

Hence, regularization is seen to be equivalent to a Bayesian framework under MAP estimation,


for a suitable choice of prior.
Which loss function (in the standard parameter estimation form) is the MAP estimator opti-
 
mizing? Consider the hit-or-miss (or the 0/1) parameter loss function `(θ̂, θ) = 1 |θ̂ − θ| > δ
for some small δ > 0. Then the expected loss of an estimator θ̂ under a prior p(θ) is

h i  
Ep(x,θ) l(θ̂(X), θ) = Pr θ̂(X) − θ > δ (10.56)
 
= 1 − Pr θ̂(X) − θ ≤ δ (10.57)
h  i
= 1 − EX Pr θ̂(X) − θ ≤ δ | X (10.58)
h i

= 1 − EX 2δ · pθ | X (θ̂(X) | X) (10.59)
98 CHAPTER 10. REGULARIZATION AND THE BAYESIAN APPROACH

l(θ̂, θ)

±δ

θ̂
θ

Figure 10.2

 
≥ 1 − 2δEX max p(θ | X) , (10.60)
θ

where the last inequality is attained with equality for

θ̂(x) = θ̂MAP (x) = argmax p(θ | x). (10.61)


θ

However, if we are already convinced that regularization is equivalent to a Bayesian approach,


we could use other perhaps more reasonable loss functions, such as the absolute loss, quadratic
loss, etc. For the quadratic loss, the optimal estimator is the MMSE estimator, or the posterior
conditional expectation.
Let us get back to the linear model X = Hθ + W with Tikhonov regularization. What is
the distribution induced by the regularization matrix and parameter?

R(θ) = kΓθk2 = θ T ΓT Γθ (10.62)

p(θ) ∝ exp{−λR(θ)} (10.63)


T T
= exp{−λθ Γ Γθ} (10.64)
 
1
= exp − θ T (2λΓT Γ)θ (10.65)
2

This is a Gaussian distribution with covariance matrix


 Cθ , (2λΓT Γ)−1 . Thus, Tikhonov regu-
−1 
larization induces the prior θ ∼ N 0, 2λΓT Γ . We see that a large value of λ corresponds

to small variance of the prior, i.e., a stronger prior (as expected). Conversely, when λ → 0
there is essentially an almost uniform prior (very wide distribution). Recall we had

−1 T −1
θ̂REG = H T C −1 H T + 2λΓT Γ H C x. (10.66)

Hence we immediately deduce that

−1
θ̂MAP = H T C −1 H T + Cθ−1 H T C −1 x. (10.67)

Suppose now we proceed with the Bayesian interpretation, but wish to use the quadratic loss
function, i.e., the MMSE estimator. Recall that θ̂MMSE = E [θ | X] and that θ̂MAP = Mode(θ | x),
which are generally dierent. But for any Gaussian r.v., Mode = Mean. Hence

θ̂MMSE = θ̂MAP . (10.68)


10.4. PENELTY FUNCTION OR PRIOR THAT ENCOURAGES SPARSITY 99

Let us check this the hard way:

θ̂MMSE = Cov [θ, X] · Cov−1 [X, X] · x. (10.69)

Cov [θ, X] = E θ · X T
 
(10.70)

= E θ θT H T + W T
 
(10.71)
T
= Cθ · H (10.72)
h i
Cov [X, X] = E (Hθ + W ) (Hθ + W )T (10.73)

= HCθ H T + C. (10.74)

Hence

−1
θ̂MMSE = Cθ · H T HCθ H T + C x. (10.75)

This looks dierent from θ̂MAP . In a homework exercise, you will prove they are the same using
the matrix inversion lemma. The covariance of the error (in both cases) is given by (verify!)
h i −1
Cov θ̂, θ̂ = H T C −1 H + Cθ−1 (10.76)
−1
= Cθ − Cθ H T HCθ H T + C HCθ (10.77)

where (10.76) is the MAP version and (10.77) is the MMSE version. Using the MAP interpreta-
tion, we can see that

−1 kCθ k→∞


H T C −1 H + Cθ−1 H T C −−−−−−→ (H T C −1 H)−1 H T C (10.78)

and

kCθ k→∞
θ̂MMSE = θ̂MAP −−−−−−→ θ̂ML (10.79)

as we expect.

10.4 Penelty Function or Prior That Encourages Sparsity


Consider the linear model and suppose we know or believe that the true parameter vector θ ∈ Rd
should be sparse, namely has only a small number of nonzero entries. This is a common situation
that arises e.g. when high-dimensional data is known to lie in a low dimensional space. How
do we incorporate this into our estimation procedure? One might think that using Tikhonov
2
regularization with Γ = I , i.e., requiring that kθk is small, should do that trick. This is however
not true; it is almost always the case that such a regularization (with any nite value of λ) will
result in an estimator whose entries are all nonzero. An intuitive explanation for this is that the
loss function is typically linear in θi in the neighborhood of θi ≈ 0, whereas the regularization
function is quadratic in θi . Thus, for any solution where θi = 0, one can slightly perturb the
value of θi to linearly decrease the loss function while only quadratically increasing the penalty
function, which for a small enough perturbation would decrease their linear combination.
A brute force solution to the sparsity problem is to use the L0 norm, i.e., R(θ) = kθk0 =
1 (θi 6= 0), hence solve
P
i

kHθ − xk2 + λ kθk0 . (10.80)


100 CHAPTER 10. REGULARIZATION AND THE BAYESIAN APPROACH

The is however a dicult problem since we need to iterate over all possible subset selections,
which is exponential in d. Therefore, a common relaxation is to use the L1 norm, i.e.,

X
R(θ) = kθk1 = |θi | (10.81)
i

and solve
n o
min kHθ − xk2 + λ kθk1 (10.82)

The problem with this relaxation is that there no explicit solution; however, ecient algorithms
do exists (we will discuss one later on). Also, as we shall later see this method encourages a
sparse solution. Note that the associated prior in this case is Laplacian:

p(θ) ∝ e−λkθk1 . (10.83)


11. Conjugate Priors

In Bayesian estimation, we impose a prior distributions p(θ) on the parameter θ, which induces
a joint distribution on the p(x, θ) = p(θ)p(x | θ) on parameter and the samples. Thus, given the
samples we can in principle compute the posterior distribution

p(θ)p(x | θ)
p(θ | x) = (11.1)
p(x)
p(θ)p(x | θ)
=R (11.2)
p(θ0 )p(x | θ0 )dθ0

of the parameter given the samples, and then pick an estimator θ̂ that minimizes the expected
value of some loss function of our choosing (quadratic, hit-or-miss, absolute deviation, etc.).
However, the computation of the posterior distribution can sometimes be very dicult, since
computing the marginal distribution p(x) requires computing the integral above that in many
cases does not admit a closed-form solution. In some interesting special cases however, this
computation is very simple.
Let {p(x | θ)}θ∈Θ be a parametric family, and let {M (µ)} be a family of distribution on
θ, parameterizes by some hyper-parameter µ (e.g., N (0, µ)). Suppose that we choose some
p(θ) ∈ M (µ) as a prior on the parameter θ, and then observe the sample x and compute the
posterior p(θ | x). If for any p(θ) ∈ M (µ), the corresponding posterior p(θ | x) ∈ M (µ) is also in
the same family of distributions on θ , then we say that M (µ) is a family of conjugate priors for
p(x | θ). When structured families of conjugate priors exist, there is often a simple update rule
for the hyper-parameters.

Example 11.1 .
(Random Constant in AWGN) Consider the model X = θ+W where W ∼
2 ),
N (0, σW where
2
σW is unknown. What are the natural conjugate priors here? We are looking
for a family of distributions on θ such that, when adding an independent Gaussian to θ, we
remain inside the family. Obviously, the answer is also Gaussian  the conjugate priors are
N (µ, σθ2 ) with µ, σθ2 as hyper-parameters. What is the update rule? If p(θ) ∼ N (µ, σθ2 ) then

p(θ | x) ∼ N (E [θ | X = x] , Var [θ | X = x]) (11.3)

σθ2 2
 
σW 2
=N µ+ 2 2 · (x − µ) , σ 2 + σ 2 · σθ . (11.4)
σθ + σW θ W

Example 11.2 (Coin toss) Let. {Xk }nk=1 ∼


i.i.d
Bern(θ), hence

p(x | θ) = θkxkH (1 − θ)n−kxkH . (11.5)

Let us now impose some prior p(θ) on the bias parameter θ. We have

p(x | θ) · p(θ)
p(θ | x) = R (11.6)
p(x | θ0 )p(θ0 )dθ0

101
102 CHAPTER 11. CONJUGATE PRIORS

P P
xk
= c(x) · p(θ) · θ (1 − θ)n− xk
. (11.7)

What would be a choice of prior that would result in a similarly looking posterior? A reasonable
idea is to set p(θ) ∝ θa−1 (1 − θ)b−1 (where the −1 is for consistency with the literature), which
yields

P P
p(θ | x) ∝ θa−1+ xk
(1 − θ)b−1+n− xk
(11.8)

= θã−1 (1 − θ)b̃−1 (11.9)

P P
where ã = a + xk and b̃ = b + n − xk . Therefore, the following family of distributions

θa−1 (1 − θ)b−1
p(θ) = (11.10)
R1
θa−1 (1 − θ)b−1 dθ
0

is a conjugate prior to the i.i.d Bernoulli trials (and hence also to the Binomial distribution). This
is known as the Beta(a, b) distribution where a, b are hyper-parameters. The hyper-parameters
update rule is very simple:

n n
!
X X
(a, b) → a+ xk , b + (1 − xk ) . (11.11)
k=1 k=1

n
P n
P
Note that xk is the number of successes and (1 − xk ) is the number of failures. Thus,
k=1 k=1
we can informally think of a−1 and b−1 as the number of successes and failures that have
occurred before we started observing the data, i.e., our prior belief as to how the sample might
behave. Note that the special case where a=b=1 pertains to observing zero successes and
failures before we start, which intuitively means that we are completely ignorant; and indeed,
Beta(1, 1) = Unif([0, 1]) is the uniform distribution over the parameter space.
Once we have obtained the posterior distribution, we can apply the optimal estimator per-
taining to our loss function. For example, if our loss function is quadratic, then we are interested
in the MMSE estimator, which is the posterior conditional expectation. It can be shown that
a
E [Beta(a, b)] = a+b , hence in this case

n n
" !#
X X
θ̂MMSE (x) = E Beta a+ xk , b + (1 − xk ) (11.12)
k=1 k=1
n
P
a+ xk
k=1
= n n (11.13)
P P
a+ xk + b + (1 − xk )
k=1 k=1
n
P
xk + a
k=1
= (11.14)
n+a+b
n a+b
= · Mean[ x ] + · E [Beta(a, b)] , (11.15)
n+a+b n+a+b

which is a certain weighted average of the prior mean and the sample mean, converging to the
sample mean as n → ∞. If our loss function is hit-or-miss, then we are interested in the MAP
103

a−1
estimator, which is the posterior mode. It can be shown that Mode(Beta(a, b)) = a+b−2 , hence
in this case

n n
!!
X X
θ̂MAP = Mode Beta a+ xk , b + (1 − xk ) (11.16)
k=1 k=1
P
xk + a − 1
= n (11.17)
P P
xk + a + (1 − xk ) + b − 2
k=1
n
P
xk + a − 1
k=1
= (11.18)
a+b+n−2
n a+b−2
= · Mean[ x ] + · Mode(Beta(a, b)). (11.19)
n+a+b−2 n+a+b−2
which is weighted sum of the prior mode and the sample mean. For the fully ignorant prior
a = b = 1, we get

θ̂MAP = Mean[ x ], (11.20)

which coincides with the MVU (and also ML) estimator for the case where no prior is assumed.

In the two examples we have seen, the parametric families have been exponential families.
This is not a coincidence  it turns out that every exponential family admits a suitable family of
p(x | θ) = h(x) · exp η T (θ) · T (x) − a(θ)

conjugate priors. To see this, let be some exponential
family. Then

p(θ | x) = c(x) · p(x | θ) · p(θ) (11.21)

= c(x) · h(x) · exp η T (θ) · T (x) − a(θ) · p(θ).



(11.22)

A reasonable choice for a conjugate prior would be similar to the exponential family, i.e.,

p(θ) = g(τ , ν) · exp η T (θ) · τ − νa(θ)



(11.23)

where the vector τ and the scalar ν are the hyper-parameters, and g is the proper normalization.
Let us check that this is indeed a conjugate prior, and nd the update rule for the hyper-
parameters:

p(θ | x) = c(x)h(x)g(τ , ν) · exp η T (θ) · (τ + T (x)) − (ν + 1)a(θ) .



(11.24)

This is clearly also in the same family. The update rule is therefore

τ 0 = τ + T (x) (11.25)
0
ν = ν + 1. (11.26)

More generally, if we are given n i.i.d. observations, the update rule is simply

n
X
τ0 = τ + T (xk ) (11.27)
k=1
ν 0 = ν + n. (11.28)

Namely, the hyper-parameter vector τ simply accumulates the sucient statistic, and the scalar
hyper-parameter ν essentially simply counts the number of samples.
104 CHAPTER 11. CONJUGATE PRIORS

Getting back to the coin toss example, we have that

n
X
0
τ =τ+ xk (11.29)
k=1
ν 0 = ν + n. (11.30)

How is this related to a and b that we had? It is a simple change of variables, where τ =a and
ν =a+b (verify at home also that we get Beta distribution indeed).

Example 11.3 .
(Poisson distribution) Suppsoe that X ∼ Poisson(θ) for some θ > 0, i.e.,

θx −θ
p(x | θ) = e x ∈ {0, 1, . . .} (11.31)
x!
Let us write it in exponential family format:

1
p(x | θ) = exp {x · log θ − θ} . (11.32)
x!
Hence the sucient statistic is T (x) = x, the natural parameter is η = log θ, and the log-partition
function is a(θ) = θ. The conjugate prior is therefore

p(θ) ∝ exp {τ log θ − νθ} (11.33)

= θτ e−νθ (11.34)

which is the Gamma distribution.

Remark 11.4. We have seen that for exponential family likelihoods, we get exponential family
conjugate priors. The set of priors for which simple update rules exist can sometimes be extended
beyond exponential families, by considering mixture distribution over the exponential family
members. Namely, we can set some additional hyper-prior that chooses the hyper-parameters
of the exponential family prior. Since the space of the distribution is closed under convex
combinations, the update formula would simply update each member separately. For example,
consider the coin toss model with a Beta(a, b) prior (which is a conjugate prior as we have seen).
Set some distribution p(a, b) over the hyper-parameters (a, b). Then (informally)

Z
p(θ) = Beta(a, b) · p(a, b)dadb (11.35)

n n
Z !
X X
p(θ | x) = Beta a+ xk , b + (1 − xk ) · p(a, b)dadb. (11.36)
k=1 k=1

p(θ) is not an exponential family, yet tracking the Beta distribution hyper-parameters is easy,
and the integral p(θ | x) may be easy to compute. For example, if p(a, b) is a nite mixture then
the integral becomes a simple summation.
12. Hypothesis Testing
In this chapter we discuss the problem of hypothesis testing, which is an estimation problem
with a discrete parameter, i.e., where the parameter space Θ has nite cardinality. Each value
of the parameter corresponds to a hypothesis about the way out samples behave, and our goal
will be to decide which hypothesis is the correct one. We will restrict ourselves here to the
binary hypothesis testing setup, where our parameter space Θ = {H0 , H1 } contains only two
hypotheses.

12.1 The Neyman-Pearson Setting


In this section, we discuss the classical binary hypothesis testing setup where there is no prior
distribution of the hypotheses, and where each hypothesis is simple, i.e., uniquely determines the
distribution of the samples. Later, we will discuss the Bayesian setup that imposes a prior on
the hypotheses, and the composite setup, where the distribution under a hypothesis may not be
fully known.
Here, we assume the following model for the samples under each hypothesis:

H0 : X ∼ p0 (x) = p(x | θ = H0 ) (12.1)

H1 : X ∼ p1 (x) = p(x | θ = H1 ). (12.2)

Below, we will suppress the parameter θ as its value has no numerical meaning.
H0 is traditionally called the null hypothesis, and it usually represents the normal state of
nature, namely where no interesting phenomenon is being observed. H1 is traditionally called
the alternative hypothesis, and represents the state of nature where an interesting phenomenon
is present.
What are we going to use as an optimality criterion? Clearly, the quadratic loss is meaningless
in this discrete setting. Let us think of our estimator θ̂ ∈ {H0 , H1 } as a decision rule, namely a
decision of whether we believe in H0 or in H1 after having observed the sample. There are four
possible events:

H0 true H1 true
type II error
accept H0 miss detection
great!
θ̂ = H0 false negative
type I error
accept H1 false alarm great!
θ̂ = H1 false positive

As the table suggests, there are several common names for the error events. Here we will use
the false-alarm vs. mis-detection notation, as it is in line with our intuitive interpretation of the
null and alternative hypotheses. The associated error probabilities are
 
PFA (θ̂) = Pr θ̂ = H1 | H0 (12.3)

105
106 CHAPTER 12. HYPOTHESIS TESTING

 
PMD (θ̂) = Pr θ̂ = H0 | H1 . (12.4)

There is an inherent tension between the false-alarm and mis-detect probabilities. Clearly, for
example, we can attain PFA = 0 by always deciding θ̂ = H0 , but that would result in the worst
possible PMD = 1 (and vice versa). More generally, it is intuitively clear that if we want to
improve one these error probabilities, we would have to pay in the other one. Our goal now
would be to characterize the best possible tradeo between the two, i.e., to x one and minimize
the other.
Some typical examples for a binary hypothesis testing problems are:

• Blood test: Suppose that X represents the numerical value of some blood test result testing
for some disease. If this test is to be useful, we expect to have (signicantly) dierent
distributions of X in healthy vs. sick people. In this case we dene

H0 : person is healthy X ∼ p0 (x) (12.5)

H1 : person is sick X ∼ p1 (x). (12.6)

Given the test result X, we would like to devise a test with a low mis-detection (false
negative) probability. However, aiming at a very low mis-detect may result in a high
false-alarm (false positive) probability, which may cause panic and result in unnecessary
additional test and procedures. In many practical cases, the operating point on the mis-
detect vs. false-alarm curve is determined by balancing the expected medical costs (which
is not necessarily what we as patients would want to optimize).

• The radar problem (signal detection): Suppose we are trying to acquire a target via radar.
We transmit a signal that hits the target, if it exists, and gets reected back to us. Letting
X represent the radar measurement, we expect X to be pure noise if there is no target,
and to be signal plus noise in case a target exists. Thus we dene

H0 : X = W (noise) (12.7)

H1 : X = S + W (signal+noise). (12.8)

where W is the ambient noise (say Gaussian etc.) and S is say a known constant (or a
signal with a known distribution). Given X, we would like to determine if we see a target.
Again, we would like to devise a decision rule with a low mis-detect to avoid intrusion, but
also a low false-alarm to avoid unnecessary sirens going o etc.

• Testing the signicance of an estimator. Suppose that we are given X ∼ p(x | θ) where
θ is a continuous parameter, and we came up with some estimator θ̂(x), say under an
MSE criterion or similar. We would now like to decide whether this estimator is worth
anything. Once approach can be to test the estimator against some null model θ0 , i.e.,
to consider the binary hypothesis testing problem

H0 : X ∼ p(x | θ0 ) (12.9)

H1 : X ∼ p(x | θ̂(X)). (12.10)

For example, in the linear model X = Hθ + W in AWGN, we may want to test the MVU
estimator θ̂MVU = (H T H)−1 H T X against the null model where θ0 = 0, i.e., where we are
observing only noise X = W .

What is the best tradeo between PMD and PFA ? Consider the achievable region in R2
containing all the pairs (PFA , PMD ) ∈ [0, 1]2 that can be attained by some decision rule. The
12.1. THE NEYMAN-PEARSON SETTING 107

tradeo: PFA PMD

p0 (x)
p1 (x)

x
H0 H1

Figure 12.1

PMD
1
acheivable
region

non pareto optimal


acheivable boson decay
region
PFA
1

Figure 12.2

interesting part of the achievable region is its boundary, which consists of the set of pairs such
that we cannot improve (decrease) both PFA and PMD . This boundary must be a convex curve,
since we can always achieve any point along a straight line connecting any two points on the
boundary, by simply randomizing between any the two associated decision rules (to achieve all
the points on the curve we sometimes need to randomize, e.g. in the discrete sample case). Note
in particular that the straight line PMD = 1 − PFA can always be attained. e.g. by a random
decision independent of the samples.
In light of the above discussion, our problem is

min PMD (θ̂) s.t. PFA (θ̂) ≤ p. (12.11)


θ̂:X →{H0 ,H1 }

for some xed desired false-alarm level p. A decision rule achieving this minimum is called in the
statistics lingo the most powerful test at level p. We will nd it more convenient to equivalently
think of a decision rule in terms of its decision regions, i.e., the regions of samples where the rule
decides in favor of a certain hypothesis. In the binary setup, we usually think of the acceptance
region

A = {x ∈ X : θ̂(x) = H0 } (12.12)

where the rule accepts the null hypothesis, and the rejection region Then in this notation

Ā = {x ∈ X : θ̂(x) = H1 } (12.13)
108 CHAPTER 12. HYPOTHESIS TESTING

where the rule rejects the null hypothesis (and accepts the alternative). Identifying the rule with
its acceptance region, we write with some abuse of notation


PFA (A) = Pr X ∈ Ā | H0 = P0 (Ā) (12.14)

PMD (A) = Pr (X ∈ A | H1 ) = P1 (A), (12.15)

where Pi (B) is the probability that X ∈B under the distribution pi (x). We can now rewrite
our optimization problem (12.16) in terms of the acceptance region:

min PMD (A) s.t. PFA (A) ≤ p. (12.16)


A⊆X

12.1.1 Optimal Decision Rule


The following result characterizes the solutions to this optimization problem (up to possible
additional randomization), or the boundary of the achievable region, and shows that the optimal
decision rules compare the likelihood ratio to a threshold.

Theorem 12.1 (Neyman-Pearson Lemma) . Let τ ≥ 0 be some threshold, and consider the
decision rule with the acceptance region
 
p0 (x)
A, x∈X : >τ . (12.17)
p1 (x)

Then for any other rule with acceptance region B ⊆ X, if

PFA (B) ≤ PFA (A) (12.18)

then

PMD (B) ≥ PMD (A). (12.19)

Proof. The cases where τ =0 and τ =∞ are trivial, hence we assume that 0 < τ < ∞. Let
1A (x) and 1B (x) be the indicator functions of the regions A, B respectively. Then for all x

[1A (x) − 1B (x)] · [p0 (x) − τ · p1 (x)] ≥ 0. (12.20)


| {z } | {z }
L R

To verify: if x ∈ A then R ≥ 0 and L ∈ {0, 1} ≥ 0, if x 6∈ A then R ≤ 0 and L ∈ {0, −1} ≤ 0.


Integrating over x yields
Z
0 ≤ [1A (x) − 1B (x)] · [p0 (x) − τ · p1 (x)] dx (12.21)
Z Z
= (p0 (x) − τ p1 (x)) dx − (p0 (x) − τ p1 (x)) dx (12.22)

x∈A x∈B
= P0 (A) − τ P1 (A) − P0 (B) + τ P1 (B) (12.23)

= τ (P1 (B) − P1 (A)) − (P0 (B) − P0 (A)) (12.24)



= τ (P1 (B) − P1 (A)) − P0 (Ā) − P0 (B̄) (12.25)

where (12.25) holds since P0 (A) = 1 − P0 (Ā). Thus we get

τ [P1 (B) − P1 (A)] ≥ P0 (Ā) − P0 (B̄), (12.26)


12.1. THE NEYMAN-PEARSON SETTING 109

which in false-alarm and mis-detect notation is

τ [PMD (B) − PMD (A)] ≥ PFA (A) − PFA (B) (12.27)

≥ 0, (12.28)

where the last inequality follows from our assumption that PFA (B) ≤ PFA (A). Recalling that
0 < τ < ∞, the result follows.

The following corollary, which follows from the Neyman-Pearson Lemma by noting that
the mis-detect probability decreases as the threshold increases, provides the solution to our
optimization problem (12.16), of nding the optimal acceptance region at a certain desired level
of false-alarm probability. The same result (with roles switched) holds when we x a mis-
detect error probability and minimize the false-alarm. For brevity, we will sometimes denote the
p0 (x)
likelihood-ratio by Λ(x) , p1 (x) .

Corollary 12.2. The solution to (12.16), i.e., the acceptance region minimizing PMD under the
constraint that PFA ≤ p, is given by

A , {x ∈ X : Λ(x) > τ } . (12.29)

where
( Z )
τ = sup 0 ≤ t ≤ ∞ : p0 (x)dx ≤ p . (12.30)
x:Λ(x)≤t

whenever the supremum is attained as a maximum. Otherwise, use the rule Λ(x) ≥ τ for the
acceptance region.

The curve delineated by the above corollary may be discontinuous (e.g., in discrete cases).
A smooth convex characterization of the optimal curve can be more elegantly given if we allow
randomized decision rules (as we have already discussed).

Corollary 12.3. Consider the family of randomized decision rules, parameterized by τ ∈R and
λ ∈ [0, 1], that decides in favor of Hθ̂(x) where


 0 Λ(x) > τ
θ̂(x) = Bern(λ) Λ(x) = τ (12.31)
1 Λ(x) < τ.

Then the boundary of the (convex hull of ) the false-alarm vs. mis-detect achievable region is
exhausted by this family. Explicitly, for any 0 ≤ p ≤ 1, the rule with parameters τ, λ satisfying

p = P0 (Λ(x) < τ ) + λ · P0 (Λ(x) = τ ) (12.32)

attains PFA = p while minimizing PMD among all randomized decision rules that attain this level
of PFA .
Remark 12.4. Note that the solution to (12.32) is unique in τ , and also unique in λ whenever
P0 (Λ(x) = τ ) 6= 0 (otherwise, λ is irrelevant).

Example 12.5 .
(Constant in AWGN) Consider the signal detection example

H0 : Xk = Wk (12.33)

H1 : Xk = s + Wk (12.34)
110 CHAPTER 12. HYPOTHESIS TESTING

i.i.d
{Wk }nk=1 ∼ N 0, σ 2 s, σ 2

where , and are known. Assume without loss of generality that s > 0.
Then
n
( )
1 1 X 2
p0 (x) = n exp − 2 xk (12.35)
(2πσ 2 ) 2 2σ
k=1
n
( )
1 1 X
2
p1 (x) = n exp − 2 (xk − s) . (12.36)
(2πσ 2 ) 2 2σ
k=1

The likelihood ratio test decides in favor of H0 if

p0 (x)
Λ(x) = >τ (12.37)
p1 (x)
log p0 (x) − log p1 (x) > log τ (12.38)
" n n
#
1 X 2 X 2
− 2 xk − (xk − s) > log τ (12.39)

k=1 k=1
n
" #
1 X
2
− 2 2s xk − s n > log τ (12.40)

k=1
Xn
2s xk − s2 n < −2σ 2 log τ, (12.41)
k=1

hence the decision rule can be compactly written as

H1 s σ2
Mean[ x ] ≷ − log τ (12.42)
H0 |2 ns
{z }
,th

where recall we have assumed that s > 0. Namely,


  the optimal test compares the sample mean
2
to a threshold. Noting that Mean[ X ] ∼ N 0, σn under H0 , the false-alarm probability is

PFA = Pr (Mean[ X ] > th | H0 ) (12.43)


 
th
=Q √ , (12.44)
σ/ n
from which we can compute the required threshold as a function of the desired PFA :
σ
th = √ Q−1 (PFA ). (12.45)
n
 2
Noting that Mean[ X ] ∼ N s, σn under H1 , the associated minimal mis-detect probability is

given by

PMD = Pr (Mean[ X ] < th | H1 ) (12.46)


 
th − s
=1−Q √ (12.47)
σ/ n
 
−1 s
= 1 − Q Q (PFA ) − √ . (12.48)
σ/ n
As a sanity check, we note that when s=0 we obtain PMD = 1 − PFA , since in this case the
samples are distributed the same way under both hypotheses. Observe that when PFA decreases
Q−1 (PFA ) increases, hence so does PMD . When σ decreases, or when n increases, then the
threshold th decreases such that PMD is decreased for same level of PFA .
12.1. THE NEYMAN-PEARSON SETTING 111

τ =1

p0 (x) p1 (x)

0 s

τ <1 PFA PMD

p0 (x)
p1 (x)τ

0 θ̂
s

Figure 12.3

12.1.2 Neyman-Pearson for i.i.d Samples


The optimal decision rule in the Neyman-Pearson problem is comparing the likelihood ratio,
or equivalently the log-likelihood-ratio (LLR), to a suitable threshold. Let us see how the LLR
behaves on average under each hypothesis:
Z  
p0 (x)
H0 : E [log Λ(X) | H0 ] = p0 (x) log dx (12.49)
p1 (x)
= D(p0 kp1 ) (12.50)

> 0. (12.51)
Z  
p0 (x)
H1 : E [log Λ(X) | H1 ] = p1 (x) log dx (12.52)
p1 (x)
= −D(p1 kp0 ) (12.53)

< 0. (12.54)

So, we expect the LLR to be centered around a positive (resp. negative) number, when H0
(resp. H1 ) is the true hypothesis. This provides some insight as to why testing the LLR against
a threshold makes sense. When we have multiple i.i.d samples, this intuition becomes more
accurate since, by the WLLN, the normalized LLR of the sample converges in probability to the
expected value, i.e.,
(
1 P D(p0 kp1 ) H0 true
log Λ(X) −
→ (12.55)
n −D(p1 kp0 ) H1 true
112 CHAPTER 12. HYPOTHESIS TESTING

log(th)

log Λ(x)
−D(p1 kp0 ) 0 D(p0 kp1 )

Figure 12.4

asn → ∞. In this case, both PFA and PMD → 0 simultaneously for any suitable threshold in the
interval( −D(p1 k p0 ), D( p0 k p1 )) applied to the LLR. In this case it makes more sense to talk
about the tradeos in the exponential decay of PFA and PMD , i.e.,

1
α , lim − log PFA (12.56)
n→∞ n
1
β , lim − log PMD . (12.57)
n→∞ n

A typical tradeo between these exponents is depicted in Figure 12.5. This tradeo can be
precisely characterized using large deviation techniques (out of scope).
Before we give an example, we need the following simple claim.

Lemma 12.6. Suppose T (X) is a sucient for the two hypotheses, i.e., that the distribution of
X given T (X) does not depend on the hypothesis. Then the likelihood ratio is given by

p(T (x) | H0 )
Λ(x) = . (12.58)
p(T (x) | H1 )
Proof.

p(x | H0 )
Λ(x) = (12.59)
p(x | H1 )
p(x | H0 )p(t | x, H0 )
= (12.60)
p(x | H1 )p(t | x, H1 )
p(x, t | H0 )
= (12.61)
p(x, t | H1 )
p(t | H0 )p(x | t, H0 )
= (12.62)
p(t | H1 )p(x | t, H1 )
p(t | H0 )p(x | t)
= (12.63)
p(t | H1 )p(x | t)
p(t | H0 )
= . (12.64)
p(t | H1 )

Example 12.7 (Coin toss). Consider the hypothesis testing problem in which we view a sequence
of n i.i.d coin tosses, where the Bernoulli parameter is either p or q 6= p. Namely:

i.i.d
H0 : X ∼ Bern(p) (12.65)
12.2. BAYESIAN HYPOTHESIS TESTING 113

α β

H1 H0

log(th)
−D(p1 kp0 ) D(p0 kp1 )

Figure 12.5

i.i.d
H1 : X ∼ Bern(q). (12.66)

As the previous lemma suggests, it suces to look at the LLR of a sucient statistic, which
n
P
in this case is the Hamming weight kxkH = xk . Let us compute the LLR for x where
k=1
kxkH = m:
 
Pr(X = x | H0 )
log Λ(x) = log (12.67)
Pr(X = x | H1 )
 
Pr (kXkH = m | H0 )
= log (12.68)
Pr (kXkH = m | H1 )
!
n m n−m

m p (1 − p)
= log n m n−m
(12.69)
m q (1 − q)
= m log p + (n − m) log(1 − p) − [m log q + (n − m) log(1 − q)] (12.70)
m  m hm  m i
=n· log p + 1 − log(1 − p) − log q + 1 − log(1 − q) (12.71)
n n n  n 
m p  m (1 − p) m q  m (1 − q)
=n· log + 1− log − log + 1− log
n m/n n 1 − m/n n m/n n 1 − m/n
(12.72)

= n (−D (Bern(m/n) k Bern(p)) + D (Bern(m/n) k Bern(q))) (12.73)

Thus, the optimal decision rule is of the form

H1
D (Bern(m/n) k Bern(p)) ≷ th + D (Bern(m/n) k Bern(q)) . (12.74)
H0

th.
for some threshold parameter This is very intuitive, if we interpret D (Bern(m/n)kBern(p))
(resp. D (Bern(m/n)kBern(q))) as a measure of closeness of the empirical distribution to
Bern(p) (resp. Bern(q)).

12.2 Bayesian Hypothesis testing


Instead of dealing with the tradeo between PFA and PMD (or other error events in the case of
more than two hypotheses), the Bayesian approach is to simply put a prior distribution p(θ) on
114 CHAPTER 12. HYPOTHESIS TESTING

the hypotheses, and optimize the average error probability:

Pr(θ̂(X) 6= θ) = Pr(θ̂(X) = 1 | θ = 0) Pr(θ = 0) + Pr(θ̂(X) = 0 | θ = 1) Pr(θ = 1). (12.75)


| {z } | {z }
PFA PMD

What is the optimal decision rule? Clearly, the loss function here is hit-or-miss (a.k.a. 0/1 loss)
hence we expect that MAP would be optimal. Let us show this again. Suppose θ̂ is an arbitrary
decision rule. Then its success probability is
h i
Pr(θ̂(X) = θ) = E Pr(θ̂(X) = θ | X) (12.76)

≤ E [max {Pr (θ = 0 | X) , Pr (θ = 1 | X)}] , (12.77)

which is attained with equality if

θ̂(x) = argmax p(θ | x). (12.78)


θ

This is the MAP decision rule. Let us write it in terms of the prior and the model:

θ̂MAP = argmax p(θ | x) (12.79)


θ
p(x | θ)p(θ)
= argmax (12.80)
θ p(x)
= argmax p(x | θ)p(θ). (12.81)
θ

Since we are in a binary setup, we can further simplify:

H0
p(x | 0)p(0) ≷ p(x | 1)p(1) (12.82)
H1

or

H0 p(1)
Λ(x) ≷ =τ (12.83)
H1 p(0)

Hence, the optimal Bayesian decision rule is also in the Neyman-Pearson form, where the thresh-
old is set by the prior likelihood ratio. In fact, scanning over all the prior distributions will take
us along the entire Neyman-Pearson trade-o curve.

Example 12.8 (Constant in AWGN) .


H0 : Xk = Wk (12.84)

H1 : Xk = s + Wk , (12.85)

i.i.d
{Wk }nk=1 ∼ N 0, σ 2 σ2, s > 0

where , and are known. Suppose θ∼ Bern(p) for some known
p. We already saw that

n
" #
1 X
log Λ(x) = 2 ns2 − 2s xk . (12.86)

k=1

Hence the MAP rule is

n
" #  
1 X H0 p
2
ns − 2s xk ≷ log (12.87)
2σ 2 H1 1−p
k=1
12.3. RELATION TO TOTAL VARIATION DISTANCE 115

2σ 2
 
H1 p
2
2sMean[ x ] − s ≷ − log (12.88)
H0 n 1−p
2
 
H1 s σ p
Mean[ x ] ≷ − log . (12.89)
H0 2 ns 1−p

1
For p= 2 we get

H0 s
Mean[ x ] ≷ . (12.90)
H1 2

When p↓0 then th → ∞, which means we always decide in favor of H0 , as expected. Check
other limit cases for intuition.

Generalizations of the simple setup we considered are:

1. Other loss functions


h  : Here
i we used the 0/1 (hit-or-miss) loss function, i.e., we wanted

to minimize E 1 θ̂ 6= θ . We can consider more general loss functions `(θ̂, θ) that give

dierent weights to dierent error events: and look for a decision rule that minimizes the
expected loss:

h  i
θ̂ = argmin E ` θ̂(X), θ . (12.91)
θ̂(x)

You will nd this optimal decision rule explicitly (for the binary case) in a homework
exercise.

2. Multiple hypotheses: In this setup again θ ∼ p(θ), and we have M hypotheses we wish
to distinguish between:

{Hi : X ∼ p(x | i) }M
i=1 . (12.92)

It is easy to show (homework again) that for multiple hypotheses, the MAP rule is again
optimal under the 0/1 loss, i.e., it minimizes the average error probability.

θ̂MAP = argmax p(θ | x) (12.93)


θ
= argmax p(x | θ)p(θ). (12.94)
θ

12.3 Relation to Total Variation Distance


The total variation distance between two distributions, also known as the statistical distance, is
a natural and useful measure of dissimilarity between the distributions. It is dened as half the
L1 distance between the distributions, i.e.,

Z
1
TV(p, q) = |p(x) − q(x)|dx. (12.95)
2

It is easy to check that 0 ≤ TV(p, q) ≤ 1, which is one reason for the half-factor. One important
interpretation for the total variation distance is given by relating it to the two types of error
probabilities in simple binary hypothesis testing.
116 CHAPTER 12. HYPOTHESIS TESTING

H0 H1

s
2 s

Figure 12.6

Lemma 12.9. Any decision rule to distinguish between p0 and p1 satises

1 − TV(p0 , p1 ) ≤ PFA + PMD ≤ 1. (12.96)

Moreover, the lower bound can be achieved with equality.

Proof. The upper bound is trivial. For the lower bound, let us pick a uniform prior on θ, which
translates to an optimal threshold of τ = 1. In this case, the average error probability of any
estimator, which is also the arithmetic mean of its false-alarm and mis-detect probabilities, is
lower bounded by that of the optimal estimator, i.e.,.

Z Z !
1 1
(PFA + PMD ) ≥ p1 (x)dx + p0 (x)dx (12.97)
2 2 x:p0 (x)>p1 (x) x:p0 (x)≤p1 (x)
Z
1
= min (p0 (x), p1 (x)) dx (12.98)
2
Z
1
= [min (p0 (x), p1 (x)) + max (p0 (x), p1 (x)) − max (p0 (x), p1 (x))] dx (12.99)
2
Z Z
1 1
= (p0 (x) + p1 (x)) dx − max (p0 (x), p1 (x)) dx (12.100)
2 2
Z
1
=1− max (p0 (x), p1 (x)) dx (12.101)
2
Summing up the (identical) terms in (12.98) and (12.101), we obtain
Z
1
PFA + PMD ≥1− [max (p0 (x), p1 (x)) − min (p0 (x), p1 (x))] dx (12.102)
2
Z
1
=1− |p0 (x) − p1 (x)|dx (12.103)
2
= 1 − TV(p0 , p1 ). (12.104)

Clearly, this performance is achieved by the optimal Bayesian rule designed for a uniform prior
(regardless of whether the setup is Bayesian on not).

Remark 12.10. In fact, Lemma 12.9 implies that the total variation distance is the largest
possible gap between the probabilities assigned to an event by both p and q, i.e.,

TV(p, q) = sup |P (A) − Q(A)| (12.105)


A
12.4. COMPOSITE HYPOTHESIS TESTING 117

where the supremum is taken over all (measurable) subsets of the sample space X (try to show
this!). In the literature (12.105) is sometimes taken as the denition of the total variation
distance, and then (12.95) is proved as a consequence.

12.4 Composite Hypothesis Testing


In this setup, the distributions of the samples under each hypothesis are not given exactly, but
are only known to belong to some family of distributions. One way to formally represent this is by
partitioning some underlying parameter space Θ to a disjoint union of two classes Θ = Θ0 ∪ Θ1 :

H0 : X ∼ p(x | θ), θ ∈ Θ0 (12.106)

H1 : X ∼ p(x | θ), θ ∈ Θ1 . (12.107)

Namely, we are interested only in a binary function of the paramter θ, i.e., just to know if it
belongs to Θ0 or Θ1 .
The Neyman-Pearson result is not applicable here since we do not know which distribution
was chosen from the class. What can we do? Before we discuss some general approaches, we
note that in some very special cases we can nevertheless go back to Neyman and Pearson.

Example 12.11 .
(Constant in AWGN) Consider the following hypotheses

H0 : Xk = Wk (12.108)

H1 : Xk = s + Wk (12.109)

i.i.d
{Wk }nk=1 ∼ N 0, σ 2 i.i.d with a known σ 2 , but now the signal s > 0 is unknown. Recall

where
that, had s ≥ 0 been known, the optimal Neyman-Pearson likelihood ratio test would have been

H1
Mean[ x ] ≷ τ (12.110)
H0

for some τ. For any choice of τ, this decision rule can be implemented without knowledge of
s. Moreover, since H0 is simple, we can set τ = √σn Q−1 (PFA ) to attain any desired value of
PFA , while minimizing PMD . Thus, we get a uniformly most powerful test at any desired level of
PFA . The resulting PMD will be a function of the underlying unknown s, but knowing s would
not improve our detector. This phenomenon occurs in more a more general setting where the
likelihood-ratio is monotone in some sucient statistic.

Theorem 12.12 (Karlin-Rubin) . Consider the hypothesis testing problem

H0 : X ∼ p(x | θ0 ) (12.111)

H1 : X ∼ p(x | θ), θ > θ0 . (12.112)

Suppose that

p(x | θ0 )
Λ(x) = (12.113)
p(x | θ)

is a monotone non-increasing function of some scalar sucient statistic T (x), for any θ > θ0 .
H1
Then the threshold decision rule T (x) ≷ τ is uniformly most powerful at any level of PFA (τ ).
H0
118 CHAPTER 12. HYPOTHESIS TESTING

You will prove (a generalization of ) this simple result in a homework exercise. The intuition
behind it is that the larger T (x), the more likely H1 is. Note that in AWGN example, the log-
likelihood ratio is monotonic in Mean[ x ] which is a sucient statistic. This is not a coincidence,
and happens more generally for some other exponential families.

Example 12.13 (scalar exponential family) Let .


p(x | θ) = h(x) exp {η(θ)T (x) − a(θ)} (12.114)

and recall that T (x) is a sucient statistic. Consider testing θ = θ0 against θ > θ0 . The
log-likelihood ratio is

log Λ(x) = (µ(θ0 ) − µ(θ) T (x) + func(θ0 , θ). (12.115)

We see that if µ(θ) > µ(θ0 ) for any θ > θ0 , then the threshold test

H1
T (x) ≷ τ (12.116)
H0

n
P H1
is uniformly most powerful. For i.i.d samples, we can use the test T (xk ) ≷ τ .
k=1 H0

In general however, there is no uniformly most power test even when the null hypothesis is
simple.

Example 12.14 (Constant in AWGN, unknown polarity). consider the setup of Example 12.11,
but where now s 6= 0 is unknown (can be negative too). Here there can be no uniformly most
pwoer test, since the optimal Neyman-Pearson rule depends on the sign of s. What could be a
good decision rule nevertheless? One reasonable idea is to use

H1
(Mean[ x ])2 ≷ τ (12.117)
H0

This seems a very reasonable solution in this case, and would yield very good performance. But
what could be an approach that would lead us to such nice solutions in the more general case?
Here are two common approaches, with no general claim for any optimality (there are many
other methods):

• Bayesian reduction: Set two distributions p0 (θ), p1 (θ) over each hypotheses class. Then
marginalize the dependence on the specic choice of the parameter inside the class:
Z
p0 (x) = p(x | θ)p0 (θ)dθ (12.118)

θ⊆Θ0
Z
p1 (x) = p(x | θ)p1 (θ)dθ, (12.119)

θ⊆Θ1

and we are now back at the simple hypothesis testing setup, where we can use Neyman-
Pearson. Of course, it is not necessarily clear which priors to use.

• Generalized likelihood ratio test (GLRT): The likelihood ratio cannot be computed
due to the lack of knowledge regarding the underlying value of the parameter under each
hypothesis. But under each hypothesis, one could reasonably set the parameter to be the
12.4. COMPOSITE HYPOTHESIS TESTING 119

H0 H1 H1 H0

H0 H1 H1 H0

Vs.

th A A th

Figure 12.7

ML estimator of θ, and then use that estimator as the true value of the parameter in a
Neyman-Pearson decision rule. Namely, compute

sup p(x | θ)
θ⊆Θ0 p(x | θ̂0 )
ΛG (x) , = , (12.120)
sup p(x | θ) p(x | θ̂1 )
θ⊆Θ1

where θ̂i is the ML estimator for θ under Hi . Then use a test of the form

H0
ΛG (x) ≷ τ (12.121)
H1

for a suitable τ (that will determine the trade-o between PFA and PMD ).

Example 12.15 (Constant in AWGN, unknown polarity) Let us return to Example 12.14..
According to GLRT approach, under H1 (which is the only composite hypothesis here) we assume
that the parameter value is ŝML = Mean[ x ]. Thus,
 n

1
x2k
P

2σ 2
 e k=1 
log ΛG (x) = log  n
 (12.122)
 1 P 2

− (xk −Mean[ x ])
2σ 2
e k=1
 
 n 
1 n(Mean[ x ])2 − 2Mean[ x ]
 X 
= 2 xk  (12.123)
2σ  
 k=1 
| {z }
nMean[ x ]
n
= − 2 (Mean[ x ])2 (12.124)

Hence, the GLRT takes the form of

H1
(Mean[ x ])2 ≷ τ (12.125)
H0

which is exactly what we have intuitively guessed as a good solution to the polarity problem.
In a homework exercise, you will verify the Bayes reduction with a Gaussian prior, as well as
another approach called the Wald test, yield the same decision rule.
120 CHAPTER 12. HYPOTHESIS TESTING
13. Minimax Estimation

Let X ∼ p(x | θ) be drawn from a parametric family, and let `(θ̂, θ) be some nonnegative loss
(a.k.a. risk) function, (e.g., quadratic (θ̂ −θ)2 ). Recall the denition of the expected loss attained
by an estimator θ̂(x) for a given value of θ :

h  i
Lθ (θ̂) , EX∼p(x | θ) ` θ̂(X, θ) . (13.1)

We have already seen that one cannot generally nd an estimator θ̂ that minimizes the expected
loss for any θ. We have seen two approaches to handle this problem: One was to restrict our
attention to unbiased estimators (and quadratic loss), then derive lower bounds on the expected
loss and look for an MVU estimator; however, an MVU does not always exist, and biased
estimators sometimes perform uniformly better than unbiased ones. The other was the Bayesian
approach, where we impose a prior on the parameter and then nd the estimator that minimizes
the expected loss; it is however not necessarily clear how to chose the prior, and whether we are
indeed interested in the average performance over the choice of parameter.
We now discuss a third approach, which does not require any assumptions on the estimators
or the parameter: The idea is to rank the estimators according to their worst-case loss over the
parameter space, and then seek the estimator whose worst-case loss is minimal. This is referred
to as the minimax (MM) estimator. Explicitly:

θ̂MM = argmin max Lθ (θ̂). (13.2)


θ∈Θ
θ̂

Note that min and max should be replaced with inf and sup as necessary, throughput this chapter.
Note that we are implicitly assuming that the minimum is achieved by some estimator; if this is
not the case we can look at a sequence of estimators approaching the inmum.
It is generally dicult to nd the minimax estimator analytically, but is sometimes less
dicult to nd upper and lower bounds on the minimax loss. To get upper bounds, we can
evaluate the worst case expected loss attained by any estimator. Namely, we can pick some
estimator θ̃ that we like (which is, say, reasonable / easy to handle) and evaluate its worst case
expected loss, which will serve as an upper bound on the loss of the minimax estimator, i.e.,

min max Lθ (θ̂) ≤ max Lθ (θ̃) (13.3)


θ̂ θ∈Θ θ∈Θ

Obtaining lower bounds on the minimax loss is trickier. Before we proceed, we need the following
well-known inequality.

Lemma 13.1 (minimax inequality) . For any function f : X × Y → R,

min max f (x, y) ≥ max min f (x, y). (13.4)


x∈X y∈Y y∈Y x∈X

121
122 CHAPTER 13. MINIMAX ESTIMATION

Proof. For any y0 ∈ Y it trivially holds that

min max f (x, y) ≥ min f (x, y 0 ). (13.5)


x∈X y∈Y x∈X

The claim follows by maximizing the right-hand-side lower bound over the choice of y0 ∈ Y .

Now using the minimax inequality, we can obtain the following lower bound:

min max Lθ (θ̂) ≥ max min Lθ (θ̂) = 0. (13.6)


θ̂ θ∈Θ θ∈Θ θ̂

This follows since we can always choose θ̂ = θ on the max-min side, and yields a trivial lower
bound. Nevertheless, there is a way to get a nontrivial lower bound via the minimax inequality,
using a Bayesian trick. To that end, we need some denitions. Consider a prior Π(θ) over the
parameter θ and let
h i
LΠ (θ̂) , Eθ∼Π Lθ (θ̂) (13.7)

be the expected loss attained by an estimator θ̂. Let θ̂Π (x) be the associated Bayes estimator:

θ̂Π = argmin LΠ (θ̂). (13.8)


θ̂

For example, if `(θ̂, θ) = (θ̂ − θ)2 is the quadratic loss, then

θ̂Π = E [θ | X] = θ̂MMSE . (13.9)

We will refer to the expected loss LΠ (θ̂Π ) attained by the Bayes estimator as the Bayes loss
(sometimes referred to in the literature as the Bayes risk).

Lemma 13.2. The minimax loss is lower bounded by the Bayes loss under any prior. Namely,
for any prior Π on the parameters:

min max Lθ (θ̂) ≥ LΠ (θ̂Π ) (13.10)


θ̂ θ∈Θ

Proof. The idea is to replace the maximization over individual θ by a maximization over distri-
butions on θ. Then, after switching the min and the max, it is not possible to trivially minimize
the loss anymore (this approach is called mixed strategies in game theory).

min max Lθ (θ̂) = min max LΠ (θ̂) (13.11)


θ̂ θ∈Θ θ̂ Π

≥ max min LΠ (θ̂) (13.12)


Π θ̂
= max LΠ (θ̂Π ) (13.13)
Π
≥ LΠ (θ̂Π ). (13.14)

(13.11) follows since max upper-bounds average, and we can always choose Π to be a deterministic
distribution attaining the max. (13.12) follows from the minimax inequality, (13.13) is by
denition, and (13.14) follows by plugging in any specic prior Π.

A prior Π∗ is called least favorable if it maximizes the Bayes loss among all priors, i.e.,

LΠ∗ (θ̂Π∗ ) = max LΠ (θ̂Π ). (13.15)


Π

We immediately conclude the following:


123

Corollary 13.3 (best Bayesian lower bound) . The minimax loss is lower-bounded by the Bayes
loss of the least favorable prior.

In some cases the least favorable Bayes lower bound is tight; this happens when the minimax
inequality holds with equality, which generally happens under suitable convexity conditions de-
termined by the well-known Minimax Theorem (out of scope). However, there are simpler cases
where this happens, which we discuss next.

Lemma 13.4. Let Π be a prior on θ, with θ̂Π and Lθ (θ̂Π ) its corresponding Bayes estimator and
Bayes loss. If

Lθ (θ̂Π ) ≤ LΠ (θ̂Π ), (13.16)

for any θ ∈ Θ, then Π is least favorable and θ̂Π is the minimax estimator.

Proof. Let us rst show that θ̂Π is minimax. To that end, let θ̂ be any other estimator. We now
show that the worst case loss attained by θ̂Π is at least as large as the one attained by θ̂:

max Lθ (θ̂) ≥ LΠ (θ̂) (13.17)


θ
≥ LΠ (θ̂Π ) (13.18)

≥ max Lθ (θ̂Π ). (13.19)


θ

where (13.17) holds since expected value is upper bounded by the supremum, (13.18) holds since
θ̂Π is by denition the Bayes estimator and (13.19) holds under the assumption in the theorem
(since the inequality holds for any θ). Note that in fact, the last inequality must hold with
equality.
We now show that Π is also least favorable. Let Π̃ be any other prior. Then

LΠ̃ (θ̂Π̃ ) ≤ LΠ̃ (θ̂Π ) (13.20)

≤ max Lθ (θ̂Π ) (13.21)


θ
≤ LΠ (θ̂Π ) (13.22)

where (13.20) holds by denition, (13.21) holds since expected value is upper bounded by the
supremum, and (13.22) again holds by assumption.

The following corollary is a useful special case of the above.

Corollary 13.5. Suppose θ̂Π has constant expected loss, i.e., Lθ (θ̂Π ) does not depend on θ. Then
Π is least favorable and θ̂Π is the minimax estimator.

Proof. In this case clearly

max Lθ (θ̂Π ) = LΠ (θ̂Π ) (13.23)


θ

and the result follows from Lemma 13.4.

We note Lemma 13.4 is slightly stronger than Corollary 13.5, as it may allow us to establish
the minimax estimator when the least favorable prior does not have full support (e.g., is discrete).
124 CHAPTER 13. MINIMAX ESTIMATION

Example 13.6 .
(Coin toss) Consider the observations
i.i.d
X ∼ Bern(θ). What is the minimax
estimator under the quadratic loss `(θ̂, θ) = (θ̂ − θ)2 ? We have seen that the Beta distribution
is a conjugate prior in this case, hence it makes sense to try and see if it is less favorable for
some choice of hyper-parametres. Hence, let Π ∼ Beta(a, b) and recall that the Bayes estimator
in this case is

n
P
xk + a
k=1
θ̂Π = . (13.24)
n+a+b
Let us try to nd a, b such that the expected loss is constant.

n
P 2 
 Xk + a
 k=1  
Lθ (θ̂Π ) = E  − θ   (13.25)
 n + a + b  

 !2 
n
= (n + a + b)−2 ·E 
X
xk + a − nθ − aθ − bθ  (13.26)
| {z }
k=1
,c
 !2 
n
X
= c · E (Xk − θ) + (a − θ(a + b))  (13.27)
k=1

= c · nθ(1 − θ) + (a − θ(a + b))2


 
(13.28)

= c · nθ − nθ2 + a2 − 2θa(a + b) + θ2 (a + b)2 .


 
(13.29)

In order for the last expression not to depend on θ, we need that


(
2a(a + b) = n
(13.30)
(a + b)2 = n
√ √ √
n
This implies that
√ a+b = n, hence 2a n = n, which nally yields a=b= 2 . Thus, the prior
√ 
n n
Beta
2 , 2 is least favorable, and its associated estimator

n √
xk + 2n
P

θ̂Π (x) = k=1 √ (13.31)


n+ n
Mean[ x ] + 2√1 n
= (13.32)
1 + √1n

is the minimax estimator.

Lemma 13.7. Let θ̂ be some estimator, and suppose that exists a sequence of priors {Πm }∞ m=1
whose Bayes losses are asymptotically upper bounded by the worst-case expected loss of θ̂ , i.e.,

max Lθ (θ̂) ≤ lim LΠm (θ̂Πm ). (13.33)


θ m→∞

Then θ̂ is the minimax estimator.

The proof is similar to that of Lemma 13.4, and is left as an exercise to the reader.
125

Example 13.8 . W ∼ N 0, σ 2 ,

(Constant in AWGN) Consider the model X = θ+W where
where σ2 is known (a similar claim works for n i.i.d. samples). Let Πm ∼ N (0, m). The
associated Bayes loss is

" 2 #
  m
LΠm θ̂Πm = E X −θ (13.34)
m + σ2
" 2 #
m
=E (θ + W ) − θ (13.35)
m + σ2
" 2 #
m σ2
=E W − θ (13.36)
m + σ2 m + σ2
m2 σ 2 mσ 4
= + (13.37)
(m + σ 2 )2 (m + σ 2 )2
mσ 2 (m + σ 2 )
= (13.38)
(m + σ 2 )2
mσ 2
= (13.39)
m + σ2
σ2
= 2 . (13.40)
1 + σm

Hence, we have that

lim LΠm (θ̂Πm ) = σ 2 . (13.41)


m→∞

Now, consider the estimator θ̂(x) = x. Then

h i
Lθ (θ̂) = E (X − θ)2 (13.42)

= E W2
 
(13.43)
2
=σ . (13.44)

Hence, the estimator θ̂(x) = x is minimax. Similarly, one can show that the sample mean
θ̂(x) = Mean[ x ] is the minimax estimator for the i.i.d. version of this problem.

You might also like