Lecture 12
Bayesian Inference.
1 Frequentists and Bayesian Paradigms
According to the frequentists theory, it is assumed that unknown parameter θ is some ˝xed number or
vector. Given parameter value θ, we observe sample data X from distribution f (·|θ). To estimate parameter
θ, we introduce an estimator T (X) which is a statistic, i.e. function of the data. This statistic is a random
variable, since X. That is, the randomness here comes from randomness of sampling. Di˙erent properties of
estimation characterize this randomness. For example, the concept of unbiasness means that if we observe
many samples from distribution f (·|θ), then the average of T (X) over the samples will be close to the true
parameter value θ, i.e. Eθ [T (X)] = θ. The concept of consistency means that if we observe many samples
from distribution f (·|θ), then the distribution of T (X) over these samples will be close in probability to the
true parameter value θ, at least when we observe large samples, i.e. T (X) will be in a small neighborhood
of θ in most observed samples.
In contrast, Bayesian theory assumes that θ is some random variable and we are interested in the real-
ization of this random variable. This realization, say, θ0 , is thought to be the true parameter value. It is
assumed that we know the distribution of θ, or at least its approximation. This distribution is called a prior.
It usually comes from our subjective belief based on our past experience. Once θ0 is realized, we observe
a random sample X = (X1 , ..., Xn ) from distribution f (·|θ0 ). Once we have data, the best thing we can
do is to calculate conditional distribution of θ given X1 , ..., Xn . This conditional distribution is called the
posterior. The posterior is used to create an estimate for θ. Since we condition on the observations Xi , we
treat them as given. The randomness in posterior is an uncertainty about θ. The Bayesian approach is often
used in the learning theory.
2 Bayesian Updating
Let π(θ) denote our prior. In other words, parameter of interest θ is a random variable with distribution
π(·). Our model is f (·|θ). In other words, once we have a realization of a parameter value θ, the observed
sample data X are drawn from the conditional distribution f (·|θ). Then the joint pdf of θ and X is
∫
f (x, θ) = π(θ)f (x|θ). The prior predictive distribution is m(x) = π(θ̃)f (x|θ̃)dθ̃. Thus, the prior predictive
distribution means the marginal pdf of our sample data X. Once we observe X = x, we can calculate the
1
posterior distribution for θ as
f (x, θ) π(θ)f (x|θ)
π(θ|X = x) = =∫ .
m(x) π(θ̃)f (x|θ̃)dθ̃
Example Let X = (X1 , ..., Xn ) be a random sample from a Bernoulli(p) distribution. Then the joint pdf
∑ ∑
of the data is f (x|p) = p i xi (1 − p)n− i xi where x = (x1 , ..., xn ). In classical theory, we would consider
some estimator of p. For example, we can take T (X) = X n . Then we have already seen that Ep [T (X)] = p
and T (X) →p p. In Bayesian theory we need some prior distribution of p. For example, suppose we believe
that all p are equally possible. Then we have a uniform prior, i.e. π(p) = 1 if p ∈ [0, 1] and 0 otherwise. Then
∑ ∑
the joint pdf of p and X is f (x, p) = p i xi n− i xi
(1 − p) I(0 ≤ p ≤ 1). The prior predictive distribution is
∫ 1 ∑ ∑ ∑
n ∑
n
m(x) = p i xi
(1 − p)n− i xi
dp = B( xi + 1, n − xi + 1),
0 i=1 i=1
∫1
where B(x, y) = 0
tx−1 (1 − t)y−1 dt is the Beta-function. The posterior distribution is
∑ ∑
(1 − p)n− i xi p i xi
π(p|X = x) = ∑n ∑n I(0 ≤ p ≤ 1)
B( i=1 xi + 1, n − i=1 xi + 1)
∑n ∑n
This distribution is called Beta B(α, β) distribution with parameters α= i=1 xi and β = n− i=1 xi + 1.
Its mean is ∑n
α xi + 1
i=1
E[p|X = x] = = .
(α + β) n+2
Its variance is
∑n ∑n
αβ ( i=1 xi + 1)(n − i=1 xi + 1)
V (p|X = x) = = .
(α + β)2 (α + β + 1) (n + 2)2 (n + 3)
2.1 How to calculate posterior distribution
Here we consider a trick that may help to calculate the posterior analytically. It does not work always,
though.
Let X = (X1 , ...Xn ) be a random sample from an N (µ, σ 2 ) distribution. Suppose σ2 is known. Let
N (µ0 , τ 2 ) be the prior distribution for µ. Then the posterior distribution is π(µ|X = x) = π(µ)f (x|µ)/m(x)
where π(µ) denotes the prior distribution and m(x) denotes the prior predictive distribution. Note that m(x)
∫
does not depend on µ. So m(x) is just a constant that normalizes π(µ|X = x)dµ to 1. Therefore, when
we calculate π(µ)f (x|µ), we can denote all multiplicative terms which do not contain µ as some constant C
instead of keeping track of all these terms. Once we have an expression for π(µ)f (x|µ) as a function of µ,
we can integrate it in order to ˝nd the normalizing constant m(x). In our case,
{ ∑n }
i=1 (xi − µ) (µ − µ0 )2
2
π(µ|X = x) = Cπ(µ)f (x|µ) = C exp − − ,
2σ 2 2τ 2
2
where C contains all terms which do not contain µ. Thus,
{ ∑n }
µ i=1 xi nµ2 µ2 µµ0
π(µ|X = x) ∝ exp − 2 − 2+ 2
σ2 2σ 2τ τ
{ ( ) ( ∑n )}
n 1 i=1 xi µ0
∝ exp −µ 2
+ 2 + 2µ + 2
2σ 2 2τ 2σ 2 2τ
{ ( )[ ( ∑n 2
)]}
1 n 1 x
i=1 i /σ + µ0 /τ 2
∝ exp − + 2 µ − 2µ
2
2 σ2 τ n/σ 2 + 1/τ 2
{ ( ) }
1 n 1 2
∝ exp − 2
+ 2 [µ − µ̃]
2 σ τ
{ }
(µ − µ̃)2
∝ exp − ,
2σ̃ 2
∑n
where µ̃ = ( i=1 xi /σ 2 + µ0 /τ 2 )/(n/σ 2 + 1/τ 2 ) and σ̃ 2 = 1/(n/σ 2 + 1/τ 2 ). Note that, via some abuse of
notation, symbol ∝ stands proportional as function of µ", thus di˙erent lines need di˙erent normalizing
constants. Thus, the conditional distribution of µ given X = x is N (µ̃, σ̃ 2 ). Now, it is easy to ˝nd a constant
√
in the last expression, namely, the missing constant is C = 1/( 2πσ̃).
Note that in the example above, posterior mean µ̃ is a weighted average of sample average xn and
2 2 2
prior mean µ0 , i.e. µ̃ = ω1 xn + ω2 µ0 with ω1 + ω2 = 1 where ω1 = (n/σ )/(n/σ + 1/τ ) and ω2 =
2 2 2 2
(1/τ )/(n/σ + 1/τ ). Here 1/τ may be interpreted as the precision of initial information. If initial
2 2
information is very precise, i.e. 1/τ is large (or τ is small), then the prior mean gets almost all weight,
and µ̃ is close to µ0 . Thus, we almost ignore new information in the form of a sample X1 , ..., Xn . If initial
2
information is poor, i.e. 1/τ is small, prior mean gets almost no weight, and µ̃ is close to xn . Note that as
n → ∞, information from the sample dominates prior information and µ̃ → xn . Moreover, at least in our
example, as n → ∞, σ̃ → 0. 2
Thus, as the sample size increases, the posterior distribution converges to a
degenerate distribution concentrated at the true parameter value.
Once we have the posterior distribution, we can construct an estimator of the parameter of interest.
Common examples include posterior mean and posterior mode (posterior mode denotes the point with the
greatest pdf value on the posterior distribution).
An important problem with Bayesian estimation is that if a prior distribution puts zero probability mass
on the true parameter value, then no matter how large our sample is, posterior distribution will put zero
mass on the true parameter value as well. You will see examples of this phenomenon in 14.384.
2.2 Conjugate Classes
Let F be the class of distributions indexed by θ. Let P be the class of prior distributions of θ. Then we say
that P is conjugate to F if whenever data is distributed according to F and prior distribution is from P,
then the posterior distribution is from P as well. For example, we have already seen that the class of normal
distributions is conjugate to the class of normal distributions with known (˝xed) variance. It is also known
that the class of B -distributions is conjugate to the class of binomial distributions. The concept of conjugate
classes is introduced because of its mathematical convenience. It is relatively easy to calculate the posterior
when the prior lies in the conjugate class. Conjugate priors were almost the only priors used for a long time,
3
as the others tend to be not analytically tractable.
2.3 Simulation Techniques
Conjugate priors were almost the only priors used for a long time, as the others tend to be not analytically
tractable. Nowadays, there are numerical algorithms (MCMC- Markov Chain Monte-Carlo), that allow one
to calculate posterior for priors outside the conjugate family. MCMC will be discussed in 14.384 and 14.385.
The goal of a typical MCMC algorithm is to get a sequence of random draws from posterior distribution.
That is to construct a numerical algorithm such that would produce θ1 , ..., θB as simulated from π(θ|X = x),
(not always independent draws but may be stationary and satisfying Law of Large Numbers). Then it allows
to make inferences of any sort on θ or any function of θ. Imagine that you wish to use mean of posterior as an
1
∑B 1
∑B
estimator than
B b=1 θb is an estimator for θ , while B b=1 g(θb ) is an estimator for parameter τ = g(θ).
If you prefer to use median of the posterior as an estimate then you would order draws of θb in increasing
order and take the middle one θ([B/2]) as an estimate for θ, etc.
2.4 Credible Intervals
The posterior distribution contains all the information that a researcher can get from the data. However,
it is often impractical to report a posterior distribution, as it might be intractable or it might not have
analytic form at all. Therefore, it is a common practice to report only some characteristics of a posterior
distribution, such as its mean, variance, and mode. Mean and mode of the posterior serve as point estimators
of the parameter while variance shows the precision of the posterior distribution. A better way to show the
precision of the posterior distribution is the concept of credible intervals. Let X be our data, θ ∈ Θ our
parameter, and π(θ|X = x) the posterior distribution. Then for any α ∈ [0, 1], a set C(x) ⊂ Θ is called
1 − α-credible if π(θ ∈ C(x)|X = x) ≥ 1 − α. In words, set C(x) contains true parameter value θ with
probability of at least 1 − α. Of course, the whole parameter space Θ is 1 − α-credible. But, apparently,
reporting Θ as a 1 − α-credible set is not useful at all. The smallest γ -credible set contains only points with
the highest posterior density.
As an example, let X = (X1 , ..., Xn ) be a random sample from N (µ, σ 2 ) with σ 2 known and let N (µ0 , τ 2 )
be the prior distribution of µ. We have already seen that the conditional distribution of µ given X = x
is
2
N (µ̃, σ̃ ) with µ̃ and σ̃ 2
de˝ned as above. Then (µ − µ̃)/σ̃ ∼ N (0, 1). Let zγ be the γ -quantile of
the standard normal distribution. Then π{(µ − µ̃)/σ̃ ∈ [zα/2 , −zα/2 ]|X = x} = 1 − α or, equivalently,
π{µ̃ + zα/2 σ̃ ≤ µ ≤ µ̃ − zα/2 σ̃|X = x} = 1 − α. Since the pdf ϕ(x) of the standard normal distribution is
decreasing on x≥0 and increasing on ˜ µ̃ − zα/2 σ̃]
x ≤ 0, [µ̃ + zα/2 σ, is the shortest 1 − α-credible interval.
3 Large Sample Properties of Bayes' Procedures
There is a mathematical theorem which claims that under some regularity conditions the prior vanishes
asymptotically and we get essentially MLE inferences.
Theorem 1. Under appropriate regularity conditions the posterior is approximately normal with mean θˆM L
4
and variance nI1 (θ̂M L ) when n is large. In particular, the asymptotic frequentist 1−α interval
[ √ √ ]
1 1
C = θ̂M L − zα/2 , θ̂M L − zα/2
nI1 (θ̂M L ) nI1 (θ̂M L )
is also an approximate Bayesian credible set at level 1 − α:
P (θ ∈ C|X = x) → 1 − α as n→∞
Theorem says that the posterior concentrates around the asymptotic limit of frequentist MLE, and in
any reasonable" situation Bayes estimate in large samples will be close to MLE. However, you should be
cautious about this theorem!
Cautions:
• The theorem is about asymptotics. However, the prior can in˛uence inferences in ˝nite samples.
• One of the regularity assumption is the identi˝cation condition. If you are not identi˝ed, then where
the Bayesian estimator converges depends on your prior.
• Prior should not restrict parameter space unnecessary
The strongest critique and objective people may have to Bayes inference is the question about where priors
come from. Priors are subjective by de˝nition, they summarize ones subjective opinion or belief about
possible values of values of θ. There is a general desire for non-informative priors, which is a hard question.
3.1 Bayesian Testing
Let π(θ|X = x) be our posterior distribution. One can test the hypothesis H 0 : θ ∈ Θ0 against the hypothesis
that H1 : θ ∈ Θ1 by looking on posterior odds and accept i˙:
P (θ ∈ Θ0 |X = x) ≥ P (θ ∈ Θ1 |X = x).
It is interesting that the hypotheses are treated in a symmetric way here. You would face a di°culty if you
want to test a speci˝c parameter value, for example, H 0 : θ = θ0 as the continuous posterior put zero weight
on any point. In such a case one has to put pass point weigh prior weight on this speci˝c value for testing.
5
MIT OpenCourseWare
[Link]
14.381 Statistical Method in Economics
Fall 2018
For information about citing these materials or our Terms of Use, visit: [Link]