Bayesian Inference and Modeling Techniques
Bayesian Inference and Modeling Techniques
1
Recall: Bayesian Statistics
The main idea is that 𝜃 is now a random variable and the information brought by
the sample x is combined with a prior information specified by a prior distribution
𝜋(𝜃), to form the posterior distribution 𝜋 (𝜃| 𝒙) by using Bayes formula:
𝑓(𝒙 | 𝜃) 𝜋(𝜃)
𝑓(𝜃| 𝒙) = ,
𝑚(𝒙)
where where likely O is
It is useful to think that 𝜃 being random represents our uncertainty about its value.
Before seeing the data, our uncertainty is represented by 𝜋(𝜃) and after that by
𝑓(𝜃| 𝒙).
Conjugate Priors
Binomials normal model values of difpiors.
Posterior ends up with
⇒ 𝜃 | 𝑥 ∼ Beta(𝛼 + 𝑥, 𝑛 + 𝛽 − 𝑥)
not relevant
Confidence
·
2.0
called credible regions, with
highest posterior density:
1.5
f(x)
𝐶(𝒙) = { 𝜃: 𝑓(𝜃 | 𝒙) ≥ 𝑘}
1.0
where k is such that
0.5
𝑃 𝜃 ∈ 𝐶(𝒙 𝒙 = 𝛾
= 𝑃 𝑓 𝜃 𝒙 ≥ 𝑘 | 𝒙)
0.0
normal are
probability -
---------
distribution on a
single variable, the
HDI of mass M is the
narrowest possible
interval of that mass.
Improper Priors Hwkind of 92 not exam
.
-
,
∫ 𝜋(𝜃) 𝑑 𝜃 = න 𝑐 𝑑𝜃 = ∞ and b.
−∞
under curve.
∞
Area
Definition: When ∫−∞ 𝜋(𝜃) 𝑑𝜃 = ∞ the prior is called improper.
Flataspossible not pafs ,
-
?
Computing the Posterior
[computer ; not solve integral by hand]
-gridbetween
I
• Numerical integration: Grid approximation, Riemann integration, trapezoid
rule, Gaussian quadrature, ... However, they don't work well in high
dimensions. How to deal with infinite intervals ?
Aside: How
-
to deal with integrals over infinite intervals?
∞
∫−∞ 𝑓
>
One approach is to change of variables, like this one:
𝑥 𝑑𝑥 =
1
𝑓
𝑡
∫−1 1−𝑡 2
1+𝑡 2
1−𝑡 2 2
𝑑𝑡
like o of variables -
⑧
• Monte Carlo (MC) methods: simple and work equally well in any dimension.
We will concentrate on this third option and study it in more details.
Monte Carlo Simulations .
:
↓
and estimate E[h(𝜃)] with ?
𝑁
can
you obtain samples
1 O2 depends on O
,
ℎ = ℎ(𝜃𝑖 )
𝑁 Oz -
Q2 - ·
𝑖=1 .
𝑎.𝑠.
Then by the SLLN, ℎ ℎ(𝜃), as 𝑁 → ∞. That is, ℎ is a
consistent estimator of E[h(𝜃)].
Of course, the above result assumes E[h(𝜃)] < ∞. If we assume that the second
moment also exists, then we also have results about the rate of convergence in
the form of the CLT.
MCMC version: If obtaining an iid sample is not an option, then we can employ
correlated and non-identically distribute sample in the form of a Markov chain,
which leads to the Markov chain Monte Carlo methods (MCMC), studied later.
-
0.15
• Let X ~ IG(𝑎, 𝑏) ↑
↓ dinvgamma(x, 1, 3)
1
• Then ~ Γ 𝑎, 𝑏
0.10
𝑋 inverse
𝑏
• 𝐸(𝑋) = , for a > 1
gamma
.
𝑎−1
𝑏2 .
• 𝑉𝑎𝑟 𝑋 =
L 𝑎−1 2 (𝑎−2)
, for a > 2 properties
0.05
𝑏
𝑏𝑎
• 𝑓 𝑥 = Γ(𝑎) 𝑥 −𝑎−1𝑒 − 𝑥 , 𝑥 > 0
13
Normal model with 𝜎2 unknown
Suppose our model is:
𝑋1 , … , 𝑋𝑛 |𝜃, 𝜎 2 ~ 𝑁 𝜃, 𝜎 2
2
𝜎
- get
improper𝜃|𝜎 2 ~ 𝑁 𝜇0 , ① Posterior of 0 Normal
:
smaller 𝜅0
going priors ② Posterior of 62 Gamma
keep 2 :
.
𝜈0 𝜈0 𝜎0
usesmala 𝜎 ~ 𝐼𝐺 2 , 2
2 =
&
Endupwith conjugate prior
.
2
𝜎
𝜃|𝜎 2 , 𝑥1 , … , 𝑥𝑛 ~ 𝑁 𝜇𝑛 ,
𝜅𝑛 14
Step 1: Likelihood
Suppose that
𝑋1 , … , 𝑋𝑛 |𝜃, 𝜎 2 ~ N 𝜃, 𝜎 2
Then:
𝑛 𝑛 2
1 1 𝑥𝑖 −𝜃
2 2 −
𝑓 𝑥1 , … , 𝑥𝑛 𝜃, 𝜎 = ෑ 𝑓 𝑥𝑖 𝜃, 𝜎 =ෑ 𝑒 2 𝜎
𝑖=1 𝑖=1
2𝜋𝜎 2
2
𝑛 − 1 σ𝑛 𝑥𝑖 −𝜃
= 2𝜋𝜎 2 − 2 𝑒 2 𝑖=1 𝜎 group
terms with O and no .
0
Notice that
𝑛 2 𝑛 𝑛
2
𝑥𝑖 − 𝜃 1 𝜃 𝜃
= 2 𝑥𝑖2 − 2 2 𝑥𝑖 + 𝑛 2
𝜎 𝜎 𝜎 𝜎
𝑖=1 𝑖=1 𝑖=1
Proof:
𝑛 𝑛
𝑥𝑖 − 𝜃 2 = 𝑥𝑖 − 𝑥ҧ + 𝑥ҧ − 𝜃 2
𝑖=1 𝑖=1
𝑛 𝑛 𝑛
= 𝑥𝑖 − 𝑥ҧ 2 + 2 𝑥𝑖 − 𝑥ҧ 𝑥ҧ − 𝜃 + 𝑥ҧ − 𝜃 2
𝑖=1 𝑖=1
𝑛
= 𝑥𝑖 − 𝑥ҧ 2 + 2 𝑥ҧ − 𝜃 × 0 + 𝑛 𝜃 − 𝑥ҧ 2
𝑖=1
16
Step 2: combine with prior
Suppose we want to find a conjugate prior distribution for 𝜋(𝜃|𝜎2). We have that
1 𝑛
− σ 𝑥 −𝜃 2
𝑓 𝜃 𝑥1 , … , 𝑥𝑛 , 𝜎2 ∝𝜋 𝜃 𝜎2 ×𝑒 2𝜎2 𝑖=1 𝑖
*outcreaboutThis sewith O
1
− 𝜃−𝑥ҧ 2
∝𝜋 𝜃 𝜎2
×𝑒 2𝜎2
This means that for 𝜋(𝜃|𝜎2) to be conjugate it must include quadratic terms like
in them
2
𝑒 𝑐1 𝜃−𝑐2 . One choice is
𝜎 2
𝜃 | 𝜎2 ~𝑁 𝜇0 ,
𝜅0 /Da=
Then using the previous result:
𝜋 𝜃 𝜎 2 × 𝑓 𝑥1 , … , 𝑥𝑛 𝜃, 𝜎 2
𝑓 𝜃 𝑥1 , … , 𝑥𝑛 , 𝜎 2 =
𝑓 𝑥1 , … , 𝑥𝑛 𝜎 2
𝜅 𝑛
− 02 𝜃−𝜇0 2 − 2 𝜃−𝑥ҧ 2
∝ 𝑒 2𝜎 × 𝑒 2𝜎
- Similar structures.
Multiplied by dif quantities -
>
17
Step 2: combine with prior
𝜋 𝜃 𝜎 2 × 𝑓 𝑥1 , … , 𝑥𝑛 𝜃, 𝜎 2
𝑓 𝜃 𝑥1 , … , 𝑥𝑛 , 𝜎 2 =
𝑓 𝑥1 , … , 𝑥𝑛 𝜎 2
𝜅 1 1
− 02 𝜃−𝜇0 2 − 2 𝜃−𝑥ҧ 2 − 2 𝜅0 𝜃−𝜇0 2 +𝑛 𝜃−𝑥ҧ 2
∝ 𝑒 2𝜎 × 𝑒 2𝜎 = 𝑒 2𝜎
I
This shows that (after completing the square as an exercise):
−
𝜅𝑛
𝜃−𝜇𝑛 2 Normaldensity property
.
𝑓 𝜃 𝑥1 , … , 𝑥𝑛 , 𝜎 2 ∝ 𝑒 2𝜎2 >
Jean 𝜎 2
𝜃|𝑥1 , … , 𝑥𝑛 , 𝜎 2 ~ 𝑁 𝜇𝑛 ,
𝜅𝑛
𝜅0 𝜇0 +𝑛𝑥ҧ
where 𝜅𝑛 = 𝜅0 + 𝑛, 𝜇𝑛 = 𝜅𝑛
𝜈𝑛 𝜈𝑛 𝜎𝑛2
𝜎 2 |𝑥1 , … , 𝑥𝑛 ~ 𝐼𝐺 ,
2 2
1 𝜅0 𝑛
where 𝜈𝑛 = 𝜈0 + 𝑛 and 𝜎𝑛2 = 𝜈 𝜈0 𝜎02 + σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2 + 𝜅𝑛
𝑥ҧ − 𝜇0 2
𝑛
21
Time to pause and think
chain .
• How did we obtain these results? used Markov
• Deriving that
𝜎 2
𝜃|𝑥1 , … , 𝑥𝑛 , 𝜎 2 ~ 𝑁 𝜇𝑛 ,
𝜅𝑛
required a lot of algebra and completion of the square technique.
• Deriving
𝜈 𝜈 𝜎 2
𝑛 𝑛 𝑛
𝜎 2 |𝑥1 , … , 𝑥𝑛 ~ 𝐼𝐺 ,
2 2
required the algebra, completion of square, followed by integration!
54135
How does it work?
Contract: King Markov must visit each island
in proportion to its population size.
Imagine the king is too lazy to keep track of schedules and wants a simple algorithm
using only the current location and nearby islands’ population sizes. Here is how he
does it:
Propose a move
Compare to current state
Decide whether to jump
Repeat until “convergence”
Yoly)
=
GEO
# -
computetogether ;
both
Height :
.
𝜏2 = ?
Recall: Hierarchical Data Structure
Consider the normal model:
𝑋1𝑗 , … , 𝑋𝑛𝑗 𝑗 |𝜃𝑗 , 𝜎 2 ~ 𝑁 𝜃𝑗 , 𝜎 2 , 𝑗 = 1, … , 𝑚
𝜃𝑗 |𝜇, 𝜏 2 ~ 𝑁 𝜇, 𝜏 2
𝜎 2 ~ IG
𝜇 ~𝑁 𝜇0 , 𝛾02
𝜏2~ ?
What should be the value of 𝜏 2 which represents knowledge
about how similar are the group means?
0 𝜏2 = ?
0 𝜏2 = ? ∞
0
𝜏 2~ 𝜋 𝜏 2 ∞
Clusters are unrelated
All clusters are the same
Aka “no pooling”
Aka “complete pooling”
Multistage Models
The basic Bayesian model has two stages:
Stage 1: 𝑓(𝑥|𝜃, 𝜂)
Stage 2: 𝑓(𝜃|𝜂)
where 𝜂 is a vector of hyperparameters. For example, in the Beta-Binomial
model 𝜂 = 𝛼, 𝛽 .
If we are not sure about the specific value of the hyperparameter 𝜂, then we
can have a third stage by placing a hyperprior on it:
Stage 3: 𝜂 ∼ 𝑔 𝜂 , 𝜂 ∈ 𝑌
Then the posterior is calculated as:
𝑓(𝑥, 𝜃) ∫ 𝑓(𝑥, 𝜃, 𝜂) 𝑑𝜂
𝑓(𝜃 | 𝑥) = =
𝑚(𝑥) 𝑥(𝑓 , 𝜃, 𝜂) 𝑑𝜂 𝑑𝜃
∫ 𝑓(𝑥 | 𝜃, 𝜂) 𝜋 (θ | 𝜂) 𝑔(𝜂) 𝑑𝜂
=
𝜃 | 𝑥(𝑓 , 𝜂) 𝜋(𝜃 | 𝜂) 𝑔(𝜂) 𝑑𝜂 𝑑𝜃
These type of models usually result in intractable posteriors and need
approximate solutions.
Hierarchical Models: Why?
Consider the school dataexample again. How would you model the data without
hierarchy?
At one extreme, we could consider that each school has their own average rate
and estimate the means individually per each school. This is known as the no
pooling of data approach.
Advantage: Unbiased individual estimates.
Disadvantage: Low accuracy (high variance).
At the other extreme, we can consider all schools similar enough and just
estimate a single pooled estimate (aka complete pooling of data).
Hierarchical estimates are a compromise between these two extremes. One lets
the data decide on the degree of pooling (judged by the Stage 2 estimated
variance).
Example: Multistage Logistic Regression
Goal: estimate death rates 𝜃j, j = A, … , L after undergoing cardiac surgery in 12
hospitals A, … , L. Data are mortality counts xj out of mj surgeries, j = A, … , L.
Stage 0: 𝑓 𝑥𝑗 𝜃𝑗 , 𝜂 ~ Binomial 𝑚𝑗 , 𝜃𝑗 , 𝑗 = 𝐴, … , 𝐿
Stage 1: 𝜃𝐴 , … , 𝜃𝐿 |𝜂 ~ 𝑓 𝜃 𝜂
Stage 2: 𝜂 ∼ 𝜋 𝜂 , 𝜂 ∈ 𝑌
𝜃𝑗
For Stage 1, the model is reparametrized with 𝛽𝑗 = log 1−𝜃 and assume the
𝑗
betas are independent with
𝛽𝑗 |𝜇, 𝜎 2 ~ 𝑁 𝜇, 𝜎 2
For Stage 2, assume 𝜇 and 𝜎2 are independent with
𝜇 ~ 𝑁 0, 𝑐 2
𝜎 2 ~ IG(𝑎, 𝑏)
where IG is the inverse gamma prior distribution.
Let a = b = 0.001 and c = 103 so the priors have little impact on the posterior.
Example (continued)
The joint density has the form: ς𝐿𝑗=𝐴 𝑓 𝑥𝑗 𝛽𝑗 , 𝜂 × 𝑓 𝛽𝑗 𝜂 × 𝑝𝑟𝑖𝑜𝑟, or
𝑓 𝜷, 𝜇, 𝜎 2 𝒙
𝑚𝑗 𝑒 𝑥𝛽𝑗 1 −
1
𝛽𝑗 −𝜇
2
= ෑ 𝑥𝑗 𝑚𝑗 1𝑒
2𝜎 2
× 𝜋 𝜇 × 𝜋 𝜎2
𝑗 1 + 𝑒 𝛽𝑗 2𝜋𝜎 2 2
where
1 1 𝜇2
−2 2
𝜋 𝜇 = 𝑒 𝑐
2𝜋𝑐 2
𝑏 𝑎 𝑏
2 2 −𝑎−1 − 2
𝜋 𝜎 = 𝜎 𝑒 𝜎
Γ 𝑎
To run a multi-stage Gibbs sampler, we need to be able to sample from the full
conditional distributions of 𝜇, 𝜎2, and 𝛽j, j = A, … , L
Exercise: Show that:
𝜇|𝜎 2 , 𝛽𝐴 , … , 𝛽𝐿 , 𝒙 ~ 𝑁(? , ? )
𝜎 2 |𝜇, 𝛽𝐴 , … , 𝛽𝐿 , 𝒙 ~ 𝐼𝐺 ? , ?
However, the conditional distribution of 𝛽j | 𝜇, 𝜎2, 𝛽-j, x is unknown. You can use a
rejection sampler to simulate from it or a Metropolis step within Gibbs.
Example: Tadpole
predation
• data(reedfrogs)