0% found this document useful (0 votes)
12 views50 pages

Bayesian Inference and Modeling Techniques

The document discusses Bayesian statistics, highlighting the concept of posterior distribution formed by combining prior information with sample data. It explains conjugate priors, particularly the Beta-Binomial model, and introduces confidence regions, also known as credible regions. Additionally, it covers methods for computing posterior distributions, including Monte Carlo simulations and the use of improper priors.

Uploaded by

tanyalim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views50 pages

Bayesian Inference and Modeling Techniques

The document discusses Bayesian statistics, highlighting the concept of posterior distribution formed by combining prior information with sample data. It explains conjugate priors, particularly the Beta-Binomial model, and introduces confidence regions, also known as credible regions. Additionally, it covers methods for computing posterior distributions, including Monte Carlo simulations and the use of improper priors.

Uploaded by

tanyalim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

STAT 5703

Statistical Inference and Modeling


for Data Science
Dobrin Marchev

1
Recall: Bayesian Statistics
The main idea is that 𝜃 is now a random variable and the information brought by
the sample x is combined with a prior information specified by a prior distribution
𝜋(𝜃), to form the posterior distribution 𝜋 (𝜃| 𝒙) by using Bayes formula:

𝑓(𝒙 | 𝜃) 𝜋(𝜃)
𝑓(𝜃| 𝒙) = ,
𝑚(𝒙)
where where likely O is

𝑚(𝒙) = න 𝑓(𝒙| 𝜃) 𝜋(𝜃) 𝑑𝜃

is the marginal distribution of X.

It is useful to think that 𝜃 being random represents our uncertainty about its value.
Before seeing the data, our uncertainty is represented by 𝜋(𝜃) and after that by
𝑓(𝜃| 𝒙).
Conjugate Priors
Binomials normal model values of difpiors.
Posterior ends up with

The computational drawbacks used to be so severe that in the


past many researchers worked only with the so-called conjugate
priors.

Definition: A family of probability distributions ℱ = { 𝜋(𝜃) } is


conjugate for the model 𝑓(𝑥| 𝜃) if
𝑓 𝜃|𝑥 ∈ℱ

Note: The Gaussian example was conjugate, because the prior


had normal distribution, and the posterior also had a normal
distribution.
Example: Beta-Binomial Model
Let
𝑥 | 𝜃 ∼ Binom(𝑛, 𝜃)
𝜃 ∼ Beta(𝛼, 𝛽)
That is,
𝑛 𝑥
𝑓(𝑥 | 𝜃) = 𝑃(𝑋 = 𝑥 | 𝜃) = 𝜃 1 − 𝜃 𝑛−𝑥
𝑥
Γ 𝛼+𝛽
𝜋 𝜃 = 𝜃 𝛼−1 1 − 𝜃 𝛽−1
Γ 𝛼 Γ 𝛽
𝑛 Γ 𝛼+𝛽
⇒ 𝑓(𝑥, 𝜃) = 𝜃 𝛼+𝑥−1 1 − 𝜃 𝑛+𝛽−𝑥−1
𝑥 Γ 𝛼 Γ 𝛽

⇒ 𝜃 | 𝑥 ∼ Beta(𝛼 + 𝑥, 𝑛 + 𝛽 − 𝑥)

Therefore, prior is Beta distributed, and posterior is Beta distributed,


meaning the Beta distribution is a conjugate prior.

Note: See R examples.


if posterior distribution is post. density
symmetric highestthat
,

not relevant
Confidence
·

skewed better approach which horizontal


But if : have :

level results in projections where it intersect


Regions ↓ a = 2, b = 5
the curve with 95% error .

Any other interval is guaranteed


to
The Bayesian analog of CI's are wider
be
the confidence regions, also
.

2.0
called credible regions, with
highest posterior density:

1.5
f(x)
𝐶(𝒙) = { 𝜃: 𝑓(𝜃 | 𝒙) ≥ 𝑘}

1.0
where k is such that

0.5
𝑃 𝜃 ∈ 𝐶(𝒙 𝒙 = 𝛾
= 𝑃 𝑓 𝜃 𝒙 ≥ 𝑘 | 𝒙)
0.0

This leads to a different


0.0 0.2 0.4 0.6 0.8 1.0
computational problem,
x
namely solving
𝑓(𝜃 | 𝒙) = 𝑘
We will compute the intervals with
the function HPDI from rethinking.
G
HDI: Explained
skewed or

stickto this don't know where posterior


is
if
you symmetric
HDI of unimodal what if want more area
you
distribution is above ?
shortest interval why use middle # 1 96
.
-
The algorithms for
computing the HDI
of an MCMC sample
rely on a crucial
property: shortest interval for
1 645-
For a unimodal -
-
.

normal are
probability -

---------
distribution on a
single variable, the
HDI of mass M is the
narrowest possible
interval of that mass.
Improper Priors Hwkind of 92 not exam
.

-
,

Sometimes we may wish that 𝜋(𝜃) contains no information about 𝜃. For


1
example, if Θ = [𝑎, 𝑏], then 𝜋 𝜃 = 𝑏−𝑎 ∼ 𝑈(𝑎, 𝑏) is a natural choice
when Θ is bounded.
flat.
distribution)
Tempting
as
parameter (e g normal
.
. .
to use prior
f

However, if, say, Θ = ℝ, and 𝜋 (𝜃) = 𝑐, then


∞ fit uniform distribution btw a

∫ 𝜋(𝜃) 𝑑 𝜃 = න 𝑐 𝑑𝜃 = ∞ and b.

−∞
under curve.


Area
Definition: When ∫−∞ 𝜋(𝜃) 𝑑𝜃 = ∞ the prior is called improper.
Flataspossible not pafs ,
-

If 𝜋(𝜃) is improper, then Bayesian inference is still possible, if 𝑓(𝜃 | 𝑥) is a


proper density. For example, if 𝜋 (𝜃) = 𝑐 , then we must have that
∫ 𝑓(𝑥| 𝜃) 𝑑 𝜃 = 𝑘 < ∞. In such cases
𝑓(𝑥 | 𝜃) 𝜋(𝜃) 𝑐𝑓(𝑥 | 𝜃) 𝑓(𝑥 | 𝜃)
𝑓(𝜃 | 𝑥) = = =
∫ 𝑓(𝑥 | 𝜃 ) 𝜋 (𝜃) 𝑑 𝜃 ∫ 𝑐𝑓(𝑥 | 𝜃 ) 𝑑 𝜃 𝑘
How to :

?
Computing the Posterior
[computer ; not solve integral by hand]

• Q: What do we do when the integrals are


not available in closed form?

• A: There are three major alternatives to


exact (calculus) solutions.

• Numerical integration: Riemann


integration, trapezoid rule, Gaussian
quadrature, ... However, they don't work
well in high dimensions.
• Analytical approach, like Laplace
approximation, but too complicated and
still doesn't work well, especially with
hierarchical (multistage) models.
• Monte Carlo (MC) methods: simple and
work equally well in any dimension. We
will concentrate on the third option and
study it in more details.
Non-exact Estimation
what

-gridbetween

I
• Numerical integration: Grid approximation, Riemann integration, trapezoid
rule, Gaussian quadrature, ... However, they don't work well in high
dimensions. How to deal with infinite intervals ?

Aside: How
-
to deal with integrals over infinite intervals?


∫−∞ 𝑓
>
One approach is to change of variables, like this one:
𝑥 𝑑𝑥 =
1
𝑓
𝑡
∫−1 1−𝑡 2
1+𝑡 2
1−𝑡 2 2
𝑑𝑡
like o of variables -

• Analytical approach, like Laplace approximation, but too complicated and


still doesn't work well, especially with hierarchical (multistage) models.
our
focus .


• Monte Carlo (MC) methods: simple and work equally well in any dimension.
We will concentrate on this third option and study it in more details.
Monte Carlo Simulations .
:

calculate model parameter


Monte Carlo Estimation
Basic idea: If we want to find 𝐸 ℎ 𝜃 = ∫ ℎ(𝜃) 𝑓 𝜃 𝑥 𝑑𝜃 for some ℎ(⋅) and
some 𝜃 ∼ 𝑓(𝜃) (within the Bayesian framework we typically have 𝑓(𝜃) =
𝑓(𝜃 | 𝑥) and ℎ(𝜃) = 𝜃), then we can obtain an iid sample 𝜃1 , … , 𝜃𝑁 from 𝑓 𝜃


and estimate E[h(𝜃)] with ?
𝑁
can
you obtain samples

1 O2 depends on O
,
ℎ = ෍ ℎ(𝜃𝑖 )
𝑁 Oz -
Q2 - ·

𝑖=1 .
𝑎.𝑠.
Then by the SLLN, ℎ෠ ℎ(𝜃), as 𝑁 → ∞. That is, ℎ෠ is a
consistent estimator of E[h(𝜃)].

Of course, the above result assumes E[h(𝜃)] < ∞. If we assume that the second
moment also exists, then we also have results about the rate of convergence in
the form of the CLT.

MCMC version: If obtaining an iid sample is not an option, then we can employ
correlated and non-identically distribute sample in the form of a Markov chain,
which leads to the Markov chain Monte Carlo methods (MCMC), studied later.
-

The LLN and CLT still hold!


Recall: Normal model
• Normal model with population variance 𝜎 2 known:
𝑋1 , … , 𝑋𝑛 |𝜃 ~ 𝑁 𝜃, 𝜎 2
𝜃 ∼ 𝑁(𝜇, 𝜏 2 ) r

Can be shown that the posterior is


2
𝜃|𝑥1 , … , 𝑥𝑛 ~ 𝑁 𝜇𝑝𝑜𝑠𝑡 , 𝜎𝑝𝑜𝑠𝑡
where
𝜇 𝑥ҧ 1 𝑛
+
𝜏 2 𝜎 2 /𝑛 𝜏 2 𝜎 2
𝜇𝑝𝑜𝑠𝑡 = 𝐸 𝜃 𝒙 = = 𝜇+ 𝑥ҧ
1 𝑛 1 𝑛 1 𝑛
+ + +
𝜏2 𝜎2 𝜏2 𝜎2 𝜏2 𝜎2
2
1
𝜎𝑝𝑜𝑠𝑡 = 𝑉𝑎𝑟 𝜃 𝒙 =
1 𝑛
+
𝜏2 𝜎2
Q: What if we have to estimate 𝜎 2 along with the mean 𝜃?
Example: Normal model with 𝜎2 unknown
𝑋1 , … , 𝑋𝑛 |𝜃 ~ 𝑁 𝜃, 𝜎 2
𝜃, 𝜎 2 ∼ ? ? ?
The posterior distribution is:
2 2
𝑓 𝑥1 , … , 𝑥𝑛 𝜃, 𝜎 𝜋 𝜃, 𝜎
𝑓 𝜃, 𝜎 2 𝑥1 , … , 𝑥𝑛 =
𝑓 𝑥1 , … , 𝑥𝑛
We will find some simple conjugate prior. Specifying a joint prior
distribution on parameters with different parameter spaces is
very challenging, so usually it is done sequentially: Bivariate
can choose dif types
𝜋 𝜃, 𝜎 2 = 𝜋 𝜃 𝜎 2 × 𝜋 𝜎 2 .

Bivariate pot of Marginal & Conditional-


2
as a
𝜎2
For 𝜋 𝜃 𝜎 we can choose a normal prior with variance
𝜅0
For σ2 we need a family of prior distributions that has support
on (0, ∞). Unfortunately, the Gamma family is not conjugate for
the normal variance. However, the inverse gamma family does
turn out to be a conjugate class of densities for σ2. 12
IG(1,3)

Aside: Inverse Gamma


in stat
Distribution
use

0.15
• Let X ~ IG(𝑎, 𝑏) ↑

↓ dinvgamma(x, 1, 3)
1
• Then ~ Γ 𝑎, 𝑏

0.10
𝑋 inverse
𝑏
• 𝐸(𝑋) = , for a > 1
gamma
.

𝑎−1
𝑏2 .
• 𝑉𝑎𝑟 𝑋 =
L 𝑎−1 2 (𝑎−2)
, for a > 2 properties

0.05
𝑏
𝑏𝑎
• 𝑓 𝑥 = Γ(𝑎) 𝑥 −𝑎−1𝑒 − 𝑥 , 𝑥 > 0

• It is available in R with the


0.00

dinvgamma function from the


package invgamma. 0 5 10 15 20 25

13
Normal model with 𝜎2 unknown
Suppose our model is:
𝑋1 , … , 𝑋𝑛 |𝜃, 𝜎 2 ~ 𝑁 𝜃, 𝜎 2
2
𝜎
- get
improper𝜃|𝜎 2 ~ 𝑁 𝜇0 , ① Posterior of 0 Normal
:

smaller 𝜅0
going priors ② Posterior of 62 Gamma
keep 2 :
.

𝜈0 𝜈0 𝜎0
usesmala 𝜎 ~ 𝐼𝐺 2 , 2
2 =

&
Endupwith conjugate prior
.

The prior parameters (𝜇 E 0, 𝜅0 ) can be interpreted as the sample


mean and sample size of prior observations on the mean.
The prior parameters (𝜎02 , 𝜈0 ) can be interpreted as the sample
variance and sample size of prior observations on variance.
The posterior distribution can be similarly decomposed:
𝑓 𝜃, 𝜎 2 𝑥1 , … , 𝑥𝑛 = 𝑓 𝜃 𝜎 2 , 𝑥1 , … , 𝑥𝑛 × 𝑓 𝜎 2 𝑥1 , … , 𝑥𝑛
You can show that specify prior slope !
on

2
𝜎
𝜃|𝜎 2 , 𝑥1 , … , 𝑥𝑛 ~ 𝑁 𝜇𝑛 ,
𝜅𝑛 14
Step 1: Likelihood
Suppose that
𝑋1 , … , 𝑋𝑛 |𝜃, 𝜎 2 ~ N 𝜃, 𝜎 2
Then:
𝑛 𝑛 2
1 1 𝑥𝑖 −𝜃
2 2 −
𝑓 𝑥1 , … , 𝑥𝑛 𝜃, 𝜎 = ෑ 𝑓 𝑥𝑖 𝜃, 𝜎 =ෑ 𝑒 2 𝜎
𝑖=1 𝑖=1
2𝜋𝜎 2
2
𝑛 − 1 σ𝑛 𝑥𝑖 −𝜃
= 2𝜋𝜎 2 − 2 𝑒 2 𝑖=1 𝜎 group
terms with O and no .
0

Notice that
𝑛 2 𝑛 𝑛
2
𝑥𝑖 − 𝜃 1 𝜃 𝜃
෍ = 2 ෍ 𝑥𝑖2 − 2 2 ෍ 𝑥𝑖 + 𝑛 2
𝜎 𝜎 𝜎 𝜎
𝑖=1 𝑖=1 𝑖=1

This shows that the pair σ𝑛𝑖=1 𝑥𝑖2 , σ𝑛𝑖=1 𝑥𝑖 is a two-dimensional


sufficient statistic. 15
Aside (from a previous lecture)
Show that
𝑛 𝑛
2 2 2
෍ 𝑥𝑖 − 𝜃 = ෍ 𝑥𝑖 − 𝑥ҧ + 𝑛 𝜃 − 𝑥ҧ
𝑖=1 𝑖=1

Proof:
𝑛 𝑛
෍ 𝑥𝑖 − 𝜃 2 = ෍ 𝑥𝑖 − 𝑥ҧ + 𝑥ҧ − 𝜃 2

𝑖=1 𝑖=1
𝑛 𝑛 𝑛
= ෍ 𝑥𝑖 − 𝑥ҧ 2 + 2 ෍ 𝑥𝑖 − 𝑥ҧ 𝑥ҧ − 𝜃 + ෍ 𝑥ҧ − 𝜃 2

𝑖=1 𝑖=1 𝑖=1


𝑛 𝑛
= ෍ 𝑥𝑖 − 𝑥ҧ 2 + 2 𝑥ҧ − 𝜃 ෍ 𝑥𝑖 − 𝑥ҧ + 𝑛 𝜃 − 𝑥ҧ 2

𝑖=1 𝑖=1
𝑛
= ෍ 𝑥𝑖 − 𝑥ҧ 2 + 2 𝑥ҧ − 𝜃 × 0 + 𝑛 𝜃 − 𝑥ҧ 2

𝑖=1
16
Step 2: combine with prior
Suppose we want to find a conjugate prior distribution for 𝜋(𝜃|𝜎2). We have that
1 𝑛
− σ 𝑥 −𝜃 2
𝑓 𝜃 𝑥1 , … , 𝑥𝑛 , 𝜎2 ∝𝜋 𝜃 𝜎2 ×𝑒 2𝜎2 𝑖=1 𝑖

*outcreaboutThis sewith O
1
− 𝜃−𝑥ҧ 2
∝𝜋 𝜃 𝜎2
×𝑒 2𝜎2

This means that for 𝜋(𝜃|𝜎2) to be conjugate it must include quadratic terms like
in them
2
𝑒 𝑐1 𝜃−𝑐2 . One choice is
𝜎 2
𝜃 | 𝜎2 ~𝑁 𝜇0 ,
𝜅0 /Da=
Then using the previous result:
𝜋 𝜃 𝜎 2 × 𝑓 𝑥1 , … , 𝑥𝑛 𝜃, 𝜎 2
𝑓 𝜃 𝑥1 , … , 𝑥𝑛 , 𝜎 2 =
𝑓 𝑥1 , … , 𝑥𝑛 𝜎 2
𝜅 𝑛
− 02 𝜃−𝜇0 2 − 2 𝜃−𝑥ҧ 2
∝ 𝑒 2𝜎 × 𝑒 2𝜎
- Similar structures.
Multiplied by dif quantities -

>

17
Step 2: combine with prior
𝜋 𝜃 𝜎 2 × 𝑓 𝑥1 , … , 𝑥𝑛 𝜃, 𝜎 2
𝑓 𝜃 𝑥1 , … , 𝑥𝑛 , 𝜎 2 =
𝑓 𝑥1 , … , 𝑥𝑛 𝜎 2
𝜅 1 1
− 02 𝜃−𝜇0 2 − 2 𝜃−𝑥ҧ 2 − 2 𝜅0 𝜃−𝜇0 2 +𝑛 𝜃−𝑥ҧ 2
∝ 𝑒 2𝜎 × 𝑒 2𝜎 = 𝑒 2𝜎

To finish we need to rewrite as a complete square in 𝜃:


𝜅0 𝜃 − 𝜇0 2 + 𝑛 𝜃 − 𝑥ҧ 2
= 𝜅0 𝜃 2 − 2𝜅0 𝜇0 𝜃 + 𝜅0 𝜇02 + 𝑛𝜃 2 − 2𝑛𝑥𝜃ҧ + 𝑛𝑥ҧ 2 = ⋯

I
This shows that (after completing the square as an exercise):

𝜅𝑛
𝜃−𝜇𝑛 2 Normaldensity property
.

𝑓 𝜃 𝑥1 , … , 𝑥𝑛 , 𝜎 2 ∝ 𝑒 2𝜎2 >

which proves that -


Jean 𝜎 2
𝜃|𝑥1 , … , 𝑥𝑛 , 𝜎 2 ~ 𝑁 𝜇𝑛 ,
𝜅𝑛
𝜅0 𝜇0 +𝑛𝑥ҧ
where 𝜅𝑛 = 𝜅0 + 𝑛, 𝜇𝑛 = 𝜅𝑛

Not exam &2 18


Posterior Distribution of 𝜎2 | x1, …, xn
The posterior distribution of σ2 can be obtained by performing an
integration over the unknown value of θ.
𝑓 𝜎 2 𝑥1 , … , 𝑥𝑛 ∝ 𝜋 𝜎 2 𝑓 𝑥1 , … , 𝑥𝑛 𝜎 2
= 𝜋 𝜎 2 න 𝑓 𝑥1 , … , 𝑥𝑛 𝜃, 𝜎 2 𝜋 𝜃 𝜎 2 𝑑𝜃
𝜈0
𝜈0 𝜎02 2 𝜈0 𝜎02
2 𝜈
− 20 −1 − 𝜎22 −
𝑛 − 1 𝜃−𝑥ҧ 2
= 𝜎 2 𝑒 න 𝜋 𝜃 𝜎2 𝜎 2 2 𝑒 2𝜎2 𝑑𝜃
𝜈
Γ 20
= … (exercise)
We conclude that
2
𝜈𝑛 𝜈𝑛 𝜎𝑛
𝜎 2 |𝑥1 , … 𝑥𝑛 ~ 𝐼𝐺 ,
2 2
1 𝜅0 𝑛
where 𝜈𝑛 = 𝜈0 + 𝑛 and 𝜎𝑛2 = 𝜈 𝜈0 𝜎02 + σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2 + 𝜅𝑛
𝑥ҧ − 𝜇0 2
𝑛
Monte Carlo Sampling
What if wanted to calculate E(𝜃|x1, … , xn)?
The only thing we know is that
2
𝜎
𝜃|𝜎 2 , 𝑥1 , … , 𝑥𝑛 ~ 𝑁 𝜇𝑛 ,
𝜅𝑛
but we don’t know the marginal posterior distribution of 𝜃 (without
conditioning on 𝜎 2 ).
Instead of trying to derive it with calculus, we can resort to the
Monte Carlo method to generate:
2 2(1)
𝜈𝑛 𝜈𝑛 𝜎𝑛 𝜎
𝜎 2(1) |𝑥1 , … 𝑥𝑛 ~ 𝐼𝐺 , , 𝑡ℎ𝑒𝑛 𝜃 1 ~𝑁 𝜇𝑛 ,
2 2 𝜅𝑛

2
𝜈𝑛 𝜈𝑛 𝜎𝑛 𝜎 2(𝑆)
𝜎 2(𝑆) |𝑥1 , … 𝑥𝑛 ~ 𝐼𝐺 , , 𝑡ℎ𝑒𝑛 𝜃 𝑆
~𝑁 𝜇𝑛 ,
2 2 𝜅𝑛
Recap
Suppose our model for the data is:
𝑋1 , … , 𝑋𝑛 |𝜃, 𝜎 2 ~ 𝑁 𝜃, 𝜎 2
Assume the following priors:
𝜎 2
𝜃|𝜎 2 ~ 𝑁 𝜇0 ,
𝜅0
2
𝜈0 𝜈0 𝜎0
𝜎 2 ~ 𝐼𝐺 ,
2 2
Then we showed that
2
𝜎2
𝜃|𝜎 , 𝑥1 , … , 𝑥𝑛 ~ 𝑁 𝜇𝑛 ,
𝜅𝑛
𝜅0 𝜇0 +𝑛𝑥ҧ
where 𝜅𝑛 = 𝜅0 + 𝑛, 𝜇𝑛 = 𝜅𝑛
and

𝜈𝑛 𝜈𝑛 𝜎𝑛2
𝜎 2 |𝑥1 , … , 𝑥𝑛 ~ 𝐼𝐺 ,
2 2
1 𝜅0 𝑛
where 𝜈𝑛 = 𝜈0 + 𝑛 and 𝜎𝑛2 = 𝜈 𝜈0 𝜎02 + σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2 + 𝜅𝑛
𝑥ҧ − 𝜇0 2
𝑛
21
Time to pause and think
chain .
• How did we obtain these results? used Markov
• Deriving that
𝜎 2
𝜃|𝑥1 , … , 𝑥𝑛 , 𝜎 2 ~ 𝑁 𝜇𝑛 ,
𝜅𝑛
required a lot of algebra and completion of the square technique.

• Deriving
𝜈 𝜈 𝜎 2
𝑛 𝑛 𝑛
𝜎 2 |𝑥1 , … , 𝑥𝑛 ~ 𝐼𝐺 ,
2 2
required the algebra, completion of square, followed by integration!

• Do we want to ever derive it again?


• Answer: No!
• We could have sampled from 𝜎 2 |𝜃, 𝑥1 , … , 𝑥𝑛 instead of 𝜎 2 |𝑥1 , … , 𝑥𝑛 and
thus avoiding the integration part (algebra remains though).
• What?! Is this legit?? Do we know its distribution?
• Can we, please, also avoid the algebra part?
• Yes!
MCMC Methods

• Metropolis ,1953: Granddaddy


of them all
• Metropolis-Hastings (MH),
1970: More general
• Gibbs sampling (GS), 1984:
Efficient version of MH
• Hamiltonian Monte Carlo
(HMC), 1996
• Newer 21st century algorithms
much better
• New methods being
developed
Metropolis and MCMC

54135
How does it work?
Contract: King Markov must visit each island
in proportion to its population size.

Imagine the king is too lazy to keep track of schedules and wants a simple algorithm
using only the current location and nearby islands’ population sizes. Here is how he
does it:
Propose a move
Compare to current state
Decide whether to jump
Repeat until “convergence”

Yoly)
=
GEO
# -

computetogether ;

chance of moving / know the ratio of


:

Histogram Likelihood of LBR .


-

both
Height :
.

TLDR Converge to sample from Posterior


.
:
How does this relate
to Bayesian analysis?
• Usual use is to draw samples from a
posterior distribution
• “Islands”: parameter values
• “Population size”: proportional to
posterior probability
• Works for any number of dimensions
(parameters)
• Works for continuous as well as
discrete parameters
• Initially implemented on the MANIAC
(Mathematical Analyzer, Numerical
Integrator, and Computer) developed
by John von Neumann.
Sitting is Nicholas Metropolis, the
grandfather of the MCMC.
MANIAC

• Stanislaw Ulam invented the


Monte Carlo method by thinking
about predicting solitaire
outcome.
[Link]
m75a/how-solitaire-inspired-the-
worlds-most-useful-simulation-tool
Metropolis-Hastings Algorithm
• Very general method for constructing a Markov chain with a specific
invariant density f(), which is the target for the simulation.
• Suppose that q(y|x) is any conditional density, called proposal density.
• Algorithm:
o Step 0: Start at any X = x0
o Step 1: Generate Yi ~ q(y|xi-1)
𝑌 with probability 𝛼(𝑥𝑖−1 , 𝑦𝑖 )
o Step 2: Set 𝑋𝑖 = ቊ 𝑖
𝑥𝑖−1 , o/w
𝑓 𝑦 𝑞 𝑥𝑦
where 𝛼 𝑥, 𝑦 = min 𝑓 𝑥 𝑞 𝑦 𝑥 , 1 is called the acceptance probability
o Go to Step 1 and repeat
Notes:
o When q(y|x) is symmetric, the algorithm is known as Metropolis.
o When q(y|x) does not depend on x, the algorithm is known as
Independent Metropolis-Hastings.
o When q(y|x) = g(y – x), the algorithm is known as Random Walk
Metropolis-Hastings (see next slide)
Metropolis-Hastings Algorithm

• When q(y|x) is symmetric, the


algorithm is known as
Metropolis.
• When q(y|x) does not depend on
x, the algorithm is known as
Independent Metropolis-
Hastings.
• When q(y|x) = g(y – x), the
algorithm is known as Random
Walk Metropolis-Hastings and
usually has a tuning parameters,
controlling the variance of q(y|x)
• The choice of that tuning
parameters results in very
different behavior (represented
with red circle on the graph)
Gibbs Sampler

• The Gibbs’ sampler is an MCMC method to efficiently


generate a draw from any multivariate distribution
f(x1, …, xp)
(Think of this as p-dimensional posterior.)
• It is used when draws from the joint distribution are hard
to compute directly.
• Draws from the conditional distributions are easy to obtain
f(xj | x1, …, xj-1, xj+1, ... , xp)

• Start with some initial guess values x1(0), …, xp(0)


• Iterate many times to obtain a Markov chain that
converges to the target distribution f(x1, …, xp).
Iteration Steps
• Given the current values x1(t), …, xp(t), obtain new
values:

• x1(t+1) ~ f(x1 | x2(t), x3(t), …, xp(t))


• x2(t+1) ~ f(x2 | x1(t+1), x3(t), …, xp(t))
• x3(t+1) ~ f(x3 | x1(t+1), x2(t+1), …, xp(t))

• xp(t+1) ~ f(xp | x1(t+1), x2(t+1), …, xp-1(t+1))

• Under mild regularity conditions it can be shown


that (x1(t), x2(t), x3(t), …, xp(t)) converges to a draw
from the target density f.
Gibbs sampler example: Bivariate Normal
Hamiltonian
Monte Carlo

• Problem with Gibbs sampling (GS)


• Models with many parameters
usually have lots of highly correlated
parameters
• GS gets stuck, degenerates
towards random walk
• Hamiltonian dynamics to the
rescue
• Implement with ulam function
from rethinking package.
Back to the School Example
Suppose now that the data are on the amount of time (in hours)
students from 3 different high schools spent on studying or
homework for a particular subject during an exam one-week
period.

Q: How do we estimate the averages in each group? Should the


schools be treated as independent (like ANOVA) or identical (one
large sample disregarding the school info)?
Hierarchical Data Structure
• Suppose we have m groups and data (X1, … , Xm) from each
of them, where each 𝑿𝑗 = 𝑋1𝑗 , … , 𝑋𝑛𝑗 𝑗 .
• The random variables 𝑋1𝑗 , … , 𝑋𝑛𝑗 𝑗 should not be
independent, because they all belong to the same cluster.
• Students within each school perform similarly, because
they have the same professors, use same textbooks, etc.
• If I know Timmy’s reading score and know that Tommy is in
the same school as Timmy, I already know a little bit about
Tommy’s score too
• The unique contribution of each student is reduced
because some information is shared by knowing which
school they are in.
Hierarchical Data Structure

• Why do we have to account for clustering?


• The total sample size appears in the
denominator of formulas for calculating
precision in single-level models
• Using the total sample size assumes that Timmy Tommy
each individual contributes unique
information
Timmy Tommy
• This is not the case with clustering, so the
denominator will be too large.
• In English – ignoring the clustering makes
the model appear more precise than it is School
Hierarchical Data Structure
Consider the normal model for grouped data:

𝑋1𝑗 , … , 𝑋𝑛𝑗 𝑗 |𝜃𝑗 , 𝜎 2 ~ 𝑁 𝜃𝑗 , 𝜎 2 , 𝑗 = 1, … , 𝑚


𝜃𝑗 |𝜇, 𝜏 2 ~ 𝑁 𝜇, 𝜏 2
𝜎 2 ~ IG
𝜇 ~𝑁 𝜇0 , 𝛾02
𝜏2~ ?
Key question: How should we represent our information about
the group mean parameters 𝜃1 , … , 𝜃𝑚 ?
What should be the value of 𝜏 2 which represents our knowledge
about how similar are the group means are?

𝜏2 = ?
Recall: Hierarchical Data Structure
Consider the normal model:
𝑋1𝑗 , … , 𝑋𝑛𝑗 𝑗 |𝜃𝑗 , 𝜎 2 ~ 𝑁 𝜃𝑗 , 𝜎 2 , 𝑗 = 1, … , 𝑚
𝜃𝑗 |𝜇, 𝜏 2 ~ 𝑁 𝜇, 𝜏 2
𝜎 2 ~ IG
𝜇 ~𝑁 𝜇0 , 𝛾02
𝜏2~ ?
What should be the value of 𝜏 2 which represents knowledge
about how similar are the group means?

0 𝜏2 = ?

All clusters are the same


Aka “complete pooling”
Recall: Hierarchical Data Structure
Consider the normal model:
𝑋1𝑗 , … , 𝑋𝑛𝑗 𝑗 |𝜃𝑗 , 𝜎 2 ~ 𝑁 𝜃𝑗 , 𝜎 2 , 𝑗 = 1, … , 𝑚
𝜃𝑗 |𝜇, 𝜏 2 ~ 𝑁 𝜇, 𝜏 2
𝜎 2 ~ IG
𝜇 ~𝑁 𝜇0 , 𝛾02
𝜏2~ ?
What should be the value of 𝜏 2 which represents knowledge
about how similar are the group means?

0 𝜏2 = ? ∞

All clusters are the same Clusters are unrelated


Aka “complete pooling” Aka “no pooling”
Recall: Hierarchical Data Structure
Consider the normal model:
𝑋1𝑗 , … , 𝑋𝑛𝑗 𝑗 |𝜃𝑗 , 𝜎 2 ~ 𝑁 𝜃𝑗 , 𝜎 2 , 𝑗 = 1, … , 𝑚
𝜃𝑗 |𝜇, 𝜏 2 ~ 𝑁 𝜇, 𝜏 2
𝜎 2 ~ IG
𝜇 ~𝑁 𝜇0 , 𝛾02
𝜏2~ 𝜋 𝜏2
What should be the value of 𝜏 2 which represents knowledge
about how similar are the group means?
A: Use a prior and learn 𝜏 2 from the data!

0
𝜏 2~ 𝜋 𝜏 2 ∞
Clusters are unrelated
All clusters are the same
Aka “no pooling”
Aka “complete pooling”
Multistage Models
The basic Bayesian model has two stages:
Stage 1: 𝑓(𝑥|𝜃, 𝜂)
Stage 2: 𝑓(𝜃|𝜂)
where 𝜂 is a vector of hyperparameters. For example, in the Beta-Binomial
model 𝜂 = 𝛼, 𝛽 .

If we are not sure about the specific value of the hyperparameter 𝜂, then we
can have a third stage by placing a hyperprior on it:
Stage 3: 𝜂 ∼ 𝑔 𝜂 , 𝜂 ∈ 𝑌
Then the posterior is calculated as:
𝑓(𝑥, 𝜃) ∫ 𝑓(𝑥, 𝜃, 𝜂) 𝑑𝜂
𝑓(𝜃 | 𝑥) = =
𝑚(𝑥) ‫𝑥(𝑓 ׭‬, 𝜃, 𝜂) 𝑑𝜂 𝑑𝜃
∫ 𝑓(𝑥 | 𝜃, 𝜂) 𝜋 (θ | 𝜂) 𝑔(𝜂) 𝑑𝜂
=
‫𝜃 | 𝑥(𝑓 ׭‬, 𝜂) 𝜋(𝜃 | 𝜂) 𝑔(𝜂) 𝑑𝜂 𝑑𝜃
These type of models usually result in intractable posteriors and need
approximate solutions.
Hierarchical Models: Why?
Consider the school dataexample again. How would you model the data without
hierarchy?

At one extreme, we could consider that each school has their own average rate
and estimate the means individually per each school. This is known as the no
pooling of data approach.
Advantage: Unbiased individual estimates.
Disadvantage: Low accuracy (high variance).

At the other extreme, we can consider all schools similar enough and just
estimate a single pooled estimate (aka complete pooling of data).

Advantage: High accuracy. Disadvantage: No individual estimates (bias).

Hierarchical estimates are a compromise between these two extremes. One lets
the data decide on the degree of pooling (judged by the Stage 2 estimated
variance).
Example: Multistage Logistic Regression
Goal: estimate death rates 𝜃j, j = A, … , L after undergoing cardiac surgery in 12
hospitals A, … , L. Data are mortality counts xj out of mj surgeries, j = A, … , L.

Stage 0: 𝑓 𝑥𝑗 𝜃𝑗 , 𝜂 ~ Binomial 𝑚𝑗 , 𝜃𝑗 , 𝑗 = 𝐴, … , 𝐿
Stage 1: 𝜃𝐴 , … , 𝜃𝐿 |𝜂 ~ 𝑓 𝜃 𝜂
Stage 2: 𝜂 ∼ 𝜋 𝜂 , 𝜂 ∈ 𝑌

𝜃𝑗
For Stage 1, the model is reparametrized with 𝛽𝑗 = log 1−𝜃 and assume the
𝑗
betas are independent with
𝛽𝑗 |𝜇, 𝜎 2 ~ 𝑁 𝜇, 𝜎 2
For Stage 2, assume 𝜇 and 𝜎2 are independent with
𝜇 ~ 𝑁 0, 𝑐 2
𝜎 2 ~ IG(𝑎, 𝑏)
where IG is the inverse gamma prior distribution.
Let a = b = 0.001 and c = 103 so the priors have little impact on the posterior.
Example (continued)
The joint density has the form: ς𝐿𝑗=𝐴 𝑓 𝑥𝑗 𝛽𝑗 , 𝜂 × 𝑓 𝛽𝑗 𝜂 × 𝑝𝑟𝑖𝑜𝑟, or
𝑓 𝜷, 𝜇, 𝜎 2 𝒙
𝑚𝑗 𝑒 𝑥𝛽𝑗 1 −
1
𝛽𝑗 −𝜇
2
= ෑ 𝑥𝑗 𝑚𝑗 1𝑒
2𝜎 2
× 𝜋 𝜇 × 𝜋 𝜎2
𝑗 1 + 𝑒 𝛽𝑗 2𝜋𝜎 2 2

where
1 1 𝜇2
−2 2
𝜋 𝜇 = 𝑒 𝑐
2𝜋𝑐 2
𝑏 𝑎 𝑏
2 2 −𝑎−1 − 2
𝜋 𝜎 = 𝜎 𝑒 𝜎
Γ 𝑎
To run a multi-stage Gibbs sampler, we need to be able to sample from the full
conditional distributions of 𝜇, 𝜎2, and 𝛽j, j = A, … , L
Exercise: Show that:
𝜇|𝜎 2 , 𝛽𝐴 , … , 𝛽𝐿 , 𝒙 ~ 𝑁(? , ? )
𝜎 2 |𝜇, 𝛽𝐴 , … , 𝛽𝐿 , 𝒙 ~ 𝐼𝐺 ? , ?
However, the conditional distribution of 𝛽j | 𝜇, 𝜎2, 𝛽-j, x is unknown. You can use a
rejection sampler to simulate from it or a Metropolis step within Gibbs.
Example: Tadpole
predation
• data(reedfrogs)

• Data on lab experiments on the density and size


dependent predation rate of an African reed frog

• Numbers of surviving tadpoles in different tanks


• Different densities
• With and without predators
• We will only run the model that focuses on
variation across tanks (like the hospital data)

You might also like