0% found this document useful (0 votes)

12 views50 pages

Bayesian Inference and Modeling Techniques

The document discusses Bayesian statistics, highlighting the concept of posterior distribution formed by combining prior information with sample data. It explains conjugate priors, particularly the Beta-Binomial model, and introduces confidence regions, also known as credible regions. Additionally, it covers methods for computing posterior distributions, including Monte Carlo simulations and the use of improper priors.

Uploaded by

tanyalim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views50 pages

Bayesian Inference and Modeling Techniques

Uploaded by

tanyalim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

STAT 5703

Statistical Inference and Modeling

for Data Science
Dobrin Marchev

1
Recall: Bayesian Statistics
The main idea is that 𝜃 is now a random variable and the information brought by
the sample x is combined with a prior information specified by a prior distribution
𝜋(𝜃), to form the posterior distribution 𝜋 (𝜃| 𝒙) by using Bayes formula:

𝑓(𝒙 | 𝜃) 𝜋(𝜃)
𝑓(𝜃| 𝒙) = ,
𝑚(𝒙)
where where likely O is

𝑚(𝒙) = න 𝑓(𝒙| 𝜃) 𝜋(𝜃) 𝑑𝜃

is the marginal distribution of X.

It is useful to think that 𝜃 being random represents our uncertainty about its value.
Before seeing the data, our uncertainty is represented by 𝜋(𝜃) and after that by
𝑓(𝜃| 𝒙).
Conjugate Priors
Binomials normal model values of difpiors.
Posterior ends up with

The computational drawbacks used to be so severe that in the

past many researchers worked only with the so-called conjugate
priors.

Definition: A family of probability distributions ℱ = { 𝜋(𝜃) } is

conjugate for the model 𝑓(𝑥| 𝜃) if
𝑓 𝜃|𝑥 ∈ℱ

Note: The Gaussian example was conjugate, because the prior

had normal distribution, and the posterior also had a normal
distribution.
Example: Beta-Binomial Model
Let
𝑥 | 𝜃 ∼ Binom(𝑛, 𝜃)
𝜃 ∼ Beta(𝛼, 𝛽)
That is,
𝑛 𝑥
𝑓(𝑥 | 𝜃) = 𝑃(𝑋 = 𝑥 | 𝜃) = 𝜃 1 − 𝜃 𝑛−𝑥
𝑥
Γ 𝛼+𝛽
𝜋 𝜃 = 𝜃 𝛼−1 1 − 𝜃 𝛽−1
Γ 𝛼 Γ 𝛽
𝑛 Γ 𝛼+𝛽
⇒ 𝑓(𝑥, 𝜃) = 𝜃 𝛼+𝑥−1 1 − 𝜃 𝑛+𝛽−𝑥−1
𝑥 Γ 𝛼 Γ 𝛽

⇒ 𝜃 | 𝑥 ∼ Beta(𝛼 + 𝑥, 𝑛 + 𝛽 − 𝑥)

Therefore, prior is Beta distributed, and posterior is Beta distributed,

meaning the Beta distribution is a conjugate prior.

Note: See R examples.

if posterior distribution is post. density
symmetric highestthat
,

not relevant
Confidence
·

skewed better approach which horizontal

But if : have :

level results in projections where it intersect

Regions ↓ a = 2, b = 5
the curve with 95% error .

Any other interval is guaranteed

to
The Bayesian analog of CI's are wider
be
the confidence regions, also
.

2.0
called credible regions, with
highest posterior density:

1.5
f(x)
𝐶(𝒙) = { 𝜃: 𝑓(𝜃 | 𝒙) ≥ 𝑘}

1.0
where k is such that

0.5
𝑃 𝜃 ∈ 𝐶(𝒙 𝒙 = 𝛾
= 𝑃 𝑓 𝜃 𝒙 ≥ 𝑘 | 𝒙)
0.0

This leads to a different

0.0 0.2 0.4 0.6 0.8 1.0
computational problem,
x
namely solving
𝑓(𝜃 | 𝒙) = 𝑘
We will compute the intervals with
the function HPDI from rethinking.
G
HDI: Explained
skewed or

stickto this don't know where posterior

is
if
you symmetric
HDI of unimodal what if want more area
you
distribution is above ?
shortest interval why use middle # 1 96
.
-
The algorithms for
computing the HDI
of an MCMC sample
rely on a crucial
property: shortest interval for
1 645-
For a unimodal -
-
.

normal are
probability -

---------
distribution on a
single variable, the
HDI of mass M is the
narrowest possible
interval of that mass.
Improper Priors Hwkind of 92 not exam
.

-
,

Sometimes we may wish that 𝜋(𝜃) contains no information about 𝜃. For

1
example, if Θ = [𝑎, 𝑏], then 𝜋 𝜃 = 𝑏−𝑎 ∼ 𝑈(𝑎, 𝑏) is a natural choice
when Θ is bounded.
flat.
distribution)
Tempting
as
parameter (e g normal
.
. .
to use prior
f

However, if, say, Θ = ℝ, and 𝜋 (𝜃) = 𝑐, then

∞ fit uniform distribution btw a

∫ 𝜋(𝜃) 𝑑 𝜃 = න 𝑐 𝑑𝜃 = ∞ and b.

−∞
under curve.

∞
Area
Definition: When ∫−∞ 𝜋(𝜃) 𝑑𝜃 = ∞ the prior is called improper.
Flataspossible not pafs ,
-

If 𝜋(𝜃) is improper, then Bayesian inference is still possible, if 𝑓(𝜃 | 𝑥) is a

?
Computing the Posterior
[computer ; not solve integral by hand]

• Q: What do we do when the integrals are

not available in closed form?

• A: There are three major alternatives to

exact (calculus) solutions.

• Numerical integration: Riemann

integration, trapezoid rule, Gaussian
quadrature, ... However, they don't work
well in high dimensions.
• Analytical approach, like Laplace
approximation, but too complicated and
still doesn't work well, especially with
hierarchical (multistage) models.
• Monte Carlo (MC) methods: simple and
work equally well in any dimension. We
will concentrate on the third option and
study it in more details.
Non-exact Estimation
what

-gridbetween

I
• Numerical integration: Grid approximation, Riemann integration, trapezoid
rule, Gaussian quadrature, ... However, they don't work well in high
dimensions. How to deal with infinite intervals ?

Aside: How
-
to deal with integrals over infinite intervals?

∞
∫−∞ 𝑓
>
One approach is to change of variables, like this one:
𝑥 𝑑𝑥 =
1
𝑓
𝑡
∫−1 1−𝑡 2
1+𝑡 2
1−𝑡 2 2
𝑑𝑡
like o of variables -

• Analytical approach, like Laplace approximation, but too complicated and

still doesn't work well, especially with hierarchical (multistage) models.
our
focus .

⑧
• Monte Carlo (MC) methods: simple and work equally well in any dimension.
We will concentrate on this third option and study it in more details.
Monte Carlo Simulations .
:

calculate model parameter

Monte Carlo Estimation
Basic idea: If we want to find 𝐸 ℎ 𝜃 = ∫ ℎ(𝜃) 𝑓 𝜃 𝑥 𝑑𝜃 for some ℎ(⋅) and
some 𝜃 ∼ 𝑓(𝜃) (within the Bayesian framework we typically have 𝑓(𝜃) =
𝑓(𝜃 | 𝑥) and ℎ(𝜃) = 𝜃), then we can obtain an iid sample 𝜃1 , … , 𝜃𝑁 from 𝑓 𝜃

↓
and estimate E[h(𝜃)] with ?
𝑁
can
you obtain samples
෠
1 O2 depends on O
,
ℎ = ෍ ℎ(𝜃𝑖 )
𝑁 Oz -
Q2 - ·

𝑖=1 .
𝑎.𝑠.
Then by the SLLN, ℎ෠ ℎ(𝜃), as 𝑁 → ∞. That is, ℎ෠ is a
consistent estimator of E[h(𝜃)].

Of course, the above result assumes E[h(𝜃)] < ∞. If we assume that the second
moment also exists, then we also have results about the rate of convergence in
the form of the CLT.

MCMC version: If obtaining an iid sample is not an option, then we can employ
correlated and non-identically distribute sample in the form of a Markov chain,
which leads to the Markov chain Monte Carlo methods (MCMC), studied later.
-

The LLN and CLT still hold!

Recall: Normal model
• Normal model with population variance 𝜎 2 known:
𝑋1 , … , 𝑋𝑛 |𝜃 ~ 𝑁 𝜃, 𝜎 2
𝜃 ∼ 𝑁(𝜇, 𝜏 2 ) r

Can be shown that the posterior is

2
𝜃|𝑥1 , … , 𝑥𝑛 ~ 𝑁 𝜇𝑝𝑜𝑠𝑡 , 𝜎𝑝𝑜𝑠𝑡
where
𝜇 𝑥ҧ 1 𝑛
+
𝜏 2 𝜎 2 /𝑛 𝜏 2 𝜎 2
𝜇𝑝𝑜𝑠𝑡 = 𝐸 𝜃 𝒙 = = 𝜇+ 𝑥ҧ
1 𝑛 1 𝑛 1 𝑛
+ + +
𝜏2 𝜎2 𝜏2 𝜎2 𝜏2 𝜎2
2
1
𝜎𝑝𝑜𝑠𝑡 = 𝑉𝑎𝑟 𝜃 𝒙 =
1 𝑛
+
𝜏2 𝜎2
Q: What if we have to estimate 𝜎 2 along with the mean 𝜃?
Example: Normal model with 𝜎2 unknown
𝑋1 , … , 𝑋𝑛 |𝜃 ~ 𝑁 𝜃, 𝜎 2
𝜃, 𝜎 2 ∼ ? ? ?
The posterior distribution is:
2 2
𝑓 𝑥1 , … , 𝑥𝑛 𝜃, 𝜎 𝜋 𝜃, 𝜎
𝑓 𝜃, 𝜎 2 𝑥1 , … , 𝑥𝑛 =
𝑓 𝑥1 , … , 𝑥𝑛
We will find some simple conjugate prior. Specifying a joint prior
distribution on parameters with different parameter spaces is
very challenging, so usually it is done sequentially: Bivariate
can choose dif types
𝜋 𝜃, 𝜎 2 = 𝜋 𝜃 𝜎 2 × 𝜋 𝜎 2 .

Bivariate pot of Marginal & Conditional-

2
as a
𝜎2
For 𝜋 𝜃 𝜎 we can choose a normal prior with variance
𝜅0
For σ2 we need a family of prior distributions that has support
on (0, ∞). Unfortunately, the Gamma family is not conjugate for
the normal variance. However, the inverse gamma family does
turn out to be a conjugate class of densities for σ2. 12
IG(1,3)

Aside: Inverse Gamma

in stat
Distribution
use

0.15
• Let X ~ IG(𝑎, 𝑏) ↑

↓ dinvgamma(x, 1, 3)
1
• Then ~ Γ 𝑎, 𝑏

0.10
𝑋 inverse
𝑏
• 𝐸(𝑋) = , for a > 1
gamma
.

𝑎−1
𝑏2 .
• 𝑉𝑎𝑟 𝑋 =
L 𝑎−1 2 (𝑎−2)
, for a > 2 properties

0.05
𝑏
𝑏𝑎
• 𝑓 𝑥 = Γ(𝑎) 𝑥 −𝑎−1𝑒 − 𝑥 , 𝑥 > 0

• It is available in R with the

0.00

dinvgamma function from the

package invgamma. 0 5 10 15 20 25

13
Normal model with 𝜎2 unknown
Suppose our model is:
𝑋1 , … , 𝑋𝑛 |𝜃, 𝜎 2 ~ 𝑁 𝜃, 𝜎 2
2
𝜎
- get
improper𝜃|𝜎 2 ~ 𝑁 𝜇0 , ① Posterior of 0 Normal
:

smaller 𝜅0
going priors ② Posterior of 62 Gamma
keep 2 :
.

𝜈0 𝜈0 𝜎0
usesmala 𝜎 ~ 𝐼𝐺 2 , 2
2 =

&
Endupwith conjugate prior
.

The prior parameters (𝜇 E 0, 𝜅0 ) can be interpreted as the sample

mean and sample size of prior observations on the mean.
The prior parameters (𝜎02 , 𝜈0 ) can be interpreted as the sample
variance and sample size of prior observations on variance.
The posterior distribution can be similarly decomposed:
𝑓 𝜃, 𝜎 2 𝑥1 , … , 𝑥𝑛 = 𝑓 𝜃 𝜎 2 , 𝑥1 , … , 𝑥𝑛 × 𝑓 𝜎 2 𝑥1 , … , 𝑥𝑛
You can show that specify prior slope !
on

2
𝜎
𝜃|𝜎 2 , 𝑥1 , … , 𝑥𝑛 ~ 𝑁 𝜇𝑛 ,
𝜅𝑛 14
Step 1: Likelihood
Suppose that
𝑋1 , … , 𝑋𝑛 |𝜃, 𝜎 2 ~ N 𝜃, 𝜎 2
Then:
𝑛 𝑛 2
1 1 𝑥𝑖 −𝜃
2 2 −
𝑓 𝑥1 , … , 𝑥𝑛 𝜃, 𝜎 = ෑ 𝑓 𝑥𝑖 𝜃, 𝜎 =ෑ 𝑒 2 𝜎
𝑖=1 𝑖=1
2𝜋𝜎 2
2
𝑛 − 1 σ𝑛 𝑥𝑖 −𝜃
= 2𝜋𝜎 2 − 2 𝑒 2 𝑖=1 𝜎 group
terms with O and no .
0

Notice that
𝑛 2 𝑛 𝑛
2
𝑥𝑖 − 𝜃 1 𝜃 𝜃
෍ = 2 ෍ 𝑥𝑖2 − 2 2 ෍ 𝑥𝑖 + 𝑛 2
𝜎 𝜎 𝜎 𝜎
𝑖=1 𝑖=1 𝑖=1

This shows that the pair σ𝑛𝑖=1 𝑥𝑖2 , σ𝑛𝑖=1 𝑥𝑖 is a two-dimensional

sufficient statistic. 15
Aside (from a previous lecture)
Show that
𝑛 𝑛
2 2 2
෍ 𝑥𝑖 − 𝜃 = ෍ 𝑥𝑖 − 𝑥ҧ + 𝑛 𝜃 − 𝑥ҧ
𝑖=1 𝑖=1

Proof:
𝑛 𝑛
෍ 𝑥𝑖 − 𝜃 2 = ෍ 𝑥𝑖 − 𝑥ҧ + 𝑥ҧ − 𝜃 2

𝑖=1 𝑖=1
𝑛 𝑛 𝑛
= ෍ 𝑥𝑖 − 𝑥ҧ 2 + 2 ෍ 𝑥𝑖 − 𝑥ҧ 𝑥ҧ − 𝜃 + ෍ 𝑥ҧ − 𝜃 2

𝑖=1 𝑖=1 𝑖=1

𝑛 𝑛
= ෍ 𝑥𝑖 − 𝑥ҧ 2 + 2 𝑥ҧ − 𝜃 ෍ 𝑥𝑖 − 𝑥ҧ + 𝑛 𝜃 − 𝑥ҧ 2

𝑖=1 𝑖=1
𝑛
= ෍ 𝑥𝑖 − 𝑥ҧ 2 + 2 𝑥ҧ − 𝜃 × 0 + 𝑛 𝜃 − 𝑥ҧ 2

𝑖=1
16
Step 2: combine with prior
Suppose we want to find a conjugate prior distribution for 𝜋(𝜃|𝜎2). We have that
1 𝑛
− σ 𝑥 −𝜃 2
𝑓 𝜃 𝑥1 , … , 𝑥𝑛 , 𝜎2 ∝𝜋 𝜃 𝜎2 ×𝑒 2𝜎2 𝑖=1 𝑖

*outcreaboutThis sewith O
1
− 𝜃−𝑥ҧ 2
∝𝜋 𝜃 𝜎2
×𝑒 2𝜎2

This means that for 𝜋(𝜃|𝜎2) to be conjugate it must include quadratic terms like
in them
2
𝑒 𝑐1 𝜃−𝑐2 . One choice is
𝜎 2
𝜃 | 𝜎2 ~𝑁 𝜇0 ,
𝜅0 /Da=
Then using the previous result:
𝜋 𝜃 𝜎 2 × 𝑓 𝑥1 , … , 𝑥𝑛 𝜃, 𝜎 2
𝑓 𝜃 𝑥1 , … , 𝑥𝑛 , 𝜎 2 =
𝑓 𝑥1 , … , 𝑥𝑛 𝜎 2
𝜅 𝑛
− 02 𝜃−𝜇0 2 − 2 𝜃−𝑥ҧ 2
∝ 𝑒 2𝜎 × 𝑒 2𝜎
- Similar structures.
Multiplied by dif quantities -

17
Step 2: combine with prior
𝜋 𝜃 𝜎 2 × 𝑓 𝑥1 , … , 𝑥𝑛 𝜃, 𝜎 2
𝑓 𝜃 𝑥1 , … , 𝑥𝑛 , 𝜎 2 =
𝑓 𝑥1 , … , 𝑥𝑛 𝜎 2
𝜅 1 1
− 02 𝜃−𝜇0 2 − 2 𝜃−𝑥ҧ 2 − 2 𝜅0 𝜃−𝜇0 2 +𝑛 𝜃−𝑥ҧ 2
∝ 𝑒 2𝜎 × 𝑒 2𝜎 = 𝑒 2𝜎

To finish we need to rewrite as a complete square in 𝜃:

𝜅0 𝜃 − 𝜇0 2 + 𝑛 𝜃 − 𝑥ҧ 2
= 𝜅0 𝜃 2 − 2𝜅0 𝜇0 𝜃 + 𝜅0 𝜇02 + 𝑛𝜃 2 − 2𝑛𝑥𝜃ҧ + 𝑛𝑥ҧ 2 = ⋯

I
This shows that (after completing the square as an exercise):
−
𝜅𝑛
𝜃−𝜇𝑛 2 Normaldensity property
.

𝑓 𝜃 𝑥1 , … , 𝑥𝑛 , 𝜎 2 ∝ 𝑒 2𝜎2 >

which proves that -

↓

Jean 𝜎 2
𝜃|𝑥1 , … , 𝑥𝑛 , 𝜎 2 ~ 𝑁 𝜇𝑛 ,
𝜅𝑛
𝜅0 𝜇0 +𝑛𝑥ҧ
where 𝜅𝑛 = 𝜅0 + 𝑛, 𝜇𝑛 = 𝜅𝑛

Not exam &2 18

Posterior Distribution of 𝜎2 | x1, …, xn
The posterior distribution of σ2 can be obtained by performing an
integration over the unknown value of θ.
𝑓 𝜎 2 𝑥1 , … , 𝑥𝑛 ∝ 𝜋 𝜎 2 𝑓 𝑥1 , … , 𝑥𝑛 𝜎 2
= 𝜋 𝜎 2 න 𝑓 𝑥1 , … , 𝑥𝑛 𝜃, 𝜎 2 𝜋 𝜃 𝜎 2 𝑑𝜃
𝜈0
𝜈0 𝜎02 2 𝜈0 𝜎02
2 𝜈
− 20 −1 − 𝜎22 −
𝑛 − 1 𝜃−𝑥ҧ 2
= 𝜎 2 𝑒 න 𝜋 𝜃 𝜎2 𝜎 2 2 𝑒 2𝜎2 𝑑𝜃
𝜈
Γ 20
= … (exercise)
We conclude that
2
𝜈𝑛 𝜈𝑛 𝜎𝑛
𝜎 2 |𝑥1 , … 𝑥𝑛 ~ 𝐼𝐺 ,
2 2
1 𝜅0 𝑛
where 𝜈𝑛 = 𝜈0 + 𝑛 and 𝜎𝑛2 = 𝜈 𝜈0 𝜎02 + σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2 + 𝜅𝑛
𝑥ҧ − 𝜇0 2
𝑛
Monte Carlo Sampling
What if wanted to calculate E(𝜃|x1, … , xn)?
The only thing we know is that
2
𝜎
𝜃|𝜎 2 , 𝑥1 , … , 𝑥𝑛 ~ 𝑁 𝜇𝑛 ,
𝜅𝑛
but we don’t know the marginal posterior distribution of 𝜃 (without
conditioning on 𝜎 2 ).
Instead of trying to derive it with calculus, we can resort to the
Monte Carlo method to generate:
2 2(1)
𝜈𝑛 𝜈𝑛 𝜎𝑛 𝜎
𝜎 2(1) |𝑥1 , … 𝑥𝑛 ~ 𝐼𝐺 , , 𝑡ℎ𝑒𝑛 𝜃 1 ~𝑁 𝜇𝑛 ,
2 2 𝜅𝑛
…
2
𝜈𝑛 𝜈𝑛 𝜎𝑛 𝜎 2(𝑆)
𝜎 2(𝑆) |𝑥1 , … 𝑥𝑛 ~ 𝐼𝐺 , , 𝑡ℎ𝑒𝑛 𝜃 𝑆
~𝑁 𝜇𝑛 ,
2 2 𝜅𝑛
Recap
Suppose our model for the data is:
𝑋1 , … , 𝑋𝑛 |𝜃, 𝜎 2 ~ 𝑁 𝜃, 𝜎 2
Assume the following priors:
𝜎 2
𝜃|𝜎 2 ~ 𝑁 𝜇0 ,
𝜅0
2
𝜈0 𝜈0 𝜎0
𝜎 2 ~ 𝐼𝐺 ,
2 2
Then we showed that
2
𝜎2
𝜃|𝜎 , 𝑥1 , … , 𝑥𝑛 ~ 𝑁 𝜇𝑛 ,
𝜅𝑛
𝜅0 𝜇0 +𝑛𝑥ҧ
where 𝜅𝑛 = 𝜅0 + 𝑛, 𝜇𝑛 = 𝜅𝑛
and

𝜈𝑛 𝜈𝑛 𝜎𝑛2
𝜎 2 |𝑥1 , … , 𝑥𝑛 ~ 𝐼𝐺 ,
2 2
1 𝜅0 𝑛
where 𝜈𝑛 = 𝜈0 + 𝑛 and 𝜎𝑛2 = 𝜈 𝜈0 𝜎02 + σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2 + 𝜅𝑛
𝑥ҧ − 𝜇0 2
𝑛
21
Time to pause and think
chain .
• How did we obtain these results? used Markov
• Deriving that
𝜎 2
𝜃|𝑥1 , … , 𝑥𝑛 , 𝜎 2 ~ 𝑁 𝜇𝑛 ,
𝜅𝑛
required a lot of algebra and completion of the square technique.

• Deriving
𝜈 𝜈 𝜎 2
𝑛 𝑛 𝑛
𝜎 2 |𝑥1 , … , 𝑥𝑛 ~ 𝐼𝐺 ,
2 2
required the algebra, completion of square, followed by integration!

• Do we want to ever derive it again?

• Answer: No!
• We could have sampled from 𝜎 2 |𝜃, 𝑥1 , … , 𝑥𝑛 instead of 𝜎 2 |𝑥1 , … , 𝑥𝑛 and
thus avoiding the integration part (algebra remains though).
• What?! Is this legit?? Do we know its distribution?
• Can we, please, also avoid the algebra part?
• Yes!
MCMC Methods

• Metropolis ,1953: Granddaddy

of them all
• Metropolis-Hastings (MH),
1970: More general
• Gibbs sampling (GS), 1984:
Efficient version of MH
• Hamiltonian Monte Carlo
(HMC), 1996
• Newer 21st century algorithms
much better
• New methods being
developed
Metropolis and MCMC

54135
How does it work?
Contract: King Markov must visit each island
in proportion to its population size.

Imagine the king is too lazy to keep track of schedules and wants a simple algorithm
using only the current location and nearby islands’ population sizes. Here is how he
does it:
Propose a move
Compare to current state
Decide whether to jump
Repeat until “convergence”

Yoly)
=
GEO
# -

computetogether ;

chance of moving / know the ratio of

Histogram Likelihood of LBR .

both
Height :
.

TLDR Converge to sample from Posterior

.
:
How does this relate
to Bayesian analysis?
• Usual use is to draw samples from a
posterior distribution
• “Islands”: parameter values
• “Population size”: proportional to
posterior probability
• Works for any number of dimensions
(parameters)
• Works for continuous as well as
discrete parameters
• Initially implemented on the MANIAC
(Mathematical Analyzer, Numerical
Integrator, and Computer) developed
by John von Neumann.
Sitting is Nicholas Metropolis, the
grandfather of the MCMC.
MANIAC

• Stanislaw Ulam invented the

Monte Carlo method by thinking
about predicting solitaire
outcome.
[Link]
m75a/how-solitaire-inspired-the-
worlds-most-useful-simulation-tool
Metropolis-Hastings Algorithm
• Very general method for constructing a Markov chain with a specific
invariant density f(), which is the target for the simulation.
• Suppose that q(y|x) is any conditional density, called proposal density.
• Algorithm:
o Step 0: Start at any X = x0
o Step 1: Generate Yi ~ q(y|xi-1)
𝑌 with probability 𝛼(𝑥𝑖−1 , 𝑦𝑖 )
o Step 2: Set 𝑋𝑖 = ቊ 𝑖
𝑥𝑖−1 , o/w
𝑓 𝑦 𝑞 𝑥𝑦
where 𝛼 𝑥, 𝑦 = min 𝑓 𝑥 𝑞 𝑦 𝑥 , 1 is called the acceptance probability
o Go to Step 1 and repeat
Notes:
o When q(y|x) is symmetric, the algorithm is known as Metropolis.
o When q(y|x) does not depend on x, the algorithm is known as
Independent Metropolis-Hastings.
o When q(y|x) = g(y – x), the algorithm is known as Random Walk
Metropolis-Hastings (see next slide)
Metropolis-Hastings Algorithm

• When q(y|x) is symmetric, the

algorithm is known as
Metropolis.
• When q(y|x) does not depend on
x, the algorithm is known as
Independent Metropolis-
Hastings.
• When q(y|x) = g(y – x), the
algorithm is known as Random
Walk Metropolis-Hastings and
usually has a tuning parameters,
controlling the variance of q(y|x)
• The choice of that tuning
parameters results in very
different behavior (represented
with red circle on the graph)
Gibbs Sampler

• The Gibbs’ sampler is an MCMC method to efficiently

generate a draw from any multivariate distribution
f(x1, …, xp)
(Think of this as p-dimensional posterior.)
• It is used when draws from the joint distribution are hard
to compute directly.
• Draws from the conditional distributions are easy to obtain
f(xj | x1, …, xj-1, xj+1, ... , xp)

• Start with some initial guess values x1(0), …, xp(0)

• Iterate many times to obtain a Markov chain that
converges to the target distribution f(x1, …, xp).
Iteration Steps
• Given the current values x1(t), …, xp(t), obtain new
values:

• x1(t+1) ~ f(x1 | x2(t), x3(t), …, xp(t))

• x2(t+1) ~ f(x2 | x1(t+1), x3(t), …, xp(t))
• x3(t+1) ~ f(x3 | x1(t+1), x2(t+1), …, xp(t))
…
• xp(t+1) ~ f(xp | x1(t+1), x2(t+1), …, xp-1(t+1))

• Under mild regularity conditions it can be shown

that (x1(t), x2(t), x3(t), …, xp(t)) converges to a draw
from the target density f.
Gibbs sampler example: Bivariate Normal
Hamiltonian
Monte Carlo

• Problem with Gibbs sampling (GS)

• Models with many parameters
usually have lots of highly correlated
parameters
• GS gets stuck, degenerates
towards random walk
• Hamiltonian dynamics to the
rescue
• Implement with ulam function
from rethinking package.
Back to the School Example
Suppose now that the data are on the amount of time (in hours)
students from 3 different high schools spent on studying or
homework for a particular subject during an exam one-week
period.

Q: How do we estimate the averages in each group? Should the

schools be treated as independent (like ANOVA) or identical (one
large sample disregarding the school info)?
Hierarchical Data Structure
• Suppose we have m groups and data (X1, … , Xm) from each
of them, where each 𝑿𝑗 = 𝑋1𝑗 , … , 𝑋𝑛𝑗 𝑗 .
• The random variables 𝑋1𝑗 , … , 𝑋𝑛𝑗 𝑗 should not be
independent, because they all belong to the same cluster.
• Students within each school perform similarly, because
they have the same professors, use same textbooks, etc.
• If I know Timmy’s reading score and know that Tommy is in
the same school as Timmy, I already know a little bit about
Tommy’s score too
• The unique contribution of each student is reduced
because some information is shared by knowing which
school they are in.
Hierarchical Data Structure

• Why do we have to account for clustering?

• The total sample size appears in the
denominator of formulas for calculating
precision in single-level models
• Using the total sample size assumes that Timmy Tommy
each individual contributes unique
information
Timmy Tommy
• This is not the case with clustering, so the
denominator will be too large.
• In English – ignoring the clustering makes
the model appear more precise than it is School
Hierarchical Data Structure
Consider the normal model for grouped data:

𝑋1𝑗 , … , 𝑋𝑛𝑗 𝑗 |𝜃𝑗 , 𝜎 2 ~ 𝑁 𝜃𝑗 , 𝜎 2 , 𝑗 = 1, … , 𝑚

𝜃𝑗 |𝜇, 𝜏 2 ~ 𝑁 𝜇, 𝜏 2
𝜎 2 ~ IG
𝜇 ~𝑁 𝜇0 , 𝛾02
𝜏2~ ?
Key question: How should we represent our information about
the group mean parameters 𝜃1 , … , 𝜃𝑚 ?
What should be the value of 𝜏 2 which represents our knowledge
about how similar are the group means are?

𝜏2 = ?
Recall: Hierarchical Data Structure
Consider the normal model:
𝑋1𝑗 , … , 𝑋𝑛𝑗 𝑗 |𝜃𝑗 , 𝜎 2 ~ 𝑁 𝜃𝑗 , 𝜎 2 , 𝑗 = 1, … , 𝑚
𝜃𝑗 |𝜇, 𝜏 2 ~ 𝑁 𝜇, 𝜏 2
𝜎 2 ~ IG
𝜇 ~𝑁 𝜇0 , 𝛾02
𝜏2~ ?
What should be the value of 𝜏 2 which represents knowledge
about how similar are the group means?

0 𝜏2 = ?

All clusters are the same

Aka “complete pooling”
Recall: Hierarchical Data Structure
Consider the normal model:
𝑋1𝑗 , … , 𝑋𝑛𝑗 𝑗 |𝜃𝑗 , 𝜎 2 ~ 𝑁 𝜃𝑗 , 𝜎 2 , 𝑗 = 1, … , 𝑚
𝜃𝑗 |𝜇, 𝜏 2 ~ 𝑁 𝜇, 𝜏 2
𝜎 2 ~ IG
𝜇 ~𝑁 𝜇0 , 𝛾02
𝜏2~ ?
What should be the value of 𝜏 2 which represents knowledge
about how similar are the group means?

0 𝜏2 = ? ∞

All clusters are the same Clusters are unrelated

Aka “complete pooling” Aka “no pooling”
Recall: Hierarchical Data Structure
Consider the normal model:
𝑋1𝑗 , … , 𝑋𝑛𝑗 𝑗 |𝜃𝑗 , 𝜎 2 ~ 𝑁 𝜃𝑗 , 𝜎 2 , 𝑗 = 1, … , 𝑚
𝜃𝑗 |𝜇, 𝜏 2 ~ 𝑁 𝜇, 𝜏 2
𝜎 2 ~ IG
𝜇 ~𝑁 𝜇0 , 𝛾02
𝜏2~ 𝜋 𝜏2
What should be the value of 𝜏 2 which represents knowledge
about how similar are the group means?
A: Use a prior and learn 𝜏 2 from the data!

0
𝜏 2~ 𝜋 𝜏 2 ∞
Clusters are unrelated
All clusters are the same
Aka “no pooling”
Aka “complete pooling”
Multistage Models
The basic Bayesian model has two stages:
Stage 1: 𝑓(𝑥|𝜃, 𝜂)
Stage 2: 𝑓(𝜃|𝜂)
where 𝜂 is a vector of hyperparameters. For example, in the Beta-Binomial
model 𝜂 = 𝛼, 𝛽 .

If we are not sure about the specific value of the hyperparameter 𝜂, then we
can have a third stage by placing a hyperprior on it:
Stage 3: 𝜂 ∼ 𝑔 𝜂 , 𝜂 ∈ 𝑌
Then the posterior is calculated as:
𝑓(𝑥, 𝜃) ∫ 𝑓(𝑥, 𝜃, 𝜂) 𝑑𝜂
𝑓(𝜃 | 𝑥) = =
𝑚(𝑥) ‫𝑥(𝑓 ׭‬, 𝜃, 𝜂) 𝑑𝜂 𝑑𝜃
∫ 𝑓(𝑥 | 𝜃, 𝜂) 𝜋 (θ | 𝜂) 𝑔(𝜂) 𝑑𝜂
=
‫𝜃 | 𝑥(𝑓 ׭‬, 𝜂) 𝜋(𝜃 | 𝜂) 𝑔(𝜂) 𝑑𝜂 𝑑𝜃
These type of models usually result in intractable posteriors and need
approximate solutions.
Hierarchical Models: Why?
Consider the school dataexample again. How would you model the data without
hierarchy?

At one extreme, we could consider that each school has their own average rate
and estimate the means individually per each school. This is known as the no
pooling of data approach.
Advantage: Unbiased individual estimates.
Disadvantage: Low accuracy (high variance).

At the other extreme, we can consider all schools similar enough and just
estimate a single pooled estimate (aka complete pooling of data).

Advantage: High accuracy. Disadvantage: No individual estimates (bias).

Hierarchical estimates are a compromise between these two extremes. One lets
the data decide on the degree of pooling (judged by the Stage 2 estimated
variance).
Example: Multistage Logistic Regression
Goal: estimate death rates 𝜃j, j = A, … , L after undergoing cardiac surgery in 12
hospitals A, … , L. Data are mortality counts xj out of mj surgeries, j = A, … , L.

Stage 0: 𝑓 𝑥𝑗 𝜃𝑗 , 𝜂 ~ Binomial 𝑚𝑗 , 𝜃𝑗 , 𝑗 = 𝐴, … , 𝐿
Stage 1: 𝜃𝐴 , … , 𝜃𝐿 |𝜂 ~ 𝑓 𝜃 𝜂
Stage 2: 𝜂 ∼ 𝜋 𝜂 , 𝜂 ∈ 𝑌

𝜃𝑗
For Stage 1, the model is reparametrized with 𝛽𝑗 = log 1−𝜃 and assume the
𝑗
betas are independent with
𝛽𝑗 |𝜇, 𝜎 2 ~ 𝑁 𝜇, 𝜎 2
For Stage 2, assume 𝜇 and 𝜎2 are independent with
𝜇 ~ 𝑁 0, 𝑐 2
𝜎 2 ~ IG(𝑎, 𝑏)
where IG is the inverse gamma prior distribution.
Let a = b = 0.001 and c = 103 so the priors have little impact on the posterior.
Example (continued)
The joint density has the form: ς𝐿𝑗=𝐴 𝑓 𝑥𝑗 𝛽𝑗 , 𝜂 × 𝑓 𝛽𝑗 𝜂 × 𝑝𝑟𝑖𝑜𝑟, or
𝑓 𝜷, 𝜇, 𝜎 2 𝒙
𝑚𝑗 𝑒 𝑥𝛽𝑗 1 −
1
𝛽𝑗 −𝜇
2
= ෑ 𝑥𝑗 𝑚𝑗 1𝑒
2𝜎 2
× 𝜋 𝜇 × 𝜋 𝜎2
𝑗 1 + 𝑒 𝛽𝑗 2𝜋𝜎 2 2

where
1 1 𝜇2
−2 2
𝜋 𝜇 = 𝑒 𝑐
2𝜋𝑐 2
𝑏 𝑎 𝑏
2 2 −𝑎−1 − 2
𝜋 𝜎 = 𝜎 𝑒 𝜎
Γ 𝑎
To run a multi-stage Gibbs sampler, we need to be able to sample from the full
conditional distributions of 𝜇, 𝜎2, and 𝛽j, j = A, … , L
Exercise: Show that:
𝜇|𝜎 2 , 𝛽𝐴 , … , 𝛽𝐿 , 𝒙 ~ 𝑁(? , ? )
𝜎 2 |𝜇, 𝛽𝐴 , … , 𝛽𝐿 , 𝒙 ~ 𝐼𝐺 ? , ?
However, the conditional distribution of 𝛽j | 𝜇, 𝜎2, 𝛽-j, x is unknown. You can use a
rejection sampler to simulate from it or a Metropolis step within Gibbs.
Example: Tadpole
predation
• data(reedfrogs)

• Data on lab experiments on the density and size

dependent predation rate of an African reed frog

• Numbers of surviving tadpoles in different tanks

• Different densities
• With and without predators
• We will only run the model that focuses on
variation across tanks (like the hospital data)

Introduction to Bayesian Inference
No ratings yet
Introduction to Bayesian Inference
34 pages
Bayesian Inference and Modeling Techniques
No ratings yet
Bayesian Inference and Modeling Techniques
39 pages
Bayesian Inference and Updating Explained
No ratings yet
Bayesian Inference and Updating Explained
6 pages
Conjugate Priors for Normal Distribution
No ratings yet
Conjugate Priors for Normal Distribution
6 pages
Bayesian Inference for Gaussian Models
No ratings yet
Bayesian Inference for Gaussian Models
11 pages
Inference Problems in Statistics
No ratings yet
Inference Problems in Statistics
15 pages
Introduction to Bayesian Methods
No ratings yet
Introduction to Bayesian Methods
53 pages
Bayesian Estimation Fundamentals
No ratings yet
Bayesian Estimation Fundamentals
6 pages
Bayesian Priors: Gamma and Beta Distributions
No ratings yet
Bayesian Priors: Gamma and Beta Distributions
19 pages
Posterior Distributions in Bayesian Inference
No ratings yet
Posterior Distributions in Bayesian Inference
7 pages
Bayesian Time Series Econometrics Guide
No ratings yet
Bayesian Time Series Econometrics Guide
72 pages
Understanding Bayesian Inference
No ratings yet
Understanding Bayesian Inference
7 pages
Understanding Bayesian Priors in Statistics
No ratings yet
Understanding Bayesian Priors in Statistics
51 pages
Bayesian Analysis in Empirical Economics
No ratings yet
Bayesian Analysis in Empirical Economics
18 pages
Bayesian Estimation Overview
No ratings yet
Bayesian Estimation Overview
4 pages
Bayesian Priors: Hyperparameters Explained
No ratings yet
Bayesian Priors: Hyperparameters Explained
16 pages
Bayesian Priors and Hyperparameters Explained
No ratings yet
Bayesian Priors and Hyperparameters Explained
13 pages
Probability Distributions Overview
No ratings yet
Probability Distributions Overview
86 pages
Introduction to Maximum Likelihood Estimation
No ratings yet
Introduction to Maximum Likelihood Estimation
31 pages
5.1-5.5 Bayesian Computation
No ratings yet
5.1-5.5 Bayesian Computation
14 pages
5.1-5.5 Bayesian Computation
No ratings yet
5.1-5.5 Bayesian Computation
14 pages
Bayesian Inference for Gaussian Models
No ratings yet
Bayesian Inference for Gaussian Models
15 pages
General Insurance Credibility Techniques
No ratings yet
General Insurance Credibility Techniques
8 pages
Topic6-2-Method For Posterior Inference
No ratings yet
Topic6-2-Method For Posterior Inference
46 pages
Bayesian Statistics: Laplace & Monte Carlo Techniques
No ratings yet
Bayesian Statistics: Laplace & Monte Carlo Techniques
64 pages
Conjugate Priors for Poisson Distributions
No ratings yet
Conjugate Priors for Poisson Distributions
5 pages
Conjugate Families of Distributions in Bayesian Analysis
No ratings yet
Conjugate Families of Distributions in Bayesian Analysis
67 pages
Bayesian Linear Regression Insights
No ratings yet
Bayesian Linear Regression Insights
9 pages
Conjugate Priors in Bayesian Statistics
No ratings yet
Conjugate Priors in Bayesian Statistics
11 pages
Ba Yes I An Regular Ized Regression
No ratings yet
Ba Yes I An Regular Ized Regression
32 pages
MCMC and Bayesian Modeling Overview
No ratings yet
MCMC and Bayesian Modeling Overview
27 pages
Chapter 2: Weakly Informative Priors
No ratings yet
Chapter 2: Weakly Informative Priors
36 pages
Lecture 2
No ratings yet
Lecture 2
35 pages
Introduction to Bayesian Statistics
No ratings yet
Introduction to Bayesian Statistics
6 pages
Bayes Theorem: Pros and Cons Explained
No ratings yet
Bayes Theorem: Pros and Cons Explained
33 pages
Bayesian Models in Machine Learning
No ratings yet
Bayesian Models in Machine Learning
126 pages
Cramer-Rao Bound and Estimators in Statistics
No ratings yet
Cramer-Rao Bound and Estimators in Statistics
3 pages
Conjugate Priors in Exponential Families
No ratings yet
Conjugate Priors in Exponential Families
6 pages
Bayesian Analysis in Econometrics
No ratings yet
Bayesian Analysis in Econometrics
37 pages
Introduction to Bayesian Statistics
No ratings yet
Introduction to Bayesian Statistics
16 pages
Bayesian Inference Course Overview
No ratings yet
Bayesian Inference Course Overview
73 pages
BS Lec05
No ratings yet
BS Lec05
41 pages
Bayesian and Classical Estimations of GE Distribution
No ratings yet
Bayesian and Classical Estimations of GE Distribution
15 pages
Conjugate, Unt 2 Nyc
No ratings yet
Conjugate, Unt 2 Nyc
57 pages
Stat 111: Bayesian Inference Overview
No ratings yet
Stat 111: Bayesian Inference Overview
7 pages
Gaussian Distribution Cheat Sheet
100% (1)
Gaussian Distribution Cheat Sheet
8 pages
Statistical Inference and Bayesian Models
No ratings yet
Statistical Inference and Bayesian Models
5 pages
Understanding Beta Distributions in Bayesian Analysis
No ratings yet
Understanding Beta Distributions in Bayesian Analysis
4 pages
Generalized Likelihood Ratio Test Explained
No ratings yet
Generalized Likelihood Ratio Test Explained
5 pages
Bayesian Inference and Monte Carlo Methods
No ratings yet
Bayesian Inference and Monte Carlo Methods
39 pages
Bayesian Estimation and Analysis
No ratings yet
Bayesian Estimation and Analysis
3 pages
Generalized Linear Proportional Hazards Model Analysis
No ratings yet
Generalized Linear Proportional Hazards Model Analysis
9 pages
Bayesian Inference Lecture Notes
No ratings yet
Bayesian Inference Lecture Notes
11 pages
Bayesian Estimation Fundamentals
No ratings yet
Bayesian Estimation Fundamentals
8 pages
Parametric Models: MLE & Bayesian Estimation
No ratings yet
Parametric Models: MLE & Bayesian Estimation
33 pages
Applied Probability and Statistics Examples
No ratings yet
Applied Probability and Statistics Examples
13 pages
Fundamentals of Statistics Review Notes
No ratings yet
Fundamentals of Statistics Review Notes
20 pages
Advanced Statistical Modelling Strategies
No ratings yet
Advanced Statistical Modelling Strategies
72 pages
Ancient Wonders of the World Explained
No ratings yet
Ancient Wonders of the World Explained
1 page
An Inter-Comparison of Inverse Models For Estimati
No ratings yet
An Inter-Comparison of Inverse Models For Estimati
32 pages
Software Test Estimation Techniques
No ratings yet
Software Test Estimation Techniques
5 pages
Introduction to Computer Programming
No ratings yet
Introduction to Computer Programming
3 pages
Python Basics for Class 11 Students
No ratings yet
Python Basics for Class 11 Students
51 pages
Debt Elimination Framework for Entrepreneurs
No ratings yet
Debt Elimination Framework for Entrepreneurs
8 pages
Binance Trading Strategies Explained
No ratings yet
Binance Trading Strategies Explained
4 pages
Brock's Biology of Microorganisms Review
No ratings yet
Brock's Biology of Microorganisms Review
2 pages
Understanding "The Road Not Taken"
No ratings yet
Understanding "The Road Not Taken"
37 pages
SAIHK NANN Account Transactions Summary
No ratings yet
SAIHK NANN Account Transactions Summary
4 pages
PV Newsletter - Volume 2012 Issue 1
No ratings yet
PV Newsletter - Volume 2012 Issue 1
7 pages
2021 Jeep Wrangler 46
100% (1)
2021 Jeep Wrangler 46
464 pages
Pythagorean Adventure Escape Challenge
No ratings yet
Pythagorean Adventure Escape Challenge
1 page
Chicuron: Unique Chicken Turon Concept
No ratings yet
Chicuron: Unique Chicken Turon Concept
13 pages
Probability Practice Paper Level-II
No ratings yet
Probability Practice Paper Level-II
2 pages
Understanding Creation and Plasma Technology
No ratings yet
Understanding Creation and Plasma Technology
33 pages
Binary Phase Diagrams and Solubility Limits
No ratings yet
Binary Phase Diagrams and Solubility Limits
21 pages
Trends in India's Capital Expenditure
No ratings yet
Trends in India's Capital Expenditure
2 pages
Grouting Inspection and Test Plan
No ratings yet
Grouting Inspection and Test Plan
1 page
Dynamics of People in Organizations
50% (4)
Dynamics of People in Organizations
29 pages
Yalehandy Ratchet Lever Hoist Specs
No ratings yet
Yalehandy Ratchet Lever Hoist Specs
1 page
Reinforced Concrete Deep Beams With Web Openings
No ratings yet
Reinforced Concrete Deep Beams With Web Openings
269 pages
Medical Microbiology MCQs Exam Guide
No ratings yet
Medical Microbiology MCQs Exam Guide
8 pages
Organic Syntheses Collective Volume 6
100% (4)
Organic Syntheses Collective Volume 6
1,299 pages
Assessment of Mixed Pain Types
No ratings yet
Assessment of Mixed Pain Types
25 pages
Certificate of Analysis for XGS-0709 Ink
No ratings yet
Certificate of Analysis for XGS-0709 Ink
1 page
Tender for Real-time PCR Cycler
No ratings yet
Tender for Real-time PCR Cycler
66 pages
Coordination Compounds Overview for Class 12
No ratings yet
Coordination Compounds Overview for Class 12
2 pages
Data Center Earthing Best Practices
No ratings yet
Data Center Earthing Best Practices
6 pages
Government Accounting Manual Overview
No ratings yet
Government Accounting Manual Overview
3 pages