Bilkent University Spring 2022
MATH 255 - Probability and Statistics
Final Exam Solutions
Problem 1) Let
( (
θe−θx if x ≥ 0 αe−αθ if θ ≥ 0
fX|Θ (x|θ) = and fΘ (θ) =
0 if x < 0 0 if θ < 0
Find the MAP and LMS estimates of θ for a single observation X = x.
R∞ R∞
Hint: For exponential random variables, we have 0 λe−λx dx = 1 and 0 x2 λe−λx dx = 2
λ2
.
We can first find the posterior distribution (2pt):
fX|Θ (x|θ)fΘ (θ)
fΘ|X (θ|x) =
fX (x)
αθe−(α+x)θ
= R∞ −(α+x)θ dθ
0 αθe
αθe−(α+x)θ
= α
(α+x)2
= (α + x)2 θe−(α+x)θ .
Then, the MAP estimate is the peak of the posterior (1pt):
θbMAP (x) = arg max{log fX|Θ (x|θ) + log fΘ (θ)}
θ
= arg max{−θx + log θ − αθ + log α}.
θ
1
Taking the derivative and setting it to zero yields that −x + θ − α = 0. This implies that (1pt)
1
θbMAP (x) = .
α+x
On the other hand, the LMS estimate is the conditional expectation of the posterior (1pt):
θbLMS (x) = EΘ|X [Θ|X = x]
Z ∞
= θfΘ|X (θ|x)dθ
Z0 ∞
= θ(α + x)2 θe−(α+x)θ dθ
0
Z ∞
= (α + x) θ2 · (α + x)e−(α+x)θ dθ
|0 {z }
2nd moment of exponential distribution
2
= (α + x) · .
(α + x)2
1
Therefore, we have (1pt)
2
θbLMS (x) = .
α+x
Problem 2) Erdös-Rényi graph is one of the simplest models for social networks. In this
graph, vertices and edges, respectively, correspond to users and the (undirected) connectivity
between them. Specifically, each edge has a fixed probability of being present or absent,
independently of the other edges. We can represent a graph through an adjacency matrix X
with binary entries. The ith row and the jth column entry, denoted by Xij , indicates an edge
connecting nodes i and j. The following figure illustrates an example of Erdös-Rényi graph
and the associated adjacency matrix.
User 1
User 1 User 2 User 3 User 4
User 1 0 1 1 0
User 2
User 2 1 0 1 0
User 4
User 3 1 1 0 1
User 4 0 0 1 0
User 3
Since the presence and absence of an edge is binary and random, we can model Xij for
each (i, j) and i 6= j as a Bernoulli random variable
Xij ∼ Ber(p),
with a success probability p. In other words, for i 6= j, Xij = 1 with probability p and Xij = 0
with probability 1 − p. Furthermore, Xii = 0 for all i.
Suppose that we are given a realization of the adjacency matrix for an Erdös-Rényi graph
with n vertices.
Hint: Each edge can be present or absent, independently of the other edges, but Xij = Xji for all i, j.
The problem is not much different from an i.i.d. sequence of Bernoulli random variables!
(a) (4pt) Find the log-likelihood function of p.
Notice that Xij = Xji for all i, j and Xii = 0 for all i. Hence, we can focus on the triangle
Pn−1 Pn
∆ := {(i, j) : i ∈ [1, n − 1] and j ∈ [i + 1, n]}. Given the realization, let cx := i=1 j=i+1 xi,j
denote the number of successes in the indices in ∆ since xij = 1 if it succeeds. Correspondingly,
Pn−1 Pn
dx := i=1 j=i+1 (1 − xi,j ) denotes the number of failures in the indices in ∆ since xij = 0, and
therefore, (1 − xij ) = 1 if it fails. Then, the likelihood function of p is given by (2pt)
fX (x; p) = pcx (1 − p)dx .
2
Hence, the log-likelihood function is given by (2pt)
log fX (x; p) = cx log p + dx log(1 − p)
n−1
X X n n−1
X X n
= xi,j log p + (1 − xi,j ) log(1 − p) .
i=1 j=i+1 i=1 j=i+1
(b) (4pt) Find the ML estimate of p.
The ML estimate is given by (1pt)
pbML = arg max log fX (x; p)
p
= arg max{cx log p + dx log(1 − p)}.
p
Taking the derivative, we obtain (1pt)
d d
fX (x; p) = (cx log p + dx log(1 − p))
dp dp
1 1
= cx − dx .
p 1−p
cx dx
Setting it to zero yields that p = 1−p . This implies that (2pt)
n−1 n
cx 2 X X
pbML = = 2 xi,j
cx + dx n −n
i=1 j=i+1
since cx + dx = (n2 − n)/2, which is the number of elements in the triangle ∆.
Problem 3) A sample x of a random variable X is observed. There are the following two
hypotheses regarding the PDF of X:
(
1 if 0 ≤ x ≤ 1
H0 : fX (x; H0 ) =
0 otherwise,
(
2x if 0 ≤ x ≤ 1
H1 : fX (x; H1 ) =
0 otherwise.
Consider using a likelihood ratio test (LRT)
fX (x; H1 )
L(x) = ≥γ
fX (x; H0 )
for choosing between the two hypotheses.
(a) (4pt) Determine the parameter γ so that the type-I error probability α is 0.2. Compute the
resulting type-II error probability β for this value of γ.
The LRT is given by
fX (x; H1 )
L(x) = = 2x ≥ γ
fX (x; H0 )
3
So, we have (1pt)
H x ≥ γ/2,
1
Θ̂(x) =
H x < γ/2
0
Type-I error probability is (1pt)
Z 1
1, γ < 0,
α = P (X ≥ γ/2; H0 ) = fX (x; H0 )dx = (1 − γ/2), 0 ≤ γ ≤ 2,
γ/2
0,
γ>2
For α = 0.2, we should choose 1 − γ/2 = 0.2 or γ = 1.6 (0.5pt). When γ = 1.6, we have (1pt)
Z 0.8
β = P (X < 1.6/2; H1 ) = 2x dx = x2 |00.8 ,
0
and β = 0.64 (0.5pt).
(b) (4pt) Generalize the result of part (a) and determine β as a function of α. Make sure to
specify β explicitly as a function of α for all 0 ≤ α ≤ 1. Make a neat sketch of β vs. α. Label
the axes carefully.
In general, we have
1, γ < 0,
α = (1 − γ/2), 0 ≤ γ ≤ 2,
0,
γ>2
and
Z γ/2
0, γ < 0,
β = P (X < γ/2; H1 ) = 2x dx = (γ/2)2 , 0 ≤ γ ≤ 2,
0
1,
γ>2
All possible pairs of (α, β) are generated by threshold values in the range 0 ≤ γ ≤ 2. Over this
range, we have the parametric relations (2pt)
α = 1 − γ/2, β = (γ/2)2 .
Since γ = 2(1 − α), we obtain (1pt)
β = (2(1 − α)/2)2 = (1 − α)2
as the relation between α and β.
β
1
1
α
(1pt)
4
Problem 4)
(a) (2pt) Explain the difference between convergence in probability and convergence with prob-
ability 1. Does any of them imply the other?
We say that Yn converges to a as n → ∞ in probability provided that (0.5pt)
lim P (|Yn − a|> ) = 0, ∀ > 0.
n→∞
On the other hand, we say that Yn converges to a as n → ∞ with probability 1 provided that (0.5pt):
P lim Yn = a = 1.
n→∞
The latter implies the former (1pt).
(b) (2pt) Explain the weak and strong law of large numbers.
Given an i.i.d. (1pt) sequence of random variables {Yn }n , the WLLN says that the sample mean
Mn := Y1 +...+Y
n
n
converges in probability to the expected value E[Yi ] (0.5pt), and the SLLN says
that the sample mean Mn converges with probability 1 to the expected value E[Yi ] (0.5pt).
(c) (2pt) Explain the difference between Bayesian and classical statistical inference.
In the Bayesian approach, we consider the unknown information as a realization of a random variable
with known distribution (1pt). In the classical approach, we consider that the unknown information
is fixed constant (1pt).
(d) (2pt) Explain the difference between the MAP and the ML estimation. When would they
lead to the same result?
The MAP rule maximizes the posterior distribution, e.g., pΘ|X (θ|x) or fΘ|X (θ|x) (0.5pt). On the
other hand, the ML rule maximizes the likelihood function, e.g., pX (x; θ) or fX (x; θ) (0.5pt). They
would lead to the same estimate if the prior distribution of Θ is uniform (1pt).