CS440/ECE448 Lecture 21:
Parameter Learning for Bayesian Networks
By Mark Hasegawa-Johnson, 3/2022
License: CC-BY 4.0
You may redistribute or remix if you cite the
source.
Parameter Learning for Bayesian Networks
• From observed data: Maximum likelihood
• From observed data: Laplace smoothing
• From partially observed data: Expectation maximization
Flying cows
The scenario:
Central Illinois has recently had a
problem with flying cows.
Farmers have called the university
to complain that their cows flew
away.
Flying cows
The university dispatched a team
of expert vaccavolatologists. They
determined that almost all flying
cows were explained by one or
both of the following causes:
• Smart cows. The cows learned
how to fly, on their own, without
help.
• Alien intervention. UFOs taught
the cows how to fly.
Flying cows
The vaccavolatologists created a
Bayes net, to help them predict
any future instances of cow flying: A S
• P(A) = Probability that aliens
teach the cow.
• P(S) = Probability that a cow is
smart enough to figure out how F
to fly on its own.
• P(F|S,A) = Probability that a cow
learns to fly.
A S
Flying cows F Day A S F
They went out to watch a nearby 1
pasture for ten days. 2 T T
• They reported the number of 3
days on which A, S, and/or F 4 T T T
occurred.
5 T
• Their results are shown in the
table at left (True is marked as 6
“T”; False is shown with a blank). 7 T T
8
9 T
10
A S
Flying cows F Day A S F
The vaccavolatologists now wish to 1
estimate the parameters of their 2 T T
Bayes net
3
• P(A)
4 T T T
• P(S)
• P(F|S,A) 5 T
6
…so that they will be better able to 7 T T
testify before Congress about the 8
relative dangers of aliens versus
smart cows. 9 T
10
Maximum Likelihood A S
Estimation F Day A S F
Suppose we have n training 1
examples, 1 ≤ 𝑖 ≤ 𝑛, with known 2 T T
values for each of the random 3
variables:
4 T T T
• 𝑎! = 𝑇 or 𝑎! = 𝐹
5 T
• 𝑠! = 𝑇 or 𝑠! = 𝐹
6
• 𝑓! = 𝑇 or 𝑓! = 𝐹
7 T T
8
9 T
10
Maximum Likelihood A S
Estimation F Day A S F
We can estimate model parameters 1
to be the values that maximize the 2 T T
likelihood of the observations, 3
subject to the constraints that
4 T T T
𝑃 𝐴 =𝑇 +𝑃 𝐴 =𝐹 =1 5 T
𝑃 𝑆 =𝑇 +𝑃 𝑆 =𝐹 =1 6
𝑃 𝐹 = 𝑇|𝑆, 𝐴 + 𝑃 𝐹 = 𝐹|𝑆, 𝐴 = 1 7 T T
8
9 T
10
Maximum Likelihood A S
Estimation F Day A S F
The maximum likelihood parameters are 1
2 T T
# days on which 𝑎! = 𝑇 3
𝑃 𝐴=𝑇 =
# days total 4 T T T
# days on which 𝑠! = 𝑇 5 T
𝑃 𝑆=𝑇 = 6
# days total
7 T T
# days (A=𝑎,S=𝑠, F=T) 8
𝑃 𝐹 = 𝐹|𝑠, 𝑎 =
# days (A=𝑎,S=𝑠)
9 T
10
Maximum Likelihood A S
Estimation F Day A S F
The maximum likelihood parameters 1
are 2 T T
3 2
𝑃 𝐴=𝑇 = , 𝑃 𝑆=𝑇 = 3
10 10
4 T T T
a s 𝑃 𝐹 = 𝑇|𝑠, 𝒂 5 T
F F 1/6 6
F T 1 7 T T
T F 1/2 8
T T 1 9 T
10
Conclusions: maximum likelihood estimation
• Smart cows are far more dangerous than aliens.
• Maximum likelihood estimation is very easy to use, IF you have
training data in which the values of ALL variables are observed.
Parameter Learning for Bayesian Networks
• From observed data: Maximum likelihood
• From observed data: Laplace smoothing
• From partially observed data: Expectation maximization
A S
Laplace smoothing F
Laplace smoothing adds an extra count of 𝑘 to both categories.
Unlike naïve Bayes, we assume that we know the cardinality of each RV in
advance, so the denominator uses 𝑘× the known cardinality (no OOVs).
# days on which 𝑎! = 𝑇 + 𝑘
𝑃 𝐴=𝑇 =
(# days total)+2k
# days on which 𝑠! = 𝑇 + 𝑘
𝑃 𝑆=𝑇 =
# days total + 2𝑘
(# days (A=𝑎,S=𝑠, F=T))+k
𝑃 𝐹 = 𝐹|𝑠, 𝑎 =
# days (A=𝑎,S=𝑠) + 2𝑘
A S
Laplace smoothing F Day A S F
Laplace-smoothed parameters: 1
3+𝑘 2 T T
𝑃 𝐴=𝑇 = ,
10 + 2𝑘 3
2+𝑘 4 T T T
𝑃 𝑆=𝑇 =
10 + 2𝑘 5 T
a s 𝑃 𝐹 = 𝑇|𝑠, 𝒂 6
F F (1+k)/(6+2k) 7 T T
F T (1+k)/(1+2k) 8
T F (1+k)/(2+2k) 9 T
T T (1+k)/(1+2k) 10
Conclusions: Laplace smoothing
Just like in naïve Bayes:
• Laplace smoothing makes it possible for things to happen in the test data
that never happened in the training data. For example, maximum likelihood
resulted in 𝑃 𝐹 = 𝐹|𝑆 = 𝑇, 𝐴 = 𝑇 = 0, but with Laplace smoothing, we
"
smooth that parameter to 𝑃 𝐹 = 𝐹|𝑆 = 𝑇, 𝐴 = 𝑇 =
#$%"
• This smoothing improves generalization from training data to test data.
Conclusions: Laplace smoothing
Unlike naïve Bayes:
• In Bayesian networks, we assume that we know the cardinality of each random
variable in advance, so no extra probability mass is kept aside for OOV events.
(# observations of (H=ℎ, X=𝑥))+k
𝑃 𝑋 = 𝑥|𝐻 = ℎ =
# observations of (H=ℎ) + 𝑘 H (# distinct values of 𝑋)
Parameter Learning for Bayesian Networks
• From observed data: Maximum likelihood
• From observed data: Laplace smoothing
• From partially observed data: Expectation maximization
Partially observed data
• Maximum likelihood estimation is very easy to use, IF you have
training data in which the values of ALL variables are observed.
• …but what if some of the variables can’t be observed?
• For example: after the 6th day, the cows decide to stop responding to
written surveys. Therefore, it’s impossible to observe, on any given
day, how smart the cows are. We don’t know if 𝑠! = 𝑇 or 𝑠! = 𝐹…
A S
F
Partially observed data Day A S F
Suppose that we have the 1
following observations: 2 T T
• We know whether A=True or 3
False. 4 T T T
• We know whether F=True or 5 T
False.
6
• After the 6th day, we don’t know
whether S is True or False (shown 7 T ? T
as ”?”). 8 ?
9 ? T
10 ?
Expectation Maximization (EM): Main idea
Remember that maximum likelihood estimation counts examples:
# days A=a, #$%, '$(
𝑃 𝐹 = 𝑇 𝐴 = 𝑎, 𝑆 = 𝑠 =
# days #$%, )$*
Expectation maximization is similar, but using “expected counts” instead of
actual counts:
𝐸[# days 𝐴 = 𝑎, 𝑆 = 𝑠, 𝐹 = 𝑇]
𝑃 𝐹 = 𝑇 𝐴 = 𝑎, 𝑆 = 𝑠 =
𝐸[# days 𝐴 = 𝑎, 𝑆 = 𝑠]
Where E[X] means “expected value of X”.
Definition of Expectation
The expected value of a random variable is its weighted average value,
with weights equal to the probabilities.
𝐸 #days 𝐴 = 𝑎, 𝑆 = 𝑠, 𝐹 = 𝑇 = O 𝑃 𝐴! = 𝑎, 𝑆! = 𝑠, 𝐹! = 𝑇
!∈'()*
A S
F
Expectation Day A S F
𝐸 #days 𝐴 = 𝐹, 𝑆 = 𝑇, 𝐹 = 𝑇 1
2 T T
#,
3
= O 𝑃 𝐴! = 𝐹, 𝑆! = 𝑇, 𝐹! = 𝑇 4 T T T
!+#
5 T
6
7 T ? T
8 ?
9 ? T
10 ?
A S
F
Expectation Day A S F
𝐸 #days 𝐴 = 𝐹, 𝑆 = 𝑇, 𝐹 = 𝑇 1
2 T T
#,
𝑃 𝐴! = 𝐹, 𝐹! = 𝑇 × 3
=O 4 T T T
𝑃 𝑆! = 𝑇|𝐴! = 𝐹, 𝐹! = 𝑇
!+#
5 T
6
𝑃 𝐴! = 𝐹, 𝐹! = 𝑇 is either 0 or 1,
depending on whether the event 7 T ? T
certainly occurred (days 2 and 9) 8 ?
or certainly did not occur (every 9 ? T
other day).
10 ?
A S
F
Expectation Day A S F
𝐸 #days 𝐴 = 𝐹, 𝑆 = 𝑇, 𝐹 = 𝑇 1
2 T T
#,
𝑃 𝐴! = 𝐹, 𝐹! = 𝑇 × 3
=O 4 T T T
𝑃 𝑆! = 𝑇|𝐴! = 𝐹, 𝐹! = 𝑇
!+#
5 T
6
• 𝑃 𝑆! = 𝑇|𝐴! = 𝐹, 𝐹! = 𝑇 = 1
on day 2, because the event 7 T ? T
certainly occurred. 8 ?
• 𝑃 𝑆! = 𝑇|𝐴! = 𝐹, 𝐹! = 𝑇 is 9 ? T
unknown on day 9 10 ?
A S
F
Expectation Day A S F
𝐸 #days 𝐴 = 𝐹, 𝑆 = 𝑇, 𝐹 = 𝑇 1
2 T T
= 1 + 𝑃 𝑆- = 𝑇|𝐴- = 𝐹, 𝐹- = 𝑇 3
4 T T T
• How can we compute 𝑃(𝑆. = 5 T
𝑇|𝐴. = 𝐹, 𝐹. = 𝑇)? 6
• In order to compute it, we need 7 T ? T
the model parameters
8 ?
• The model parameters are the
9 ? T
thing we’re trying to estimate!
10 ?
Expectation Maximization (EM) is iterative
INITIALIZE: guess the model parameters.
ITERATE until convergence:
1. E-Step: 𝐸[# days 𝑆 = 𝑠, 𝐴 = 𝑎, 𝐹 = 𝑓] = ∑!:0!+0,2!+2 𝑃 𝑆 = 𝑠|𝑎, 𝑓
3[# days 6+7,8+0,9+2]
2. M-Step: 𝑃 𝐹 = 𝑓 𝑆 = 𝑠, 𝐴 = 𝑎 =
3[# days 6+7, 8+0]
Continue the iteration, shown above, until the model parameters stop
changing.
A S
F
Example: Initialize
Marilyn Modigliani is a professional vaccavolatologist. She gives us
these initial guesses about the possible model parameters (her guesses
are probably not quite right, but they are as good a guess as anybody
else’s):
1 1
𝑃 𝐴=𝑇 = , 𝑃 𝑆=𝑇 =
4 4
a s 𝑃 𝐹 = 𝑇|𝑠, 𝒂
F F 0
F T 1/2
T F 1/2
T T 1
A S
F
E-Step Day A S F
Based on Marilyn’s model, we 1
calculate 𝑃 𝑆 = 𝑇|𝑎! , 𝑓! for each 2 T T
of the missing days, as shown in
3
the table at right.
4 T T T
5 T
6
7 T 2/5 T
8 1/7
9 1 T
10 1/7
A S
E-Step F
The expected counts are
𝐸[# days 𝑆 = 𝑠, 𝐴 = 𝑎, 𝐹 = 𝑓] = O 𝑃 𝑆 = 𝑠|𝑎, 𝑓
!:0!+0,2!+2
a f 𝑬[# 𝒅𝒂𝒚𝒔 𝑺 = 𝑻|𝑎, 𝑓] 𝑬[# 𝒅𝒂𝒚𝒔 𝑺 = 𝑭|𝑎, 𝑓]
F F 1 1 2 6 6 33
0+0+0+ + = 1+1+1+ + =
7 7 7 7 7 7
F T 1+1=2 0+0=0
T F 0 1
T T 2 7 3 3
1+ = 0+ =
5 5 5 5
a f 𝑬[# 𝒅𝒂𝒚𝒔 𝑺 = 𝑻|𝑎, 𝑓] 𝑬[# 𝒅𝒂𝒚𝒔 𝑺 = 𝑭|𝑎, 𝑓]
F F 1 1 2 6 6 33
0+0+0+ + = 1+1+1+ + =
7 7 7 7 7 7
F T 1+1=2 0+0=0
T F 0 1
T T 2 7 3 3
1+ = 0+ =
5 5 5 5
M-Step
Now let’s re-estimate the model parameters. For example,
𝐸[# days 𝑆 = 𝐹, 𝐴 = 𝐹, 𝐹 = 𝑇]
𝑃 𝐹 = 𝑇 𝑆 = 𝐹, 𝐴 = 𝐹 =
𝐸[# days 𝑆 = 𝐹, 𝐴 = 𝐹]
0
= =0
33
+0
7
a f 𝑬[# 𝒅𝒂𝒚𝒔 𝑺|𝑎, 𝑓] 𝑬[# 𝒅𝒂𝒚𝒔 ¬𝑺|𝑎, 𝑓]
F F 1 1 2 6 6 33
0+0+0+ + = 1+1+1+ + =
7 7 7 7 7 7
F T 1+1=2 0+0=0
T F 0 1
T T 2 7 3 3
1+ = 0+ =
5 5 5 5
M-Step
Now let’s re-estimate the model parameters. For example,
𝐸[# days 𝑆 = 𝑇, 𝐴 = 𝐹, 𝐹 = 𝑇]
𝑃 𝐹 = 𝑇 𝑆 = 𝑇, 𝐴 = 𝐹 =
𝐸[# days 𝑆 = 𝑇, 𝐴 = 𝐹]
2 7
= =
2
+2 8
7
A S
M-Step F
The re-estimated probabilities are
a s 𝑃 𝐹 𝑆 = 𝑠, 𝐴 = 𝑎
# days 𝐴 = 𝑇 3 F F 0
=0
𝑃 𝐴=𝑇 = = 33
# days total 10 7
+0
F T 2 7
2 7 =
2
𝐸 #days 𝑆 = 𝑇 7
+2+0+
5 +2 8
7
𝑃 𝑆=𝑇 = =
# days total 10 T F 3/5 3
94 =
3 8
= 1+
5
350
T T 7/5
=1
0 + 7/5
Expectation Maximization (EM): review
INITIALIZE: guess the model parameters.
ITERATE until convergence:
1. E-Step: 𝐸[# days 𝑆 = 𝑠, 𝐴 = 𝑎, 𝐹 = 𝑓] = ∑!:0!+0,2!+2 𝑃 𝑆 = 𝑠|𝑎, 𝑓
3[# days 6+7,8+0,9+2]
2. M-Step: 𝑃 𝐹 = 𝑓 𝑆 = 𝑠, 𝐴 = 𝑎 =
3[# days 6+7, 8+0]
Continue the iteration, shown above, until the model parameters stop
changing.
Properties of the EM algorithm
• It always converges.
• The parameters it converges to (P(A), P(S), and P(F|A,S)):
• are guaranteed to be at least as good as your initial guess, but
• They depend on your initial guess. Different initial guesses may
result in different results, after the algorithm converges.
• For example, Marilyn’s initial guess was 𝑃 𝐹 = 𝑇|𝑆 = 𝐹, 𝐴 = 𝐹 =
𝟎. Notice that we ended up with the same value! According to
the fully observed data we saw earlier, that might not be the best
possible parameter for these data.
Parameter Learning for Bayesian Networks
• Maximum Likelihood (ML):
# days (A=𝑎, S=𝑠, F=T)
𝑃 𝐹 = 𝑇|𝑆 = 𝑠, 𝐴 = 𝑎 =
# days (A=𝑎, S=𝑠)
• Laplace Smoothing:
# days (A=𝑎, S=𝑠, F=T) + 𝑘
𝑃 𝐹 = 𝑇|𝑆 = 𝑠, 𝐴 = 𝑎 =
# days (A=𝑎, S=𝑠) + 2𝑘
• Expectation Maximization (EM):
𝐸[# days 𝐴 = 𝑎, 𝑆 = 𝑠, 𝐹 = 𝑇]
𝑃 𝐹 = 𝑇 𝑆 = 𝑠, 𝐴 = 𝑎 =
𝐸[# days 𝐴 = 𝑎, 𝑆 = 𝑠]