Artificial Intelligence: A Modern
Approach
Fourth Edition
Chapter 20
Learning Probabilistic Models
1
Copyright © 2021 Pearson Education, Inc. All Rights Reserved
Outline
♦ Statistical Learning
♦ Learning with Complete Data
♦ Learning with Hidden Variables: The EM Algorithm
Chapter 20, Sections 1–3
2
Introduction
♦ Agents can handle uncertainty by using the methods of probability and
decision theory.
♦ However, they need to learn their probabilistic theories of the world form
experience.
♦ This chapter focuses on that.
♦ Focus on Data and Hypotheses
♦ Data are evidence: instantiations of some or all of the random variables
describing the domain.
♦ The hypotheses in this chapter are probabilistic theories of how the domain
works, including logical theories as a special case.
Chapter 20, Sections 1–3
3
Statistical Learning Example
Suppose there are five kinds of bags of candies:
h1: 100% cherry candies
h2: 75% cherry candies + 25% lime candies
h3: 50% cherry candies + 50% lime candies
h4: 25% cherry candies + 75% lime candies
h5: 100% lime candies
Then we observe candies drawn from some bag:
What kind of bag is it? What flavour will the next candy be?
• Given a new bag of candy, the random variable H (for hypothesis) denotes
the type of the bag, with possible values h1 through h5.
• H is not directly observable, of course. As the pieces of candy are opened
and inspected, data are revealed—D1, D2,..., DN, where each Di is a random
variable with possible values cherry and lime.
• Agent’s Goal: Predict what flavor is the next candy.
Full Bayesian learning
• Bayesian learning simply calculates the probability of each hypothesis, given the
data, and makes predictions on that basis.
• predictions are made by using all the hypotheses, weighted by their probabilities,
rather than by using just a single “best” hypothesis. In this way, learning is
reduced to probabilistic inference.
• Let D represent all the data, with observed value d. The key quantities in the
Bayesian approach are the hypothesis prior, P(hi), and the likelihood of the data
under each hypothesis, Likelihood P(d|hi).
Full Bayesian learning
View learning as Bayesian updating of a probability distribution over the
hypothesis space
H is the hypothesis variable, values h1, h2, . . ., prior P(H)
jth observation dj gives the outcome of random variable D j
training data d = d1 , . . . , dN
Given the data so far, each hypothesis has a posterior probability:
P (hi |d) = αP (d|hi )P (hi ) (Bayes’ rule)
where P (d|hi ) is called the likelihood
Predictions use a likelihood-weighted average over the hypotheses:
P(X|d) = Σ i P(X|h i )P (hi|d)
where each hypothesis determines a probability distribution over X.
No need to pick one best-guess hypothesis!
Example
Suppose there are five kinds of bags of candies
10% are h1: 100% cherry candies
20% are h2: 75% cherry candies + 25% lime candies
40% are h3: 50% cherry candies + 50% lime candies
20% are h4: 25% cherry candies + 75% lime candies
10% are h5: 100% lime candies
Then we observe candies drawn from some bag:
What kind of bag is it? What flavour will the next candy be?
Assumption: observations are independent and identically distributed (i.i.d) then the
likelihood of the data is:
Posterior probability of hypotheses
1
P(h1 | d)
Posterior probability of hypothesis
P(h2 |
d) P(h3
0.8 | d)
P(h4 | d)
P(h5 |
0.6 d)
0.4
0.2
0
0 2 4 6 8 10
Number of samples in d
Chapter 20, Sections 1–3
8
© 2021 Pearson Education Ltd.
© 2021 Pearson Education Ltd.
Prediction probability
0.9
P(next candy is lime | d)
0.8
0.7
0.6
0.5
0.4
0 2 4 6 8 10
Number of samples in d
11
M A P approximation
• Bayesian predictions are optimal regardless of the dataset size.
• However, the hypotheses space is very large or infinite.
• Summing over the hypothesis space is often intractable, so we need to resort
to approximate or simplified methods. (e.g., 18,446,744,073,709,551,616
Boolean functions of 6 attributes)
• Common approach: Make prediction based on a single most probable
hypothesis.
• Maximum a posteriori (MAP) learning: choose hMAP maximizing P (hi|d)
I.e., maximize P (d|hi)P (hi) or minimizing -log2 P (d|hi) - log2 P
(hi)
• For deterministic hypotheses, P (d|hi) is 1 if consistent, 0 otherwise (e.g. h1
in our example)
⇒ MAP = simplest consistent hypothesis
12
ML approximation
For large data sets, prior becomes irrelevant
Maximum likelihood (ML) learning: choose hML maximizing P (d|hi )
I.e., simply get the best fit to the data; identical to MAP for uniform
prior (which is reasonable if all hypotheses are of the same complexity)
ML is the “standard” (non-Bayesian) statistical learning method
13
Learning with Complete Data
• The task of learning a probability model, given data that are assumed to
be generated from that model, is called density estimation.
• Density estimation is a form of unsupervised learning.
• We will assume the data is complete. That is. Each data point contains
values for every variable in the probability model being learned.
14
Bayesian Network Model
(a) Bayesian network model for the case of candies with an unknown
proportion of cherries and limes.
(b) Model for the case where the wrapper color depends (probabilistically) on
the candy flavor.
15
ML parameter learning in Bayes nets
Bag from a new manufacturer; fraction θ of cherry P(F=cherry )
candies?
Any θ is possible: continuum of hypotheses hθ
θ is a parameter for this simple (binomial) family of Flavor
models we unwrap N candies, c cherries and l = N − c limes
Suppose
These are i.i.d. (independent, identically distributed) observations, so
INI c
P (d|hθ) = ∏ P (dj |h ) = θ · (1 l −
j =1 θ
θ)
Maximize this w.r.t. θ—which is easier for the log-likelihood:
N
L(d|hθ) = log P (d|hθ) = ∑ log P (dj|hθ) = c log θ + l log(1
j =1
− θ) θ)
dL(d|h c € c
= − =0 ⇒ θ c
d θ 1− θ c + l= N
=
θ
Seems sensible, but causes problems with 0
counts!
16
Multiple parameters
Red/green wrapper depends probabilistically on P(F=cherry )
flavor:
Likelihood for, e.g., cherry candy in green wrapper:
Flavor
P (F = cherry , W = green | F P(W=red | F)
h= P2 )(F = cherry |h
θ,θ1,θ )P (W = green |F = cherry , h cherry 1
θ,θ1 , θ,θ1 ,
) θ 2 θ 2 lime 2
= θ · (1 − θ1 ) Wrapper
N candies, rc red-wrapped cherry candies, etc.:
Take the logarithm:
17
Multiple parameters contd.
Derivatives of L contain only the relevant parameter:
With complete data, parameters can be learned separately
18
Naive Bayes Models
Assuming Boolean variables, the parameters are
With observed attribute values x1, ... , xn, the probability of each class is given
by
A deterministic prediction can be obtained by choosing the most likely class
The method learns fairly well but not as well as decision-tree learning; this is
presumably because the true hypothesis-which is a decision tree-is not
representable exactly using a naive Bayes model.
Naive Bayes learning turns out to do surprisingly well in a wide range of
applications; the boosted version
Scales well to large problems with n Boolean attributes there are just 2n + 1
parameters
19
Naive Bayes Models
The learning curve for naive Bayes learning applied to the restaurant
problem from chapter 18. Compared with decision tree learning.
20
Example: linear Gaussian model
1
0.8
P(y |x)
4 0.6
3.
5
y
3
2. 0.4
5
2
1. 1 0.2
5 0.8
010 0.2 0.6
0.4 y
0. 0.4 0.6 0.2
5 0.8 0 0
x 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Assume data are generated as follows: x
Maximize the conditional likelihood
with respect to θ1, θ2.
LN 2
= minimizing E = (yj − (θ1 xj +
j =1
θ2))
That is, minimizing the sum of squared errors gives the ML
solution for a linear fit assuming Gaussian noise of fixed
variance 21
Naive Bayes Models
Bayesian parameter learning
Hypothesis prior: Bayesian approach to parameter learning starts by defining
a prior probability distribution over the possible hypotheses
In the Bayesian view, is the (unknown) value of a random variable that
defines the hypothesis space
The hypothesis prior is just the prior distribution P().
Thus, P( = ) is the prior probability that the bag has a fraction of cherry
candies.
if we don’t anything about the possible values of then we can use :
P() = Uniform[0,1](), uniform density is part of beta distributions.
Each beta distribution is defined by two hyperparameters a and b such that
22
Naive Bayes Models
Bayesian parameter learning
Suppose we observe a cherry, then we have:
23
Naive Bayes Models
Bayesian parameter learning
Assuming Parameter independence
A Bayesian network that corresponds to a Bayesian learning process.
Posterior distributions for the parameter variables , 1, and 2 can be
inferred from their prior distributions and the evidence in the Flavori and
Wrapperi variables.
24
Naive Bayes Models
Density estimation with nonparametric models
• It is possible to learn a probability model without making any assumptions
about its structure and parameterization by adopting the nonparametric
methods.
Two possibilities:
• k-nearest neighbors (use them for density estimation rather than regression
or classification)
to estimate the unknown probability density at a query point x, we can simply
measure the density of the data points in the neighborhood of x.
• Use kernel functions
a) A 3D plot of the mixture of Gaussians
b) A 128- point sample of points from the mixture, together with two query points
(small squares) and their IO-nearest-neighborhoods (medium and large circles).
Naive Bayes Models
Density estimation using k-nearest-neighbors, for k= 3, 10, and 40 respectively.
k = 3 is too spiky, 40 is too smooth, and 10 is just about right.
The best value for k can be chosen by cross-validation
26
Naive Bayes Models
Kernel density estimation using Gaussian kernels with = 0.02, 0.07, and 0.20
respectively. w = 0.07 is about right.
27
Learning With Hidden Variables: The EM Algorithm
Unsupervised clustering: Learning mixtures of Gaussians
• Unsupervised clustering is the problem of discerning multiple categories in a collection of
Unsupervised clustering objects. The problem is unsupervised because the category labels are not
given.
• Unsupervised clustering begins with Data.
• Next, we need to understand what kind of probability distribution might have generated the data.
• Clustering presumes that the data are generated from a mixture distribution, P. Such distribution has
k components, each of which is a distribution in its own.
• Let the random variable C denote the component, with values 1,...,k; then the mixture distribution is
given by
x refers to the values of the attributes for a data point
• For continuous data, a natural choice is mixture of Gaussians (multivariate Gaussian) .The
parameters of a mixture of Gaussians Mixture of Gaussians are wi =P(C=i) (the weight of each
component), µi (the mean of each component), and Σi (the covariance of each component).
28
Learning With Hidden Variables: The EM Algorithm
Many real-world problems have hidden variables (sometimes called latent variables)
which are not observable in the data.
Expectation-maximization helps with hidden variables
• Pretend to know the parameters for the model, then to infer the probability that each
data point belongs to each component.
• refit the components to the data, where each component is fitted to the entire data set
with each point weighted by the probability that it belongs to that component.
• The process iterates until convergence
Essentially, we are “completing” the data by inferring probability distributions over the
hidden variables—which component each data point belongs to—based on the
current model.
29
Learning With Hidden Variables: The EM Algorithm
Unsupervised clustering: Learning mixtures of Gaussians
• For the mixture of Gaussians, initialize the mixture-model parameters arbitrarily and iterate the E-
step (Expectation Step) & M-step (Maximization Step)
• E-Step: Compute the probabilities pij =P(C=i|xj), the probability that datum xj was generated by
component i. By Bayes’ rule, we have p ij =αP(xj |C=i)P(C=i). The term P(xj |C=i) is just the
probability at xj of the ith Gaussian, and the term P(C=i) is just the weight parameter for the i th
Gaussian. Define ni = ∑j pij, the effective number of data points currently assigned to
component i.
• M-Step: Compute the new mean, covariance, and component weights using the following steps
in sequence: (finds the new values of the parameters that maximize the log likelihood of the
data, given the expected values of the hidden indicator variables)
where N is the total number of data points.
• Observations: the log likelihood for the final learned model slightly exceeds
that of the original model & EM increases the log likelihood of the data at
every iteration.
30
Learning With Hidden Variables: The EM Algorithm
Learning Bayesian networks with hidden variables
To learn a Bayesian network with hidden variables, we apply the same insights
that worked for mixtures of Gaussians.
Example: a situation in which there are two bags of candies that have been
mixed together.
• Candies are described by three features: in addition to the Flavor and the
Wrapper, some candies have a Hole in the middle and some do not.
• The distribution of candies in each bag is described by a naive Bayes
model: the features are independent, given the bag, but the conditional
probability distribution for each feature depends on the bag.
• A 1000 samples were generated from a mode with true parameters:
31
Learning With Hidden Variables: The EM Algorithm
Learning Bayesian networks with hidden variables
Start with arbitrary values for model parameters.
The expected count of ( Bag = l) is the sum, over all candies, of the probability that the
candy came from bag 1:
using Bayes' rule and applying conditional independence
The expected count of cherry candies from bag 1 is given by
32
Learning With Hidden Variables: The EM Algorithm
Learning Bayesian networks with hidden variables
(a) A mixture model for candy. The proportions of different flavors, wrappers,
presence of holes depend on the bag, which is not observed.
(b) Bayesian network for a Gaussian mixture. The mean and covariance of the
observable variables X depend on the component C.
33
Learning With Hidden Variables: The EM Algorithm
Graphs showing the log likelihood of the data, L , as a function of the EM
iteration (a) Graph for Gaussian mixture model (b) Graph for the Bayesian
network
34
Learning With Hidden Variables: The EM Algorithm
Learning hidden Markov models
One application of EM involves learning the transition probabilities in hidden
Markov models (HMMs).
A hidden Markov model can be represented by a dynamic Bayes net with a
single discrete state variable
Each data point consists of an observation sequence of finite length
transition probability from state i to state j,
• calculate the expected proportion of times that the system undergoes a
transition to state j when in state i:
35
Summary
Bayesian learning methods formulate learning as a form of probabilistic
inference, using the observations to update a prior distribution over hypotheses.
Maximum a posteriori (MAP) learning selects a single most likely hypothesis
given the data.
Maximum-likelihood learning simply selects the hypothesis that maximizes the
likelihood of the data; it is equivalent to MAP learning with a uniform prior.
Naive Bayes learning is a particularly effective technique that scales
When some variables are hidden, local maximum likelihood solutions can be
found using the EM algorithm
Nonparametric models represent a distribution using the collection of data
points.
36