0% found this document useful (0 votes)

20 views36 pages

Chapter20 4e

This chapter discusses learning probabilistic models from data. It covers learning with complete data by estimating parameters of Bayesian network models using maximum likelihood. The likelihood function is derived for cases with single and multiple parameters, and parameters can be learned separately with complete data.

Uploaded by

Anas Hamdan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views36 pages

Chapter20 4e

Uploaded by

Anas Hamdan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Artificial Intelligence: A Modern

Approach
Fourth Edition

Chapter 20

Learning Probabilistic Models

1
Copyright © 2021 Pearson Education, Inc. All Rights Reserved
Outline
♦ Statistical Learning

♦ Learning with Complete Data

♦ Learning with Hidden Variables: The EM Algorithm

Chapter 20, Sections 1–3

2
Introduction
♦ Agents can handle uncertainty by using the methods of probability and
decision theory.
♦ However, they need to learn their probabilistic theories of the world form
experience.
♦ This chapter focuses on that.
♦ Focus on Data and Hypotheses
♦ Data are evidence: instantiations of some or all of the random variables
describing the domain.
♦ The hypotheses in this chapter are probabilistic theories of how the domain
works, including logical theories as a special case.

Chapter 20, Sections 1–3

3
Statistical Learning Example
Suppose there are five kinds of bags of candies:
h1: 100% cherry candies
h2: 75% cherry candies + 25% lime candies
h3: 50% cherry candies + 50% lime candies
h4: 25% cherry candies + 75% lime candies
h5: 100% lime candies

Then we observe candies drawn from some bag:

What kind of bag is it? What flavour will the next candy be?
• Given a new bag of candy, the random variable H (for hypothesis) denotes
the type of the bag, with possible values h1 through h5.
• H is not directly observable, of course. As the pieces of candy are opened
and inspected, data are revealed—D1, D2,..., DN, where each Di is a random
variable with possible values cherry and lime.
• Agent’s Goal: Predict what flavor is the next candy.
Full Bayesian learning
• Bayesian learning simply calculates the probability of each hypothesis, given the
data, and makes predictions on that basis.
• predictions are made by using all the hypotheses, weighted by their probabilities,
rather than by using just a single “best” hypothesis. In this way, learning is
reduced to probabilistic inference.
• Let D represent all the data, with observed value d. The key quantities in the
Bayesian approach are the hypothesis prior, P(hi), and the likelihood of the data
under each hypothesis, Likelihood P(d|hi).
Full Bayesian learning
View learning as Bayesian updating of a probability distribution over the
hypothesis space
H is the hypothesis variable, values h1, h2, . . ., prior P(H)

jth observation dj gives the outcome of random variable D j

training data d = d1 , . . . , dN
Given the data so far, each hypothesis has a posterior probability:
P (hi |d) = αP (d|hi )P (hi ) (Bayes’ rule)

where P (d|hi ) is called the likelihood

Predictions use a likelihood-weighted average over the hypotheses:
P(X|d) = Σ i P(X|h i )P (hi|d)
where each hypothesis determines a probability distribution over X.
No need to pick one best-guess hypothesis!
Example

Suppose there are five kinds of bags of candies

10% are h1: 100% cherry candies
20% are h2: 75% cherry candies + 25% lime candies
40% are h3: 50% cherry candies + 50% lime candies
20% are h4: 25% cherry candies + 75% lime candies
10% are h5: 100% lime candies

Then we observe candies drawn from some bag:

What kind of bag is it? What flavour will the next candy be?
Assumption: observations are independent and identically distributed (i.i.d) then the
likelihood of the data is:
Posterior probability of hypotheses

1
P(h1 | d)
Posterior probability of hypothesis
P(h2 |
d) P(h3
0.8 | d)
P(h4 | d)
P(h5 |
0.6 d)

0.4

0.2

0
0 2 4 6 8 10
Number of samples in d

Chapter 20, Sections 1–3

8
© 2021 Pearson Education Ltd.
© 2021 Pearson Education Ltd.
Prediction probability

0.9
P(next candy is lime | d)

0.8

0.7

0.6

0.5

0.4
0 2 4 6 8 10
Number of samples in d

11
M A P approximation
• Bayesian predictions are optimal regardless of the dataset size.
• However, the hypotheses space is very large or infinite.
• Summing over the hypothesis space is often intractable, so we need to resort
to approximate or simplified methods. (e.g., 18,446,744,073,709,551,616
Boolean functions of 6 attributes)
• Common approach: Make prediction based on a single most probable
hypothesis.
• Maximum a posteriori (MAP) learning: choose hMAP maximizing P (hi|d)

I.e., maximize P (d|hi)P (hi) or minimizing -log2 P (d|hi) - log2 P

(hi)
• For deterministic hypotheses, P (d|hi) is 1 if consistent, 0 otherwise (e.g. h1
in our example)
⇒ MAP = simplest consistent hypothesis

12
ML approximation
For large data sets, prior becomes irrelevant
Maximum likelihood (ML) learning: choose hML maximizing P (d|hi )
I.e., simply get the best fit to the data; identical to MAP for uniform
prior (which is reasonable if all hypotheses are of the same complexity)
ML is the “standard” (non-Bayesian) statistical learning method

13
Learning with Complete Data
• The task of learning a probability model, given data that are assumed to
be generated from that model, is called density estimation.
• Density estimation is a form of unsupervised learning.
• We will assume the data is complete. That is. Each data point contains
values for every variable in the probability model being learned.

14
Bayesian Network Model

(a) Bayesian network model for the case of candies with an unknown
proportion of cherries and limes.
(b) Model for the case where the wrapper color depends (probabilistically) on
the candy flavor.

15
ML parameter learning in Bayes nets
Bag from a new manufacturer; fraction θ of cherry P(F=cherry )
candies? 
Any θ is possible: continuum of hypotheses hθ
θ is a parameter for this simple (binomial) family of Flavor
models we unwrap N candies, c cherries and l = N − c limes
Suppose
These are i.i.d. (independent, identically distributed) observations, so
INI c
P (d|hθ) = ∏ P (dj |h ) = θ · (1 l −
j =1 θ
θ)
Maximize this w.r.t. θ—which is easier for the log-likelihood:
N
L(d|hθ) = log P (d|hθ) = ∑ log P (dj|hθ) = c log θ + l log(1
j =1
− θ) θ)
dL(d|h c € c
= − =0 ⇒ θ c
d θ 1− θ c + l= N
=
θ
Seems sensible, but causes problems with 0
counts!

16
Multiple parameters
Red/green wrapper depends probabilistically on P(F=cherry )

flavor:
Likelihood for, e.g., cherry candy in green wrapper:
Flavor
P (F = cherry , W = green | F P(W=red | F)

h= P2 )(F = cherry |h
θ,θ1,θ )P (W = green |F = cherry , h cherry  1
θ,θ1 , θ,θ1 ,
) θ 2 θ 2 lime  2
= θ · (1 − θ1 ) Wrapper
N candies, rc red-wrapped cherry candies, etc.:

Take the logarithm:

17
Multiple parameters contd.
Derivatives of L contain only the relevant parameter:

With complete data, parameters can be learned separately

18
Naive Bayes Models

Assuming Boolean variables, the parameters are

With observed attribute values x1, ... , xn, the probability of each class is given
by

A deterministic prediction can be obtained by choosing the most likely class

The method learns fairly well but not as well as decision-tree learning; this is
presumably because the true hypothesis-which is a decision tree-is not
representable exactly using a naive Bayes model.

Naive Bayes learning turns out to do surprisingly well in a wide range of

applications; the boosted version

Scales well to large problems with n Boolean attributes there are just 2n + 1
parameters

19
Naive Bayes Models

The learning curve for naive Bayes learning applied to the restaurant
problem from chapter 18. Compared with decision tree learning.

20
Example: linear Gaussian model
1

0.8
P(y |x)
4 0.6
3.
5

y
3
2. 0.4
5
2
1. 1 0.2
5 0.8
010 0.2 0.6
0.4 y
0. 0.4 0.6 0.2
5 0.8 0 0
x 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Assume data are generated as follows: x

Maximize the conditional likelihood

with respect to θ1, θ2.

LN 2
= minimizing E = (yj − (θ1 xj +
j =1
θ2))
That is, minimizing the sum of squared errors gives the ML
solution for a linear fit assuming Gaussian noise of fixed
variance 21
Naive Bayes Models

Bayesian parameter learning

Hypothesis prior: Bayesian approach to parameter learning starts by defining

a prior probability distribution over the possible hypotheses

In the Bayesian view, is the (unknown) value of a random variable that

defines the hypothesis space

The hypothesis prior is just the prior distribution P().

Thus, P( = ) is the prior probability that the bag has a fraction of cherry
candies.

if we don’t anything about the possible values of then we can use :

P() = Uniform[0,1](), uniform density is part of beta distributions.

Each beta distribution is defined by two hyperparameters a and b such that

22
Naive Bayes Models

Bayesian parameter learning

Suppose we observe a cherry, then we have:

23
Naive Bayes Models

Bayesian parameter learning

Assuming Parameter independence

A Bayesian network that corresponds to a Bayesian learning process.

Posterior distributions for the parameter variables , 1, and 2 can be
inferred from their prior distributions and the evidence in the Flavori and
Wrapperi variables.
24
Naive Bayes Models

Density estimation with nonparametric models

• It is possible to learn a probability model without making any assumptions

about its structure and parameterization by adopting the nonparametric
methods.
Two possibilities:
• k-nearest neighbors (use them for density estimation rather than regression
or classification)
to estimate the unknown probability density at a query point x, we can simply
measure the density of the data points in the neighborhood of x.

• Use kernel functions

a) A 3D plot of the mixture of Gaussians

b) A 128- point sample of points from the mixture, together with two query points
(small squares) and their IO-nearest-neighborhoods (medium and large circles).
Naive Bayes Models

Density estimation using k-nearest-neighbors, for k= 3, 10, and 40 respectively.

k = 3 is too spiky, 40 is too smooth, and 10 is just about right.
The best value for k can be chosen by cross-validation

26
Naive Bayes Models

Kernel density estimation using Gaussian kernels with = 0.02, 0.07, and 0.20
respectively. w = 0.07 is about right.

27
Learning With Hidden Variables: The EM Algorithm
Unsupervised clustering: Learning mixtures of Gaussians
• Unsupervised clustering is the problem of discerning multiple categories in a collection of
Unsupervised clustering objects. The problem is unsupervised because the category labels are not
given.
• Unsupervised clustering begins with Data.
• Next, we need to understand what kind of probability distribution might have generated the data.
• Clustering presumes that the data are generated from a mixture distribution, P. Such distribution has
k components, each of which is a distribution in its own.
• Let the random variable C denote the component, with values 1,...,k; then the mixture distribution is
given by
x refers to the values of the attributes for a data point

• For continuous data, a natural choice is mixture of Gaussians (multivariate Gaussian) .The
parameters of a mixture of Gaussians Mixture of Gaussians are wi =P(C=i) (the weight of each
component), µi (the mean of each component), and Σi (the covariance of each component).

28
Learning With Hidden Variables: The EM Algorithm

Many real-world problems have hidden variables (sometimes called latent variables)
which are not observable in the data.

Expectation-maximization helps with hidden variables

• Pretend to know the parameters for the model, then to infer the probability that each
data point belongs to each component.
• refit the components to the data, where each component is fitted to the entire data set
with each point weighted by the probability that it belongs to that component.
• The process iterates until convergence

Essentially, we are “completing” the data by inferring probability distributions over the
hidden variables—which component each data point belongs to—based on the
current model.

29
Learning With Hidden Variables: The EM Algorithm
Unsupervised clustering: Learning mixtures of Gaussians
• For the mixture of Gaussians, initialize the mixture-model parameters arbitrarily and iterate the E-
step (Expectation Step) & M-step (Maximization Step)
• E-Step: Compute the probabilities pij =P(C=i|xj), the probability that datum xj was generated by
component i. By Bayes’ rule, we have p ij =αP(xj |C=i)P(C=i). The term P(xj |C=i) is just the
probability at xj of the ith Gaussian, and the term P(C=i) is just the weight parameter for the i th
Gaussian. Define ni = ∑j pij, the effective number of data points currently assigned to
component i.
• M-Step: Compute the new mean, covariance, and component weights using the following steps
in sequence: (finds the new values of the parameters that maximize the log likelihood of the
data, given the expected values of the hidden indicator variables)

where N is the total number of data points.

• Observations: the log likelihood for the final learned model slightly exceeds
that of the original model & EM increases the log likelihood of the data at
every iteration.
30
Learning With Hidden Variables: The EM Algorithm

Learning Bayesian networks with hidden variables

To learn a Bayesian network with hidden variables, we apply the same insights
that worked for mixtures of Gaussians.

Example: a situation in which there are two bags of candies that have been
mixed together.
• Candies are described by three features: in addition to the Flavor and the
Wrapper, some candies have a Hole in the middle and some do not.
• The distribution of candies in each bag is described by a naive Bayes
model: the features are independent, given the bag, but the conditional
probability distribution for each feature depends on the bag.

• A 1000 samples were generated from a mode with true parameters:

31
Learning With Hidden Variables: The EM Algorithm

Learning Bayesian networks with hidden variables

Start with arbitrary values for model parameters.

The expected count of ( Bag = l) is the sum, over all candies, of the probability that the
candy came from bag 1:

using Bayes' rule and applying conditional independence

The expected count of cherry candies from bag 1 is given by

32
Learning With Hidden Variables: The EM Algorithm

Learning Bayesian networks with hidden variables

(a) A mixture model for candy. The proportions of different flavors, wrappers,
presence of holes depend on the bag, which is not observed.
(b) Bayesian network for a Gaussian mixture. The mean and covariance of the
observable variables X depend on the component C.

33
Learning With Hidden Variables: The EM Algorithm

Graphs showing the log likelihood of the data, L , as a function of the EM

iteration (a) Graph for Gaussian mixture model (b) Graph for the Bayesian
network

34
Learning With Hidden Variables: The EM Algorithm

Learning hidden Markov models

One application of EM involves learning the transition probabilities in hidden

Markov models (HMMs).

A hidden Markov model can be represented by a dynamic Bayes net with a

single discrete state variable

Each data point consists of an observation sequence of finite length

transition probability from state i to state j,

• calculate the expected proportion of times that the system undergoes a
transition to state j when in state i:

35
Summary
Bayesian learning methods formulate learning as a form of probabilistic
inference, using the observations to update a prior distribution over hypotheses.

Maximum a posteriori (MAP) learning selects a single most likely hypothesis

given the data.

Maximum-likelihood learning simply selects the hypothesis that maximizes the

likelihood of the data; it is equivalent to MAP learning with a uniform prior.

Naive Bayes learning is a particularly effective technique that scales

When some variables are hidden, local maximum likelihood solutions can be
found using the EM algorithm

Nonparametric models represent a distribution using the collection of data

points.

Probabilistic Models in AI Learning
No ratings yet
Probabilistic Models in AI Learning
31 pages
Ai Notes (Unit 3)
No ratings yet
Ai Notes (Unit 3)
33 pages
Bayesian Learning and ML Methods
No ratings yet
Bayesian Learning and ML Methods
3 pages
Statistical Learning Methods
No ratings yet
Statistical Learning Methods
28 pages
Bayesian Learning in Probabilistic Models
No ratings yet
Bayesian Learning in Probabilistic Models
66 pages
Naïve Bayesian Classifier Overview
No ratings yet
Naïve Bayesian Classifier Overview
13 pages
Bayesian Learning Overview
No ratings yet
Bayesian Learning Overview
51 pages
Bayesian Decision Theory in ML
0% (1)
Bayesian Decision Theory in ML
15 pages
Bayesian Learning
No ratings yet
Bayesian Learning
22 pages
Understanding Bayesian Learning Methods
No ratings yet
Understanding Bayesian Learning Methods
42 pages
Introduction to Bayesian Reasoning
No ratings yet
Introduction to Bayesian Reasoning
9 pages
Bayesian Classification in Machine Learning
No ratings yet
Bayesian Classification in Machine Learning
19 pages
Understanding Bayesian Classifiers
No ratings yet
Understanding Bayesian Classifiers
58 pages
Introduction To Probabilistic Models
No ratings yet
Introduction To Probabilistic Models
52 pages
Bayesian Learning and Naïve Bayes Classifier
No ratings yet
Bayesian Learning and Naïve Bayes Classifier
35 pages
Naive Bayes Classifier Overview
No ratings yet
Naive Bayes Classifier Overview
13 pages
Bayesian Classification Methods Overview
No ratings yet
Bayesian Classification Methods Overview
21 pages
Unsupervised Learning Foundations
No ratings yet
Unsupervised Learning Foundations
9 pages
Naïve Bayes Classification Overview
No ratings yet
Naïve Bayes Classification Overview
29 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
20 pages
Machine Learning Algorithms Overview
No ratings yet
Machine Learning Algorithms Overview
33 pages
ML PDF
No ratings yet
ML PDF
151 pages
21-25 ML Assignment NB
No ratings yet
21-25 ML Assignment NB
37 pages
Learn Naive Bayes in Python & R
No ratings yet
Learn Naive Bayes in Python & R
6 pages
Naïve Bayes Classifiers Overview
No ratings yet
Naïve Bayes Classifiers Overview
21 pages
Gaussian Mean and Variance MAP Analysis
No ratings yet
Gaussian Mean and Variance MAP Analysis
36 pages
Understanding Logistic Regression Basics
No ratings yet
Understanding Logistic Regression Basics
36 pages
Naïve Bayes Classifier Overview
No ratings yet
Naïve Bayes Classifier Overview
80 pages
Bayesian Decision Theory Overview
No ratings yet
Bayesian Decision Theory Overview
56 pages
Naive Bayes Classifier Explained
No ratings yet
Naive Bayes Classifier Explained
6 pages
Supervised Learning-1
100% (1)
Supervised Learning-1
37 pages
Naive Bayes Algorithm Course Overview
No ratings yet
Naive Bayes Algorithm Course Overview
8 pages
Introduction to Probabilistic Models
No ratings yet
Introduction to Probabilistic Models
18 pages
ML IB 04 - 05 NaiveBayes
No ratings yet
ML IB 04 - 05 NaiveBayes
131 pages
Probabilistic Classification Methods
No ratings yet
Probabilistic Classification Methods
78 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
24 pages
Understanding Bayesian Classification Techniques
No ratings yet
Understanding Bayesian Classification Techniques
27 pages
Understanding Naive Bayes Algorithm
No ratings yet
Understanding Naive Bayes Algorithm
11 pages
Bayesian Classification Techniques
No ratings yet
Bayesian Classification Techniques
16 pages
Bayesian Learning and Probabilistic Models
No ratings yet
Bayesian Learning and Probabilistic Models
28 pages
Naive Bayes Classification Overview
No ratings yet
Naive Bayes Classification Overview
21 pages
Overview of Machine Learning Techniques
No ratings yet
Overview of Machine Learning Techniques
44 pages
Materi Naive Bayes
No ratings yet
Materi Naive Bayes
17 pages
Choosing Machine Learning Approaches
No ratings yet
Choosing Machine Learning Approaches
31 pages
Naïve Bayes Classification Explained
No ratings yet
Naïve Bayes Classification Explained
23 pages
Bayesian Learning and Probability Theory
No ratings yet
Bayesian Learning and Probability Theory
40 pages
Understanding Naïve Bayes Classifiers
No ratings yet
Understanding Naïve Bayes Classifiers
34 pages
Naive Bayes Classification Explained
No ratings yet
Naive Bayes Classification Explained
17 pages
Bayes Theorem in Medical Diagnosis
No ratings yet
Bayes Theorem in Medical Diagnosis
44 pages
RGPV CS-601 Machine Learning Notes
No ratings yet
RGPV CS-601 Machine Learning Notes
12 pages
Understanding Probability and Bayes' Theorem
No ratings yet
Understanding Probability and Bayes' Theorem
26 pages
Understanding Graphical Models in ML
No ratings yet
Understanding Graphical Models in ML
15 pages
Learning Models with Hidden Variables
No ratings yet
Learning Models with Hidden Variables
11 pages
Bayesian Approaches in Machine Learning
No ratings yet
Bayesian Approaches in Machine Learning
26 pages
Comprehensive Machine Learning Notes
No ratings yet
Comprehensive Machine Learning Notes
71 pages
Machine Learning: Logistic Regression & Classifiers
No ratings yet
Machine Learning: Logistic Regression & Classifiers
104 pages
Bayesian Statistics in Machine Learning
No ratings yet
Bayesian Statistics in Machine Learning
7 pages
NBayes Log Reg
No ratings yet
NBayes Log Reg
18 pages
Naïve Bayes Machine Learning Overview
No ratings yet
Naïve Bayes Machine Learning Overview
48 pages
Powerbridge Installation and Wiring Guide
No ratings yet
Powerbridge Installation and Wiring Guide
1 page
Quality Management in Tourism Strategy
No ratings yet
Quality Management in Tourism Strategy
26 pages
CBT for Music Performance Anxiety
No ratings yet
CBT for Music Performance Anxiety
33 pages
Components of Communicative Competence
No ratings yet
Components of Communicative Competence
18 pages
Account Statement: A/c XXXXXXX18
No ratings yet
Account Statement: A/c XXXXXXX18
2 pages
Air Compressor Trouble Shooting
No ratings yet
Air Compressor Trouble Shooting
24 pages
Multicultural Leadership in Minang-Nias
No ratings yet
Multicultural Leadership in Minang-Nias
14 pages
Buzz Retail India Pvt Ltd Invoice
No ratings yet
Buzz Retail India Pvt Ltd Invoice
3 pages
AutoCAD Applications in Electrical Design
No ratings yet
AutoCAD Applications in Electrical Design
5 pages
Grade 7 Math: Polygons Week 3 DLL
No ratings yet
Grade 7 Math: Polygons Week 3 DLL
6 pages
Class 10 Physics Practical: Light Refraction
No ratings yet
Class 10 Physics Practical: Light Refraction
2 pages
Understanding %lu in C Programming
No ratings yet
Understanding %lu in C Programming
50 pages
Understanding Safety and Hazards Basics
No ratings yet
Understanding Safety and Hazards Basics
23 pages
Common Fixed Points in Metric Spaces
No ratings yet
Common Fixed Points in Metric Spaces
7 pages
Delegate List for Adam Davie Academy
No ratings yet
Delegate List for Adam Davie Academy
10 pages
Punjabi University Admission Handbook 2017-18
No ratings yet
Punjabi University Admission Handbook 2017-18
224 pages
Halliburton SRP Back Pressure Valve Guide
No ratings yet
Halliburton SRP Back Pressure Valve Guide
24 pages
Category Overview and Performance Metrics
No ratings yet
Category Overview and Performance Metrics
6 pages
PS/2 Core for Altera DE-Series Boards
No ratings yet
PS/2 Core for Altera DE-Series Boards
8 pages
Wastewater Treatment Process Modeling
No ratings yet
Wastewater Treatment Process Modeling
7 pages
E - Mulazim Approved
No ratings yet
E - Mulazim Approved
3 pages
Arctic LNG 2 Exception Request RFC
No ratings yet
Arctic LNG 2 Exception Request RFC
3 pages
Neural Measurement Techniques Overview
No ratings yet
Neural Measurement Techniques Overview
26 pages
Opening The Door To Immortality - C H Harvey
100% (3)
Opening The Door To Immortality - C H Harvey
83 pages
Grade 4 Integrated Science Term 2 Plan
100% (1)
Grade 4 Integrated Science Term 2 Plan
6 pages
Andhra Pradesh Social Empowerment Fellowship 2019-20
No ratings yet
Andhra Pradesh Social Empowerment Fellowship 2019-20
8 pages
VMware vSAN 8 Training Course Details
No ratings yet
VMware vSAN 8 Training Course Details
3 pages
Hse+3704+curriculum+development+workbook++section+a +introduction Docx 2
No ratings yet
Hse+3704+curriculum+development+workbook++section+a +introduction Docx 2
14 pages
Impact of Management Methods on Firm Performance
No ratings yet
Impact of Management Methods on Firm Performance
17 pages
PUEHLER G Recycling Press Overview
No ratings yet
PUEHLER G Recycling Press Overview
8 pages

Chapter20 4e

Uploaded by

Chapter20 4e

Uploaded by

Artificial Intelligence: A Modern

Learning Probabilistic Models

♦ Learning with Complete Data

♦ Learning with Hidden Variables: The EM Algorithm

Chapter 20, Sections 1–3

Chapter 20, Sections 1–3

Then we observe candies drawn from some bag:

jth observation dj gives the outcome of random variable D j

where P (d|hi ) is called the likelihood

Suppose there are five kinds of bags of candies

Then we observe candies drawn from some bag:

Chapter 20, Sections 1–3

I.e., maximize P (d|hi)P (hi) or minimizing -log2 P (d|hi) - log2 P

Take the logarithm:

With complete data, parameters can be learned separately

Assuming Boolean variables, the parameters are

A deterministic prediction can be obtained by choosing the most likely class

Naive Bayes learning turns out to do surprisingly well in a wide range of

Maximize the conditional likelihood

Bayesian parameter learning

Hypothesis prior: Bayesian approach to parameter learning starts by defining

In the Bayesian view, is the (unknown) value of a random variable that

The hypothesis prior is just the prior distribution P().

if we don’t anything about the possible values of then we can use :

Each beta distribution is defined by two hyperparameters a and b such that

Bayesian parameter learning

Suppose we observe a cherry, then we have:

Bayesian parameter learning

Assuming Parameter independence

A Bayesian network that corresponds to a Bayesian learning process.

Density estimation with nonparametric models

• It is possible to learn a probability model without making any assumptions

• Use kernel functions

a) A 3D plot of the mixture of Gaussians

Density estimation using k-nearest-neighbors, for k= 3, 10, and 40 respectively.

Expectation-maximization helps with hidden variables

where N is the total number of data points.

Learning Bayesian networks with hidden variables

• A 1000 samples were generated from a mode with true parameters:

Learning Bayesian networks with hidden variables

using Bayes' rule and applying conditional independence

The expected count of cherry candies from bag 1 is given by

Learning Bayesian networks with hidden variables

Graphs showing the log likelihood of the data, L , as a function of the EM

Learning hidden Markov models

One application of EM involves learning the transition probabilities in hidden

A hidden Markov model can be represented by a dynamic Bayes net with a

Each data point consists of an observation sequence of finite length

transition probability from state i to state j,

Maximum a posteriori (MAP) learning selects a single most likely hypothesis

Maximum-likelihood learning simply selects the hypothesis that maximizes the

Naive Bayes learning is a particularly effective technique that scales

Nonparametric models represent a distribution using the collection of data

You might also like