0% found this document useful (0 votes)
7 views11 pages

Bayesian Learning and Decision Theory

The document discusses Bayesian learning, focusing on Bayesian Decision Theory (BDT) and its applications in machine learning, such as the Naïve Bayes Classifier and Bayesian Belief Networks. It explains key concepts like prior and posterior probabilities, the Bayes Optimal Classifier, and the EM Algorithm for learning with unobservable variables. The document emphasizes the practical utility of Bayesian methods in various domains, including medical diagnosis and text classification.

Uploaded by

code.alentech
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views11 pages

Bayesian Learning and Decision Theory

The document discusses Bayesian learning, focusing on Bayesian Decision Theory (BDT) and its applications in machine learning, such as the Naïve Bayes Classifier and Bayesian Belief Networks. It explains key concepts like prior and posterior probabilities, the Bayes Optimal Classifier, and the EM Algorithm for learning with unobservable variables. The document emphasizes the practical utility of Bayesian methods in various domains, including medical diagnosis and text classification.

Uploaded by

code.alentech
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

LESSON 7

BAYESIAN LEARNING

Bayesian learning algorithm is a method of calculating probabilities for hypothesis

It is one of the most practical approaches to certain type of learning problems

Bayesian Decision Theory

The BDTcame long before Version Spaces, Decision Tree Learning and Neural Networks. It was
studied in the field of Statistical Theory and more specifically, in the field of Pattern
Recognition.

Bayesian Decision Theory is at the basis of important learning schemes such as the Naïve Bayes
Classifier, Learning Bayesian Belief Networks and the EM (Expectation
Maximization)Algorithm.

The BDT is also useful as it provides a framework within which many non-Bayesian classifiers
can be studied.

Bayes Theorem

The above can be written as

Where:

P(h) = prior probability of hypothesis h.

P(D) = prior probability of training data D.

P(h|D) = probability of h given D.

P(D|h) = probability of D given h.


Goal: To determine the most probable hypothesis, given the data D plus any initial knowledge
about the prior probabilities of the various hypotheses in H.

P(h) - the prior probability of a hypothesis h. Reflects background knowledge; before data is
observed. If no information - uniform distribution.

P(D) - The probability that this sample of the Data is observed. (No knowledge of the
hypothesis)

P(D|h): The probability of observing the sample D, given hypothesis h

P(h|D): The posterior probability of h. The probability of h given that D has been observed.

Why Bayesian?

It provides practical learning algorithms for example the Naïve Bayes. This is because Prior
knowledge and observed data can be combined

It is also a generative (model based) approach, which offers a useful conceptual framework

- E.g. sequences could also be classified, based on a probabilistic model specification


- And, Any kind of objects can be classified, based on a probabilistic model specification

Bayes Optimal Classifier

One great advantage of Bayesian Decision Theory is that it gives a lower bound on the
classification error that can be obtained for a given problem.

Bayes Optimal Classification: The most probable classification of a new instance is obtained
by combining the predictions of all hypotheses, weighted by their posterior probabilities:

argmaxvjVhi HP(vh|hi)P(hi|D)

where V is the set of all the values a classification can take and vj is one possible such
classification.

Unfortunately, Bayes Optimal Classifier is usually too costly to apply! ==> Naïve Bayes
Classifier

Example
Does patient have cancer or not?

A patient takes a lab test and the result comes back positive. It is known that the test returns a
correct positive result in only 98% of the cases and a correct negative result in only 97% of the
cases. Furthermore, only 0.008 of the entire population has this disease.

1. What is the probability that this patient has cancer?

2. What is the probability that he does not have cancer?

3. What is the diagnosis?

MAP Learner

For each hypothesis h in H, calculate the posterior probability

Output the hypothesis hmap with the highest posterior probability

Choosing Hypotheses
Maximum Likelihood hypothesis:

Generally we want the most probable hypothesis given training data. This is the maximum a
posteriori hypothesis:

Useful observation: it does not depend on the denominator P(d)

Now we compute the diagnosis

To find the Maximum Likelihood hypothesis, we evaluate P(d|h) for the data d, which is the
positive lab test and chose the hypothesis (diagnosis) that maximises it:

To find the Maximum A Posteriori hypothesis, we evaluate P(d|h)P(h) for the data d, which is
the positive lab test and chose the hypothesis (diagnosis) that maximises it. This is the same as
choosing the hypotheses gives the higher posterior probability.

Some Results from the Analysis of Learners in a Bayesian Framework

If P(h)=1/|H| and if P(D|h)=1 if D is consistent with h, and 0 otherwise, then every hypothesis in
the version space resulting from D is a MAP hypothesis.

Under certain assumptions regarding noise in the data, minimizing the mean squared error (what
common neural nets do) corresponds to computing the maximum likelihood hypothesis.

When using a certain representation for hypotheses, choosing the smallest hypotheses
corresponds to choosing MAP hypotheses (An attempt at justifying Occam’s razor)
Naïve Bayes Classifier

An important, special and simple of a Bayes optimal classifier, where

– hypothesis = classification

– all attributes are independent given the class

All the attributes belong to the same class.

What can we do if our data d has several attributes?

Naïve Bayes assumption: Attributes that describe data instances are conditionally independent
given the classification hypothesis

It is a simplifying assumption, obviously it may be violated in reality and in spite of that, it


works well in practice

The Bayesian classifier that uses the Naïve Bayes assumption and computes the MAP hypothesis
is called Naïve Bayes classifier

One of the most practical learning methods

Has been successful applied in the following areas:

 Medical Diagnosis

 Text classification

Example

Given the following Training data (Play Tennis)


Solution by Naïve Bayes is to Classify any new datum instance x=(a1,…aT) as:

Based on training examples, we need to estimate the parameters from the training examples:

For each target value (hypothesis) h

For each attribute value at of each datum instance

Based on the examples in the table, classify the following datum x:

x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=strong)

That means we need to determine either Play tennis or not?


Computation:

x’=(Outlook=Sunny,Temperature=Cool,Humidity=High, Wind=Strong)

P(Outlook=Sunny|Play=Yes) = 2/9

P(Temperature=Cool|Play=Yes) = 3/9

P(Huminity=High|Play=Yes) = 3/9

P(Wind=Strong|Play=Yes) = 3/9

P(Play=Yes) = 9/14

P(Outlook=Sunny|Play=No) = 3/5

P(Temperature=Cool|Play==No) = 1/5

P(Huminity=High|Play=No) = 4/5

P(Wind=Strong|Play=No) = 3/5
P(Play=No) = 5/14

P(Yes|x’)

≈[P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)] P(Play=Yes)

=2/9*3/9*3/9*3/9*9/14

= 0.0053

P(No|x’)

≈[P(Sunny|No)P(Cool|No)P(High|No)P(Strong|No)] P(Play=No)

=3/5*1/5*4/5*3/5*5/14

= 0.0206

Given the fact P(Yes|x’) < P(No|x’), then the new datum is label x’ to be “No-class”.

Naive Bayes Classifier is a simple but effective Bayesian classifier for vector data (i.e. data with
several attributes) that assumes that attributes are independent given the class.

Let each instance x of a training set D be described by a conjunction of n attribute values


<a1,a2,..,an> and let f(x), the target function, be such that f(x)  V, a finite set.

Bayesian Approach:

vMAP = argmaxvj V P(vj|a1,a2,..,an)

= argmaxvj V [P(a1,a2,..,an|vj) P(vj)/P(a1,a2,..,an)]

= argmaxvj V [P(a1,a2,..,an|vj) P(vj)


Naïve Bayesian Approach: We assume that the attribute values are conditionally independent
so that P(a1,a2,..,an|vj) =i P(a1|vj) [and not too large a data set is required.] Naïve Bayes
Classifier:

vNB = argmaxvj V P(vj) i P(ai|vj)

Bayesian Belief Networks

The Bayes Optimal Classifier is often too costly to apply.

The Naïve Bayes Classifier uses the conditional independence assumption to defray these costs.
However, in many cases, such an assumption is overly restrictive.

Bayesian belief networks provide an intermediate approach which allows stating conditional
independence assumptions that apply to subsets of the variable.

Conditional Independence

We say that X is conditionally independent of Y given Z if the probability distribution governing


X is independent of the value of Y given a value for Z.

i.e., (xi,yj,zk) P(X=xi|Y=yj,Z=zk)=P(X=xi|Z=zk) or, P(X|Y,Z)=P(X|Z)

This definition can be extended to sets of variables as well: we say that the set of variables
X1…Xl is conditionally independent of the set of variables Y1…Ym given the set of variables
Z1…Zn , if

P(X1…Xl|Y1…Ym,Z1…Zn(=P(X1…Xl|Z1…Zn)

Representation in Bayesian Belief Networks


Associated with each node is a conditional probability table, which specifies the conditional
distribution for the variable, given its immediate parents in the graph

Each node is asserted to be conditionally independent of its non-descendants, given its


immediate parents

Inference in Bayesian Belief Networks

A Bayesian Network can be used to compute the probability distribution for any subset of
network variables given the values or distributions for any subset of the remaining variables.

Unfortunately, exact inference of probabilities in general for an arbitrary Bayesian Network is


known to be NP-hard.

In theory, approximate techniques (such as Monte Carlo Methods) can also be NP-hard, though
in practice, many such methods were shown to be useful.

Learning Bayesian Belief Networks

3 Cases:

1. The network structure is given in advance and all the variables are fully observable in the
training examples. ==> Trivial Case: just estimate the conditional probabilities.

2. The network structure is given in advance but only some of the variables are observable in the
training data. ==> Similar to learning the weights for the hidden units of a Neural Net: Gradient
Ascent Procedure

3. The network structure is not known in advance. ==> Use a heuristic search or constraint-based
technique to search through potential structures.

The EM Algorithm: Learning with unobservable relevant variables.

Example:Assume that data points have been uniformly generated from k distinct Gaussian with
the same known variance. The problem is to output a hypothesis h=<1, 2 ,.., k> that
describes the means of each of the k distributions. In particular, we are looking for a maximum
likelihood hypothesis for these means.

We extend the problem description as follows: for each point xi, there are k hidden variables
zi1,..,zik such that zil=1 if xi was generated by normal distribution l and ziq= 0 for all ql.
An arbitrary initial hypothesis h=<1, 2 ,.., k> is chosen.

The EM Algorithm iterates over two steps:

Step 1 (Estimation, E): Calculate the expected value E[zij] of each hidden variable zij, assuming
that the current hypothesis h=<1, 2 ,.., k> holds.

Step 2 (Maximization, M): Calculate a new maximum likelihood hypothesis h’=<1’, 2’ ,..,
k’>, assuming the value taken on by each hidden variable zij is its expected value E[zij]
calculated in step 1. Then replace the hypothesis h=<1, 2 ,.., k> by the new hypothesis
h’=<1’, 2’ ,.., k’> and iterate.

The EM Algorithm can be applied to more general problems

You might also like