0% found this document useful (0 votes)
12 views59 pages

Bayesian Learning in Machine Learning

Module 4 covers Bayesian Learning, introducing Bayes theorem and its application in machine learning, including concepts like Maximum a Posteriori (MAP) and Maximum Likelihood (ML) hypotheses. It discusses the advantages and challenges of Bayesian methods, such as the need for prior probabilities and computational costs. The module also explores practical examples, including medical diagnosis and probabilistic predictions in various scenarios.

Uploaded by

nidhimaa1820
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views59 pages

Bayesian Learning in Machine Learning

Module 4 covers Bayesian Learning, introducing Bayes theorem and its application in machine learning, including concepts like Maximum a Posteriori (MAP) and Maximum Likelihood (ML) hypotheses. It discusses the advantages and challenges of Bayesian methods, such as the need for prior probabilities and computational costs. The module also explores practical examples, including medical diagnosis and probabilistic predictions in various scenarios.

Uploaded by

nidhimaa1820
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

ARTIFICIAL INTELLIGENCE

AND MACHINE LEARNING


18CS71

Module 4
Module 4

Bayesian Learning: Introduction, Bayes theorem,


Bayes theorem and concept learning, ML
and LS error hypothesis, ML for predicting, MDL
principle, Bates optimal classifier, Gibbs
algorithm, Navie Bayes classifier, BBN, EM Algorithm
Texbook1: Chapter 6
Bayesian Learning: Introduction
• Bayesian methods provide the basis for probabilistic learning methods that
accommodate (and require) knowledge about the prior probabilities of
alternative hypotheses and about the probability of observing various data given
the hypothesis.
• Bayesian methods allow assigning a posterior probability to each candidate
hypothesis, based on these assumed priors and the observed data.
Bayesian learning methods are relevant to study of machine learning for two
different reasons.
• First,Bayesian learning algorithms that calculate explicit probabilities for
hypotheses, such as the naive Bayes classifier, are among the most practical
approaches to certain types of learning problems
• The second reason is that they provide a useful perspective for
understanding many learning algorithms that do not explicitly manipulate
probabilities.
Features of Bayesian Learning Methods
• Each observed training example can incrementally decrease or increase the
estimated probability that a hypothesis is correct. This provides a more flexible
approach to learning than algorithms that completely eliminate a hypothesis if it
is found to be inconsistent with any single example
• Prior knowledge can be combined with observed data to determine the final
probability of a hypothesis. In Bayesian learning, prior knowledge is provided by
asserting (1) a prior probability for each candidate hypothesis, and (2) a
probability distribution over observed data for each possible hypothesis.
• Bayesian methods can accommodate hypotheses that make probabilistic
predictions
• New instances can be classified by combining the predictions of multiple
hypotheses, weighted by their probabilities.
• Even in cases where Bayesian methods prove computationally intractable, they
can provide a standard of optimal decision making against which other practical
methods can be measured.
Practical difficulty in applying Bayesian methods

• One practical difficulty in applying Bayesian methods is that they


typically require initial knowledge of many probabilities. When
these probabilities are not known in advance they are often
estimated based on background knowledge, previously available
data, and assumptions about the form of the underlying
distributions.
• A second practical difficulty is the significant computational cost
required to determine the Bayes optimal hypothesis in the general
case. In certain specialized situations, this computational cost can
be significantly reduced.
BAYES THEOREM
• Bayes theorem provides a way to calculate the probability of a
hypothesis based on its prior probability, the probabilities of
observing various data given the hypothesis, and the observed data
itself.
• Notations
• P(h) prior probability of h, reflects any background knowledge
about the chance that h is correct
• P(D) prior probability of D, probability that D will be observed
• P(D|h) probability of observing D given a world in which h holds
• P(h|D) posterior probability of h, reflects confidence that h holds
after D has been observed
Bayes theorem is the cornerstone of Bayesian learning methods
because it provides a way to calculate the posterior probability
P(h|D), from the prior probability P(h), together with P(D) and
P(D(h).

P(h|D) increases with P(h) and with P(D|h) according to Bayes


theorem.
P(h|D) decreases as P(D) increases, because the more probable it is
that D will be observed independent of h, the less evidence D
provides in support of h.
Maximum a Posteriori (MAP) Hypothesis
• In many learning scenarios, the learner considers some set of candidate
hypotheses H and is interested in finding the most probable hypothesis h ∈
H given the observed data D. Any such maximally probable hypothesis is
called a maximum a posteriori (MAP) hypothesis.
• Bayes theorem to calculate the posterior probability of each candidate
hypothesis is hMAP is a MAP hypothesis provided

• P(D) can be dropped, because it is a constant independent of h


Maximum Likelihood (ML) Hypothesis

In some cases, it is assumed that every hypothesis in H is equally probable a priori


(P(hi) = P(hj) for all hi and hj in H).
In this case the below equation can be simplified and need only consider the term
P(D|h) to find the most probable hypothesis.

P(D|h) is often called the likelihood of the data D given h, and any hypothesis that
maximizes P(D|h) is called a maximum likelihood (ML) hypothesis
The avaiiable data is from a particular laboratory test with two
possible outcomes: $ (positive) and 8 (negative). We have prior
knowledge that over the entire population of people only .008 have
this disease. Furthermore, the lab test is only an imperfect indicator
of the disease. The test returns a correct positive result in only 98% of
the cases in which the disease is actually present and a correct
negative result in only 97% of the cases in which the disease is not
present. In other cases, the test returns the opposite result.

Suppose we now observe a new patient for whom the lab test returns
a positive result. Should we diagnose the patient as having cancer or
not?
Example
Consider a medical diagnosis problem in which there are two alternative hypotheses
• The patient has a particular form of cancer (denoted by cancer)
• The patient does not (denoted by ¬ cancer)

The available data is from a particular laboratory with two possible outcomes: +
(positive) and - (negative)
• Suppose a new patient is observed for whom the lab test returns a positive (+)
result.
• Should we diagnose the patient as having cancer or not?
Chance that Team 0 wins: (Y0)→95%
Probability that Team 0 wins: P(Y0)→0.95
Probability that Team 1 wins: P(Y1)→(1-0.95)= 0.05

Probability that Team 1 hosts the match & wins: (75%)-P(X1|Y1)→0.75


Probability that Team 1 hosts the match & Team 0 wins: (30%)
→P(X1|Y0)→0.30
To find: P(Y0|X1) =?
P(Y1|X1) =? Using Bayes Theorem
P(Y1|X1) = (P(X1|Y1) * P (Y1)) / P(X1)
= (0.75 * 0.05) / (P(X1|Y1) * P(Y1) + P(X1|Y0) * P(Y0))
= (0.75 * 0.05) / ( 0.75 * 0.05) + (0.30 * 0.95)
= 0.1162

P(Y0|X1) = 1- 0.1162 = 0.8838

Team 0 is more probable of Winning than team 1


Another Method:
P(Y1|X1) = (P(X1|Y1) * P (Y1))
= 0.0375
P(Y0|X1) = (P(X1|Y0) * P (Y0))
= 0.285

Team 0 is more probable of Winning than team 1


Let us assume dangerous fire are rare
(1%) but smoke is fairly common (10%)
due to barbeques and 90% of the
dangerous fires makes smoke. Find the
probability of dangerous fire when there
is smoke
Let us assume dangerous fire are rare (1%) but smoke is fairly
common (10%) due to barbeques and 90% of the dangerous
fires makes smoke. Find the probability of dangerous fire
when there is smoke
BAYES T H EO R E M A N D C O N C E P T L EA R N I N G

What is the relationship between Bayes theorem and


the problem of concept learning?

Since Bayes theorem provides a principled way to calculate


the posterior probability of each hypothesis given the
training data, and can use it as the basis for a straightforward
learning algorithm that calculates the probability for each
possible hypothesis, then outputs the most probable.
Brute-Force Bayes Concept Learning

We can design a straightforward concept learning algorithm to output the maximum


a posteriori hypothesis, based on Bayes theorem, as follows:
In order specify a learning problem for the BRUTE-FORCE MAP LEARNING
algorithm we must specify what values are to be used for P(h) and for P(D|h) ?

Lets choose P(h) and for P(D|h) to be consistent with the following assumptions:
• The training data D is noise free (i.e., di = c(xi))
• The target concept c is contained in the hypothesis space H
• We have no a priori reason to believe that any hypothesis is more probable than any other.
What values should we specify for P(h)?
• Given no prior knowledge that one hypothesis is more likely than another, it is
reasonable to assign the same prior probability to every hypothesis h in H.
• Assume the target concept is contained in H and require that these prior
probabilities sum to 1.
What choice shall we make for P(D|h)?

• P(D|h) is the probability of observing the target values D = (d1 . . .dm) for the
fixed set of instances (x1 . . . xm), given a world in which hypothesis h holds
• Since we assume noise-free training data, the probability of observing
classification di given h is just 1 if di = h(xi) and 0 if di # h(xi). Therefore,
Given these choices for P(h) and for P(D|h) we now have a fully-defined
problem for the above BRUTE-FORCE MAP LEARNING algorithm.

In a first step, we have to determine the probabilities for P(h|D)


To summarize, Bayes theorem implies that the posterior probability P(h|D)
under our assumed P(h) and P(D|h) is

where |VSH,D| is the number of hypotheses from H consistent with D


The below figure shows the evaluation of posterior probabilities. Initially all
hypotheses have the same probability (in figure 6.1 a). As training data
accumulates (Figures 6.1 b and 6. lc), the posterior probability for inconsistent
hypotheses becomes zero while the total probability summing to one is
shared equally among the remaining consistent hypotheses. The above
analysis implies that under our choice for P(h) and P(Dlh), every consistent
hypothesis has posterior probability (1 /I V SH, I), and every inconsistent
hypothesis has posterior probability 0. Every consistent hypothesis is,
therefore, a MAP hypothesis.
MAP Hypotheses and Consistent Learners

• Learning algorithm is a consistent learner provided it outputs a hypothesis


that commits zero errors over the training examples.
• Using Bayes theorem and posterior probability, we can conclude that every
consistent learner outputs a MAP hypothesis, if we assume a uniform prior
probability distribution over H (i.e., P(hi) = P(hj) for all i, j), and if we assume
deterministic, noise free training data (i.e., P(D Ih) = 1 if D and h are
consistent, and 0 otherwise).
• SECOND IA PORTIONS WILL START FROM NEXT SLIDE
MAXIMUM LIKELIHOOD AND LEAST-SQUARED
ERROR HYPOTHESES

Consider the problem of learning a continuous-valued target function such as neural


network learning, linear regression, and polynomial curve fitting

A straightforward Bayesian analysis will show that under certain assumptions any
learning algorithm that minimizes the squared error between the output hypothesis
predictions and the training data will output a maximum likelihood (ML) hypothesis
Learning A Continuous-Valued Target Function

• Learner L considers an instance space X and a hypothesis space H consisting of some class of real-
valued functions defined over X, i.e., (∀ h ∈ H)[ h : X → R] and training examples of the form
<xi,di>
• The problem faced by L is to learn an unknown target function f : X → R
• A set of m training examples is provided, where the target value of each example is corrupted by
random noise drawn according to a Normal probability distribution with zero mean (di = f(xi) + ei)
• Each training example is a pair of the form (xi ,di ) where di = f (xi ) + ei .
– Here f(xi) is the noise-free value of the target function and ei is a random variable representing
the noise.
–It is assumed that the values of the ei are drawn independently and that they are distributed
according to a Normal distribution with zero mean.
• The task of the learner is to output a maximum likelihood hypothesis, or, equivalently, a MAP
hypothesis assuming all hypotheses are equally probable a priori.
Learning A Linear Function

• The target function f corresponds to the solid


line.
• The training examples (xi , di ) are assumed to
have Normally distributed noise ei with zero
mean added to the true target value f (xi ).
• The dashed line corresponds to the hypothesis
hML with least-squared training error, hence the
maximum likelihood hypothesis.
• Notice that the maximum likelihood hypothesis is
not necessarily identical to the correct
hypothesis, f, because it is inferred from only a
limited sample of noisy training data
Normal Probability Distribution (Gaussian Distribution)

A Normal distribution is a smooth, bell-shaped distribution that can be completely


characterized by its mean μ and its standard deviation σ
Using the previous
definition of h we have ML

Assuming training examples are mutually independent given h, we can write P(D|h) as the product of
the various (di|h)

Given the noise ei obeys a Normal distribution with zero mean and unknown variance σ2 , each di
must also obey a Normal distribution around the true targetvalue f(xi). Because we are writing the
expression for P(D|h), we assume h is the correct description of f. Hence, µ = f(xi) = h(xi)
MAXIMUM LIKELIHOOD HYPOTHESES FOR
PREDICTING PROBABILITIES
Consider the setting in which we wish to learn a nondeterministic (probabilistic)
function f : X → {0, 1}, which has two discrete output values.

We want a function approximator whose output is the probability that f(x) = 1


In other words , learn the target function
f ’ : X → [0, 1] such that f ’ (x) = P(f(x) = 1)

How can we learn f' using a neural network?


Use of brute force way would be to first collect the observed frequencies of 1's and
0's for each possible value of x and to then train the neural network to output the
target frequency for each x.

Deepak D, Asst. Prof., Dept. of CSE, Canara Engg. College 41


What criterion should we optimize in order to find a maximum likelihood
hypothesis for f' in this setting?

• First obtain an expression for P(D|h)


• Assume the training data D is of the form D = {(x , d ) . . . (x
1 1 m, dm)}, where di is the
observed 0 or 1 value for f (xi).
• Both x and d as random variables, and assuming that each training example is
i i
drawn independently, we can write P(D|h) as

Applying the product rule


The probability P(di|h, xi)

Re-express it in a more mathematically manipulable form, as

Equation (4) to substitute for P(di |h, xi) in Equation (5) to obtain
We write an expression for the maximum likelihood hypothesis

The last term is a constant independent of h, so it can be dropped

It easier to work with the log of the likelihood, yielding

Equation (7) describes the quantity that must be maximized in order to obtain the maximum
likelihood hypothesis in our current problem setting
Gradient Search to Maximize Likelihood in a Neural Net
Derive a weight-training rule for neural network learning that seeks to maximize G(h, D) using
gradient ascent
• The gradient of G(h, D) is given by the vector of partial derivatives of G(h, D) with respect to the
various network weights that define the hypothesis h represented by the learned network
• In this case, the partial derivative of G(h, D) with respect to weight wjk from input k to unit j is
Suppose our neural network is constructed from a single layer of sigmoid units. Then,

where xijk is the kth input to unit j for the ith training example, and d(x) is the derivative of the sigmoid
squashing function.
Finally, substituting this expression into Equation (1), we obtain a simple expression for the
derivatives that constitute the gradient
Because we seek to maximize rather than minimize P(D|h),
we perform gradient ascent rather than gradient descent
search. On each iteration of the search the weight vector is
adjusted in the direction of the gradient, using the weight
update rule

where η is a small positive constant that determines the step size of the i gradient ascent search
It is interesting to compare this weight-update rule to the weight-update rule used by the
BACKPROPAGATION algorithm to minimize the sum of squared errors between predicted and
observed network outputs.
The BACKPROPAGATION update rule for output unit weights, re-expressed using our current
notation, is
MINIMUM DESCRIPTION LENGTH PRINCIPLE
• A Bayesian perspective on Occam’s razor
• Motivated by interpreting the definition of hMAP in the light of basic concepts from information
theory.

which can be equivalently expressed in terms of maximizing the log2

or alternatively, minimizing the negative of this quantity

• This equation can be interpreted as a statement that short hypotheses are preferred, assuming a
particular representation scheme for encoding hypotheses and data
• Consider the problem of designing a code to transmit messages drawn at random
• i is the message
• The probability of encountering message i is pi
• Interested in the most compact code; that is, interested in the code that minimizes
the
expected number of bits we must transmit in order to encode a message drawn at
random
• To minimize the expected code length we should assign shorter codes to messages
that are more probable
• Shannon and Weaver (1949) showed that the optimal code (i.e., the code that
minimizes the expected message length) assigns - log, pi bitst to encode message
i.
• The number of bits required to encode message i using code C as the description
length
of message i with respect to C, which we denote by Lc(i).
Interpreting the equation

• -log2P(h): the description length of h under the optimal encoding for the hypothesis space H
LCH (h) = −log2P(h), where CH is the optimal code for hypothesis space H.
• -log2P(D | h): the description length of the training data D given hypothesis h, under the
optimal encoding fro the hypothesis space H: LCH (D|h) = −log2P(D| h) , where C D|h is the
optimal code for describing data D assuming that both the sender and receiver know the
hypothesis h.

Rewrite Equation (1) to show that hMAP is the hypothesis h that minimizes the sum given by the
description length of the hypothesis plus the description length of the data given the hypothesis.

where CH and CD|h are the optimal encodings for H and for D given h
The Minimum Description Length (MDL) principle recommends choosing the hypothesis that
minimizes the sum of these two description lengths of equ.

Minimum Description Length principle:

Where, codes C1 and C2 to represent the hypothesis and the data given the hypothesis

The above analysis shows that if we choose C1 to be the optimal encoding of hypotheses CH, and if
we choose C2 to be the optimal encoding CD|h, then hMDL = hMAP
Application to Decision Tree Learning
Apply the MDL principle to the problem of learning decision trees from some training data.
What should we choose for the representations C1 and C2 of hypotheses and data?
• For C1: C1 might be some obvious encoding, in which the description length grows with the
number of nodes and with the number of edges
• For C2: Suppose that the sequence of instances (x1 . . .xm) is already known to both the transmitter
and receiver, so that we need only transmit the classifications (f (x1) . . . f (xm)).
Now if the training classifications (f (x1) . . .f(xm)) are identical to the predictions of the
hypothesis, then there is no need to transmit any information about these examples. The
description length of the classifications given the hypothesis ZERO
If examples are misclassified by h, then for each misclassification we need to transmit a message
that identifies which example is misclassified as well as its correct classification
The hypothesis hMDL under the encoding C1 and C2 is just the one that minimizes the sum of these
description lengths.
• MDL principle provides a way for trading off hypothesis complexity for the number of errors
committed by the hypothesis
• MDL provides a way to deal with the issue of overfitting the data.
• Short imperfect hypothesis may be selected over a long perfect hypothesis.
BAYES OPTIMAL CLASSIFIER

• The Bayes optimal classifier combines the predictions of all alternative


hypotheses, weighted by their posterior probabilities, to calculate the most
probable classification of each new instance
BAYES OPTIMAL CLASSIFIER
GIBBS ALGORITHM
THE EM ALGORITHM

• The EM algorithm (Dempster et al. 1977), a widely used


approach to learning in the presence of unobserved
variables.
• The EM algorithm can be used even for variables whose
value is never directly observed, provided the general
form of the probability distribution governing these
variables is known.
• Estimating Means of k Gaussians
• Consider a problem in which the data D is a set of
instances generated by a probability distribution that is
a mixture of k distinct Normal distributions.
The above fig is for the case where k = 2 and where the instances are the
points shown along the x axis.
Each instance is generated using a two-step process.
First, one of the k Normal distributions is selected at random.
Second, a single random instance xi is generated according to this
selected distribution.
This process is repeated to generate a set of data points as shown in the
figure.
• To simplify our discussion, we consider the special case where the selection of the
single Normal distribution at each step is based on choosing each with uniform
probability, where each of the k Normal distributions has the same variance a2, and
where a2 is known. The learning task is to output a hypothesis h = (FI, . . . pk) that
describes the means of each of the k distributions.

You might also like