0% found this document useful (0 votes)
5 views52 pages

Understanding Bayesian Learning Methods

Bayesian Learning methods in machine learning utilize explicit probabilities for hypotheses, making them practical for various learning problems, such as classification. The Naïve Bayes Classifier, a supervised learning algorithm based on Bayes' theorem, assumes feature independence and is effective in text classification. However, it faces challenges like the zero-frequency problem and the violation of independence assumptions, which can be addressed through techniques like Laplace smoothing.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views52 pages

Understanding Bayesian Learning Methods

Bayesian Learning methods in machine learning utilize explicit probabilities for hypotheses, making them practical for various learning problems, such as classification. The Naïve Bayes Classifier, a supervised learning algorithm based on Bayes' theorem, assumes feature independence and is effective in text classification. However, it faces challenges like the zero-frequency problem and the violation of independence assumptions, which can be addressed through techniques like Laplace smoothing.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Bayesian Learning

Why using Bayesian Learning methods in Machine


Learning:-

- Bayesian learning algorithms calculate explicit


probabilities for hypothesis.

- These methods are among most practical approaches to


certain types of leaning problems
Eg:- Bayes classifier - calculate probability of hypothesis.

- They provide useful perspective for understanding many


learning algorithms that do not explicitly manipulate
probabilities.
Features of Bayesian Learning
• Instead of eliminating hypothesis, it will incrementally
increase or decrease the estimated probability that the
hypothesis is correct.

• This method provides hypothesis that makes probabilistic


predictions.

• New instances can be classified by combining the


predictions of multiple hypothesis, weighted by their
probabilities.
Features of Bayesian Learning
• When Bayesian method proves computationally complex,
they provide decision making against which other practical
methods can be measured.

• Prior knowledge can be combined with observed data to


determine the final probability of a hypothesis.

• Bayesian methods can accommodate hypotheses that make


probabilistic predictions
Issues in Bayesian Methods

1. It require initial knowledge of many probabilities.

If it is not known in advance, they are estimated based on


background knowledge previously available data, and
assumptions about the form of the underlying distribution.

2. Computational cost is high to determine Bayes optimal


hypothesis.
- in some special cases, it can be subsequently reduced.
What is Bayes Theorem

• In probability theory, it relates the conditional probability and the


marginal probabilities of two events.

• It determines the conditional probability of an event A given that event


B has already occurred.

• Bayes Theorem calculates the probability based on the hypothesis.

• Bayes Theorem allows to update the predicted probabilities of an event


by incorporating new information.
Bayes Theorem
• Determines the best hypothesis from some space ‘H’ which
generates highest probability score, with the given observed
training data and the initial knowledge about the dataset.

• Calculates the probability of a hypothesis based on its prior


probability, the probabilities of observing various data points
given the hypothesis and the observed data itself.
Bayes Theorem
• Given a hypothesis h and data D which bears on the hypothesis:
P ( D | h) P ( h)
P(h | D) =
P( D)
• P(h): independent probability of h: prior probability (The probability
“h” being true. This is knowledge before we observed the training
data. )

• P(D): independent probability of D, Marginal Probability, the


probability of evidence. (The probability “D” being true.)

• P(D|h): conditional probability of D given h: likelihood (prob. of


observing data D in which hypothesis “h” holds.)

• P(h|D): conditional probability of h given D: posterior probability,


(The prob. of “h” being true, after we have seen training data “D”.)
Maximum A Posterior Hypothesis
• Given some set of candidate hypothesis H and is interested in
finding most probable hypothesis h ∈ H
• Given the observed data D.
• Any such maximally probable hypothesis is called MAP
Maximum A Posterior Hypothesis
• Maximally probable hypothesis is called a maximum a
posteriori (MAP) hypothesis.
• Based on Bayes Theorem, we can compute the Maximum A
Posterior (MAP) hypothesis for the data.
• We are interested in the best hypothesis for some space H given
observed training data D.
hMAP  argmax P( h | D )
hH

P ( D | h) P ( h)
= argmax
hH P( D)
= argmax P ( D | h) P (h)
hH

P(D) is dropped as it is constant.


H: set of all hypothesis.
Maximum Likelihood

• A special case of MAP where all the candidate hypothesis have


same probability

• Now assume that all hypotheses are equally probable a priori,


i.e., P(hi ) = P(hj ) for all hi, hj belong to H.

• This is called assuming a uniform prior. It simplifies


computing the posterior:
hML = arg max P( D | h)
hH

• We dropped P(h) as it is constant.


• This hypothesis that maximizes P(D|h) is called the maximum
likelihood hypothesis.
Desirable Properties of Bayes Classifier

• Incrementality: with each training example, the prior and the


likelihood can be updated dynamically: flexible and robust to
errors.

• Combines prior knowledge and observed data: prior


probability of a hypothesis multiplied with probability of the
hypothesis given the training data.

• Probabilistic hypothesis: outputs not only a classification, but


a probability distribution over all classes
Naïve Bayes Classifier Algorithm

• It is a supervised learning algorithm, which is based on Bayes


theorem and used for solving classification problems.

• It is mainly used in text classification that includes a high-


dimensional training dataset.

• It is a probabilistic classifier, which means it predicts on


the basis of the probability of an object.
Naïve Bayes Classifier Algorithm

The Naïve Bayes algorithm is comprised of two words Naïve and


Bayes, Which can be described as:
• Naïve: It is called Naïve because it assumes that the
occurrence of a certain feature is independent of the
occurrence of other features.
• Bayes: It is called Bayes because it depends on the principle of
Bayes Theorem.
Naïve Bayes Classifier Example
Suppose we have a
dataset of animal details
and corresponding target
variable “Pet". Classify
pet and non-pet animals
using the given feature-
set.
Naïve Bayes Classifier Example
Assumptions of Naive Bayes
• All the variables are independent.
That is if the animal is Dog that doesn’t mean that “Size”
will be only Medium or Small.

• All the predictors have an equal effect on the outcome.


That is, the animal being dog does not have more
importance in deciding if we can pet him or not. All the
features have equal importance.
Naïve Bayes Classifier Example
So to solve this problem, we need to follow the below steps:
• Convert the given dataset into frequency tables.
• Generate Likelihood table by finding the probabilities of
given features.
• Now, use Bayes theorem to calculate the posterior
probability.
Naïve Bayes Classifier Example
We need to find P(xi|yj) for each xi in X and each yj in Y.
Naïve Bayes Classifier Example
We need to find P(xi|yj) for each xi in X and each yj in Y.
Naïve Bayes Classifier Example
We also need the probabilities (P(y)), which are calculated in
the table below. For example, P(Pet Animal = NO) = 6/14.
Naïve Bayes Classifier Example
Now if test data is = (Cow, Medium, Black)

Probability of petting an animal :

And the probability of not petting an animal:


Naïve Bayes Classifier Example
As per the axioms of probability theory
P(Yes|Test)+P(No|test) = 1
So, we will normalize the result:

We see here that P(Yes|Test) > P(No|Test), so the prediction that


we can pet this animal is “Yes”.
Advantages of Naïve Bayes Classifier

• It doesn’t require larger amounts of training data.


• It can be used for Binary as well as Multi-class Classifications.
• It performs well in Multi-class predictions as compared to the
other Algorithms.
• Convergence is quicker than other models, which are
discriminative.
• It is highly scalable with several data points and predictors.
• It can handle both continuous and categorical data.
• It is not sensitive to irrelevant data and doesn’t follow the
assumptions it holds.
Disadvantages of Naïve Bayes Classifier

• The Naive Bayes Algorithm has trouble with the ‘zero-frequency


problem’. It happens when you assign zero probability for
categorical variables in the training dataset that is not available.
When you use a smooth method for overcoming this problem,
you can make it work the best.

• It assumes that all features are independent or unrelated, so it


cannot learn the relationship between features.

• Dependencies among the features can not be modeled by Naive


Bayesian Classifier
Relevant Issues

• Violation of Independence Assumption


• Zero conditional probability problem
Violation of Independence Assumption
• Naive Bayesian classifiers assume that the effect of an
attribute value on a given class is independent of the values
of the other attributes.

• This assumption is called class conditional independence.

• It is made to simplify the computations involved and, in this


sense, is considered “naive.”
Improvement
• Bayesian belief network are graphical models, which unlike
naive Bayesian classifiers, allow the representation of
dependencies among subsets of attributes.

• Bayesian belief networks can also be used for classification.


Zero Conditional Probability Problem
• If a given class and feature value never occur together in the
training set then the frequency-based probability estimate will
be zero.

• This is problematic since it will wipe out all information in


the other probabilities when they are multiplied.

• It is therefore often desirable to incorporate a small-sample


correction in all probability estimates such that no probability
is ever set to be exactly zero.
Correction
• To eliminate zeros, we use add-one or Laplace smoothing,
which simply adds 1 to each count.
Example
Suppose,
• For the class Buys computer ‘D’ (yes) in some training database,
D, containing 1000 tuples.
• we have 0 tuples with income D low,
• 990 tuples with income D medium, and
• 10 tuples with income D high.
• The probabilities of these events, without the Laplacian correction,
are 0, 0.990 (from 990/1000), and 0.010 (from 10/1000),
respectively.
Example
• Using the Laplacian correction for the three quantities, we pretend
that we have 1 more tuple for each income-value pair. In this way,
we instead obtain the following probabilities:

• The “corrected” probability estimates are close to their


“uncorrected” counterparts, yet the zero probability value is
avoided.
Types of Naïve Bayes Model
Types of Naïve Bayes Model

Optimal Naive Bayes


• Optimal Naive Bayes selects the class that has the greatest posterior probability of
happenings.
• As per the name, it is optimal. But it will go through all the possibilities, which is very
slow and time-consuming.

Gaussian Naive Bayes


• It is a straightforward algorithm used when the attributes are continuous.
• The attributes present in the data should follow the rule of Gaussian distribution or
normal distribution.
Types of Naïve Bayes Model

Bernoulli Naive Bayes


• Bernoulli Naive Bayes is an algorithm that is useful for data that has binary or boolean
attributes.
• The attributes will have a value of yes or no, useful or not, granted or rejected, etc.
• It is more popular for document classification.

Multinomial Naive Bayes


• The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed.
• It is primarily used for document classification problems.
• The features needed for this type are the frequency of the words converted from the
document.
Application of Naïve Bayes Classifier
Bayesian Network
"A Bayesian network is a probabilistic graphical model
which represents a set of variables and their conditional
dependencies using a directed acyclic graph.“

It simplifies the representation of probabilistic relationships


between random variables.

It is also called a Bayes network, belief network, decision


network, or Bayesian model
Bayesian Network
• By representing conditional dependence by edges in a directed
graph, they seek to model conditional dependence.

• The relationships help to conduct inference on random variables in


the graph.

• These networks satisfy the local Markov property which allows to


simplify the joint distribution to a smaller form. It helps to minimize
the amount of computation needed in bigger networks.

• The task of prediction is about calculating a probability distribution


over one or more variables whose values we want to know, with the
prior information or evidence about some other variables.
Bayesian Network
How to create a Bayesian Network?
To create a Bayesian network, following things are required to be
defined-

1. Define the variables that exist in the problem that are required to
be solved and identify the main variable.

2. Define the conditional relationships between all the variables,


i.e. the structure of the network.

3. Figure out the probability distributions for each variable, i.e., the
probability rules for the relationships between variables.
Example:
Harry installed a new burglar alarm at his home to detect burglary. The
alarm reliably responds at detecting a burglary but also responds for
minor earthquakes. Harry has two neighbors David and Sophia, who
have taken a responsibility to inform Harry at work when they hear the
alarm. David always calls Harry when he hears the alarm, but
sometimes he get confused with the phone ringing and calls at that
time too. On the other hand, Sophia likes to listen to high music, so
sometimes she misses to hear the alarm. Here we would like to
compute the probability of Burglary Alarm.

Problem:
Calculate the probability that alarm has sounded, but there is neither a
burglary, nor an earthquake occurred, and David and Sophia both called
the Harry.
Solution:
The Bayesian network for the above problem is given below.

• The network structure is showing that burglary and earthquake


is the parent node of the alarm and directly affecting the
probability of alarm's going off, but David and Sophia's calls
depend on alarm probability.

• The network is representing that our assumptions do not directly


perceive the burglary and also do not notice the minor earthquake,
and they also not discuss before calling.
List of all events occurring in this network:

Burglary (B)
Earthquake(E)
Alarm(A)
David Calls(D)
Sophia calls(S)
What are Bayesian networks used for?

1. Medical diagnosis
They can be used to figure out the probable disease that a patient is
suffering from, based on the symptoms that are identified. A doctor can
note the symptoms that are observed and enter them into the program
which would compute the probabilities of multiple diseases based on the
symptoms that were identified.

2. Testing hypotheses
Bayesian networks help in understanding the causal relationships between
various features. It determines whether the effect of a new feature is
desirable.
What are Bayesian networks used for?

3. Environmental modeling
These networks can be used to model animal population trends.
Environmental stressors have a lot of attention paid to them here.

4. Forecasting traffic
Bayesian networks can be used to forecast traffic flows & learn from
them.
Markov Model

Markov Chain models:


• A Markov chain is a model that tells us something about the probabilities of
sequences of random states/variables.
• A Markov chain makes a very strong assumption that if we want to predict the
future in the sequence, all that matters is the current state.
• All the states before the current state have no impact on the future except via the
current state.

• A finite state machine with probabilistic state transitions.


• Makes Markov assumption that next state only depends on the
current state and independent of previous history.

44
Markov Model

Markov Chain models:


• Below are the specified components of Markov Chains :

45
Markov Model

Markov Chain models:


Say you have a sequence. Something like this:

Sunny, Rainy, Cloudy, Cloudy, Sunny, Sunny, Sunny, Rainy

So, the weather for any given day can be in any of the three states.

46
Markov Model

Now using the data that we have, we can construct the following
state diagram with the labelled probabilities.

47
Markov Model

In order to compute the probability of any given day’s weather


given N previous observations, we will use the Markovian
Property.

The Markov property suggests that the distribution for a random


variable in the future depends solely only on its distribution in the
current state, and none of the previous states have any impact on
the future states.
48
What are Hidden Markov Models?
• A Hidden Markov Model (HMM) is a probabilistic model that consists
of a sequence of hidden states, each of which generates an observation.
• The hidden states are usually not directly observable, and the goal of
HMM is to estimate the sequence of hidden states based on a sequence of
observations.
• HMM is a statistical model in which the system being modeled are
Markov processes with unobserved or hidden states.
• Markov process is a Memoryless process in which the past and future
states are independent
• Markov assumption is the assumption that a hidden variable is dependent
only on the previous hidden state.
• Mathematically, the probability of being in a state at a time t depends only
on the state at the time (t-1).
What are Hidden Markov Models?
A hidden Markov model consists of five important components:
1. Initial probability distribution: An initial probability distribution over
states. The initialization distribution defines each hidden variable in its
initial condition at time t=0 (the initial hidden state).
2. One or more hidden states

3. Transition probability distribution: A transition probability matrix


where each aij represents the probability of moving from state i to state j.
The transition matrix is used to show the hidden state to hidden state
transition probabilities.

4. A sequence of observations

5. Emission probabilities: A sequence of observation likelihoods, also


called emission probabilities, each expressing the probability of an
observation Oi being generated from a state I.
HMM Example-
Markov Networks Vs Bayesian Networks
• Bayesian networks are probabilistic graphical models that represent
sets of random variables and their conditional dependencies through
the means of directed acyclic graphs (DAGs).

• A Markov network, a undirected graphical model is a set of random


variables that has a Markov property described by an undirected
graph.

• Bayesian networks are directed and acyclic, whereas Markov


networks are undirected and could be cyclic. I.e. Markov networks
also represent cyclic dependencies

You might also like