0% found this document useful (0 votes)
15 views5 pages

Bayesian Learning in Machine Learning

The document discusses Bayesian learning in machine learning, focusing on the principles of probability-based learning and the Naïve Bayes model. It explains key concepts such as prior, likelihood, and posterior probabilities, along with various classification algorithms derived from Bayes' theorem. Additionally, it covers techniques for handling continuous attributes and different types of Naïve Bayes classifiers, including Gaussian, Bernoulli, and Multinomial Naïve Bayes.

Uploaded by

emanikanta535
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views5 pages

Bayesian Learning in Machine Learning

The document discusses Bayesian learning in machine learning, focusing on the principles of probability-based learning and the Naïve Bayes model. It explains key concepts such as prior, likelihood, and posterior probabilities, along with various classification algorithms derived from Bayes' theorem. Additionally, it covers techniques for handling continuous attributes and different types of Naïve Bayes classifiers, including Gaussian, Bernoulli, and Multinomial Naïve Bayes.

Uploaded by

emanikanta535
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MACHINE LEARNING(BCS602)

MODULE 4
CHAPTER 8
BAYESIAN LEARNING
8.1 INTRODUCTIO TO PROBABILITY- BASED LEARNING
Probability-based learning is a valuable paradigm in machine learning and AI, particularly
when dealing with real-world problems that involve uncertainty and incomplete information.
It allows for more principled and robust decision-making in situations where deterministic
models may fall short.

8.2 FUNDAMENTALS OF BAYES THEOREM


Naïve Bayes Model relies on Bayes theorem that works on the principle of three
kinds of probabilities called prior probability, likelihood probability, and
posterior probability.
Prior Probability
It is the general probability of an uncertain event before an observation is seen or
some evidence is collected. It is the initial probability that is believed before any
new information is collected.
Likelihood Probability
Likelihood probability is the relative probability of the observation occurring for
each class or the sampling density for the evidence given the hypothesis. It is
stated as P (Evidence | Hypothesis), which denotes the likeliness of the
occurrence of the evidence given the parameters.
Posterior Probability
It is the updated or revised probability of an event taking into account the
observations from the training data. P (Hypothesis | Evidence) is the posterior
distribution representing the belief about the hypothesis, given the evidence from
the training data. Therefore,
Posterior probability = prior probability + new evidence

Deepa S, [Link] CSE,RNSIT 1


MACHINE LEARNING(BCS602)

8.3 CLASSIFICATION USING BAYES MODEL


Naive Bayes classifiers are a collection of classification algorithms based on Bayes’
Theorem. It is not a single algorithm but a family of algorithms where all of them share a
common principle, i.e. every pair of features being classified is independent of each other.
Bayes’ Theorem finds the probability of an event occurring given the probability of another
event that has already occurred. Bayes’ theorem is stated mathematically as the following
equation:

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of
a hypothesis is true

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

OR

Deepa S, [Link] CSE,RNSIT 2


MACHINE LEARNING(BCS602)

Maximum A Posteriori (MAP) Hypothesis, hMAP


Given a set of candidate hypotheses, the hypothesis which has the maximum value is
considered as the maximum probable hypothesis or most probable hypothesis. This most
probable hypothesis is called the Maximum A Posteriori Hypothesis hap. Bayes theorem Eq.
(8.1) can be used to find the hMAP

Maximum Likelihood (ML) Hypothesis, hML


Given a set of candidate hypotheses, if every hypothesis is equally probable, only P (E | h) is
used to find the most probable hypothesis. The hypothesis that gives the maximum likelihood
for P (E | h) is called the Maximum Likelihood (ML) Hypothesis, hML

8.3.1 NAÏVE BAYES ALGORITHM


The Naïve Bayes algorithm is a probabilistic classification algorithm based on Bayes'
theorem. It is supervised learning algorithm

.
Zero Probability Error

Zero-probability error can be solved by applying a smoothing technique called Laplace


correction which means given 1000 data instances in the training dataset, if there are zero
instances for a particular value of a feature we can add 1 instance for each attribute value pair
of that feature which will not make much difference for 1000 data instances and the overall
probability does not become zero.

Deepa S, [Link] CSE,RNSIT 3


MACHINE LEARNING(BCS602)

8.3.2 Brute Force Bayes Algorithm


Applying Bayes theorem, Brute Force Bayes algorithm relies on the idea of concept learning
wherein given a hypothesis space H for the training dataset T, the algorithm computes the
posterior probabilities for all the hypothesis hiЄH. Then, Maximum A Posteriori (MAP)
Hypothesis, hMAP, is used to output the hypothesis with maximum posterior probability. The
algorithm is quite expensive since it requires computations for all the hypotheses. Although
computing posterior probabilities is inefficient, this idea is applied in various other algorithms
which is also quite interesting.

8.3.3 Bayes Optimal Classifier


Bayes optimal classifier is a probabilistic model, which in fact, uses the Bayes theorem to find
the most probable classification for a new instance given the training data by combining the
predictions of all posterior hypotheses, This is different from Maximum A Posteriori (MAP)
Hypothesis, hMAP Which chooses the maximum probable hypothesis or the most probable
hypothesis.
Here, a new instance can be classified to a possible classification value Ci by the following Eq.
(8.4).

8.3.4 Gibbs Algorithm


The main drawback of Bayes optimal classifier is that it computes the posterior probability for
all hypotheses in the hypothesis space and then combines the predictions to classify a new
instance.
Gibbs algorithm is a sampling technique which randomly selects a hypothesis from the
hypothesis space according to the posterior probability distribution and classifies a new
instance. It is found that the prediction error occurs twice with the Gibbs algorithm when
compared to Bayes Optimal classifier.

8.4 NAÏVE BAYES ALGORITHM FOR CONTINUOUS ATTRIBUTES


There are two ways to predict with Naive Bayes algorithm for continuous attributes:
1. Discretize continuous feature to discrete feature.
2. Apply Normal or Gaussian distribution for continuous feature.

Gaussian Naive Bayes Algorithm


In Gaussian Naive Bayes, the values of continuous features are assumed to be sampled from a
Gaussian distribution.

Deepa S, [Link] CSE,RNSIT 4


MACHINE LEARNING(BCS602)

8.5 OTHER POPULAR TYPES OF NAIVE BAYES CLASSIFIERS

Bernoulli Naive Bayes Classifier


Bernoulli Naive Bayes works with discrete features. In this algorithm, the features used for
making predictions are Boolean variables that take only two values either 'yes' or 'no'. This is
particularly useful for text classification where all features are binary with each feature
containing two values whether the word occurs or not.
Multinomial Naive Bayes Classifier
This algorithm is a generalization of the Bernoulli Naive Bayes model that works for
categorical data or particularly integer features. This classifier is useful for text classification
where each feature will have an integer value that represents the frequency of occurrence of
words.
Multi-class Naïve Bayes Classifier
This algorithm is useful for classification problems with more than two classes where the target
feature contains multiple classes and test instance has to be predicted with the class it belongs
to.

Deepa S, [Link] CSE,RNSIT 5

Common questions

Powered by AI

Gaussian Naive Bayes and Bernoulli Naive Bayes classifiers differ in their handling of features and underlying distribution assumptions. Gaussian Naive Bayes assumes that the continuous features are drawn from a Gaussian (normal) distribution, making it suitable for continuous datasets where this assumption holds. Conversely, Bernoulli Naive Bayes works with binary/Boolean features, making it suitable for text classification problems with binary occurrence data, such as whether a word appears in a document. Each classifier requires the assumption to be met for effective application: Gaussian for continuous, approximately normally distributed data, and Bernoulli for binary data .

The primary distinction between the MAP hypothesis and the ML hypothesis resides in their consideration of prior probabilities. The Maximum A Posteriori (MAP) hypothesis considers both the likelihood and the prior probabilities when determining the most probable hypothesis. In contrast, the Maximum Likelihood (ML) hypothesis focuses solely on likelihood probabilities, assuming that all hypotheses are equally probable, thereby ignoring prior probabilities .

The Bayes Optimal Classifier is more robust than the MAP hypothesis because it considers all posterior hypotheses to predict the most probable classification for a new instance. This comprehensive approach allows it to integrate information from the entire hypothesis space, reducing the bias that might arise from selecting a single hypothesis, as is done in the MAP approach. While MAP selects only the most probable hypothesis, the Bayes Optimal Classifier averages over multiple hypotheses, leading to potentially more accurate predictions especially in varied and complex data environments .

The Naïve Bayes classifier assumes that every pair of features is independent, meaning that the presence or absence of one feature does not affect the presence or absence of another feature. This assumption simplifies the computation of the posterior probability, making the algorithm computationally efficient and suitable for high-dimensional data. However, this assumption may not hold true for all datasets, potentially affecting the classifier's performance negatively in cases where features are highly correlated .

The primary challenge associated with the Brute Force Bayes Algorithm is its computational inefficiency. It requires computing posterior probabilities for all possible hypotheses, which is computationally expensive, especially for large hypothesis spaces. This approach contrasts with other Bayesian methods like the Naïve Bayes, which assumes feature independence and is thus computationally efficient. The Brute Force approach is necessary in some scenarios for thorough exploration of the hypothesis space, but its lack of efficiency makes it less practical for real-time or large-scale applications .

The Naïve Bayes Model utilizes three types of probabilities: prior probability, likelihood probability, and posterior probability. The prior probability is the initial probability of an event before observing any evidence. The likelihood probability is the probability of the evidence given the hypothesis. Posterior probability is the updated probability of an event based on the observed evidence, calculated as the prior probability adjusted by the likelihood of the evidence. Together, these probabilities help calculate the probability of a hypothesis given the evidence, allowing for classification .

The Laplace correction addresses the zero-probability error by adding a small value, typically 1, to the count of each attribute value pair within the dataset. This prevents the scenario where an unseen feature value in the training data results in a zero probability, which would cause the entire product of probabilities to be zero in a Naïve Bayes calculation. By ensuring that no probability is zero, the Laplace correction ensures that the model can still make a prediction even for features that did not appear in the training dataset, enhancing the model's robustness .

Discretization plays a role in converting continuous attributes into discrete values, facilitating the application of Naïve Bayes, which traditionally handles discrete data. This simplification can make the algorithm applicable without altering its foundational structure. However, using Gaussian distribution can be preferred when the continuous data is approximately normally distributed, as it avoids information loss associated with discretization and allows the model to leverage the continuous nature of the data for potentially more accurate probability estimations .

The Naïve Bayes algorithm handles multi-class classification problems using the Multi-class Naïve Bayes Classifier, which is designed to predict the class of an instance when there are more than two classes involved. This is achieved by computing the probability of each class given the input features and selecting the class with the highest probability. The benefit of this approach is its simplicity and efficiency, making it suitable for large-scale applications where computational resources are limited. Additionally, by leveraging the independence assumption, it reduces complexity, allowing it to scale well with datasets with many classes .

Gibbs Algorithm differs from the Bayes Optimal Classifier by classifying new instances through sampling rather than computing posterior probabilities for all hypotheses. It randomly selects a hypothesis according to the posterior probability distribution and uses it to classify a new instance. The main trade-off of using Gibbs Algorithm is that while it is computationally more efficient since it avoids the exhaustive computation of posterior probabilities, it may also result in higher prediction errors compared to the Bayes Optimal Classifier, given its reliance on sampling rather than a comprehensive evaluation of the hypothesis space .

You might also like