Bayesian Theory & Naïve Bayes Classifiers
Course 4232: Machine Learning
Dept. of Computer Science
Faculty of Science and Technology
Bayesian Classifier
A statistical classifier: performs probabilistic
prediction, i.e., predicts class membership
probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A basic Bayesian classifier, naïve
Bayesian classifier, has comparable performance
with decision tree and selected neural network
classifiers
Standard: Even when Bayesian methods are
computationally intractable, they can provide a
standard of optimal decision making against
which other methods can be measured
Bayes’ Theorem: Basics
Bayes’ Theorem:
P( H | X) P(X | H ) P( H ) P(X | H )P( H ) / P(X)
P(X)
Let X be a data sample (“evidence”): class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), (i.e., posteriori
probability): the probability that the hypothesis holds given
the observed data sample X
P(H) (prior probability): the initial probability
E.g., X will buy computer, regardless of age, income, …
P(X): probability that sample data is observed
P(X|H) (likelihood): the probability of observing the sample X,
given that the hypothesis holds
E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
Prediction Based on Bayes’ Theorem
Given training data X, posteriori probability of a
hypothesis H, P(H|X), follows the Bayes’ theorem
P(H | X) P(X | H ) P(H ) P(X | H )P(H ) / P(X)
P(X)
Informally, this can be viewed as
posteriori = likelihood x prior/evidence
Predicts X belongs to Ci iff the probability P(Ci|X) is
the highest among all the P(Ck|X) for all the k
classes
Practical difficulty: It requires initial knowledge of
many probabilities, involving significant
Classification Is to Derive the Maximum Posteriori
Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
This can be derived fromP(XBayes’
| C )P(C theorem
)
P(C | X) i i
i P(X)
Since P(X) is constant for all classes, only
P(C | X) P(X | C )P(C )
i i i
needs to be maximized
Applying Bayes’ rule:
A basic Example
A doctor knows that the disease meningitis causes the patient to have a stiff neck, say,
70% of the time. The doctor also knows some unconditional facts: the prior probability that a
patient has meningitis is 1/50,000, and the prior probability that any patient has a stiff
neck is 1%.
Letting
s be the proposition that the patient has a stiff neck and
m be the proposition that the patient has meningitis, we have
Bayesian Methods
Learning and classification methods based
on probability theory.
Bayes theorem plays a critical role in
probabilistic learning and classification.
Uses prior probability of each category
given no information about an item.
Categorization produces a posterior
probability distribution over the possible
categories given a description of an item.
Bayes Classifiers
Assumption: training set consists of instances of different
classes described cj as conjunctions of attributes values
Task: Classify a new instance d based on a tuple of
attribute values into one of the classes cj C
Key idea: assign the most probable classc MAP using
Bayes Theorem.
cMAP argmax P(c j | x1 , x2 , , xn )
c j C
P ( x1 , x2 , , xn | c j ) P (c j )
argmax
c j C P ( x1 , x2 , , xn )
argmax P( x1 , x2 , , xn | c j ) P(c j )
c j C
Naïve Bayes Classifier
A simplified assumption: attributes are
conditionally independent (i.e., no dependence
relation between n
attributes):
P( X | )
Ci P( | ) P( | ) P( | ) ... P( | )
x k Ci x 1 Ci x 2 Ci x n Ci
k 1
This greatly reduces the computation cost: Only
counts the class distribution
If A is categorical, P(x |C ) is the # of tuples in C
k k i i
having value xk for Ak divided by |Ci, D| (# of tuples
of Ci in D)
If A is continous-valued, P(xk1|Ci) ( xis )
usually
2
k
g ( x, , ) distribution
2
2
computed based on Gaussian e with a
2
mean μ and standard deviation σ
P ( X | C i ) g ( x k , Ci , C i )
Parameters estimation
P(cj)
Can be estimated from the frequency of classes
in the training examples.
P(x1,x2,…,xn|cj)
O(|X|n•|C|) parameters
Could only be estimated if a very, very large
number of training examples was available.
Independence Assumption: attribute values are
conditionally independent given the target
value: naïve Bayes.
P ( x1 , x 2 , , x n | c j ) P ( xi | c j )
i
c NB arg max P (c j ) P ( xi | c j )
c j C i
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
Class:
31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data to be classified: 31…40 low yes excellent yes
X = (age <=30, <=30 medium no fair no
Income = medium, <=30 low yes fair yes
>40 medium yes fair yes
Student = yes
<=30 medium yes excellent yes
Credit_rating = Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
P(C ): P(buys_computer = “yes”) = 9/14 = 0.643 >40 medium no fair yes
i >40 low yes fair yes
>40 low yes excellent no
P(buys_computer = “no”) = 5/14= 0.357 31…40
<=30
low
medium
yes excellent
no fair
yes
no
Compute P(X|C ) for each class <=30
>40
low yes fair
medium yes fair
yes
yes
i <=30 medium yes excellent yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.22231…40
31…40
medium
high
no excellent
yes fair
yes
yes
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 >40 medium no excellent no
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
X = (age <= 30 , income = medium, student = yes, credit_rating =
fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 =
0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) =
0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) =
0.007
What is discriminant Function
For classification problem, for each class,
define a function such that we
choose Ci if
gi ( x), i 1,..., K
gi ( x) max g k ( x)
k
K=2 Classes
Dichotomizer (K=2) vs Polychotomizer
(K>2)
g(x) = g1(x) – g2(x)
C 1 if g x 0
choose
C 2 otherwise
P C 1 | x
Log odds:log
P C 2 | x
Discriminant Functions
chooseC i if gi x maxk gk x gi x, i 1, , K
R i | x
gi x P C i | x
p x | C P C
i i
K decision regions R1,...,RK
R i x | gi x maxk gk x
Properties
Estimating
P ( xi | c j ) instead , x2 , , xn | c j )
P( x1of
greatly reduces the number of
parameters (and the data sparseness).
TheP(learning
xi | c j ) step in Naïve Bayes consists
of estimating and based
on the frequenciesP(in c j )the training data
An unseen instance is classified by
computing the class that maximizes the
posterior
When conditioned independence is
satisfied, Naïve Bayes corresponds to MAP
classification.
Maximum A Posterior
Based on Bayes Theorem, we can compute the
Maximum A Posterior (MAP) hypothesis for the data
We are interested in the best hypothesis for some
space H given observed training data D.
hMAP argmax P ( h | D )
hH
P ( D | h) P ( h)
argmax
hH P( D)
argmax P( D | h) P(h)
hH
H: set of all hypothesis.
Note that we can drop P(D) as the probability of the data
is constant (and independent of the hypothesis).
Desirable Properties of Bayes
Classifier
Incrementality: with each training example,
the prior and the likelihood can be updated
dynamically: flexible and robust to errors.
Combines prior knowledge and observed
data: prior probability of a hypothesis
multiplied with probability of the hypothesis
given the training data
Probabilistic hypothesis: outputs not only a
classification, but a probability distribution
over all classes
Naïve Bayes Classifier: Comments
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence,
therefore loss of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung
cancer, diabetes, etc.
Dependencies among these cannot be modeled by Naïve
Bayes Classifier