0% found this document useful (0 votes)
11 views19 pages

Naïve Bayes Classifier Overview

The document discusses Bayesian Theory and Naïve Bayes Classifiers, which are statistical classifiers based on Bayes' Theorem that predict class membership probabilities. It explains the principles of Bayes' Theorem, the process of classification using maximum posteriori probabilities, and the assumptions of conditional independence in Naïve Bayes classifiers. Additionally, it highlights the advantages and disadvantages of using Naïve Bayes in practical applications.

Uploaded by

nihar44203
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views19 pages

Naïve Bayes Classifier Overview

The document discusses Bayesian Theory and Naïve Bayes Classifiers, which are statistical classifiers based on Bayes' Theorem that predict class membership probabilities. It explains the principles of Bayes' Theorem, the process of classification using maximum posteriori probabilities, and the assumptions of conditional independence in Naïve Bayes classifiers. Additionally, it highlights the advantages and disadvantages of using Naïve Bayes in practical applications.

Uploaded by

nihar44203
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Bayesian Theory & Naïve Bayes Classifiers

Course 4232: Machine Learning

Dept. of Computer Science


Faculty of Science and Technology
Bayesian Classifier
A statistical classifier: performs probabilistic
prediction, i.e., predicts class membership
probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A basic Bayesian classifier, naïve
Bayesian classifier, has comparable performance
with decision tree and selected neural network
classifiers
 Standard: Even when Bayesian methods are
computationally intractable, they can provide a
standard of optimal decision making against
which other methods can be measured
Bayes’ Theorem: Basics

 Bayes’ Theorem:

P( H | X) P(X | H ) P( H ) P(X | H )P( H ) / P(X)


P(X)
 Let X be a data sample (“evidence”): class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), (i.e., posteriori
probability): the probability that the hypothesis holds given
the observed data sample X
 P(H) (prior probability): the initial probability
 E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed
 P(X|H) (likelihood): the probability of observing the sample X,
given that the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
Prediction Based on Bayes’ Theorem

 Given training data X, posteriori probability of a


hypothesis H, P(H|X), follows the Bayes’ theorem

P(H | X) P(X | H ) P(H ) P(X | H )P(H ) / P(X)


P(X)
 Informally, this can be viewed as

posteriori = likelihood x prior/evidence


 Predicts X belongs to Ci iff the probability P(Ci|X) is

the highest among all the P(Ck|X) for all the k


classes
 Practical difficulty: It requires initial knowledge of
many probabilities, involving significant
Classification Is to Derive the Maximum Posteriori

 Let D be a training set of tuples and their associated class


labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
 This can be derived fromP(XBayes’
| C )P(C theorem
)
P(C | X)  i i
i P(X)

 Since P(X) is constant for all classes, only


P(C | X) P(X | C )P(C )
i i i

needs to be maximized
Applying Bayes’ rule:
A basic Example

A doctor knows that the disease meningitis causes the patient to have a stiff neck, say,
70% of the time. The doctor also knows some unconditional facts: the prior probability that a
patient has meningitis is 1/50,000, and the prior probability that any patient has a stiff
neck is 1%.

Letting
s be the proposition that the patient has a stiff neck and
m be the proposition that the patient has meningitis, we have
Bayesian Methods
Learning and classification methods based
on probability theory.
Bayes theorem plays a critical role in
probabilistic learning and classification.
Uses prior probability of each category
given no information about an item.
Categorization produces a posterior
probability distribution over the possible
categories given a description of an item.
Bayes Classifiers
Assumption: training set consists of instances of different
classes described cj as conjunctions of attributes values
Task: Classify a new instance d based on a tuple of
attribute values into one of the classes cj  C
Key idea: assign the most probable classc MAP using
Bayes Theorem.

cMAP argmax P(c j | x1 , x2 ,  , xn )


c j C

P ( x1 , x2 ,  , xn | c j ) P (c j )
argmax
c j C P ( x1 , x2 ,  , xn )
argmax P( x1 , x2 ,  , xn | c j ) P(c j )
c j C
Naïve Bayes Classifier

A simplified assumption: attributes are


conditionally independent (i.e., no dependence
relation between n
attributes):
P( X | ) 
Ci P( | ) P( | ) P( | ) ... P( | )
 x k Ci x 1 Ci x 2 Ci x n Ci
k 1

 This greatly reduces the computation cost: Only


counts the class distribution
 If A is categorical, P(x |C ) is the # of tuples in C
k k i i
having value xk for Ak divided by |Ci, D| (# of tuples
of Ci in D)
 If A is continous-valued, P(xk1|Ci)  ( xis  )
usually
2

k
g ( x,  ,  ) distribution
2
2
computed based on Gaussian e with a
2 
mean μ and standard deviation σ
P ( X | C i )  g ( x k ,  Ci ,  C i )
Parameters estimation
P(cj)
 Can be estimated from the frequency of classes
in the training examples.
P(x1,x2,…,xn|cj)
 O(|X|n•|C|) parameters
 Could only be estimated if a very, very large
number of training examples was available.
 Independence Assumption: attribute values are
conditionally independent given the target
value: naïve Bayes.
P ( x1 , x 2 ,  , x n | c j )  P ( xi | c j )
i

c NB arg max P (c j ) P ( xi | c j )
c j C i
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
Class:
31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data to be classified: 31…40 low yes excellent yes
X = (age <=30, <=30 medium no fair no
Income = medium, <=30 low yes fair yes
>40 medium yes fair yes
Student = yes
<=30 medium yes excellent yes
Credit_rating = Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
 P(C ): P(buys_computer = “yes”) = 9/14 = 0.643 >40 medium no fair yes
i >40 low yes fair yes
>40 low yes excellent no
P(buys_computer = “no”) = 5/14= 0.357 31…40
<=30
low
medium
yes excellent
no fair
yes
no

 Compute P(X|C ) for each class <=30


>40
low yes fair
medium yes fair
yes
yes
i <=30 medium yes excellent yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.22231…40
31…40
medium
high
no excellent
yes fair
yes
yes

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 >40 medium no excellent no

P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444


P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
 X = (age <= 30 , income = medium, student = yes, credit_rating =
fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 =
0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) =
0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) =
0.007
What is discriminant Function
For classification problem, for each class,
define a function such that we
choose Ci if
gi ( x), i 1,..., K

gi ( x) max g k ( x)
k
K=2 Classes
Dichotomizer (K=2) vs Polychotomizer
(K>2)
g(x) = g1(x) – g2(x)
C 1 if g x  0
choose
C 2 otherwise

P C 1 | x
Log odds:log
P C 2 | x
Discriminant Functions
chooseC i if gi x  maxk gk x gi x, i 1, , K

 R i | x

gi x  P C i | x
p x | C P C 
 i i

K decision regions R1,...,RK

R i x | gi x  maxk gk x


Properties
Estimating
P ( xi | c j ) instead , x2 , , xn | c j )
P( x1of
greatly reduces the number of
parameters (and the data sparseness).
TheP(learning
xi | c j ) step in Naïve Bayes consists
of estimating and based
on the frequenciesP(in c j )the training data
An unseen instance is classified by
computing the class that maximizes the
posterior
When conditioned independence is
satisfied, Naïve Bayes corresponds to MAP
classification.
Maximum A Posterior
Based on Bayes Theorem, we can compute the
Maximum A Posterior (MAP) hypothesis for the data
We are interested in the best hypothesis for some
space H given observed training data D.

hMAP argmax P ( h | D )
hH

P ( D | h) P ( h)
argmax
hH P( D)
argmax P( D | h) P(h)
hH
H: set of all hypothesis.
Note that we can drop P(D) as the probability of the data
is constant (and independent of the hypothesis).
Desirable Properties of Bayes
Classifier
Incrementality: with each training example,
the prior and the likelihood can be updated
dynamically: flexible and robust to errors.
Combines prior knowledge and observed
data: prior probability of a hypothesis
multiplied with probability of the hypothesis
given the training data
Probabilistic hypothesis: outputs not only a
classification, but a probability distribution
over all classes
Naïve Bayes Classifier: Comments
 Advantages
Easy to implement
Good results obtained in most of the cases
 Disadvantages
Assumption: class conditional independence,
therefore loss of accuracy
Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung
cancer, diabetes, etc.
 Dependencies among these cannot be modeled by Naïve
Bayes Classifier

You might also like