0% found this document useful (0 votes)
13 views21 pages

Bayesian Classification Methods Overview

The document discusses Bayesian classification methods, including the Naïve Bayes classifier and Bayesian belief networks, highlighting their principles, advantages, and limitations. It explains the application of Bayes' theorem in classification, the importance of conditional probabilities, and the challenges of dependency among variables. Additionally, it outlines scenarios for training Bayesian networks and the computational aspects involved.

Uploaded by

Nedia Ben Ammar
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views21 pages

Bayesian Classification Methods Overview

The document discusses Bayesian classification methods, including the Naïve Bayes classifier and Bayesian belief networks, highlighting their principles, advantages, and limitations. It explains the application of Bayes' theorem in classification, the importance of conditional probabilities, and the challenges of dependency among variables. Additionally, it outlines scenarios for training Bayesian networks and the computational aspects involved.

Uploaded by

Nedia Ben Ammar
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Università degli Studi di Milano

Master Degree in Computer Science

Information Management
course
Teacher: Alberto Ceselli

Lecture 19: 10/12/2015


Data Mining:
Concepts and Techniques
(3rd ed.)

— Chapter 8, 9 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
2
Classification methods
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Support Vector Machines
 Model Evaluation and Selection
 Rule-Based Classification
 Techniques to Improve Classification
Accuracy: Ensemble Methods
3
Bayesian Classification:
Why?
 A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve
Bayesian classifier, has comparable performance with
decision tree and selected neural network classifiers
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with
observed data
 Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured
4
Bayesian Classification Rationale:
conditional probability
income student credit buys
Class: high no fair no
C1:buys_computer = ‘yes’ high no excellent no
C2:buys_computer = ‘no’ high no fair yes
medium no fair yes
low yes fair yes
low yes excellent no
P(C1)? low yes excellent yes
P(C1|student = yes)? medium no fair no
low yes fair yes
medium yes fair yes
medium yes excellent yes
medium no excellent yes
high yes fair yes
medium no excellent no
5
Bayesian Classification Rationale
 Let P(Ci|X) be the conditional probability of observing
class Ci provided the set of attributes values of my
element is X
 Final aim: obtaining (an estimation of) P(Ci|X) for each i
and for each X (classification model is the set of these
values)
 P(Ci|X) = P(Ci ∩ X) / P(X)
 How to compute P(X)?
 We would need a sufficient number of elements in

the training set whose attribute values are X


 … and therefore some elements for each possible

combination of the attribute values (unrealistic)


 How to compute P(Ci ∩ X)? Same problems
6
Bayesian Theorem: Basics
 Let X be an evidence (data sample): unkn. class label
 Let H be a hypothesis on the class X belongs
(say “potential” class)
 Classification is to find P(H|X)
a posteriori probability: the probability that the
hypothesis holds given the observed data sample X
 We can estimate:
P(H) (a priori probability), an initial “blind” probability
 E.g., X buys computer, regardless of age, income

P(X): probability that a certain data sample is observed


P(X|H) (likelyhood), the probability of observing the
sample X, given that the hypothesis H holds
7
Bayesian classification: defs
age income student credit_rating PC
<=30 high no fair no • Evidence X =
<=30 high no excellent no (age = 31..40;
31…40 high no fair yes
income = medium;
>40 medium no fair yes
>40 low yes fair yes
student = no;
>40 low yes excellent no rating = excellent)
31…40 low yes excellent yes • Hypotesis H =
<=30 medium no fair no (PC = yes)
<=30 low yes fair yes • A priori Probability
>40 medium yes fair yes P(H) = 9/14
<=30 medium yes excellent yes
• Likelihood
31…40 medium no excellent yes
31…40 high yes fair yes P(X|H) = 1/9
>40 medium no excellent no • A posteriori Probability
P(H|X) = ???

8
Bayesian Theorem
 Given training data X, posteriori probability of a
hypothesis H, P(H|X), follows the Bayes theorem
P ( X ∣H ) P ( H )
P ( H ∣ X )= = P ( X ∣H )× P ( H )/ P ( X )
P( X )
 Informally, this can be written as
posteriori = likelihood x priori/evidence
 Predicts that X belongs to Ci iff the probability
P(Ci|X) is the highest among all the P(Ck|X) for all
the k classes
 Practical difficulty: require initial knowledge of
many probabilities, significant computational cost 9
Bayesian Classification
 Let D be a training set of tuples and their
associated class labels, and each tuple is
represented by an n-D attribute vector
X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori,
i.e., the maximal P(Ci|X)
 This can be derived from Bayes’ theorem
P ( X ∣C i ) P (C i )
P (C i∣X )=
P( X )
 Since P(X) is constant for all classes, only max

P (C i∣X )=P ( X ∣C i ) P (C i )
needs to be found (Maximum A Posteriori method) 10
The “Optimal” Bayesian
Classifier
 From a theoretical point of view, the Bayesian MAP
classifier is optimal: no classifier can exist
achieving a smaller error rate
 In order to compute

P (C i∣X )=P ( X ∣C i ) P (C i )
we need
P (C i )
→ “easy”: just scan the DB once
and
P ( X ∣C i )
→ if we have k classes and m attributes, each
taking n possible values: k*nm probability values!
11
Derivation of Naïve Bayes
Classifier
 A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes) and identically distributed (iid):
n
P ( X ∣C i )=∏ P ( x k∣C i )= P ( x 1∣C i )× P ( x 2∣C i )×. ..× P ( x n∣C i )
k =1

 This greatly reduces the computation cost: Only counts


the class distribution (k*n*m probabilities)

 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having


value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
 If Ak is continuous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
standard deviation σ 2
( x− μ)

1 2σ 2
g( x , μ , σ )= e P ( X∣C i )=g ( x k , μ C ,σ C )
√ 2π σ i i
12
Training a Naïve Bayesian
Classifier (example)
Training:
age income student credit_rating PC
• P(PC = yes) = 9/14
<=30 high no fair no • P(PC = no) = 5/14
<=30 high no excellent no • P(age = “<=30” | PC = yes) = 2/9
31…40 high no fair yes • P(age = “<=30” | PC = no) = 3/5
>40 medium no fair yes • P(incm. = “med” | PC = yes) = 4/9
>40 low yes fair yes • P(incm. = “med” | PC = no) = 2/5
>40 low yes excellent no • P(student = “yes” | PC = yes) = 6/9
31…40 low yes excellent yes
• P(student = “yes” | PC = no) = 1/5
<=30 medium no fair no
<=30 low yes fair yes
• P(credit = “fair” | PC = “yes”) = 6/9
>40 medium yes fair yes • P(credit = “fair” | PC = “no”) = 2/5
<=30 medium yes excellent yes • P( all other combinations )
31…40 medium no excellent yes …
31…40 high yes fair yes Using:
>40 medium no excellent no •X = (“<=30”;“med”;“yes”;“fair”)
•P(X|PC = yes) →
P(age = “<=30” | PC = yes) *
P(incm. = “med” | PC = yes) *
P(student = “yes” | PC = yes) *
P(credit = “fair” | PC = “yes”) → 0.044
•P(X|PC = no) → 0.019
•P(PC = yes | X)→π*P(X|PC = yes)*P(PC = yes)→π*0.028
•P(PC = no | X)→π*P(X|PC = no)*P(PC = no)→π*0.007
PREDICT “PC = yes”!!! 13
Avoiding the Zero-Probability
Problem
 Naïve Bayesian prediction requires each
conditional prob. be non-zero. Otherwise, the
predicted prob. will be zero
n
P ( X ∣C i )=∏ P ( x k ∣C i )
k =1
 Ex. Suppose a dataset with 1000 tuples,
income=low (0), income= medium (990), and
income = high (10)
 Use Laplacian correction (or Laplacian
estimator)
 Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003 15
Naïve Bayesian Classifier:
Comments
 Advantages
 Easy to implement and computationally efficient

 Good results obtained in most of the cases

 Disadvantages
 Assumption: class conditional independence,

therefore loss of accuracy


 Practically, dependencies exist among variables


E.g., hospitals: patients: Profile: age, family
history, etc.
Symptoms: fever, cough etc., Disease: lung
cancer, diabetes, etc.

Dependencies among these cannot be modeled
by Naïve Bayesian Classifier
 How to deal with these dependencies?
→ Bayesian Belief Networks 16
Bayesian Belief Networks
 Bayesian belief networks (also known as
Bayesian networks, probabilistic networks):
allow class conditional independencies between
subsets of variables
 A (directed acyclic) graphical model of causal
relationships
 Represents dependency among the variables

 Gives a specification of joint probability

distribution

17
Bayesian Belief Networks

Nodes: random variables

Links: dependency

X and Y are the parents of Z, and Y is the
parent of P

No dependency between Z and P

Has no loops/cycles

X Y

Z
P
18
Bayesian Belief Network: An
Example
Family CPT: Conditional Probability
Smoker (S)
History (FH) Table for variable LungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

LC 0.8 0.5 0.7 0.1


LungCancer
Emphysema ~LC 0.2 0.5 0.3 0.9
(LC)

shows the conditional probability


for each possible combination of its
parents
PositiveXRay Dyspnea Derivation of the probability of a
particular combination of values
of X, from CPT:
Bayesian Belief Network n
P ( x 1 , .. . , x n )= ∏ P ( x i∣Parents ( x i ))
i=1
19
Training Bayesian Networks:
Several Scenarios
 Scenario 1: Given both the network structure and all
variables observable: compute only the CPT entries
 Scenario 2: Network structure known, some variables hidden:
gradient descent (greedy hill-climbing) method, i.e., search for
a solution along the steepest descent of a criterion function
 Weights are initialized to random probability values

 At each iteration, it moves towards what appears to be the

best solution at the moment, w.o. backtracking


 Weights are updated at each iteration & converge to local

optimum

20
Training Bayesian Networks:
Several Scenarios
 Scenario 3: Network structure unknown, all variables
observable: search through the model space to reconstruct
network topology
 Scenario 4: Unknown structure, all hidden variables: No good
algorithms known for this purpose
 D. Heckerman. A Tutorial on Learning with Bayesian Networks
. In Learning in Graphical Models, M. Jordan, ed.. MIT Press,
1999.

21
Bayesian Belief Networks:
Comments
 Advantages
 Computationally heavier than naïve classifier, but

still tractable
 Handle (approximating) dependencies

 Very good results (provided a meaningful network

is designed & tuned)


 Disadvantages
 Need expert problem knowledge or external

mining algorithms for designing the network

22

You might also like