Università degli Studi di Milano
Master Degree in Computer Science
Information Management
course
Teacher: Alberto Ceselli
Lecture 19: 10/12/2015
Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 8, 9 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
2
Classification methods
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Support Vector Machines
Model Evaluation and Selection
Rule-Based Classification
Techniques to Improve Classification
Accuracy: Ensemble Methods
3
Bayesian Classification:
Why?
A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, naïve
Bayesian classifier, has comparable performance with
decision tree and selected neural network classifiers
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with
observed data
Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured
4
Bayesian Classification Rationale:
conditional probability
income student credit buys
Class: high no fair no
C1:buys_computer = ‘yes’ high no excellent no
C2:buys_computer = ‘no’ high no fair yes
medium no fair yes
low yes fair yes
low yes excellent no
P(C1)? low yes excellent yes
P(C1|student = yes)? medium no fair no
low yes fair yes
medium yes fair yes
medium yes excellent yes
medium no excellent yes
high yes fair yes
medium no excellent no
5
Bayesian Classification Rationale
Let P(Ci|X) be the conditional probability of observing
class Ci provided the set of attributes values of my
element is X
Final aim: obtaining (an estimation of) P(Ci|X) for each i
and for each X (classification model is the set of these
values)
P(Ci|X) = P(Ci ∩ X) / P(X)
How to compute P(X)?
We would need a sufficient number of elements in
the training set whose attribute values are X
… and therefore some elements for each possible
combination of the attribute values (unrealistic)
How to compute P(Ci ∩ X)? Same problems
6
Bayesian Theorem: Basics
Let X be an evidence (data sample): unkn. class label
Let H be a hypothesis on the class X belongs
(say “potential” class)
Classification is to find P(H|X)
a posteriori probability: the probability that the
hypothesis holds given the observed data sample X
We can estimate:
P(H) (a priori probability), an initial “blind” probability
E.g., X buys computer, regardless of age, income
P(X): probability that a certain data sample is observed
P(X|H) (likelyhood), the probability of observing the
sample X, given that the hypothesis H holds
7
Bayesian classification: defs
age income student credit_rating PC
<=30 high no fair no • Evidence X =
<=30 high no excellent no (age = 31..40;
31…40 high no fair yes
income = medium;
>40 medium no fair yes
>40 low yes fair yes
student = no;
>40 low yes excellent no rating = excellent)
31…40 low yes excellent yes • Hypotesis H =
<=30 medium no fair no (PC = yes)
<=30 low yes fair yes • A priori Probability
>40 medium yes fair yes P(H) = 9/14
<=30 medium yes excellent yes
• Likelihood
31…40 medium no excellent yes
31…40 high yes fair yes P(X|H) = 1/9
>40 medium no excellent no • A posteriori Probability
P(H|X) = ???
8
Bayesian Theorem
Given training data X, posteriori probability of a
hypothesis H, P(H|X), follows the Bayes theorem
P ( X ∣H ) P ( H )
P ( H ∣ X )= = P ( X ∣H )× P ( H )/ P ( X )
P( X )
Informally, this can be written as
posteriori = likelihood x priori/evidence
Predicts that X belongs to Ci iff the probability
P(Ci|X) is the highest among all the P(Ck|X) for all
the k classes
Practical difficulty: require initial knowledge of
many probabilities, significant computational cost 9
Bayesian Classification
Let D be a training set of tuples and their
associated class labels, and each tuple is
represented by an n-D attribute vector
X = (x1, x2, …, xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori,
i.e., the maximal P(Ci|X)
This can be derived from Bayes’ theorem
P ( X ∣C i ) P (C i )
P (C i∣X )=
P( X )
Since P(X) is constant for all classes, only max
P (C i∣X )=P ( X ∣C i ) P (C i )
needs to be found (Maximum A Posteriori method) 10
The “Optimal” Bayesian
Classifier
From a theoretical point of view, the Bayesian MAP
classifier is optimal: no classifier can exist
achieving a smaller error rate
In order to compute
P (C i∣X )=P ( X ∣C i ) P (C i )
we need
P (C i )
→ “easy”: just scan the DB once
and
P ( X ∣C i )
→ if we have k classes and m attributes, each
taking n possible values: k*nm probability values!
11
Derivation of Naïve Bayes
Classifier
A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes) and identically distributed (iid):
n
P ( X ∣C i )=∏ P ( x k∣C i )= P ( x 1∣C i )× P ( x 2∣C i )×. ..× P ( x n∣C i )
k =1
This greatly reduces the computation cost: Only counts
the class distribution (k*n*m probabilities)
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
If Ak is continuous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
standard deviation σ 2
( x− μ)
−
1 2σ 2
g( x , μ , σ )= e P ( X∣C i )=g ( x k , μ C ,σ C )
√ 2π σ i i
12
Training a Naïve Bayesian
Classifier (example)
Training:
age income student credit_rating PC
• P(PC = yes) = 9/14
<=30 high no fair no • P(PC = no) = 5/14
<=30 high no excellent no • P(age = “<=30” | PC = yes) = 2/9
31…40 high no fair yes • P(age = “<=30” | PC = no) = 3/5
>40 medium no fair yes • P(incm. = “med” | PC = yes) = 4/9
>40 low yes fair yes • P(incm. = “med” | PC = no) = 2/5
>40 low yes excellent no • P(student = “yes” | PC = yes) = 6/9
31…40 low yes excellent yes
• P(student = “yes” | PC = no) = 1/5
<=30 medium no fair no
<=30 low yes fair yes
• P(credit = “fair” | PC = “yes”) = 6/9
>40 medium yes fair yes • P(credit = “fair” | PC = “no”) = 2/5
<=30 medium yes excellent yes • P( all other combinations )
31…40 medium no excellent yes …
31…40 high yes fair yes Using:
>40 medium no excellent no •X = (“<=30”;“med”;“yes”;“fair”)
•P(X|PC = yes) →
P(age = “<=30” | PC = yes) *
P(incm. = “med” | PC = yes) *
P(student = “yes” | PC = yes) *
P(credit = “fair” | PC = “yes”) → 0.044
•P(X|PC = no) → 0.019
•P(PC = yes | X)→π*P(X|PC = yes)*P(PC = yes)→π*0.028
•P(PC = no | X)→π*P(X|PC = no)*P(PC = no)→π*0.007
PREDICT “PC = yes”!!! 13
Avoiding the Zero-Probability
Problem
Naïve Bayesian prediction requires each
conditional prob. be non-zero. Otherwise, the
predicted prob. will be zero
n
P ( X ∣C i )=∏ P ( x k ∣C i )
k =1
Ex. Suppose a dataset with 1000 tuples,
income=low (0), income= medium (990), and
income = high (10)
Use Laplacian correction (or Laplacian
estimator)
Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003 15
Naïve Bayesian Classifier:
Comments
Advantages
Easy to implement and computationally efficient
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence,
therefore loss of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family
history, etc.
Symptoms: fever, cough etc., Disease: lung
cancer, diabetes, etc.
Dependencies among these cannot be modeled
by Naïve Bayesian Classifier
How to deal with these dependencies?
→ Bayesian Belief Networks 16
Bayesian Belief Networks
Bayesian belief networks (also known as
Bayesian networks, probabilistic networks):
allow class conditional independencies between
subsets of variables
A (directed acyclic) graphical model of causal
relationships
Represents dependency among the variables
Gives a specification of joint probability
distribution
17
Bayesian Belief Networks
●
Nodes: random variables
●
Links: dependency
●
X and Y are the parents of Z, and Y is the
parent of P
●
No dependency between Z and P
●
Has no loops/cycles
X Y
Z
P
18
Bayesian Belief Network: An
Example
Family CPT: Conditional Probability
Smoker (S)
History (FH) Table for variable LungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
LC 0.8 0.5 0.7 0.1
LungCancer
Emphysema ~LC 0.2 0.5 0.3 0.9
(LC)
shows the conditional probability
for each possible combination of its
parents
PositiveXRay Dyspnea Derivation of the probability of a
particular combination of values
of X, from CPT:
Bayesian Belief Network n
P ( x 1 , .. . , x n )= ∏ P ( x i∣Parents ( x i ))
i=1
19
Training Bayesian Networks:
Several Scenarios
Scenario 1: Given both the network structure and all
variables observable: compute only the CPT entries
Scenario 2: Network structure known, some variables hidden:
gradient descent (greedy hill-climbing) method, i.e., search for
a solution along the steepest descent of a criterion function
Weights are initialized to random probability values
At each iteration, it moves towards what appears to be the
best solution at the moment, w.o. backtracking
Weights are updated at each iteration & converge to local
optimum
20
Training Bayesian Networks:
Several Scenarios
Scenario 3: Network structure unknown, all variables
observable: search through the model space to reconstruct
network topology
Scenario 4: Unknown structure, all hidden variables: No good
algorithms known for this purpose
D. Heckerman. A Tutorial on Learning with Bayesian Networks
. In Learning in Graphical Models, M. Jordan, ed.. MIT Press,
1999.
21
Bayesian Belief Networks:
Comments
Advantages
Computationally heavier than naïve classifier, but
still tractable
Handle (approximating) dependencies
Very good results (provided a meaningful network
is designed & tuned)
Disadvantages
Need expert problem knowledge or external
mining algorithms for designing the network
22