Bayesian Classification
Dr. Navneet Goyal BITS, Pilani
Bayesian Classification
What are Bayesian Classifiers? Statistical Classifiers Predict class membership probabilities Based on Bayes Theorem Nave Bayesian Classifier Computationally Simple Comparable performance with DT and NN classifiers
Bayesian Classification
Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.
Bayes Theorem
Let X be a data sample whose class label is unknown Let H be some hypothesis that X belongs to a class C For classification determine P(H/X) P(H/X) is the probability that H holds given the observed data sample X P(H/X) is posterior probability
Bayes Theorem
Example: Sample space: All Fruits X is round and red H= hypothesis that X is an Apple P(H/X) is our confidence that X is an apple given that X is round and red P(H) is Prior Probability of H, ie, the probability that any given data sample is an apple regardless of how it looks P(H/X) is based on more information Note that P(H) is independent of X
Bayes Theorem
Example: Sample space: All Fruits P(X/H) ? It is the probability that X is round and red given that we know that it is true that X is an apple Here P(X) is prior probability = P(data sample from our set of fruits is red and round)
Estimating Probabilities
P(X), P(H), and P(X/H) may be estimated from given data Bayes Theorem
P( X | H )P(H ) P(H | X ) = P( X )
Use of Bayes Theorem in Nave Bayesian Classifier!!
Nave Bayesian Classification
Also Why Class Effect This
called Simple BC Nave/Simple?? Conditional Independence
of an attribute values on a given class is independent of the values of other attributes assumption simplifies computations
Nave Bayesian Classification
Steps Involved
1.
Each data sample is of the type X=(xi) i =1(1)n, where xi is the values of X for attribute Ai
2.
Suppose there are m classes Ci, i=1(1)m. X Ci iff P(Ci|X) > P(Cj|X) for 1 j m, j i i.e BC assigns X to class Ci having highest
Nave Bayesian Classification
The class for which P(Ci|X) is maximized is called the maximum posterior hypothesis. From Bayes Theorem
P(Ci | X ) =P( X | Ci) P(Ci) P( X )
3.
P(X) is constant. Only
P( X | Ci)P(Ci)need be maximized.
If class prior probabilities not known, then assume all classes to be equally likely Otherwise maximize
P(Ci) = Si/S
P( X | Ci)P(Ci)
Problem: computing P(X|Ci) is unfeasible! (find out how you would find it and why it is infeasible)
Nave Bayesian Classification
4.
Nave assumption: attribute independence = P(x1,,xn|C) = P(xk|C) P( X | C i )
5.
In order to classify an unknown sample X, P( X | Ci)P(Ci) each class C . evaluate for i Sample X is assigned to the class Ci iff P(X|Ci)P(Ci) > P(X|Cj) P(Cj) for 1 j m, j i
Nave Bayesian Classification
EXAMPLE
Age <=30 <=30 31..40 >40 >40 >40 31..40 <=30 <=30 >40 <=30 31.40 31.40 >40 Income HIGH HIGH HIGH MEDIUM LOW LOW LOW MEDIUM LOW MEDIUM MEDIUM MEDIUM HIGH MEDIUM Student N N N N Y Y Y N Y Y Y N Y N Credit_rating FAIR EXCELLENT FAIR FAIR FAIR EXCELLENT EXCELLENT FAIR FAIR FAIR EXCELLENT EXCELLENT FAIR EXCELLENT Class:Buys_comp N N Y Y Y N Y N Y Y Y Y Y N
Nave Bayesian Classification
EXAMPLE X= (<=30,MEDIUM, Y,FAIR, ???) We need to maximize:
P(X|Ci)P(Ci) for i=1,2. P(Ci) is computed from training sample P(buys_comp=Y) = 9/14 = 0.643 P(buys_comp=N) = 5/14 = 0.357 How to calculate P(X|Ci)P(Ci) for i=1,2? P(X|Ci) = P(x1, x2, x3, x4|C) = P(xk|C)
Nave Bayesian Classification
EXAMPLE
P(age<=30 | buys_comp=Y)=2/9=0.222 P(age<=30 | buys_comp=N)=3/5=0.600 P(income=medium | buys_comp=Y)=4/9=0.444 P(income=medium | buys_comp=N)=2/5=0.400 P(student=Y | buys_comp=Y)=6/9=0.667 P(student=Y | buys_comp=N)=1/5=0.200 P(credit_rating=FAIR | buys_comp=Y)=6/9=0.667 P(credit_rating=FAIR | buys_comp=N)=2/5=0.400
Nave Bayesian Classification
EXAMPLE
P(X | buys_comp=Y)=0.222*0.444*0.667*0.667=0.044 P(X | buys_comp=N)=0.600*0.400*0.200*0.400=0.019 P(X | buys_comp=Y)P(buys_comp=Y) = 0.044*0.643=0.028 P(X | buys_comp=N)P(buys_comp=N) = 0.019*0.357=0.007 CONCLUSION: X buys computer
Nave Bayes Classifier: Issues
Probability
Recall
values ZERO!
what you observed in WEKA! what you observed in WEKA!
If
Ak is continuous valued!
Recall
If there are no tuples in the training set corresponding to students for the class buys-comp=NO P(student = Y|buys_comp=N)=0 Implications? Solution?
Nave Bayes Classifier: Issues
Laplacian Correction or Laplace Estimator Philosophy we assume that the training data set is so large that adding one to each count that we need would only make a negligible difference in the estimated prob. value. Example: D (1000) Class: buys_comp=Y income=low zero tuples income=medium 990 tuples income=high 10 tuples Without Laplacian Correction the probs. are 0, 0.990, and 0.010 With Laplacian correction: 1/1003 = 0.001, 991/1003=0.988, and 11/1003=0.011 respectively.
Nave Bayes Classifier: Issues
Continuous
variable: need to do more work than categorical attributes! It is typically assumed to have a Guassian distribution with a mean and a std. dev. . Do it yourself! And cross check with WEKA!
Nave Bayes (Summary)
Robust Handle
to isolated noise points
missing values by ignoring the instance during probability estimate calculations to irrelevant attributes
Robust
Independence
Use
assumption may not hold for some attributes
other techniques such as Bayesian Belief Networks (BBN)
Probability Calculations
Age Income Student Credit_rating Class:Buys_comp <=30 <=30 HIGH HIGH N N FAIR EXCELLENT N N 31..40 HIGH N FAIR Y >40 MEDIUM N FAIR Y
No. of attributes = 4 Distinct values = 3,3,3,3 No. of classes = 2 Total no. of probability calculations in NBC = 4*3*2 = 24! What if conditional ind. was not assumed? O(kp) for p k-valued attributes Multiply by m classes.
>40 >40
LOW LOW
Y Y
GOOD EXCELLENT
Y N
31..40
LOW
EXCELLENT
<=30
MEDIUM
FAIR
<=30 >40
LOW MEDIUM
Y Y
GOOD FAIR
Y Y
<=30
MEDIUM
EXCELLENT
31.40
MEDIUM
EXCELLENT
31.40 >40
HIGH MEDIUM
Y N
FAIR EXCELLENT
Y N
Bayesian Belief Networks
Nave BC assumes Class Conditional Independence This assumption simplifies computations When this assumption holds true, Nave BC is most accurate compared to all other classifiers In real problems, dependencies do exist between variables 2 methods to overcome this limitation of NBC
Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes Decision trees, that reason on one attribute at the time, considering most important attributes first
Conditional Independence
Let
X, Y, & Z denote three set of random variables. The variables in X are said to be conditionally independent of Y, given Z if
P(X|Y,Z) = P(X|Z)
Rel.
bet. a persons arm length and his/her reading skills!! One might observe that people with longer arms tend to have higher levels of reading skills How do you explain this rel.?
Conditional Independence
Can
be explained through a confounding factor, AGE A young child tends to have short arms and lacks the reading skills of an adult If the age of a person is fixed, then the observed rel. between arm length and reading skills disappears We can this conclude that arm length and reading skills are conditionally independent when the age variable is fixed
P(reading skills| long arms,age) = P(reading skills|age)
Conditional Independence
P(X,Y|Z) = P(X,Y,Z)/P(Z) = P(X,Y,Z)/P(Y,Z) x P(Y,Z)/P(Z) = P(X|Y,Z) x P(Y|Z) = P(X|Z) x P(Y|Z) This explains the Nave Bayesian:
P(X|Ci) = P(x1, x2, x3,,xn|C) = P(xk|C)
Bayesian Belief Networks
Belief Networks Bayesian Networks Probabilistic Networks
Bayesian Belief Networks
Conditional Independence (CI) assumption made by NBC may be too rigid
Specially for classification problems in which attributes are somewhat correlated
We need a more flexible approach for modeling the class conditional probabilities
P(X|Ci) = P(x1, x2, x3,,xn|C)
instead of requiring that all the attributes be CI given the class, BBN allows us to specify which pair of attributes are CI
Bayesian Belief Networks
Belief
Networks has 2 components Acyclic Graph (DAG) Probability Table (CPT)
Directed
Conditional
Bayesian Belief Networks
A node in BBN is CI of its non-descendants, if its parents are known
Bayesian Belief Networks
Family History Smoker (FH, S) (FH, ~S)(~FH, S) (~FH, ~S)
LC
LungCancer Emphysema
0.8 0.2
0.5 0.5
0.7 0.3
0.1 0.9
~LC
The conditional probability table for the variable LungCancer
PositiveXRay Dyspnea
Bayesian Belief Networks
Bayesian Belief Networks
6 boolean variables
Arcs allow representation of causal knowledge
Having lung cancer is influenced by family history and smoking PositiveXray is ind. of whether the paient has a FH or if he/she is a smoker given that we know that the patient has lung cancer
Once we know the outcome of Lung Cancer, FH & Smoker do not provide any additional info. about PositiveXray
Bayesian Belief Networks
Lung Cancer is CI of Emphysema, given its parents, FH
& Smoker BBN has a Conditional Probability Table (CPT) for each variable in the DAG
CPT for a variable Y specifies the conditional distribution P(Y|parents(Y))
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
LC ~LC
0.8 0.2
0.5 0.5
0.7 0.3
0.1 0.9
P(LC=Y|FH=Y,S=Y) = 0.8 P(LC=N|FH=N,S=N) = 0.9
CPT for LungCancer
Bayesian Belief Networks
Let X=(x1, x2,,xn) be a tuple described by variables
or attributes Y1, Y2, ,Yn respectively Each variable is CI of its nondescendants given its parents
Allows he DAG to provide a complete representation of the existing Joint Probability Distribution by:
P(x1, x2, x3,,xn)=P(xi|Parents(Yi)) P(x1, x2, x3,,xn) is the prob. of a particular combination of values of X, and the values for P(xi| Parents(Yi)) correspond to the entries in CPT for Yi
where
Bayesian Belief Networks
A node within the network can selected as an output
node, representing a class label attribute
More than one output node
Rather
than returning a single class label, the classification process can return a probability distribution that gives the probability of each class
Training BBN!!
Training BBN
Number of scenarios possible
Network topology may be given in advance or inferred from data
Variables may be observable or hidden (mising or incomplete data) in all or some of the training tuples
Many algos for learning the network topology from the training data given observable attibutes
If network topology is known and the variables observable, training is straightforward (just compute CPT entries)
Training BBNs
Topology given, but some variables are hidden
Gradient Descent (self study)
Falls under the class of algos called Adaptive Probabilistic Networks
BBNs are computationally expensive BBNs provide explicit representation of Causal structure
Domain experts can provide prior knowledge to the training process in the form of topology and/or in conditional probability values This leads to significant improvement in the learning process