Naïve Bayes Classifier Overview
Naïve Bayes Classifier Overview
— Chapter 8 —
2
Chapter 8. Classification: Basic Concepts
3/12/23 4
Naïve Bayes Classifier
P(Money|N) = 0.006
P(Dear|N) = 0.47
P(Friend|N) = 0.29
P(Lunch|N) = 0.18
P(Money|N) = 0.006
P(Dear|S) = 0.29
P(Friend|S) = 0.14
P(Lunch|S) = 0
P(Money|S) = 0.57
P(Dear|N) = 0.47
P(Friend|N) = 0.29
Prior Probability
P(Lunch|N) = 0.18
P(Money|N) = 0.006 P(N) = 0.67
P(Dear|S) = 0.29
P(Friend|S) = 0.14
P(Lunch|S) = 0
P(Money|S) = 0.57
P(S)=4/12=0.33
2023-03-12 MGSC5126: Data Mining 8
Naïve Bayes Classifier
P(N)=8/12=0.67
P(N)*P(Lunch|N)*P(money|N)*
P(Dear|S) = 0.29
P(money|N)*P(money|N)
P(Friend|S) = 0.14
=0.33*0*0.006*0.006*0.006
P(Lunch|S) = 0
=0
P(Money|S) = 0.57
P(S)=4/12=0.33
• Bayes’ Theorem:
P(H | X) = P(X | H ) P(H ) = P(X | H )´ P(H ) / P(X)
P(X)
o Let X be a data sample (“evidence”): class label is unknown
o Let H be a hypothesis that X belongs to class C
o Classification is to determine P(H|X)
3/12/23 14
Prediction Based on Bayes’ Theorem
3/12/23 15
Classification Is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
16
Classification Is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
n
P(X | C i) = Õ P( x | C i) = P( x | C i) ´ P( x | C i) ´ ... ´ P( x | C i)
k 1 2 n
k =1
This greatly reduces the computation cost: Only counts the class
distribution
18
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_computer
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data to be classified: 31…40 low yes excellent yes
X = (age <=30, <=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
P(C1|X) 31…40 high yes fair yes
P(C2|X) >40 medium no excellent no
3/12/23 19
Naïve Bayes Classifier: Training Dataset
3/12/23
Naïve Bayes Classifier: An Example
age income studentcredit_rating
buys_compute
<=30 high no fair no
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 <=30 high no excellent no
31…40 high no fair yes
P(buys_computer = “no”) = 5/14= 0.357 >40 medium no fair yes
>40 low yes fair yes
• Compute P(X|Ci) for each class >40 low yes excellent no
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 31…40 low yes excellent yes
<=30 medium no fair no
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
<=30 low yes fair yes
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444>40 medium yes fair yes
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 <=30 medium yes excellent yes
31…40 medium no excellent yes
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 31…40 high yes fair yes
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 >40 medium no excellent no
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
3/12/23 Therefore, X belongs to class (“buys_computer = yes”) 21
Naïve Bayes Classifier
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
and P(xk|Ci) is
( x-µ )2
1 -
g ( x, µ , s ) = e 2s 2
2p s
P ( X | C i ) = g ( x k , µ C i , s Ci )
22
Avoiding the Zero-Probability Problem
• Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P( X | C i) = Õ P( xk | C i)
k =1
• Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
• Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
• The “corrected” prob. estimates are close to their
“uncorrected” counterparts
23 23
Bayesian Classification: Why?
3/12/23 24
Naïve Bayes Classifier: Comments
• Advantages
o Easy to implement
o Good results obtained in most of the cases
• Disadvantages
o Assumption: class conditional independence, therefore loss of
accuracy
o Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
Dependencies among these cannot be modeled by Naïve Bayes
Classifier
• How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)
3/12/23 25
Chapter 8. Classification: Basic Concepts
3/12/23 27
Rule-based Classification
3/12/23 28
Rule-based Classification
Points to remember:
• The IF part of the rule is called rule antecedent or precondition.
• The THEN part of the rule is called rule consequent.
• The antecedent part the condition consist of one or more attribute tests
and these tests are logically ANDed.
• The consequent part consists of class prediction.
3/12/23 29
Rule-based Classification
Antecedent Consequent
3/12/23 30
Rule Extraction
• NeroFuzzy Systems
R1 age?
<=30 R5
31..40 >40
no yes no yes
3/12/23 33
Assessment of a rule: coverage and accuracy
• coverage
ncovers = # of tuples covered by R
• Accuracy
3/12/23 34
Assessment of a rule: coverage and accuracy
Quality of a classification rule can be evaluated age income studentcredit_rating
buys_compute
by: <=30 high no fair no
<=30 high no excellent no
• Coverage: fraction of records that satisfy 31…40 high no fair yes
the antecedent of a rule (LHS) >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Coverage(r) = 31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
• Accuracy: fraction of records covered by >40 medium yes fair yes
the rule that belong to the predicted
class (LHS and RHS) <=30 medium yes excellent yes
31…40 medium no excellent yes
∩ 31…40 high yes fair yes
Accuracy(r) =
>40 medium no excellent no
3/12/23 35
Rule Induction: Sequential Covering Method
3/12/23 36
Rule Induction: Sequential Covering Method
Rule induction creates a model built from if–then–else.
Steps:
• Rules are learned one at a time
• Each time a rule is learned, the tuples covered by the rules are
removed
• Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the quality
of a rule returned is below a user-specified threshold
Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3
Positive
examples
3/12/23 38
How to Learn-One-Rule?
39 39
Example 1
ID A B C Y If ? Then Y=y1
1 a1 b1 c1 y1
A=a1 Accuracy(A=a1,Y=y1)=ncorrect/ncover=2/3
2 a1 b2 c1 y1 A=a2 Accuracy(A=a2,Y=y1)=ncorrect/ncover=0/2
3 a1 b2 c2 y2 A=a3 Accuracy(A=a3,Y=y1)=ncorrect/ncover=2/5
4 a2 b1 c1 y2 B=b1 Accuracy(B=b1,Y=y1)=ncorrect/ncover=3/5
5 a2 b2 c3 y2 B=b2 Accuracy(B=b2,Y=y1)=ncorrect/ncover=1/5
C=c1 Accuracy(C=c1,Y=y1)=ncorrect/ncover=3/5
6 a3 b1 c2 y2 C=c2 Accuracy(C=c2,Y=y1)=ncorrect/ncover=0/2
7 a3 b1 c1 y1 C=c3 Accuracy(C=c3,Y=y1)=ncorrect/ncover=1/3
8 a3 b2 c1 y2
9 a3 b2 c3 y2 If A=a1 Then Y=y1
10 a3 b1 c3 y1
2023-03-12 MGSC5126: Data Mining 40
Example 2
If (Age <= 30 and income = Medium and credit-rating = Fair) Then Yes
46
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
47
Classifier Evaluation Metrics: Confusion Matrix
48
Classifier Evaluation Metrics: Accuracy, Error
Rate
A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All
49
Classifier Evaluation Metrics: Accuracy, Error
Rate, Sensitivity and Specificity
A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All
50
Class Imbalance Problem
A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All
positive class
51
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
52
Classifier Evaluation Metrics: Example
53
Holdout & Cross-Validation Methods
Holdout method
54
Holdout & Cross-Validation Methods
Cross-validation (k-fold, where k = 10 is most popular)
55
Bootstrap Method
Bootstrap
• Works well with small data sets
• Samples the given training tuples uniformly with
replacement
• i.e., each time a tuple is selected, it is equally
likely to be selected again and re-added to the
training set
56
Bootstrap Method
Several bootstrap methods, and a common one is .632 boostrap
• A data set with d tuples is sampled d times, with
replacement, resulting in a training set of d samples. The
data tuples that did not make it into the training set end up
forming the test set. About 63.2% of the original data end
up in the bootstrap, and the remaining 36.8% form the test
set (since (1 – 1/d)d ≈ e-1 = 0.368)
• Repeat the sampling procedure k times, overall accuracy of
the model:
57
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
• These mean error rates are just estimates of error on the true population of
future data cases
• What if the difference between the 2 error rates is just attributed to chance?
o Use a test of statistical significance
o Obtain confidence limits for our error estimates
58
Estimating Confidence Intervals:
Null Hypothesis
• Assume samples follow a t distribution with k–1 degrees of freedom (here, k=10)
• Use t-test (or Student’s t-test)
• Null Hypothesis: M1 & M2 are the same
59
Estimating Confidence Intervals: t-test
• If only 1 test set available: pairwise comparison
o For ith round of 10-fold cross-validation, the same cross partitioning is used to
obtain err(M1)i and err(M2)i
o Average over 10 rounds to get and
o t-test computes t-statistic with k-1 degrees of freedom:
where
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
60
Estimating Confidence Intervals:
Table for t-distribution
• Symmetric
• Significance level,
e.g., sig = 0.05 or 5%
means M1 & M2 are
significantly different
for 95% of
population
• Confidence limit, z =
sig/2
61
Estimating Confidence Intervals:
Statistical Significance
62
Model Selection: ROC Curves
• ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
• Originated from signal detection theory
• Shows the trade-off between the true
positive rate and the false positive rate
• The area under the ROC curve is a n Vertical axis
measure of the accuracy of the model represents the true
• Rank the test tuples in decreasing positive rate
order: the one that is most likely to n Horizontal axis rep.
belong to the positive class appears at the false positive rate
the top of the list n The plot also shows a
• The closer to the diagonal line (i.e., the diagonal line
closer the area is to 0.5), the less n A model with perfect
accurate is the model accuracy will have an
area of 1.0
63
Predictor Error Measures
Measure predictor accuracy: measure how far off the predicted value is from the actual
known value
• Loss function: measures the error betw. yi and the predicted value yi’
• Absolute error: | yi – yi’|
• Squared error: (yi – yi’)2
• Test error (generalization error): the average loss over the test set d
d
å(y - yi ' ) 2
• Mean absolute error:å | yi - yi ' | i
i =1
Mean squared error: i =1
d
d
d d
å | y - y '|
i =1
i i
å ( yi - yi ' ) 2
• Relative absolute error: | y - y |
d
Relative squared error: i =1
å d
å ( y - y)
i
2
i =1
i
i =1
64
Issues Affecting Model Selection
• Accuracy
• classifier accuracy: predicting class label
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree size or
compactness of classification rules
65
Chapter 8. Classification: Basic Concepts
Ensemble methods
• Use a combination of models to increase accuracy
• Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
New Data
Data Sample
Classifier 1
Combined Class
Classifier 2
votes Prediction
.
.
Classifier T
67
Ensemble Methods: Increasing the Accuracy
68
Bagging: Boostrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
o Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with
replacement from D (i.e., bootstrap)
o A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
o Each classifier Mi returns its class prediction
o The bagged classifier M* counts the votes and assigns the class with the most votes to X
• Prediction: can be applied to the prediction of continuous values by taking the average value
of each prediction for a given test tuple
• Accuracy
o Often significantly better than a single classifier derived from D
o For noise data: not considerably worse, more robust
o Proved improved accuracy in prediction
69
Boosting
• Analogy: Consult several doctors, based on a combination of weighted
diagnoses—weight assigned based on the previous diagnosis accuracy
• How boosting works?
o Weights are assigned to each training tuple
o A series of k classifiers is iteratively learned
o After a classifier Mi is learned, the weights are updated to allow the
subsequent classifier, Mi+1, to pay more attention to the training
tuples that were misclassified by Mi
o The final M* combines the votes of each individual classifier, where
the weight of each classifier's vote is a function of its accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy, but it
also risks overfitting the model to misclassified data
70
Adaboost (Freund and Schapire, 1997)
1 - error ( M i )
log
• The weight of classifier Mi’s vote is error ( M i )
71
Random Forest (Breiman 2001)
Random Forest:
• Each classifier in the ensemble is a decision tree classifier and is
generated using a random selection of attributes at each node to
determine the split
• During classification, each tree votes and the most popular class is
returned
72
Random Forest (Breiman 2001)
Two Methods to construct Random Forest:
73
Classification of Class-Imbalanced Data Sets
74
Classification of Class-Imbalanced Data Sets
Typical methods for imbalance data in 2-class classification:
75
Summary (I)
76
Summary (II)
• Significance tests and ROC curves are useful for model selection.
77
References (1)
• C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997
• C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press,
1995
• L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984
• C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data
Mining and Knowledge Discovery, 2(2): 121-168, 1998
• P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data
for scaling machine learning. KDD'95
• H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern Analysis for
Effective Classification, ICDE'07
• H. Cheng, X. Yan, J. Han, and P. S. Yu, Direct Discriminative Pattern Mining for
Effective Classification, ICDE'08
• W. Cohen. Fast effective rule induction. ICML'95
• G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for
gene expression data. SIGMOD'05
78
References (2)
• A. J. Dobson. An Introduction to Generalized Linear Models. Chapman & Hall, 1990.
• G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and
differences. KDD'99.
• R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley, 2001
• U. M. Fayyad. Branching on attribute values in decision tree generation. AAAI’94.
• Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and
an application to boosting. J. Computer and System Sciences, 1997.
• J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. VLDB’98.
• J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree
Construction. SIGMOD'99.
• T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer-Verlag, 2001.
• D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The
combination of knowledge and statistical data. Machine Learning, 1995.
• W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple
Class-Association Rules, ICDM'01.
79
References (3)
• T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity,
and training time of thirty-three old and new classification algorithms. Machine
Learning, 2000.
• J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic
interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing
Research, Blackwell Business, 1994.
• M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining.
EDBT'96.
• T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
• S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-
Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
• J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
• J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.
• J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
• J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.
80
References (4)
• R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and
pruning. VLDB’98.
• J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data
mining. VLDB’96.
• J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann,
1990.
• P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2005.
• S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert
Systems. Morgan Kaufman, 1991.
• S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
• I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques, 2ed. Morgan Kaufmann, 2005.
• X. Yin and J. Han. CPAR: Classification based on predictive association rules. SDM'03
• H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical
clusters. KDD'03.
81