0% found this document useful (0 votes)

12 views80 pages

Naïve Bayes Classifier Overview

Chapter 8 of 'Data Mining: Concepts and Techniques' discusses classification methods, focusing on Bayesian classification and the Naïve Bayes classifier. It explains how to calculate conditional probabilities using Bayes' Theorem and demonstrates the classifier's application through examples. The chapter also addresses the performance and advantages of the Naïve Bayes classifier in comparison to other classification methods.

Uploaded by

ajithbaby06

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views80 pages

Naïve Bayes Classifier Overview

Uploaded by

ajithbaby06

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data Mining:

Concepts and Techniques

(3rd ed.)

— Chapter 8 —

Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.

2
Chapter 8. Classification: Basic Concepts

• Classification: Basic Concepts

• Decision Tree Induction
• Bayes Classification Methods
• Rule-Based Classification
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Summary
3
Bayesian Classification: Why?

• A statistical classifier: performs probabilistic prediction, i.e., predicts

class membership probabilities

• Foundation: Based on Bayes’ Theorem.

3/12/23 4
Naïve Bayes Classifier

Imagine we receive normal messages from family and friends

And we also receive spams

From friends or spam?

2023-03-12 MGSC5126: Data Mining 5

Naïve Bayes Classifier
Total number of words = 17
Normal emails From friends
Total number of word Dear = 8
P(Dear|N) = 0.47
P(Friend|N) = 0.29
P(Lunch|N) = 0.18

P(Money|N) = 0.006

Spam Dear Friend Lunch Money

2023-03-12 MGSC5126: Data Mining 6

Naïve Bayes Classifier

P(Dear|N) = 0.47
P(Friend|N) = 0.29
P(Lunch|N) = 0.18
P(Money|N) = 0.006

P(Dear|S) = 0.29
P(Friend|S) = 0.14
P(Lunch|S) = 0
P(Money|S) = 0.57

2023-03-12 MGSC5126: Data Mining 7

Naïve Bayes Classifier

P(Dear|N) = 0.47
P(Friend|N) = 0.29
Prior Probability
P(Lunch|N) = 0.18
P(Money|N) = 0.006 P(N) = 0.67

P(N)=8/12=0.67 P(S)= 0.33

P(Dear|S) = 0.29
P(Friend|S) = 0.14
P(Lunch|S) = 0
P(Money|S) = 0.57
P(S)=4/12=0.33
2023-03-12 MGSC5126: Data Mining 8
Naïve Bayes Classifier

What is the probability that the new mail is normal, given it

starts with Dear Friend?

What is the probability that the new mail is spam, given it

starts with Dear Friend?

2023-03-12 MGSC5126: Data Mining 9

P(N)=8/12=0.67

P(Dear|S) = 0.29 P(S)P(Dear|S)P(Friend|S)

P(Friend|S) = 0.14 =0.33*0.29*0.14=0.01
P(Lunch|S) = 0
P(Money|S) = 0.57 0.09 > 0.01
P(S)=4/12=0.33 à the new email is a Normal message
2023-03-12 MGSC5126: Data Mining 10
Naïve Bayes Classifier
What is the probability that the new mail
P(Dear|N) = 0.47 is normal, given it starts contains Lunch
P(Friend|N) = 0.29 money money money?

P(Lunch|N) = 0.18 P(N)P(Lunch|N)P(money|N)*

P(Money|N) = 0.006 P(money|N)*P(money|N)
=0.67*0.18*0.006*0.006*0.006
P(N)=8/12=0.67 = 2.60496e-08

2023-03-12 MGSC5126: Data Mining 11

Naïve Bayes Classifier

2023-03-12 MGSC5126: Data Mining 12

Naïve Bayes Classifier

Give attributes, x1, x2, …, xn , and classes c1, c2, …, ck

The goal of Naïve Bayes classifier is to calculate the conditional probability:
P(Ck | x1, x2, …, xn )

for each of K possible outcomes or classes Ck.

2023-03-12 MGSC5126: Data Mining 13

Bayes’ Theorem: Basics

• Bayes’ Theorem:
P(H | X) = P(X | H ) P(H ) = P(X | H )´ P(H ) / P(X)
P(X)
o Let X be a data sample (“evidence”): class label is unknown
o Let H be a hypothesis that X belongs to class C
o Classification is to determine P(H|X)

o P(H) (prior probability): the initial probability

E.g., X will buy computer, regardless of age, income, …
o P(X): probability that sample data is observed
o P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds
E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income

3/12/23 14
Prediction Based on Bayes’ Theorem

• Given training data X, posteriori probability of a hypothesis H,

P(H|X), follows the Bayes’ theorem

P(H | X) = P(X | H ) P( H ) = P(X | H )´ P(H ) / P(X)

P(X)
• Informally, this can be viewed as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
• Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost

3/12/23 15
Classification Is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)

P(Ck | x1, x2, …, xn )

16
Classification Is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)

• Since P(X) is constant for all classes, only

P(C | X) = P(X | C )P(C )

needs to be maximized i i i
17
Naïve Bayes Classifier

A simplified assumption: attributes are conditionally independent (i.e.,

no dependence relation between attributes):

This greatly reduces the computation cost: Only counts the class
distribution

18
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_computer
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data to be classified: 31…40 low yes excellent yes
X = (age <=30, <=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
P(C1|X) 31…40 high yes fair yes
P(C2|X) >40 medium no excellent no
3/12/23 19
Naïve Bayes Classifier: Training Dataset

age income student credit_rating

buys_computer
Assumption: <=30 high no fair no
<=30 high no excellent no
age, income, and credit_rating,
31…40 high no fair yes
student are conditionally independent >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

3/12/23
Naïve Bayes Classifier: An Example
age income studentcredit_rating
buys_compute
<=30 high no fair no
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 <=30 high no excellent no
31…40 high no fair yes
P(buys_computer = “no”) = 5/14= 0.357 >40 medium no fair yes
>40 low yes fair yes
• Compute P(X|Ci) for each class >40 low yes excellent no
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 31…40 low yes excellent yes
<=30 medium no fair no
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
<=30 low yes fair yes
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444>40 medium yes fair yes
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 <=30 medium yes excellent yes
31…40 medium no excellent yes
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 31…40 high yes fair yes
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 >40 medium no excellent no
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
3/12/23 Therefore, X belongs to class (“buys_computer = yes”) 21
Naïve Bayes Classifier
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
and P(xk|Ci) is
( x-µ )2
1 -
g ( x, µ , s ) = e 2s 2

2p s

P ( X | C i ) = g ( x k , µ C i , s Ci )

22
Avoiding the Zero-Probability Problem
• Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P( X | C i) = Õ P( xk | C i)
k =1
• Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
• Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
• The “corrected” prob. estimates are close to their
“uncorrected” counterparts
23 23
Bayesian Classification: Why?

• Performance: naïve Bayesian classifier, has comparable performance

with decision tree and selected neural network classifiers
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct — prior
knowledge can be combined with observed data
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision making
against which other methods can be measured

3/12/23 24
Naïve Bayes Classifier: Comments
• Advantages
o Easy to implement
o Good results obtained in most of the cases
• Disadvantages
o Assumption: class conditional independence, therefore loss of
accuracy
o Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
Dependencies among these cannot be modeled by Naïve Bayes
Classifier
• How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)

3/12/23 25
Chapter 8. Classification: Basic Concepts

• Classification: Basic Concepts

• Rule-based classifier makes the class decision by using of a

set of IF-THEN rules.

• We can express a rule in the following from:

IF condition THEN conclusion

• The rules are easily interpretable and thus these classifiers

are generally used to generate descriptive models

3/12/23 27
Rule-based Classification

Let us consider a rule R1,

R1: IF age = youth AND student = yes THEN buy_computer = yes

Note − We can also write rule R1 as follows:

R1: (age = youth) ^ (student = yes)) -> (buys computer = yes)

3/12/23 28
Rule-based Classification

R1: IF age = youth AND student = yes THEN buy_computer = yes

Points to remember:
• The IF part of the rule is called rule antecedent or precondition.
• The THEN part of the rule is called rule consequent.
• The antecedent part the condition consist of one or more attribute tests
and these tests are logically ANDed.
• The consequent part consists of class prediction.

3/12/23 29
Rule-based Classification

R1: IF age = youth AND student = yes THEN buy_computer = yes

Antecedent Consequent

R1: IF income >= 30K THEN buy_computer = yes

3/12/23 30
Rule Extraction

There are several ways to build a rule-base classifier:

• Extract rules from Decision Trees

• Sequential Covering Method

• NeroFuzzy Systems

2023-03-12 MGSC5126: Data Mining 31

Rule Extraction from a Decision Tree

Rules are easier to understand than large trees.

To extract a rule from a decision tree:

• One rule is created for each path from the root to the leaf node.

• To form a rule antecedent, each splitting criterion is logically ANDed.

• The leaf node holds the class prediction, forming the rule consequent.

• Rules are mutually exclusive and exhaustive.

2023-03-12 MGSC5126: Data Mining 32

Rule Extraction from a Decision Tree
R2

R1 age?

<=30 R5
31..40 >40

student? credit rating?

yes

no yes excellent fair

no yes no yes

R1: IF age = young AND student = no THEN buys_computer = no

R2: IF age = young AND student = yes THEN buys_computer = yes
R3: IF age = mid-age THEN buys_computer = yes
R4: IF age = old AND credit_rating = excellent THEN buys_computer = no
R5: IF age = old AND credit_rating = fair THEN buys_computer = yes

3/12/23 33
Assessment of a rule: coverage and accuracy

• coverage
ncovers = # of tuples covered by R

coverage(R) = ncovers /|D| /* D: training data set */

• Accuracy

ncorrect = # of tuples correctly classified by R

accuracy(R) = ncorrect / ncovers

3/12/23 34
Assessment of a rule: coverage and accuracy
Quality of a classification rule can be evaluated age income studentcredit_rating
buys_compute
by: <=30 high no fair no
<=30 high no excellent no
• Coverage: fraction of records that satisfy 31…40 high no fair yes
the antecedent of a rule (LHS) >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Coverage(r) = 31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
• Accuracy: fraction of records covered by >40 medium yes fair yes
the rule that belong to the predicted
class (LHS and RHS) <=30 medium yes excellent yes
31…40 medium no excellent yes
∩ 31…40 high yes fair yes
Accuracy(r) =
>40 medium no excellent no

(n is the number of records in our sample) (student = yes) → computer-buy = No

Coverage = 50%, Accuracy = 14%

3/12/23 35
Rule Induction: Sequential Covering Method

• Sequential Covering Algorithm can be used to extract IF-THEN rules

form the training data.
• Generating a decision tree is not required.
• Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
• Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER

3/12/23 36
Rule Induction: Sequential Covering Method
Rule induction creates a model built from if–then–else.

Steps:
• Rules are learned one at a time
• Each time a rule is learned, the tuples covered by the rules are
removed
• Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the quality
of a rule returned is below a user-specified threshold

Note: In Decision Tree Induction rules res learned simultaneously

but in Sequential Covering Method rules are learned one at a time
3/12/23 37
Sequential Covering Algorithm

while (enough target tuples left)

generate a rule
remove positive target tuples satisfying this rule

Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3

Positive
examples

3/12/23 38
How to Learn-One-Rule?

• Start with the most general rule possible: condition =

empty
• Adding new attributes by adopting a greedy depth-first
strategy
• Picks the one that most improves the rule quality
• Rule-Quality measures: consider both coverage and
accuracy

39 39
Example 1

ID A B C Y If ? Then Y=y1
1 a1 b1 c1 y1
A=a1 Accuracy(A=a1,Y=y1)=ncorrect/ncover=2/3
2 a1 b2 c1 y1 A=a2 Accuracy(A=a2,Y=y1)=ncorrect/ncover=0/2
3 a1 b2 c2 y2 A=a3 Accuracy(A=a3,Y=y1)=ncorrect/ncover=2/5
4 a2 b1 c1 y2 B=b1 Accuracy(B=b1,Y=y1)=ncorrect/ncover=3/5
5 a2 b2 c3 y2 B=b2 Accuracy(B=b2,Y=y1)=ncorrect/ncover=1/5
C=c1 Accuracy(C=c1,Y=y1)=ncorrect/ncover=3/5
6 a3 b1 c2 y2 C=c2 Accuracy(C=c2,Y=y1)=ncorrect/ncover=0/2
7 a3 b1 c1 y1 C=c3 Accuracy(C=c3,Y=y1)=ncorrect/ncover=1/3
8 a3 b2 c1 y2
9 a3 b2 c3 y2 If A=a1 Then Y=y1
10 a3 b1 c3 y1
2023-03-12 MGSC5126: Data Mining 40
Example 2

ID A B C Y If A=a1 Then Y=y1

1 a1 b1 c1 y1
2 a1 b2 c1 y1 B=b1 Accuracy(A=a1 and B=b1,Y=y1)=ncorrect/ncover=1/1
B=b2 Accuracy(A=a1 and B=b2,Y=y1)=ncorrect/ncover=1/2
3 a1 b2 c2 y2 C=c1 Accuracy(A=a1 and C=c1,Y=y1)=ncorrect/ncover=2/2
4 a2 b1 c1 y2 C=c2 Accuracy(A=a1 and C=c2,Y=y1)=ncorrect/ncover=0/1
C=c3 Accuracy(A=a1 and C=c3,Y=y1)=ncorrect/ncover=0/0
5 a2 b2 c3 y2
6 a3 b1 c2 y2
7 a3 b1 c1 y1
If (A=a1 and C=c1) Then Y=y1
8 a3 b2 c1 y2
9 a3 b2 c3 y2
10 a3 b1 c3 y1
2023-03-12 MGSC5126: Data Mining 41
Conflict Resolution

If (Age <= 30 and income = Medium and credit-rating = Fair) Then Yes

If (Age <= 30 and income = Medium) Then No

2023-03-12 MGSC5126: Data Mining 42

Conflict Resolution

If more than one rule are triggered, need conflict resolution.

Conflict resolution can be done by:

• Size ordering: assign the highest priority to the triggering

rules that has the “toughest” requirement (i.e., with the
most attribute tests)
• Class-based ordering: decreasing order of prevalence or
misclassification cost per class (give priority to classes)
• Rule-based ordering (decision list): rules are organized
into one long priority list, according to some measure of
rule quality or by experts
3/12/23 43
Rule Pruning

The rule is pruned is due to the following reason:

• The rule may perform well on training data but less well on subsequent
data. That's why the rule pruning is required.

• The rule is pruned by removing conjunct. The rule R is pruned, if pruned

version of R has greater quality than what was assessed on an independent
set of tuples.

3/12/23 Data Mining: Concepts and Techniques 44

Chapter 8. Classification: Basic Concepts

• Classification: Basic Concepts

• Classification accuracy measures:

o Confusion matrix
o Accuracy and error rate
o Sensitivity and Specificity
• Methods for comparing classifiers:
o Confidence intervals
o Cost-benefit analysis and ROC Curves

46
Classifier Evaluation Metrics: Confusion Matrix

Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

• Given m classes, an entry, CMi,j in a confusion matrix indicates #

of tuples in class i that were labeled by the classifier as class j
• May have extra rows/columns to provide totals

47
Classifier Evaluation Metrics: Confusion Matrix

Example of Confusion Matrix:

Actual class\Predicted buy_computer = buy_computer = Total

class yes no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

48
Classifier Evaluation Metrics: Accuracy, Error
Rate

A\P C ¬C

C TP FN P

¬C FP TN N

P’ N’ All

• Classifier Accuracy, or recognition rate: percentage of test set tuples that

are correctly classified
Accuracy = (TP + TN)/All
• Error rate: 1 – accuracy, or
Error rate = (FP + FN)/All

49
Classifier Evaluation Metrics: Accuracy, Error
Rate, Sensitivity and Specificity
A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All

Sensitivity: True Positive recognition rate

Sensitivity = TP/P
Specificity: True Negative recognition rate
Specificity = TN/N

50
Class Imbalance Problem

A\P C ¬C

C TP FN P

¬C FP TN N

P’ N’ All

Class Imbalance Problem:

n One class may be rare, e.g. fraud, or HIV-positive

n Significant majority of the negative class and minority of the

positive class

51
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

• Recall: completeness – what % of positive tuples did the

classifier label as positive?
• Perfect score is 1.0
• Inverse relationship between precision & recall
• F measure (F1 or F-score): harmonic mean of precision and
recall,

• Fß: weighted measure of precision and recall

o assigns ß times as much weight to recall as to precision

52
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

• Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

53
Holdout & Cross-Validation Methods

Holdout method

• Given data is randomly partitioned into two independent

sets
o Training set (e.g., 2/3) for model construction
o Test set (e.g., 1/3) for accuracy estimation

• Random sampling: a variation of holdout

o Repeat holdout k times, accuracy = avg. of the accuracies obtained

54
Holdout & Cross-Validation Methods
Cross-validation (k-fold, where k = 10 is most popular)

• Randomly partition the data into k mutually exclusive subsets,

each approximately equal size

• At i-th iteration, use Di as test set and others as training set

• Leave-one-out: k folds where k = # of tuples, for small sized

data

• Stratified cross-validation: folds are stratified so that class

dist. in each fold is approx. the same as that in the initial data

55
Bootstrap Method
Bootstrap
• Works well with small data sets
• Samples the given training tuples uniformly with
replacement
• i.e., each time a tuple is selected, it is equally
likely to be selected again and re-added to the
training set

56
Bootstrap Method
Several bootstrap methods, and a common one is .632 boostrap
• A data set with d tuples is sampled d times, with
replacement, resulting in a training set of d samples. The
data tuples that did not make it into the training set end up
forming the test set. About 63.2% of the original data end
up in the bootstrap, and the remaining 36.8% form the test
set (since (1 – 1/d)d ≈ e-1 = 0.368)
• Repeat the sampling procedure k times, overall accuracy of
the model:

57
Estimating Confidence Intervals:
Classifier Models M1 vs. M2

• Suppose we have 2 classifiers, M1 and M2, which one is better?

• Use 10-fold cross-validation to obtain and

• These mean error rates are just estimates of error on the true population of
future data cases

• What if the difference between the 2 error rates is just attributed to chance?
o Use a test of statistical significance
o Obtain confidence limits for our error estimates

58
Estimating Confidence Intervals:
Null Hypothesis

• Perform 10-fold cross-validation

• Assume samples follow a t distribution with k–1 degrees of freedom (here, k=10)
• Use t-test (or Student’s t-test)
• Null Hypothesis: M1 & M2 are the same

• If we can reject null hypothesis, then

o we conclude that the difference between M1 & M2 is statistically significant
o Chose model with lower error rate

59
Estimating Confidence Intervals: t-test
• If only 1 test set available: pairwise comparison
o For ith round of 10-fold cross-validation, the same cross partitioning is used to
obtain err(M1)i and err(M2)i
o Average over 10 rounds to get and
o t-test computes t-statistic with k-1 degrees of freedom:
where

• If two test sets available: use non-paired t-test

where

where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.

60
Estimating Confidence Intervals:
Table for t-distribution

• Symmetric
• Significance level,
e.g., sig = 0.05 or 5%
means M1 & M2 are
significantly different
for 95% of
population
• Confidence limit, z =
sig/2

61
Estimating Confidence Intervals:
Statistical Significance

• Are M1 & M2 significantly different?

o Compute t. Select significance level (e.g. sig = 5%)
o Consult table for t-distribution: Find t value corresponding to k-1 degrees of
freedom (here, 9)
o t-distribution is symmetric: typically upper % points of distribution shown →
look up value for confidence limit z=sig/2 (here, 0.025)
o If t > z or t < -z, then t value lies in rejection region:
Reject null hypothesis that mean error rates of M1 & M2 are same
Conclude: statistically significant difference between M1 & M2
o Otherwise, conclude that any difference is chance

62
Model Selection: ROC Curves
• ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
• Originated from signal detection theory
• Shows the trade-off between the true
positive rate and the false positive rate
• The area under the ROC curve is a n Vertical axis
measure of the accuracy of the model represents the true
• Rank the test tuples in decreasing positive rate
order: the one that is most likely to n Horizontal axis rep.
belong to the positive class appears at the false positive rate
the top of the list n The plot also shows a
• The closer to the diagonal line (i.e., the diagonal line
closer the area is to 0.5), the less n A model with perfect
accurate is the model accuracy will have an
area of 1.0
63
Predictor Error Measures

Measure predictor accuracy: measure how far off the predicted value is from the actual
known value
• Loss function: measures the error betw. yi and the predicted value yi’
• Absolute error: | yi – yi’|
• Squared error: (yi – yi’)2
• Test error (generalization error): the average loss over the test set d
d
å(y - yi ' ) 2
• Mean absolute error:å | yi - yi ' | i

i =1
Mean squared error: i =1

d
d
d d
å | y - y '|
i =1
i i
å ( yi - yi ' ) 2
• Relative absolute error: | y - y |
d
Relative squared error: i =1
å d

å ( y - y)
i
2
i =1
i
i =1

The mean squared-error exaggerates the presence of outliers

Popularly use (square) root mean-square error, similarly, root relative squared error

64
Issues Affecting Model Selection
• Accuracy
• classifier accuracy: predicting class label
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree size or
compactness of classification rules

65
Chapter 8. Classification: Basic Concepts

• Classification: Basic Concepts

Ensemble methods
• Use a combination of models to increase accuracy
• Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*

New Data
Data Sample
Classifier 1
Combined Class
Classifier 2
votes Prediction
.
.
Classifier T

67
Ensemble Methods: Increasing the Accuracy

Popular ensemble methods

• Bagging: averaging the prediction over a collection of

classifiers
• Boosting: weighted vote with a collection of classifiers
• Ensemble: combining a set of heterogeneous classifiers

68
Bagging: Boostrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
o Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with
replacement from D (i.e., bootstrap)
o A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
o Each classifier Mi returns its class prediction
o The bagged classifier M* counts the votes and assigns the class with the most votes to X
• Prediction: can be applied to the prediction of continuous values by taking the average value
of each prediction for a given test tuple
• Accuracy
o Often significantly better than a single classifier derived from D
o For noise data: not considerably worse, more robust
o Proved improved accuracy in prediction

69
Boosting
• Analogy: Consult several doctors, based on a combination of weighted
diagnoses—weight assigned based on the previous diagnosis accuracy
• How boosting works?
o Weights are assigned to each training tuple
o A series of k classifiers is iteratively learned
o After a classifier Mi is learned, the weights are updated to allow the
subsequent classifier, Mi+1, to pay more attention to the training
tuples that were misclassified by Mi
o The final M* combines the votes of each individual classifier, where
the weight of each classifier's vote is a function of its accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy, but it
also risks overfitting the model to misclassified data

70
Adaboost (Freund and Schapire, 1997)

• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)

• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
• Tuples from D are sampled (with replacement) to form a training set Di
of the same size
• Each tuple’s chance of being selected is based on its weight
• A classification model Mi is derived from Di
• Its error rate is calculated using Di as a test set
• If a tuple is misclassified, its weight is increased, o.w. it is decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi
error rate is the sum of the weightsd of the misclassified tuples:
error ( M i ) = å w j ´ err ( X j )
j

1 - error ( M i )
log
• The weight of classifier Mi’s vote is error ( M i )
71
Random Forest (Breiman 2001)

Random Forest:
• Each classifier in the ensemble is a decision tree classifier and is
generated using a random selection of attributes at each node to
determine the split

• During classification, each tree votes and the most popular class is
returned

72
Random Forest (Breiman 2001)
Two Methods to construct Random Forest:

• Forest-RI (random input selection): Randomly select, at each

node, F attributes as candidates for the split at the node. The CART
methodology is used to grow the trees to maximum size

• Forest-RC (random linear combinations): Creates new attributes

(or features) that are a linear combination of the existing
attributes (reduces the correlation between individual classifiers)

73
Classification of Class-Imbalanced Data Sets

• Class-imbalance problem: Rare positive example but numerous negative

ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.

• Traditional methods assume a balanced distribution of classes and equal

error costs: not suitable for class-imbalanced data

74
Classification of Class-Imbalanced Data Sets
Typical methods for imbalance data in 2-class classification:

• Oversampling: re-sampling of data from positive class

• Under-sampling: randomly eliminate tuples from negative class
• Threshold-moving: moves the decision threshold, t, so that the rare class
tuples are easier to classify, and hence, less chance of costly false negative
errors
• Ensemble techniques: Ensemble multiple classifiers introduced above

Still difficult for class imbalance problem on multiclass tasks

75
Summary (I)

• Classification is a form of data analysis that extracts models describing

important data classes.
• Effective and scalable methods have been developed for decision tree
induction, Naive Bayesian classification, rule-based classification, and
many other classification methods.
• Evaluation metrics include: accuracy, sensitivity, specificity, precision,
recall, F measure, and Fß measure.
• Stratified k-fold cross-validation is recommended for accuracy estimation.
Bagging and boosting can be used to increase overall accuracy by learning
and combining a series of individual models.

76
Summary (II)

• Significance tests and ROC curves are useful for model selection.

• There have been numerous comparisons of the different

classification methods; the matter remains a research topic
• No single method has been found to be superior over all others
for all data sets
• Issues such as accuracy, training time, robustness, scalability, and
interpretability must be considered and can involve trade-offs,
further complicating the quest for an overall superior method

77
References (1)

• C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997
• C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press,
1995
• L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984
• C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data
Mining and Knowledge Discovery, 2(2): 121-168, 1998
• P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data
for scaling machine learning. KDD'95
• H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern Analysis for
Effective Classification, ICDE'07
• H. Cheng, X. Yan, J. Han, and P. S. Yu, Direct Discriminative Pattern Mining for
Effective Classification, ICDE'08
• W. Cohen. Fast effective rule induction. ICML'95
• G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for
gene expression data. SIGMOD'05

78
References (2)
• A. J. Dobson. An Introduction to Generalized Linear Models. Chapman & Hall, 1990.
• G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and
differences. KDD'99.
• R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley, 2001
• U. M. Fayyad. Branching on attribute values in decision tree generation. AAAI’94.
• Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and
an application to boosting. J. Computer and System Sciences, 1997.
• J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. VLDB’98.
• J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree
Construction. SIGMOD'99.
• T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer-Verlag, 2001.
• D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The
combination of knowledge and statistical data. Machine Learning, 1995.
• W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple
Class-Association Rules, ICDM'01.

79
References (3)

• T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity,
and training time of thirty-three old and new classification algorithms. Machine
Learning, 2000.
• J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic
interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing
Research, Blackwell Business, 1994.
• M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining.
EDBT'96.
• T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
• S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-
Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
• J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
• J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.
• J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
• J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.
80
References (4)
• R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and
pruning. VLDB’98.
• J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data
mining. VLDB’96.
• J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann,
1990.
• P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2005.
• S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert
Systems. Morgan Kaufman, 1991.
• S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
• I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques, 2ed. Morgan Kaufmann, 2005.
• X. Yin and J. Han. CPAR: Classification based on predictive association rules. SDM'03
• H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical
clusters. KDD'03.

Naïve Bayes Classifier Explained
No ratings yet
Naïve Bayes Classifier Explained
19 pages
Bayesian Classification
No ratings yet
Bayesian Classification
11 pages
Machine Learning: Classification Techniques
No ratings yet
Machine Learning: Classification Techniques
37 pages
Naïve Bayes Classifier Explained
No ratings yet
Naïve Bayes Classifier Explained
9 pages
Mod 3 Aiml
No ratings yet
Mod 3 Aiml
84 pages
Naïve Bayesian Classification Explained
No ratings yet
Naïve Bayesian Classification Explained
17 pages
Naive Bayesian Classification Overview
No ratings yet
Naive Bayesian Classification Overview
15 pages
Bayes' Theorem in Data Classification
No ratings yet
Bayes' Theorem in Data Classification
10 pages
Bayesian Classification Techniques
No ratings yet
Bayesian Classification Techniques
16 pages
Naïve Bayes Classifier Explained
No ratings yet
Naïve Bayes Classifier Explained
22 pages
Bayesian Classification Techniques Explained
No ratings yet
Bayesian Classification Techniques Explained
22 pages
Understanding Bayesian Classification Techniques
No ratings yet
Understanding Bayesian Classification Techniques
15 pages
Naive Bayes Classification Explained
No ratings yet
Naive Bayes Classification Explained
18 pages
Classification Algorithms Overview
No ratings yet
Classification Algorithms Overview
169 pages
Naïve Bayes Classifier Overview
No ratings yet
Naïve Bayes Classifier Overview
27 pages
Naïve Bayesian Classification Overview
No ratings yet
Naïve Bayesian Classification Overview
47 pages
Supervised Learning: Classification Methods
No ratings yet
Supervised Learning: Classification Methods
79 pages
Bayes Classification Explained
No ratings yet
Bayes Classification Explained
41 pages
Bayes Classification Techniques Overview
No ratings yet
Bayes Classification Techniques Overview
30 pages
Naive Bayesian Classifier Overview
No ratings yet
Naive Bayesian Classifier Overview
21 pages
Bayesian Classification in Data Mining
No ratings yet
Bayesian Classification in Data Mining
6 pages
Naive Bayesian Classification Overview
No ratings yet
Naive Bayesian Classification Overview
16 pages
Bayesian Classification Explained
No ratings yet
Bayesian Classification Explained
18 pages
10 Naive Bayesian Classifier
No ratings yet
10 Naive Bayesian Classifier
20 pages
Naïve Bayes Classifier Overview
No ratings yet
Naïve Bayes Classifier Overview
21 pages
Data Mining Notes
No ratings yet
Data Mining Notes
18 pages
Bayesian Classification in Data Mining
No ratings yet
Bayesian Classification in Data Mining
15 pages
Data-Mining-lecture-Eight-Naïve Bayes Algorithm
No ratings yet
Data-Mining-lecture-Eight-Naïve Bayes Algorithm
24 pages
Understanding Classification Techniques
No ratings yet
Understanding Classification Techniques
111 pages
Naive Bayesian Classifiers Overview
No ratings yet
Naive Bayesian Classifiers Overview
43 pages
Bayesian Classification Methods Overview
No ratings yet
Bayesian Classification Methods Overview
21 pages
Bayesian Classification Methods Explained
No ratings yet
Bayesian Classification Methods Explained
46 pages
Bayesian Classifiers and Data Mining Techniques
No ratings yet
Bayesian Classifiers and Data Mining Techniques
7 pages
Naïve Bayes Classifier Overview
No ratings yet
Naïve Bayes Classifier Overview
19 pages
Bayesian Classification Overview
No ratings yet
Bayesian Classification Overview
66 pages
Naïve Bayesian Classifier Overview
No ratings yet
Naïve Bayesian Classifier Overview
13 pages
K-Nearest Neighbors Overview
No ratings yet
K-Nearest Neighbors Overview
35 pages
Naïve Bayes Classification Explained
No ratings yet
Naïve Bayes Classification Explained
23 pages
Classification and Prediction Methods
No ratings yet
Classification and Prediction Methods
46 pages
Bayesian Classification Techniques Explained
No ratings yet
Bayesian Classification Techniques Explained
24 pages
Classification Algorithms Overview
No ratings yet
Classification Algorithms Overview
23 pages
Data Mining: Classification Techniques
No ratings yet
Data Mining: Classification Techniques
28 pages
Naive Bayes Algorithm Course Overview
No ratings yet
Naive Bayes Algorithm Course Overview
8 pages
Understanding Bayesian Classification Techniques
No ratings yet
Understanding Bayesian Classification Techniques
40 pages
Understanding Bayesian Classification Techniques
No ratings yet
Understanding Bayesian Classification Techniques
23 pages
Naive Bayes Classifier Overview and Applications
No ratings yet
Naive Bayes Classifier Overview and Applications
23 pages
Classification Techniques Overview
No ratings yet
Classification Techniques Overview
43 pages
Understanding Naïve Bayes Classifiers
No ratings yet
Understanding Naïve Bayes Classifiers
31 pages
ML Lec06 Supervised Learning NB
No ratings yet
ML Lec06 Supervised Learning NB
16 pages
Understanding Bayesian Classifiers
No ratings yet
Understanding Bayesian Classifiers
58 pages
Bayesian Classification of Computer Buyers
No ratings yet
Bayesian Classification of Computer Buyers
4 pages
Bayesian Concept Learning Overview
No ratings yet
Bayesian Concept Learning Overview
40 pages
Bayesian Classification in Data Mining
No ratings yet
Bayesian Classification in Data Mining
46 pages
Supervised Learning: Naïve Bayes & kNN
No ratings yet
Supervised Learning: Naïve Bayes & kNN
29 pages
Understanding Naïve Bayesian Classifier
No ratings yet
Understanding Naïve Bayesian Classifier
13 pages
Machine Learning: Classification & Clustering
No ratings yet
Machine Learning: Classification & Clustering
44 pages
Bayesian Concept Learning Overview
No ratings yet
Bayesian Concept Learning Overview
40 pages
Understanding Bayesian Classification Techniques
No ratings yet
Understanding Bayesian Classification Techniques
49 pages
Naïve Bayes Classification Explained
No ratings yet
Naïve Bayes Classification Explained
22 pages
Human Stress Detection with ML
No ratings yet
Human Stress Detection with ML
42 pages
Sklearn Modules and APIs Quick Reference
No ratings yet
Sklearn Modules and APIs Quick Reference
11 pages
Conditional Random Fields in ML
No ratings yet
Conditional Random Fields in ML
57 pages
Fake News Detection with Machine Learning
No ratings yet
Fake News Detection with Machine Learning
57 pages
Probability Analysis in Bayesian Networks
No ratings yet
Probability Analysis in Bayesian Networks
30 pages
Multinomial Naive Bayes Explained
No ratings yet
Multinomial Naive Bayes Explained
4 pages
Naïve Bayes Classifier Overview
No ratings yet
Naïve Bayes Classifier Overview
26 pages
Aerial Threat Classifier Framework
No ratings yet
Aerial Threat Classifier Framework
14 pages
Fake News Detection with ML & NLP
No ratings yet
Fake News Detection with ML & NLP
10 pages
Machine Learning Techniques Question Bank
No ratings yet
Machine Learning Techniques Question Bank
10 pages
Generative AI in Executive PG Program
No ratings yet
Generative AI in Executive PG Program
27 pages
Wine Quality Prediction Using Vivino Data
No ratings yet
Wine Quality Prediction Using Vivino Data
19 pages
Glucose Level Monitoring Arduino Glucometer
No ratings yet
Glucose Level Monitoring Arduino Glucometer
15 pages
Machine Learning for Toxic Comment Detection
No ratings yet
Machine Learning for Toxic Comment Detection
6 pages
Machine Learning Concepts and Applications
100% (1)
Machine Learning Concepts and Applications
44 pages
Probabilistic Reasoning Q&A Guide
No ratings yet
Probabilistic Reasoning Q&A Guide
5 pages
Enhanced Spam Detection with Naïve Bayes
No ratings yet
Enhanced Spam Detection with Naïve Bayes
6 pages
Subdomains of AI and Machine Learning Explained
No ratings yet
Subdomains of AI and Machine Learning Explained
16 pages
Overview of Naïve Bayes Classifier
No ratings yet
Overview of Naïve Bayes Classifier
8 pages
AI Multiple Choice Questions Overview
No ratings yet
AI Multiple Choice Questions Overview
13 pages
Patient-Centric Drug Recommendation System
No ratings yet
Patient-Centric Drug Recommendation System
55 pages
Naive Bayes and Gaussian Naive Bayes
No ratings yet
Naive Bayes and Gaussian Naive Bayes
10 pages
Medicare Fraud Detection with Big Data
No ratings yet
Medicare Fraud Detection with Big Data
25 pages
Kuleshov's Generative Models Lecture 2
No ratings yet
Kuleshov's Generative Models Lecture 2
31 pages
Understanding Logistic Regression Basics
No ratings yet
Understanding Logistic Regression Basics
44 pages
Fake Job Post Detection with ML Techniques
No ratings yet
Fake Job Post Detection with ML Techniques
9 pages
Naive Bayes Classifier: Pros and Cons
No ratings yet
Naive Bayes Classifier: Pros and Cons
7 pages
Movie Recommender System Report
No ratings yet
Movie Recommender System Report
27 pages
AIP-210 Midterm Exam Guide
No ratings yet
AIP-210 Midterm Exam Guide
125 pages
Election Data Analysis and Modeling
100% (1)
Election Data Analysis and Modeling
18 pages

Naïve Bayes Classifier Overview

Uploaded by

Naïve Bayes Classifier Overview

Uploaded by

Data Mining:

Concepts and Techniques

Jiawei Han, Micheline Kamber, and Jian Pei

• Classification: Basic Concepts

• A statistical classifier: performs probabilistic prediction, i.e., predicts

• Foundation: Based on Bayes’ Theorem.

Imagine we receive normal messages from family and friends

And we also receive spams

From friends or spam?

2023-03-12 MGSC5126: Data Mining 5

Spam Dear Friend Lunch Money

2023-03-12 MGSC5126: Data Mining 6

2023-03-12 MGSC5126: Data Mining 7

P(N)=8/12=0.67 P(S)= 0.33

What is the probability that the new mail is normal, given it

What is the probability that the new mail is spam, given it

2023-03-12 MGSC5126: Data Mining 9

P(Dear|S) = 0.29 P(S)*P(Dear|S)*P(Friend|S)

P(Lunch|N) = 0.18 P(N)*P(Lunch|N)*P(money|N)*

2023-03-12 MGSC5126: Data Mining 11

2023-03-12 MGSC5126: Data Mining 12

Give attributes, x1, x2, …, xn , and classes c1, c2, …, ck

for each of K possible outcomes or classes Ck.

2023-03-12 MGSC5126: Data Mining 13

o P(H) (prior probability): the initial probability

• Given training data X, posteriori probability of a hypothesis H,

P(H | X) = P(X | H ) P( H ) = P(X | H )´ P(H ) / P(X)

P(Ck | x1, x2, …, xn )

• Since P(X) is constant for all classes, only

P(C | X) = P(X | C )P(C )

A simplified assumption: attributes are conditionally independent (i.e.,

age income student credit_rating

• Performance: naïve Bayesian classifier, has comparable performance

• Classification: Basic Concepts

• Rule-based classifier makes the class decision by using of a

• We can express a rule in the following from:

• The rules are easily interpretable and thus these classifiers

Let us consider a rule R1,

R1: IF age = youth AND student = yes THEN buy_computer = yes

Note − We can also write rule R1 as follows:

R1: (age = youth) ^ (student = yes)) -> (buys computer = yes)

R1: IF age = youth AND student = yes THEN buy_computer = yes

R1: IF age = youth AND student = yes THEN buy_computer = yes

R1: IF income >= 30K THEN buy_computer = yes

There are several ways to build a rule-base classifier:

• Extract rules from Decision Trees

• Sequential Covering Method

2023-03-12 MGSC5126: Data Mining 31

Rules are easier to understand than large trees.

To extract a rule from a decision tree:

• To form a rule antecedent, each splitting criterion is logically ANDed.

• Rules are mutually exclusive and exhaustive.

2023-03-12 MGSC5126: Data Mining 32

student? credit rating?

no yes excellent fair

R1: IF age = young AND student = no THEN buys_computer = no

coverage(R) = ncovers /|D| /* D: training data set */

ncorrect = # of tuples correctly classified by R

accuracy(R) = ncorrect / ncovers

(n is the number of records in our sample) (student = yes) → computer-buy = No

• Sequential Covering Algorithm can be used to extract IF-THEN rules

Note: In Decision Tree Induction rules res learned simultaneously

while (enough target tuples left)

• Start with the most general rule possible: condition =

ID A B C Y If A=a1 Then Y=y1

If (Age <= 30 and income = Medium) Then No

2023-03-12 MGSC5126: Data Mining 42

If more than one rule are triggered, need conflict resolution.

• Size ordering: assign the highest priority to the triggering

The rule is pruned is due to the following reason:

• The rule is pruned by removing conjunct. The rule R is pruned, if pruned

3/12/23 Data Mining: Concepts and Techniques 44

• Classification: Basic Concepts

• Classification accuracy measures:

• Given m classes, an entry, CMi,j in a confusion matrix indicates #

Example of Confusion Matrix:

P(Dear|S) = 0.29 P(S)P(Dear|S)P(Friend|S)

P(Lunch|N) = 0.18 P(N)P(Lunch|N)P(money|N)*

• Stratified cross-validation: folds are stratified so that class