0% found this document useful (0 votes)
12 views80 pages

Naïve Bayes Classifier Overview

Chapter 8 of 'Data Mining: Concepts and Techniques' discusses classification methods, focusing on Bayesian classification and the Naïve Bayes classifier. It explains how to calculate conditional probabilities using Bayes' Theorem and demonstrates the classifier's application through examples. The chapter also addresses the performance and advantages of the Naïve Bayes classifier in comparison to other classification methods.

Uploaded by

ajithbaby06
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views80 pages

Naïve Bayes Classifier Overview

Chapter 8 of 'Data Mining: Concepts and Techniques' discusses classification methods, focusing on Bayesian classification and the Naïve Bayes classifier. It explains how to calculate conditional probabilities using Bayes' Theorem and demonstrates the classifier's application through examples. The chapter also addresses the performance and advantages of the Naïve Bayes classifier in comparison to other classification methods.

Uploaded by

ajithbaby06
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Mining:

Concepts and Techniques


(3rd ed.)

— Chapter 8 —

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.

2
Chapter 8. Classification: Basic Concepts

• Classification: Basic Concepts


• Decision Tree Induction
• Bayes Classification Methods
• Rule-Based Classification
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Summary
3
Bayesian Classification: Why?

• A statistical classifier: performs probabilistic prediction, i.e., predicts


class membership probabilities

• Foundation: Based on Bayes’ Theorem.

3/12/23 4
Naïve Bayes Classifier

Imagine we receive normal messages from family and friends

And we also receive spams

From friends or spam?

2023-03-12 MGSC5126: Data Mining 5


Naïve Bayes Classifier
Total number of words = 17
Normal emails From friends
Total number of word Dear = 8
P(Dear|N) = 0.47
P(Friend|N) = 0.29
P(Lunch|N) = 0.18

P(Money|N) = 0.006

Spam Dear Friend Lunch Money

2023-03-12 MGSC5126: Data Mining 6


Naïve Bayes Classifier

P(Dear|N) = 0.47
P(Friend|N) = 0.29
P(Lunch|N) = 0.18
P(Money|N) = 0.006

P(Dear|S) = 0.29
P(Friend|S) = 0.14
P(Lunch|S) = 0
P(Money|S) = 0.57

2023-03-12 MGSC5126: Data Mining 7


Naïve Bayes Classifier

P(Dear|N) = 0.47
P(Friend|N) = 0.29
Prior Probability
P(Lunch|N) = 0.18
P(Money|N) = 0.006 P(N) = 0.67

P(N)=8/12=0.67 P(S)= 0.33

P(Dear|S) = 0.29
P(Friend|S) = 0.14
P(Lunch|S) = 0
P(Money|S) = 0.57
P(S)=4/12=0.33
2023-03-12 MGSC5126: Data Mining 8
Naïve Bayes Classifier

What is the probability that the new mail is normal, given it


starts with Dear Friend?

What is the probability that the new mail is spam, given it


starts with Dear Friend?

2023-03-12 MGSC5126: Data Mining 9


Naïve Bayes Classifier
What is the probability that the new mail
P(Dear|N) = 0.47 is normal, given it starts with Dear Friend?
P(Friend|N) = 0.29
P(Lunch|N) = 0.18
P(N)*P(Dear|N)*P(Friend|N)
P(Money|N) = 0.006 =0.67*0.47*0.29=0.09

P(N)=8/12=0.67

P(Dear|S) = 0.29 P(S)*P(Dear|S)*P(Friend|S)


P(Friend|S) = 0.14 =0.33*0.29*0.14=0.01
P(Lunch|S) = 0
P(Money|S) = 0.57 0.09 > 0.01
P(S)=4/12=0.33 à the new email is a Normal message
2023-03-12 MGSC5126: Data Mining 10
Naïve Bayes Classifier
What is the probability that the new mail
P(Dear|N) = 0.47 is normal, given it starts contains Lunch
P(Friend|N) = 0.29 money money money?

P(Lunch|N) = 0.18 P(N)*P(Lunch|N)*P(money|N)*


P(Money|N) = 0.006 P(money|N)*P(money|N)
=0.67*0.18*0.006*0.006*0.006
P(N)=8/12=0.67 = 2.60496e-08

P(N)*P(Lunch|N)*P(money|N)*
P(Dear|S) = 0.29
P(money|N)*P(money|N)
P(Friend|S) = 0.14
=0.33*0*0.006*0.006*0.006
P(Lunch|S) = 0
=0
P(Money|S) = 0.57
P(S)=4/12=0.33

2023-03-12 MGSC5126: Data Mining 11


Naïve Bayes Classifier

2023-03-12 MGSC5126: Data Mining 12


Naïve Bayes Classifier

Give attributes, x1, x2, …, xn , and classes c1, c2, …, ck


The goal of Naïve Bayes classifier is to calculate the conditional probability:
P(Ck | x1, x2, …, xn )

for each of K possible outcomes or classes Ck.

2023-03-12 MGSC5126: Data Mining 13


Bayes’ Theorem: Basics

• Bayes’ Theorem:
P(H | X) = P(X | H ) P(H ) = P(X | H )´ P(H ) / P(X)
P(X)
o Let X be a data sample (“evidence”): class label is unknown
o Let H be a hypothesis that X belongs to class C
o Classification is to determine P(H|X)

o P(H) (prior probability): the initial probability


E.g., X will buy computer, regardless of age, income, …
o P(X): probability that sample data is observed
o P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds
E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income

3/12/23 14
Prediction Based on Bayes’ Theorem

• Given training data X, posteriori probability of a hypothesis H,


P(H|X), follows the Bayes’ theorem

P(H | X) = P(X | H ) P( H ) = P(X | H )´ P(H ) / P(X)


P(X)
• Informally, this can be viewed as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
• Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost

3/12/23 15
Classification Is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)

P(Ck | x1, x2, …, xn )

16
Classification Is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)

• Since P(X) is constant for all classes, only

P(C | X) = P(X | C )P(C )


needs to be maximized i i i
17
Naïve Bayes Classifier

A simplified assumption: attributes are conditionally independent (i.e.,


no dependence relation between attributes):

n
P(X | C i) = Õ P( x | C i) = P( x | C i) ´ P( x | C i) ´ ... ´ P( x | C i)
k 1 2 n
k =1

This greatly reduces the computation cost: Only counts the class
distribution

18
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_computer
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data to be classified: 31…40 low yes excellent yes
X = (age <=30, <=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
P(C1|X) 31…40 high yes fair yes
P(C2|X) >40 medium no excellent no
3/12/23 19
Naïve Bayes Classifier: Training Dataset

age income student credit_rating


buys_computer
Assumption: <=30 high no fair no
<=30 high no excellent no
age, income, and credit_rating,
31…40 high no fair yes
student are conditionally independent >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

3/12/23
Naïve Bayes Classifier: An Example
age income studentcredit_rating
buys_compute
<=30 high no fair no
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 <=30 high no excellent no
31…40 high no fair yes
P(buys_computer = “no”) = 5/14= 0.357 >40 medium no fair yes
>40 low yes fair yes
• Compute P(X|Ci) for each class >40 low yes excellent no
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 31…40 low yes excellent yes
<=30 medium no fair no
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
<=30 low yes fair yes
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444>40 medium yes fair yes
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 <=30 medium yes excellent yes
31…40 medium no excellent yes
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 31…40 high yes fair yes
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 >40 medium no excellent no
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
3/12/23 Therefore, X belongs to class (“buys_computer = yes”) 21
Naïve Bayes Classifier
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
and P(xk|Ci) is
( x-µ )2
1 -
g ( x, µ , s ) = e 2s 2

2p s

P ( X | C i ) = g ( x k , µ C i , s Ci )

22
Avoiding the Zero-Probability Problem
• Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P( X | C i) = Õ P( xk | C i)
k =1
• Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
• Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
• The “corrected” prob. estimates are close to their
“uncorrected” counterparts
23 23
Bayesian Classification: Why?

• Performance: naïve Bayesian classifier, has comparable performance


with decision tree and selected neural network classifiers
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct — prior
knowledge can be combined with observed data
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision making
against which other methods can be measured

3/12/23 24
Naïve Bayes Classifier: Comments
• Advantages
o Easy to implement
o Good results obtained in most of the cases
• Disadvantages
o Assumption: class conditional independence, therefore loss of
accuracy
o Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
Dependencies among these cannot be modeled by Naïve Bayes
Classifier
• How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)

3/12/23 25
Chapter 8. Classification: Basic Concepts

• Classification: Basic Concepts


• Decision Tree Induction
• Bayes Classification Methods
• Rule-Based Classification
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Summary
26
Rule-based Classification

• Rule-based classifier makes the class decision by using of a


set of IF-THEN rules.

• We can express a rule in the following from:


IF condition THEN conclusion

• The rules are easily interpretable and thus these classifiers


are generally used to generate descriptive models

3/12/23 27
Rule-based Classification

Let us consider a rule R1,

R1: IF age = youth AND student = yes THEN buy_computer = yes

Note − We can also write rule R1 as follows:

R1: (age = youth) ^ (student = yes)) -> (buys computer = yes)

3/12/23 28
Rule-based Classification

R1: IF age = youth AND student = yes THEN buy_computer = yes

Points to remember:
• The IF part of the rule is called rule antecedent or precondition.
• The THEN part of the rule is called rule consequent.
• The antecedent part the condition consist of one or more attribute tests
and these tests are logically ANDed.
• The consequent part consists of class prediction.

3/12/23 29
Rule-based Classification

R1: IF age = youth AND student = yes THEN buy_computer = yes

Antecedent Consequent

R1: IF income >= 30K THEN buy_computer = yes

3/12/23 30
Rule Extraction

There are several ways to build a rule-base classifier:

• Extract rules from Decision Trees

• Sequential Covering Method

• NeroFuzzy Systems

2023-03-12 MGSC5126: Data Mining 31


Rule Extraction from a Decision Tree

Rules are easier to understand than large trees.

To extract a rule from a decision tree:


• One rule is created for each path from the root to the leaf node.

• To form a rule antecedent, each splitting criterion is logically ANDed.


• The leaf node holds the class prediction, forming the rule consequent.

• Rules are mutually exclusive and exhaustive.

2023-03-12 MGSC5126: Data Mining 32


Rule Extraction from a Decision Tree
R2

R1 age?

<=30 R5
31..40 >40

student? credit rating?


yes

no yes excellent fair

no yes no yes

R1: IF age = young AND student = no THEN buys_computer = no


R2: IF age = young AND student = yes THEN buys_computer = yes
R3: IF age = mid-age THEN buys_computer = yes
R4: IF age = old AND credit_rating = excellent THEN buys_computer = no
R5: IF age = old AND credit_rating = fair THEN buys_computer = yes

3/12/23 33
Assessment of a rule: coverage and accuracy

• coverage
ncovers = # of tuples covered by R

coverage(R) = ncovers /|D| /* D: training data set */

• Accuracy

ncorrect = # of tuples correctly classified by R

accuracy(R) = ncorrect / ncovers

3/12/23 34
Assessment of a rule: coverage and accuracy
Quality of a classification rule can be evaluated age income studentcredit_rating
buys_compute
by: <=30 high no fair no
<=30 high no excellent no
• Coverage: fraction of records that satisfy 31…40 high no fair yes
the antecedent of a rule (LHS) >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Coverage(r) = 31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
• Accuracy: fraction of records covered by >40 medium yes fair yes
the rule that belong to the predicted
class (LHS and RHS) <=30 medium yes excellent yes
31…40 medium no excellent yes
∩ 31…40 high yes fair yes
Accuracy(r) =
>40 medium no excellent no

(n is the number of records in our sample) (student = yes) → computer-buy = No


Coverage = 50%, Accuracy = 14%

3/12/23 35
Rule Induction: Sequential Covering Method

• Sequential Covering Algorithm can be used to extract IF-THEN rules


form the training data.
• Generating a decision tree is not required.
• Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
• Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER

3/12/23 36
Rule Induction: Sequential Covering Method
Rule induction creates a model built from if–then–else.

Steps:
• Rules are learned one at a time
• Each time a rule is learned, the tuples covered by the rules are
removed
• Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the quality
of a rule returned is below a user-specified threshold

Note: In Decision Tree Induction rules res learned simultaneously


but in Sequential Covering Method rules are learned one at a time
3/12/23 37
Sequential Covering Algorithm

while (enough target tuples left)


generate a rule
remove positive target tuples satisfying this rule

Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3

Positive
examples

3/12/23 38
How to Learn-One-Rule?

• Start with the most general rule possible: condition =


empty
• Adding new attributes by adopting a greedy depth-first
strategy
• Picks the one that most improves the rule quality
• Rule-Quality measures: consider both coverage and
accuracy

39 39
Example 1

ID A B C Y If ? Then Y=y1
1 a1 b1 c1 y1
A=a1 Accuracy(A=a1,Y=y1)=ncorrect/ncover=2/3
2 a1 b2 c1 y1 A=a2 Accuracy(A=a2,Y=y1)=ncorrect/ncover=0/2
3 a1 b2 c2 y2 A=a3 Accuracy(A=a3,Y=y1)=ncorrect/ncover=2/5
4 a2 b1 c1 y2 B=b1 Accuracy(B=b1,Y=y1)=ncorrect/ncover=3/5
5 a2 b2 c3 y2 B=b2 Accuracy(B=b2,Y=y1)=ncorrect/ncover=1/5
C=c1 Accuracy(C=c1,Y=y1)=ncorrect/ncover=3/5
6 a3 b1 c2 y2 C=c2 Accuracy(C=c2,Y=y1)=ncorrect/ncover=0/2
7 a3 b1 c1 y1 C=c3 Accuracy(C=c3,Y=y1)=ncorrect/ncover=1/3
8 a3 b2 c1 y2
9 a3 b2 c3 y2 If A=a1 Then Y=y1
10 a3 b1 c3 y1
2023-03-12 MGSC5126: Data Mining 40
Example 2

ID A B C Y If A=a1 Then Y=y1


1 a1 b1 c1 y1
2 a1 b2 c1 y1 B=b1 Accuracy(A=a1 and B=b1,Y=y1)=ncorrect/ncover=1/1
B=b2 Accuracy(A=a1 and B=b2,Y=y1)=ncorrect/ncover=1/2
3 a1 b2 c2 y2 C=c1 Accuracy(A=a1 and C=c1,Y=y1)=ncorrect/ncover=2/2
4 a2 b1 c1 y2 C=c2 Accuracy(A=a1 and C=c2,Y=y1)=ncorrect/ncover=0/1
C=c3 Accuracy(A=a1 and C=c3,Y=y1)=ncorrect/ncover=0/0
5 a2 b2 c3 y2
6 a3 b1 c2 y2
7 a3 b1 c1 y1
If (A=a1 and C=c1) Then Y=y1
8 a3 b2 c1 y2
9 a3 b2 c3 y2
10 a3 b1 c3 y1
2023-03-12 MGSC5126: Data Mining 41
Conflict Resolution

If (Age <= 30 and income = Medium and credit-rating = Fair) Then Yes

If (Age <= 30 and income = Medium) Then No

2023-03-12 MGSC5126: Data Mining 42


Conflict Resolution

If more than one rule are triggered, need conflict resolution.


Conflict resolution can be done by:

• Size ordering: assign the highest priority to the triggering


rules that has the “toughest” requirement (i.e., with the
most attribute tests)
• Class-based ordering: decreasing order of prevalence or
misclassification cost per class (give priority to classes)
• Rule-based ordering (decision list): rules are organized
into one long priority list, according to some measure of
rule quality or by experts
3/12/23 43
Rule Pruning

The rule is pruned is due to the following reason:


• The rule may perform well on training data but less well on subsequent
data. That's why the rule pruning is required.

• The rule is pruned by removing conjunct. The rule R is pruned, if pruned


version of R has greater quality than what was assessed on an independent
set of tuples.

3/12/23 Data Mining: Concepts and Techniques 44


Chapter 8. Classification: Basic Concepts

• Classification: Basic Concepts


• Decision Tree Induction
• Bayes Classification Methods
• Rule-Based Classification
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Summary
45
Model Evaluation and Selection

• Classification accuracy measures:


o Confusion matrix
o Accuracy and error rate
o Sensitivity and Specificity
• Methods for comparing classifiers:
o Confidence intervals
o Cost-benefit analysis and ROC Curves

46
Classifier Evaluation Metrics: Confusion Matrix

Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

• Given m classes, an entry, CMi,j in a confusion matrix indicates #


of tuples in class i that were labeled by the classifier as class j
• May have extra rows/columns to provide totals

47
Classifier Evaluation Metrics: Confusion Matrix

Example of Confusion Matrix:

Actual class\Predicted buy_computer = buy_computer = Total


class yes no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

48
Classifier Evaluation Metrics: Accuracy, Error
Rate

A\P C ¬C

C TP FN P

¬C FP TN N

P’ N’ All

• Classifier Accuracy, or recognition rate: percentage of test set tuples that


are correctly classified
Accuracy = (TP + TN)/All
• Error rate: 1 – accuracy, or
Error rate = (FP + FN)/All

49
Classifier Evaluation Metrics: Accuracy, Error
Rate, Sensitivity and Specificity
A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All

Sensitivity: True Positive recognition rate


Sensitivity = TP/P
Specificity: True Negative recognition rate
Specificity = TN/N

50
Class Imbalance Problem

A\P C ¬C

C TP FN P

¬C FP TN N

P’ N’ All

Class Imbalance Problem:


n One class may be rare, e.g. fraud, or HIV-positive

n Significant majority of the negative class and minority of the

positive class

51
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

• Recall: completeness – what % of positive tuples did the


classifier label as positive?
• Perfect score is 1.0
• Inverse relationship between precision & recall
• F measure (F1 or F-score): harmonic mean of precision and
recall,

• Fß: weighted measure of precision and recall


o assigns ß times as much weight to recall as to precision

52
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)


cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

• Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

53
Holdout & Cross-Validation Methods

Holdout method

• Given data is randomly partitioned into two independent


sets
o Training set (e.g., 2/3) for model construction
o Test set (e.g., 1/3) for accuracy estimation

• Random sampling: a variation of holdout


o Repeat holdout k times, accuracy = avg. of the accuracies obtained

54
Holdout & Cross-Validation Methods
Cross-validation (k-fold, where k = 10 is most popular)

• Randomly partition the data into k mutually exclusive subsets,


each approximately equal size

• At i-th iteration, use Di as test set and others as training set

• Leave-one-out: k folds where k = # of tuples, for small sized


data

• *Stratified cross-validation*: folds are stratified so that class


dist. in each fold is approx. the same as that in the initial data

55
Bootstrap Method
Bootstrap
• Works well with small data sets
• Samples the given training tuples uniformly with
replacement
• i.e., each time a tuple is selected, it is equally
likely to be selected again and re-added to the
training set

56
Bootstrap Method
Several bootstrap methods, and a common one is .632 boostrap
• A data set with d tuples is sampled d times, with
replacement, resulting in a training set of d samples. The
data tuples that did not make it into the training set end up
forming the test set. About 63.2% of the original data end
up in the bootstrap, and the remaining 36.8% form the test
set (since (1 – 1/d)d ≈ e-1 = 0.368)
• Repeat the sampling procedure k times, overall accuracy of
the model:

57
Estimating Confidence Intervals:
Classifier Models M1 vs. M2

• Suppose we have 2 classifiers, M1 and M2, which one is better?

• Use 10-fold cross-validation to obtain and

• These mean error rates are just estimates of error on the true population of
future data cases

• What if the difference between the 2 error rates is just attributed to chance?
o Use a test of statistical significance
o Obtain confidence limits for our error estimates

58
Estimating Confidence Intervals:
Null Hypothesis

• Perform 10-fold cross-validation

• Assume samples follow a t distribution with k–1 degrees of freedom (here, k=10)
• Use t-test (or Student’s t-test)
• Null Hypothesis: M1 & M2 are the same

• If we can reject null hypothesis, then


o we conclude that the difference between M1 & M2 is statistically significant
o Chose model with lower error rate

59
Estimating Confidence Intervals: t-test
• If only 1 test set available: pairwise comparison
o For ith round of 10-fold cross-validation, the same cross partitioning is used to
obtain err(M1)i and err(M2)i
o Average over 10 rounds to get and
o t-test computes t-statistic with k-1 degrees of freedom:
where

• If two test sets available: use non-paired t-test


where

where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.

60
Estimating Confidence Intervals:
Table for t-distribution

• Symmetric
• Significance level,
e.g., sig = 0.05 or 5%
means M1 & M2 are
significantly different
for 95% of
population
• Confidence limit, z =
sig/2

61
Estimating Confidence Intervals:
Statistical Significance

• Are M1 & M2 significantly different?


o Compute t. Select significance level (e.g. sig = 5%)
o Consult table for t-distribution: Find t value corresponding to k-1 degrees of
freedom (here, 9)
o t-distribution is symmetric: typically upper % points of distribution shown →
look up value for confidence limit z=sig/2 (here, 0.025)
o If t > z or t < -z, then t value lies in rejection region:
Reject null hypothesis that mean error rates of M1 & M2 are same
Conclude: statistically significant difference between M1 & M2
o Otherwise, conclude that any difference is chance

62
Model Selection: ROC Curves
• ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
• Originated from signal detection theory
• Shows the trade-off between the true
positive rate and the false positive rate
• The area under the ROC curve is a n Vertical axis
measure of the accuracy of the model represents the true
• Rank the test tuples in decreasing positive rate
order: the one that is most likely to n Horizontal axis rep.
belong to the positive class appears at the false positive rate
the top of the list n The plot also shows a
• The closer to the diagonal line (i.e., the diagonal line
closer the area is to 0.5), the less n A model with perfect
accurate is the model accuracy will have an
area of 1.0
63
Predictor Error Measures

Measure predictor accuracy: measure how far off the predicted value is from the actual
known value
• Loss function: measures the error betw. yi and the predicted value yi’
• Absolute error: | yi – yi’|
• Squared error: (yi – yi’)2
• Test error (generalization error): the average loss over the test set d
d
å(y - yi ' ) 2
• Mean absolute error:å | yi - yi ' | i

i =1
Mean squared error: i =1

d
d
d d
å | y - y '|
i =1
i i
å ( yi - yi ' ) 2
• Relative absolute error: | y - y |
d
Relative squared error: i =1
å d

å ( y - y)
i
2
i =1
i
i =1

The mean squared-error exaggerates the presence of outliers


Popularly use (square) root mean-square error, similarly, root relative squared error

64
Issues Affecting Model Selection
• Accuracy
• classifier accuracy: predicting class label
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree size or
compactness of classification rules

65
Chapter 8. Classification: Basic Concepts

• Classification: Basic Concepts


• Decision Tree Induction
• Bayes Classification Methods
• Rule-Based Classification
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Summary
66
Ensemble Methods: Increasing the Accuracy

Ensemble methods
• Use a combination of models to increase accuracy
• Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*

New Data
Data Sample
Classifier 1
Combined Class
Classifier 2
votes Prediction
.
.
Classifier T

67
Ensemble Methods: Increasing the Accuracy

Popular ensemble methods

• Bagging: averaging the prediction over a collection of


classifiers
• Boosting: weighted vote with a collection of classifiers
• Ensemble: combining a set of heterogeneous classifiers

68
Bagging: Boostrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
o Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with
replacement from D (i.e., bootstrap)
o A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
o Each classifier Mi returns its class prediction
o The bagged classifier M* counts the votes and assigns the class with the most votes to X
• Prediction: can be applied to the prediction of continuous values by taking the average value
of each prediction for a given test tuple
• Accuracy
o Often significantly better than a single classifier derived from D
o For noise data: not considerably worse, more robust
o Proved improved accuracy in prediction

69
Boosting
• Analogy: Consult several doctors, based on a combination of weighted
diagnoses—weight assigned based on the previous diagnosis accuracy
• How boosting works?
o Weights are assigned to each training tuple
o A series of k classifiers is iteratively learned
o After a classifier Mi is learned, the weights are updated to allow the
subsequent classifier, Mi+1, to pay more attention to the training
tuples that were misclassified by Mi
o The final M* combines the votes of each individual classifier, where
the weight of each classifier's vote is a function of its accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy, but it
also risks overfitting the model to misclassified data

70
Adaboost (Freund and Schapire, 1997)

• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)


• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
• Tuples from D are sampled (with replacement) to form a training set Di
of the same size
• Each tuple’s chance of being selected is based on its weight
• A classification model Mi is derived from Di
• Its error rate is calculated using Di as a test set
• If a tuple is misclassified, its weight is increased, o.w. it is decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi
error rate is the sum of the weightsd of the misclassified tuples:
error ( M i ) = å w j ´ err ( X j )
j

1 - error ( M i )
log
• The weight of classifier Mi’s vote is error ( M i )
71
Random Forest (Breiman 2001)

Random Forest:
• Each classifier in the ensemble is a decision tree classifier and is
generated using a random selection of attributes at each node to
determine the split

• During classification, each tree votes and the most popular class is
returned

72
Random Forest (Breiman 2001)
Two Methods to construct Random Forest:

• Forest-RI (random input selection): Randomly select, at each


node, F attributes as candidates for the split at the node. The CART
methodology is used to grow the trees to maximum size

• Forest-RC (random linear combinations): Creates new attributes


(or features) that are a linear combination of the existing
attributes (reduces the correlation between individual classifiers)

73
Classification of Class-Imbalanced Data Sets

• Class-imbalance problem: Rare positive example but numerous negative


ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.

• Traditional methods assume a balanced distribution of classes and equal


error costs: not suitable for class-imbalanced data

74
Classification of Class-Imbalanced Data Sets
Typical methods for imbalance data in 2-class classification:

• Oversampling: re-sampling of data from positive class


• Under-sampling: randomly eliminate tuples from negative class
• Threshold-moving: moves the decision threshold, t, so that the rare class
tuples are easier to classify, and hence, less chance of costly false negative
errors
• Ensemble techniques: Ensemble multiple classifiers introduced above

Still difficult for class imbalance problem on multiclass tasks

75
Summary (I)

• Classification is a form of data analysis that extracts models describing


important data classes.
• Effective and scalable methods have been developed for decision tree
induction, Naive Bayesian classification, rule-based classification, and
many other classification methods.
• Evaluation metrics include: accuracy, sensitivity, specificity, precision,
recall, F measure, and Fß measure.
• Stratified k-fold cross-validation is recommended for accuracy estimation.
Bagging and boosting can be used to increase overall accuracy by learning
and combining a series of individual models.

76
Summary (II)

• Significance tests and ROC curves are useful for model selection.

• There have been numerous comparisons of the different


classification methods; the matter remains a research topic
• No single method has been found to be superior over all others
for all data sets
• Issues such as accuracy, training time, robustness, scalability, and
interpretability must be considered and can involve trade-offs,
further complicating the quest for an overall superior method

77
References (1)

• C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997
• C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press,
1995
• L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984
• C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data
Mining and Knowledge Discovery, 2(2): 121-168, 1998
• P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data
for scaling machine learning. KDD'95
• H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern Analysis for
Effective Classification, ICDE'07
• H. Cheng, X. Yan, J. Han, and P. S. Yu, Direct Discriminative Pattern Mining for
Effective Classification, ICDE'08
• W. Cohen. Fast effective rule induction. ICML'95
• G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for
gene expression data. SIGMOD'05

78
References (2)
• A. J. Dobson. An Introduction to Generalized Linear Models. Chapman & Hall, 1990.
• G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and
differences. KDD'99.
• R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley, 2001
• U. M. Fayyad. Branching on attribute values in decision tree generation. AAAI’94.
• Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and
an application to boosting. J. Computer and System Sciences, 1997.
• J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. VLDB’98.
• J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree
Construction. SIGMOD'99.
• T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer-Verlag, 2001.
• D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The
combination of knowledge and statistical data. Machine Learning, 1995.
• W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple
Class-Association Rules, ICDM'01.

79
References (3)

• T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity,
and training time of thirty-three old and new classification algorithms. Machine
Learning, 2000.
• J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic
interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing
Research, Blackwell Business, 1994.
• M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining.
EDBT'96.
• T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
• S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-
Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
• J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
• J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.
• J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
• J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.
80
References (4)
• R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and
pruning. VLDB’98.
• J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data
mining. VLDB’96.
• J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann,
1990.
• P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2005.
• S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert
Systems. Morgan Kaufman, 1991.
• S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
• I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques, 2ed. Morgan Kaufmann, 2005.
• X. Yin and J. Han. CPAR: Classification based on predictive association rules. SDM'03
• H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical
clusters. KDD'03.

81

You might also like