Classification-
Decision Tree
Classification by Decision Tree
Induction
• Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
– The topmost node in the tree is the root node.
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
Decision Tree for PlayTennis
Decision trees classify instances or examples by starting at the root of the tree
and moving through it until a leaf node.
Outlook (1) Which to start? (root)
Sunny Overcast Rain
Humidity Each internal node (2) Which node to
tests an attribute proceed?
High Normal Each branch corresponds to an
attribute value node
No Yes Each leaf node assigns a classification
(3) When to stop/ come to conclusion?
When to consider Decision Trees
• Instances describable by attribute-value pairs
• Target function is discrete valued
• Disjunctive hypothesis may be required
• Possibly noisy training data
• Missing attribute values
• Examples:
– Medical diagnosis
– Credit risk analysis
– Object classification for robot manipulator (Tan
1993)
Basket ball data
What we know ?
• The game will be away, at 9pm, and that Joe will play
center on offense…
• A classification problem
• Generalizing the learned rule to new examples
• What you don't know, of course, is who will win this game.
• Of course, it is reasonable to assume that this future game will
resemble the past games. Note, however, there are no previous games
that match these specific values -- ie, no previous game was exactly
[Where=Away, When=9pm, FredStarts=No, JoeOffense=Center,
JoeDefends=Forward, OppC=Tall].
We therefore need to generalize -- by using the known examples to infer
the likely outcome of this new situation. But how?
Use a Decision Tree to determine who should win the game
As we did not indicate the outcome of this game we call this an
"unlabeled instance"; the goal of a classifier is finding the class label for
such unlabeled instances.
An instance that also includes the outcome is called a "labeled instance" ---
eg, the first row of the table
corresponds to the labeled instance
Decision Trees
In general, a decision tree is a tree structure; see left-hard
figure below.
Example of a Decision Tree
Tid Refund Marital Taxable
Status Income Cheat
Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO MarSt
4 Yes Married 120K No Single, Divorced Married
5 No Divorced 95K Yes
TaxInc NO
6 No Married 60K No
< 80K > 80K
7 Yes Divorced 220K No
8 No Single 85K Yes NO YES
9 No Married 75K No
10 No Single 90K Yes
10
Training Data Model: Decision Tree
Apply Model to Test Data
Test Data
Start at the root of tree Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Algorithm
Principle
‒ Basic algorithm (adopted by ID3, C4.5 and CART): a greedy algorithm
‒ Tree is constructed in a top-down recursive divide-and-conquer manner
‒ Attributes are categorical (if continuous-valued, they are discretized in
advance)
‒ Choose the best attribute(s) to split the remaining instances and make
that attribute a decision node
Iterations
‒ At start, all the training tuples are at the root
‒ Tuples are partitioned recursively based on selected attributes
‒ Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Stopping conditions
‒ All samples for a given node belong to the same class
‒ There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
‒ There are no samples left
Example
Example
Example
Three Possible Partition Scenarios
Attribute Selection Measures
How to choose An Attribute?
• An attribute selection measure is a heuristic for selecting the splitting
criterion that “best” separates a given data partition, D, of class
labeled training tuples into individual classes.
Ideally
‒ Each resulting partition would be pure
‒ A pure partition is a partition containing tuples that all belong to the
same class
• Attribute selection measures (splitting rules)
‒ Determine how the tuples at a given node are to be split
‒ Provide ranking for each attribute describing the tuples
‒ The attribute with highest score is chosen
‒ Determine a split point or a splitting subset
• Methods
– Information gain (ID3 (Iterative Dichotomiser 3) /C4.5)
– Gain ratio
– Gini Index (IBM IntelligentMiner)
Information Gain
• Information Gain: Information Gain is the
decrease/increase in Entropy value when
the node is split.
• An attribute should have the highest
information gain to be selected for splitting.
• Based on the computed values of Entropy
and Information Gain, we choose the best
attribute at any particular step.
22
ID3 (Iterative Dichotomiser 3)
— This uses entropy and information gain as metric.
1st approach: Information Gain Approach
Information Gain Approach
Information Gain Approach
Assume there are two classes, P and N
Let the set of examples D contain p elements of class P
and n elements of class N
The amount of information, needed to decide if an
arbitrary example in D belongs to P or N is defined as
p p n n
Info(D) = I ( p, n) log2 log2
pn pn pn pn
log2x=log10x/log102
Info(D): Example
Information Gain in Attribute
Information Gain in Attribute
• Assume that using attribute A a set D will be
partitioned into sets {D1, D2 , …, Dv}
– If D contains pi examples of P and ni examples of N, the
entropy, or the expected information needed to classify
objects in all subtrees Si is
pi ni
E ( A) I ( pi , ni )
i 1 p n
• The encoding information that would be gained by
branching on A
Gain ( A ) I ( p , n ) E ( A )
Infoage(D): Example
Information Gain in Attribute
Infoage(D): Example
Extracting Classification Rules from
Trees
• Represent the knowledge in the form of IF-THEN rules
• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
• Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer
= “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”
Avoid Overfitting in
Classification
• The generated tree may overfit the training data
– Too many branches, some may reflect anomalies
due to noise or outliers
– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not
split a node if this would result in the goodness
measure falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
• Use a set of data different from the training data
to decide which is the “best pruned tree”
Approaches to Determine the Final
Tree Size
• Separate training (2/3) and testing (1/3) sets
• Use cross validation, e.g., 10-fold cross
validation
• Use all the data for training
– but apply a statistical test (e.g., chi-square) to
estimate whether expanding or pruning a node
may improve the entire distribution
• Use minimum description length (MDL) principle:
– halting growth of the tree when the encoding is
Enhancements to basic decision
tree induction
• Allow for continuous-valued attributes
– Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
• Handle missing attribute values
– Assign the most common value of the attribute
– Assign probability to each of the possible values
• Attribute construction
– Create new attributes based on existing ones that are
sparsely represented
– This reduces fragmentation, repetition, and replication
Exercise: For the following Medical Diagnosis Data, create a
decision tree.
Sore Throat Fever Swollen Glands Congestion Headache Diagnosis
YES YES YES YES YES Strep Throat
NO NO NO YES YES Allergy
YES YES NO YES NO Cold
YES NO YES NO NO Strep Throat
NO YES NO YES NO Cold
NO NO NO YES NO Allergy
NO NO YES NO NO Strep Throat
YES NO NO YES YES Allergy
NO YES NO YES YES Cold
YES YES NO YES YES Cold
p p n n
(i ).I ( p , n ) lo g 2 lo g 2 ,
S S S S
w here, S ( p n )
v
pi ni
( ii ) . E ( A )
i 1 p n
I ( pi , ni )
( iii ) .G a in ( A ) I ( p i , n i ) E ( A )
S=Strep Throat (3)+Allergy(3)+Cold(4)=10
3 3 3 3 4 4
InfoGain log 2 log 2 log 2
10 10 10 10 10 10
0.3 log 2 (0.3) 2 0.4 log 2 (0.4)
log 10 (0.3) log 10 (0.4)
0.6 0.4
log 10 2 log 10 2
( 0.522) ( 0.397)
0.6 0.4
0.301 0.301
0.6(1.73) 0.4(1. 318)
1.038 0.5272 1.562
Info(S)=1.562
Finding Splitting Attribute
• Select Attribute with highest Gain
Sore Throat= Strep
Allergy Cold
Throat
YES 2 1 2 Information Gain x P
+ = Entropy
NO 1 2 2 Information Gain x P
Sore Throat= 2 2 1 1 2 2
In fo ( Y E S ) lo g 2 lo g 2 lo g 2
5 5 5 5 5 5
In fo ( Y E S ) 1 .5 2
1 1 2 2 2 2
Info ( N O ) log 2 log 2 log 2
5 5 5 5 5 5
Info ( N O ) 1.52
Entropy (E(Sore Throat)= P(YES)x1.52 + P(NO)x1.52
= (5/10)x1.52 + (5/10)x1.52 = 1.52
Gain (Sore Throat)= Info(S)-E(Sore Throat)
= 1.562-1.52 = 0.05
• Gain for each Attribute
Attribute Gain
Sore Throat 0.05
Decision Tree
Fever 0.72
Swallen Glands 0.88 Swallen
Glands
Congestion 0.45 No Yes
Headache 0.05
Diagnosis=Strep Throat
Fever
No Yes
Diagnosis=Allergy Diagnosis=Cold
IF Swallen Glands = “YES”, THEN Diagnosis=Strep Throat
IF Swallen Glands = “NO” AND Fever = “YES”, THEN Diagnosis=Cold
IF Swallen Glands = “NO” AND Fever = “NO”, THEN Diagnosis=Allergy