0% found this document useful (0 votes)
6 views43 pages

Lecture 3

The document provides an overview of Decision Trees, a machine learning tool used for decision support that models decisions and their potential outcomes. It discusses the structure of decision trees, the process of learning from data, and methods for selecting the best attributes for splitting data. Additionally, it covers concepts such as information gain and the ID3 algorithm for constructing decision trees.

Uploaded by

q842163675
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views43 pages

Lecture 3

The document provides an overview of Decision Trees, a machine learning tool used for decision support that models decisions and their potential outcomes. It discusses the structure of decision trees, the process of learning from data, and methods for selecting the best attributes for splitting data. Additionally, it covers concepts such as information gain and the ID3 algorithm for constructing decision trees.

Uploaded by

q842163675
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Machine Learning

Decision Trees

Zhibin Yu
2021.08.25
Contents
Introduction to Decision Tree

Algorithm

Overfitting

Summary
Part I: Introduction to Decision Tree
What is Decision Tree
● decision tree is a decision support tool that uses a tree-like
model of decisions and their possible consequences, including
chance event outcomes, resource costs, and utility. It is one way
to display an algorithm that only contains conditional control
statements.
A possible decision tree
● Each internal node: test one attribute 𝑋𝑖
● Each branch from a node: selects one value for 𝑋𝑖
● Each leaf node: predict 𝑌 or 𝑝(𝑌 | 𝑥 ∈ 𝑙𝑒𝑎𝑓 )
Sample Dataset
● Columns denote features Xi
● Rows denote labeled instances < 𝒙𝑖, 𝑦𝑖 >
● Class label denotes whether a tennis game was played

< 𝒙𝑖, 𝑦𝑖 >


Function Approximation
● Problem Setting
● Set of possible instances X
● Set of possible labels Y
● Unknown target function 𝑓: 𝑋 → 𝑌
● Set of function hypotheses 𝐻 = {h h: 𝑋 → 𝑌}

● Input: Training examples of unknown target function 𝑓


{⟨𝒙𝑖, 𝑦𝑖⟩} ={⟨𝒙1, 𝑦1⟩,…, ⟨𝒙𝑛, 𝑦𝑛⟩}
n
𝑖=1
● Output: Hypothesis h ∈ 𝐻 that best approximates 𝑓
Decision Tree
● A possible decision tree for the data:

• What prediction would we make for


<outlook=sunny, temperature=hot, humidity=high, wind=weak> ?
Decision Tree
● If features are continuous, internal nodes can test the value of a
feature against a threshold
Decision Tree Learning
● Problem Setting:
● Set of possible instances X
● each instance x in X is a feature vector
● e.g., <Humidity=low, Wind=weak, Outlook=rain, Temp=hot>
● Unknown target function 𝑓: 𝑋 → 𝑌
● Y is discrete valued
● Set of function hypotheses 𝐻 = {h h: 𝑋 → 𝑌 }
● each hypothesis h is a decision tree
● trees sorts x to leaf, which assigns y
Stages of (Batch) Machine Learning
𝑋, 𝑌 = {⟨𝒙𝒊, 𝒚𝒊⟩}
𝑛
●Given: labeled training data 𝑖=1
● Assumes each 𝑥𝑖~𝐷(𝑥) with 𝑦𝑖 = 𝑓𝑡𝑎𝑟𝑔𝑒𝑡(𝑥𝑖)
● Train the model:
● model ← [Link](X, Y )

● Apply the model to new data:


● Given: new unlabeled instance x ⇠ D(X )
● 𝑦𝑝𝑟𝑒𝑑𝑖𝑐𝐼𝑜𝑛 ← 𝑚𝑜𝑑𝑒𝑙 . 𝑝𝑟𝑒𝑑𝑖𝑐𝑡(𝑥)
Example Application:A Tree to Predict
Caesarean Section Risk
Decision Tree Induced Partition
Decision Tree – Decision Boundary
● Decision trees divide the feature space into axis- parallel
(hyper-)rectangles
● Each rectangular region is labeled with one label
● or a probability distribution over labels
Expressiveness
● Decision trees can represent any Boolean function of the
input attributes
● In the worst case, the tree will require exponentially many
nodes

Truth table row →path to leaf


Expressiveness
● Decision trees have a variable-sized hypothesis space
● As the #nodes (or depth) increases, the hypothesis space grows
● Depth 1 (“decision stump”): can represent any Boolean function of one
feature
● Depth 2: any Boolean fn of two features; some involving three features (e.g.,
(𝑥1 ∩ 𝑥2) ∪ ( ¬𝑥1 ∩ ¬𝑥3))
Another Example: Restaurant
● Model a patron’s decision of whether to wait for a table at a
restaurant
Another Example: Restaurant

Is this the best decision tree?


Preference bias: Ockham’s Razor
● Principle stated by William of Ockham (1285-1347)
● entities are not to be multiplied beyond necessity
● AKA Occam’s Razor, Law of Economy, or Law of Parsimony

Idea: The simplest consistent explanation is the


●best!
Therefore, the smallest decision tree that correctly classifies all
of the training examples is best
● Finding the provably smallest decision tree is NP-hard
● ...So instead of constructing the absolute smallest tree consistent
with the training examples, construct one that is pretty small
ID-3 Decision Tree
● node = root of decision tree
● Main loop:
● A ← the “best” decision attribute for the next node.
● Assign A as decision attribute for node.
● For each value of A, create a new descendant of node.
● Sort training examples to leaf nodes.
● If training examples are perfectly classified, stop. Else, recurse
over new leaf nodes.

How do we choose which attribute is best?


Choosing the Best Attribute
● Key problem: choosing which attribute to split a given set of examples
● Random: Select any attribute at random
● Least-Values: Choose the attribute with the smallest number of possible
values
● Most-Values: Choose the attribute with the largest number of possible
values
● Max-Gain: Choose the attribute that has the largest expected information
gain
● i.e., attribute that results in smallest expected size of subtrees rooted at its
children
● The ID3 algorithm uses the Max-Gain method of selecting the best
attribute
Choosing an Attribute
● Idea: a good attribute splits the examples into subsets that are
(ideally) “all positive” or “all negative”
● Which split is more informative: Patrons? or Type?
ID3-induced Decision Tree
Compare the Two Decision Trees
Information Gain
● Which test is more informative?
Split over whether Split over whether
Balance exceeds 50K applicant is employed

Less or equal 50K Over 50K Unemployed Employed


Information Gain
● Impurity/Entropy (informal)
● Measures the level of impurity in a group of examples
Information Gain
● Impurity/Entropy (informal)
● Measures the level of impurity in a group of examples
Information Gain
# of
● Entropy H(X) of a random variable X
possible
values for X

● H(X) is the expected number of bits needed to encode a


randomly drawn value of X (under most efficient code)
● Why? Information theory:
● Most efficient code assigns −𝑙𝑜𝑔2𝑃 (𝑋 = 𝑖)bits to encode the message
𝑋=𝑖
● So, expected number of bits to code one random X is:
2-Class Cases:
● What is the entropy of a group in which all examples belong
to the same class?
● 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 = − 1log21 = 0
● not a good training set for learning

● What is the entropy of a group with 50% in either class?


● 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 = − 0 . 5log20 . 5–0 . 5log20 . 5 = 1
● good training set for learning
Sample Entropy
Information Gain
● We want to determine which attribute in a given set of training feature
vectors is most useful for discriminating between the classes to be
learned.

● Information gain tells us how important a given attribute of the feature


vectors is.

● We will use it to decide the ordering of attributes in the nodes of a


decision tree.
From Entropy to Information Gain
● Entropy H(X) of a random variable X

● Specific conditional entropy H(X|Y=v) of X given Y=v

● Conditional entropy H(X|Y) of X given Y :

● Mutual information (aka Information Gain) of X and Y :


Information Gain
● Information Gain is the mutual information between input attribute A
and target variable Y

● Information Gain is the expected reduction in entropy of target


variable Y for data sample S, due to sorting on variable A
Calculating Information Gain

Information Gain= 0.996 - 0.615 = 0.38


Entropy-Based Automatic Decision Tree
Construction

● Quinlan suggested information gain in his ID3


system and later the gain ratio, both based on
entropy.
Example
Selecting the next attribute
Selecting the next attribute
Slide by Tom Mitchell
Which Tree Should We Output?
● ID3 performs heuristic search through space of
decision trees
● It stops at smallest acceptable tree. Why?

Occam’s razor: prefer the


simplest hypothesis that
fits the data
Which Tree Should We Output?
• ID3 performs heuristic
search through space of
decision trees
• It stops at smallest
acceptable tree. Why?

Occam’s razor: prefer the


simplest hypothesis that
fits the data
ID-3 Decision Tree Pseudo Code
● ID3 (Examples, Target_Attribute, Attributes)
● Create a root node for the tree
● If all examples are positive, Return the single-node tree Root, with label = +
● If all examples are negative, Return the single-node tree Root, with label = -.
● If number of predicting attributes is empty, then Return the single node tree Root,
● with label = most common value of the target attribute in the examples.
● Otherwise Begin
● A ← The Attribute that best classifies examples.
● Decision Tree attribute for Root = A.
● For each possible value, vi, of A,
● Add a new tree branch below Root, corresponding to the test A = vi.
● Let Examples(vi) be the subset of examples that have the value vi for A
● If Examples(vi) is empty
● Then below this new branch add a leaf node with label = most common target value in the examples
● Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A})
● End
● Return Root
Questions?

You might also like