DECISION TREE BASED
LEARNING
CSC 518
Lecture notes – Chapter 5
LEARNING OUTCOMES
On completion of this topic, learners should be able to: -
a) Identify Decision trees
b) Construct a Decision tree
c) Handle missing values
d) Choose the appropriate attribute(s), Information gain and Entropy,
Gini, index, Gain ratio
e) Identify the termination Criteria / Pruning Decision trees
f) Induce Decision trees – Decision tree algorithms, CART, C4.5, Greedy
Decision trees.
g) Identify benefits of Decision trees, Specialized trees, Oblique trees,
Random forests, Evolutionary trees, Hellinger trees
h) Implement Decision trees Using Mahout, Using R, Using Spark, Using
Python (scikit-learn), Using Julia
DECISION TREES
Decision trees are known to be one of the most powerful and widely used
modeling techniques in the field of Machine learning. Decision trees
naturally induce rules that can be used in data classification and prediction.
Following is an example of a rule definition derived from building a
Decision tree:
If (laptop model is x) and (manufactured by y) and (is z years old) and (with
some owners being k) then (the battery life is n hours).
TERMINOLOGY
Decision trees classify instances by representing in a tree structure starting from the
root to a leaf. Most importantly, at a high level, there are two representations of a
Decision tree—a node and an arc that connects nodes. To make a decision, the flow
starts at the root nodes, navigates to the arcs until it has reached a leaf node, and then
makes a decision. Each node of the tree denotes testing of an attribute, and the
branches denote the possible values that the attribute can take.
Some of the characteristics of a Decision tree representation:
▪ Every non-leaf node (for example, a decision node) denotes a representation of the
attribute value
▪ Every branch denotes the rest of the value representation
▪ Every leaf (or terminal) node represents the value of the target attribute
▪ The starting node is called the root node
PURPOSE AND USES
Decision trees are used for classification and regression.
Two types of trees are used in this context:
1. Classification trees
2. Regression trees
❑ Classification trees are used to classify the given data set into categories. To
use classification trees, the response of the target variable needs to be a
categorical value such as yes/no, true/false.
❑ On the other hand, regression trees are used to address prediction
requirements and are always used when the target or response variable is a
numeric or discrete value such as stock value, commodity and price.
CONSTRUCTING A DECISION TREE
Core rules for building Decision trees:
➢ We usually start building Decision trees with one attribute, split the data based on the
attribute, and continue with the same process for other attributes.
➢ There can be many Decision trees for the given problem.
➢ The depth of the tree is directly proportional to the number of attributes chosen.
➢ There needs to be a Termination Criteria that will determine when to stop further
building the tree. In the case of no termination criteria, the model will result in the
over-fitting of the data.
➢ Finally, the output is always in the form of simple rule(s) that can be stored and
applied to different datasets for classification and/or prediction.
One of the reasons why Decision trees are preferred in the field of Machine learning is
because of their robustness to errors; they can be used when there are some unknown
values in the training datasets too (for example, the data for income is not available for all
the records).
HANDLING MISSING VALUES
One of the interesting ways of assigning values to some unknowns is to see that the most common value in terms
of occurrence is assigned and in some cases they can belong to the same class, if possible we should bring it closer
to accuracy.
There is another probabilistic way of doing this where the prediction is distributed proportionately:
Assign a probability pi for every value vi of x.
Now, assign the fraction pi of x to each of the descendants. These probabilities can be estimated again based on
the observed frequencies of the various values for A, among the examples at node n.
For example, let's consider a Boolean attribute A. Let there be 10 values for A out of which three have a value of True and
the rest 7 have a value of False. So, the probability of A(x) = True is 0.3, and the probability that A(x) = False is 0.7.
➢ A fractional 0.3 of this is distributed down the branch for A = True, and a fractional 0.7 is distributed down the
other.
➢ These probability values are used for computing the information gain, and can be used if a second missing
attribute value needs to be tested.
➢ The same methodology can be applied in the case of learning when we need to fill any unknowns for the new
branches.
➢ The C4.5 algorithm uses this mechanism for filling the missing values.
CONSIDERATIONS FOR CONSTRUCTING DECISION TREES
The key to constructing Decision trees is knowing where to split them. To do this, we need to be clear on the following:
1. Which attribute to start and which attribute to apply subsequently?
2. When do we stop building the Decision tree (that is avoid over-fitting)?
Choosing the appropriate attribute(s)
There are three different ways to identify the best-suited attributes:
1. Information Gain and Entropy
2. Gini index
3. Gain ratio
Information gain and Entropy
This entity is used in an algorithm known as C4.5. Entropy is a measure of uncertainty in the data. Let us take an intuitive approach to understand the concepts
of Information gain and Entropy. In the original version of the C5.0 and C4.5 algorithms (ID3), a root node was chosen on the basis of how much of the total
Entropy was reduced if this node was chosen. This is called information gain.
Information gain = Entropy of the system before split - Entropy of the system after split
Gini index
Gini index is a general splitting criterion. Gini Index is used to measure the probability of two random items belonging to the same class. In the case of a real
dataset, this probability value is 1. The Gini measure of a node is the sum of the squares of the proportions of the classes.
A node with two classes each has a score of 0.52 + 0.52 = 0.5. This is because the probability of picking the same class at random is 1 out of 2.
Gain ratio
Another improvement in C4.5 compared to ID3 is that the factor that decides the attribute is the gain ratio. The gain ratio is the ratio of information gain and
information content. The attribute that gives the maximum amount of gain ratio is the attribute that is used to split it. The gain ratio requires measure for
Information content. Information content is defined as -f i log2 f i . Note that here, we do not take the value of the dependent variable into account.
TERMINATION CRITERIA / PRUNING DECISION TREES
There are many ways of avoiding over-fitting in Decision tree learning.
The Following are the two different cases:
1. One case where the Decision tree is terminated for growth way before a perfect classification of the training data is done
2. Another case where the over-fitting of data is done and then the tree is pruned to recover
Following are a couple of approaches to find the correct tree size:
1. Identify a separate and different dataset to that of the target training data set to be used, and evaluate the correctness of post-
pruning nodes in the tree. This is a common approach and is called training and validation set approach.
2. Instead of having a subset of data in the training set, use up all the data in the training set, and apply probabilistic methods to
check if pruning a particular node has any likelihood to produce any improvement over and above the training dataset. Use all the
available data for training. For example, the chi-square test can be used to check this probability. Reduced-Error-Pruning (D): We
prune at a node by removing the subtree that is rooted at the node. We make that node a leaf.
Following are the steps of the Rule Post-Pruning process:
1. Construct a Decision tree from the training set by growing it until there is an obvious over-fitting seen.
2. Generate rules from the constructed Decision tree with every path, starting from the root node to a particular leaf node mapping to
a rule.
3. Apply pruning to each rule for removing identified preconditions and help improve the probabilistic accuracy.
4. Next, use the pruned rules in the order of their increased accuracy on the subsequent instances.
Following are the advantages of rule-based pruning and its need for converting into rules:
➢ Improving the readability of the rules
➢ A consistent testing can be done at both the root and leaf level nodes
➢ There is a clear decision that can be made of either removing the decision node or retaining it
INDUCING DECISION TREES – DECISION TREE ALGORITHMS
There are many Decision tree inducing methods. Among all the methods, C4.5 and CART are the most adopted or popular
ones.
CART
▪ CART stands for Classification and Regression Trees (Breiman et al., 1984).
▪ CART creates binary trees. This means there are always two branches that can emerge from a given node. The philosophy of
the CART algorithm is to follow a goodness criterion, which is all about choosing the best possible partition.
▪ Moreover, as the tree grows, a cost-complexity pruning mechanism is adopted. CART uses the Gini index to select
appropriate attributes or the splitting criteria.
▪ Using CART, the prior probability distribution can be provided. We can generate Regression trees using CART that in turn
help in predicting real numbers against a class. The prediction is done by applying the weighted mean for the node. CART
identifies splits that minimize the prediction squared error (that is, the least-squared deviation).
C4.5
➢ Similar to CART, C4.5 is a Decision tree algorithm with a primary difference that it can generate more than binary trees,
which means support for multiway splits.
➢ For attribute selection, C4.5 uses the information gain measure. As explained in the previous section, an attribute with the
largest information gain (or the lowest Entropy reduction) value helps to achieve closer to accurate classification with the
least quantity of data. One of the key drawbacks of C4.5 is the need for large memory and CPU capacity for generating rules.
➢ The C5.0 algorithm is a commercial version of C4.5 that was presented in 1997. C4.5 is an evolution of the ID3 algorithm.
The gain ratio measure is used for identifying the splitting criteria.
➢ The splitting process stops when the number of splits reaches a boundary condition definition that acts as a threshold. Post
this growing phase of the tree, pruning is done, and an error-based pruning method is followed.
GREEDY DECISION TREES
A vital characteristic of Decision trees is that they are Greedy! A greedy algorithm targets achieving optimal
solutions globally by achieving local optimums at every stage. Though the global optimum is not always
guaranteed, the local optimums help in achieving global optimum to a maximum extent.
Every node is greedily searched to reach the local optimum, and the possibility of getting stuck at achieving
local optima is high. Most of the time, targeting local optima might help in providing a good enough
solution.
Benefits of Decision trees
Some of the advantages of using Decision trees are listed here:
➢ Decision trees are fast and easy to build and require little experimentation
➢ They are robust
➢ They are easy to understand and interpret
➢ Decision trees do not require complex data preparation
➢ They can handle both categorical and numerical data
➢ They are supported using statistical models for validation
➢ They can handle highly dimensional data and also operate large datasets
SPECIALIZED TREES
1. Oblique trees
12
Oblique trees are used in cases where the data is extremely complex. If the attributes are x1, x2, AND x3…xn, then the C4.5 and CART tests the criteria as
x1>some value or x2< some other value, and so on. The goal in such cases is to find an attribute to test at each node.
2. Random forests
▪ These specialized trees are used when there are too many dimensions. The basic premise of the curse of dimensionality is that high dimensional data
brings in complexity. With more dimensions and features, the possibility of errors is also high. In the case of Random forests, the application of boosting
is about how single tree methods are brought together to see a boost in the result regarding accuracy. A Random forest extends Decision trees by
including more number of Decision trees. These Decision trees are built by a combination of random selection of data (samples) and a random selection
of a subset of attributes. Another variable input required for the making of multiple Decision trees are random subsets of the attributes.
▪ Since each tree is built using random dataset and random variable set, these trees are called Random trees. Moreover, many such Random trees define a
Random forest. The result of a Random tree is based on two radical beliefs. One is that each of the trees make an accurate prediction for maximum part
of the data. Second, mistakes are encountered at different places. So, on an average, a poll of results is taken across the Decision trees to conclude a result.
There are not enough observations to get good estimates, which leads to sparsity issues. There are two important causes for exponential increase on spatial
density, one, is increase in dimensionality and the other is increase in the equidistant points in data. Most of the data is in the tails. Random forests are a
vital extension of the Decision trees that are very simple to understand and are extremely efficient, particularly when one is dealing with high
dimensional spaces. For prediction, a new sample is pushed down the tree. A new label of the training sample is assigned to the terminal node, where it
ends up. This procedure is iterated over all the trees in the group, and the average vote of all trees is reported as the Random forest prediction.
3. Evolutionary trees
When achieving the global optima seems almost impossible, Evolutionary trees are used. As you learned, Decision trees are greedy. So sometimes, we may be
constructing much bigger trees just because we are stuck in local optima. So, if your tree length is just too much, try oblique trees or evolutionary trees. The
concept of evolutionary trees is originated from a very exciting concept called genetic algorithms. Instead of mathematically computing the best attribute at
every node, an Evolutionary tree randomly picks a node at each point and creates a tree. It then iterates and creates a collection of trees (forest). Now, it
identifies the best trees in the forest for the data. It then creates the next generation of the forest by combining these trees randomly. Evolutionary trees, on
the other hand, choose a radically different top node and produce a much shorter tree, which has the same efficiency. Evolutionary algorithms take more time
to compute.
4. Hellinger trees
There have been attempts to identify impurity measures that are less sensitive to the distribution of dependent variable values than Entropy or Gini index. A
very recent paper suggested Hellinger distance as a measure of impurity that does not depend on the distribution of the target variable.
IMPLEMENTING DECISION TREES
Using Mahout
Refer to the folder .../mahout/chapter5/decisiontreeexample/.
Refer to the folder.../mahout/chapter5/randomforestexample/.
Using R
Refer to the folder .../r/chapter5/decisiontreeexample/.
Refer to the folder .../r/chapter5/randomforestexample/.
Using Spark
Refer to the folder .../spark/chapter5/decisiontreeexample/.
Refer to the folder .../spark/chapter5/randomforestexample/.
Using Python (scikit-learn)
Refer to the folder .../python scikitlearn/chapter5/decisiontreeexample/.
Refer to the folder .../python scikitlearn/chapter5/randomforestexample/.
Using Julia
Refer to the folder .../julia/chapter5/decisiontreeexample/.
Refer to the folder .../julia/chapter5/randomforestexample/
THANK YOU