Ensemble Methods, Decision Tree
and Random forest
Prof. Gajendra P.S. Raghava
Head, Center for Computational Biology
Web Site: [Link]
These slides were created with using various resources so
no claim of authorship on any slide
Ensemble learning
Ensembles of Classifiers
Idea
Combine the classifiers to improve the performance
Ensembles of Classifiers
Combine the classification results from different
classifiers to produce the final output
Unweighted voting
Weighted voting
Build Ensemble Classifiers
• Basic idea:
Build different “experts”, and let them vote
• Advantages:
Improve predictive performance
Other types of classifiers can be directly included
Easy to implement
No too much parameter tuning
• Disadvantages:
The combined classifier is not so transparent (black box)
Not a compact representation
Ensemble-based methods
Classifier combination/Aggregator is a general concept
Creation of large set of classifiers, here called the “base
learners”
Building large number of base classifiers
These sets and classifies should be diverse
Aim to maximize accuracy using combination
These methods will be:
Bagging techniques (Random Forest)
Boosting techniques (AdaBoost)
Ensemble Techniques
Bagging Techniques Boosting Techniques
Bootstrap Aggregation • AdaBoost (Adaptive Boosting)
• Random Forest • Gradient Boosting Machines
• Extra-Tree • XGBoost (Extreme Gradient Boosting)
(Extremely Randomized Trees) • CatBoost (Category Boosting)
• LightGBM
Bagging techniques are a type of ensemble learning method that involves training multiple models in
parallel and then combining their predictions to improve the overall performance and robustness of the
predictive model.
Boosting is an ensemble learning technique that works by sequentially training a series of weak models,
each trying to correct the errors made by its predecessors.
Bias/Variance Tradeoff
Minimize variance Low Variance High Variance
– Bagging .
– Random Forests Low Bias ... . . . .
• Minimize bias
.
– Functional Gradient Descent .
High Bias .... .
– Boosting .
– Ensemble Selection
Bias: Measure closeness between predicted and actual value, higher difference higher
biasness.
Variance: Predicted values are scattered or dense; variance in predicted values
Bagging
Bagging = Bootstrap + aggregating
It uses bootstrap resampling to generate L different training sets
On the L training sets it trains L base learners
During testing it aggregates the L learners by taking their average
In case of classification, it uses majority voting
The diversity or complementarity of the base learners is not controlled
The ensemble model is almost always better than the unique base learners
if the base learners are unstable (which means that a small change in the
training dataset may cause a large change in the result of the training)
Bootstrap resampling
Suppose we have a training set with n samples
Create L different training sets
Bootstrap resampling takes random samples
Randomness is required to obtain different sets for L rounds
Allowing replacement is required
As the L training sets are different, the result of the training over
these set will be different
Works better with unstable learners (e.g. neural nets, decision
trees)
Not really effective with stable learners (e.g. k-NN, SVM)
Aggregation methods
There are several methods to combine (aggregate) the outputs of the various classifiers
When the output is a class label:
Majority voting
Weighted majority voting (e.g we can weight each classifier
by ts reliability (which also has to be estimated somehow, of course…)
When the output is numeric (e.g. a probability
estimate for each class ci):
We can combine the dj scores by taking their (weighted)
mean, product, minimum, maximum, …
Stacking
Instead of using the above simple aggregation rules, we can
train yet another classifier on the output values of the base
classifiers
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity Each internal node tests an attribute
High Normal Each branch corresponds to an
attribute value node
No Yes Each leaf node assigns a classification
16
When to consider Decision Trees
Instances describable by attribute-value pairs
Target function is discrete valued
Disjunctive hypothesis may be required
Possibly noisy training data
Missing attribute values
Examples:
Medical diagnosis
Credit risk analysis
Object classification for robot manipulator (Tan 1993) 17
Converting a Tree to Rules
Outlook
Sunny Overcast Rain
Humidity Yes Wind
High Normal Strong Weak
No Yes No Yes
R1: If (Outlook=Sunny) (Humidity=High) Then PlayTennis=No
R2: If (Outlook=Sunny) (Humidity=Normal) Then PlayTennis=Yes
R3: If (Outlook=Overcast) Then PlayTennis=Yes
R4: If (Outlook=Rain) (Wind=Strong) Then PlayTennis=No
R5: If (Outlook=Rain) (Wind=Weak) Then PlayTennis=Yes 18
Basic Concept
A Decision Tree is an important data structure known to
solve many computational problems
Binary Decision Tree
A B C f
0 0 0 m0
0 0 1 m1
0 1 0 m2
0 1 1 m3
1 0 0 m4
1 0 1 m5
1 1 0 m6
1 1 1 m7
19
Basic Concept
In last example, we have considered a decision tree where
values of any attribute if binary only. Decision tree is also
possible where attributes are of continuous data type
Decision Tree with numeric data
20
Some Characteristics
Decision tree may be n-array, n ≥ 2.
There is a special node called root node.
All nodes drawn with circle (ellipse) are called internal nodes.
All nodes drawn with rectangle boxes are called terminal nodes or leaf nodes.
Edges of a node represent the outcome for a value of the node.
In a path, a node with same label is never repeated.
Decision tree is not unique, as different ordering of internal nodes can give
different decision tree.
21
Decision Tree and Classification Task
Decision tree helps us to classify data.
Internal nodes are some attribute
Edges are the values of attributes
External nodes are the outcome of classification
Such a classification is, in fact, made by posing questions starting from the root
node to each terminal node.
22
Decision Tree and Classification Task
Vertebrate Classification
Name Body Temperature Skin Cover Gives Birth Aquatic Aerial Has Legs Hibernates Class
Creature Creature
Human Warm hair yes no no yes no Mammal
Python Cold scales no no no no yes Reptile
Salmon Cold scales no yes no no no Fish
Whale Warm hair yes yes no no no Mammal
Frog Cold none no semi no yes yes Amphibian
Komodo Cold scales no no no yes no Reptile
Bat Warm hair yes no yes yes yes Mammal
Pigeon Warm feathers no no yes yes no Bird
Cat Warm fur yes no no yes no Mammal
Leopard Cold scales yes yes no no no Fish
Turtle Cold scales no semi no yes no Reptile
Penguin Warm feathers no semi no yes no Bird
Porcupine Warm quills yes no no yes yes Mammal
Eel Cold scales no yes no no no Fish
Salamander Cold none no semi no yes yes Amphibian
23
What are the class label of Dragon and Shark?
Decision Tree and Classification Task
Vertebrate Classification
Suppose, a new species is discovered as follows.
Name Body Skin Gives Aquatic Aerial Has Hibernates Class
Temperature Cover Birth Creature Creature Legs
Gila Monster cold scale no no no yes yes
?
Decision Tree that can be inducted based on the data is as follows.
24
Building Decision Tree
Many decision tree can be constructed from a dataset
Some of the tree may not be optimum
Some of them may give inaccurate result
Two approaches are known
Greedy strategy
A top-down recursive divide-and-conquer
Modification of greedy strategy
ID3
C4.5
CART, etc.
Unsupervised decision trees are structurally similar to hierarchical clustering
25
Built Decision Tree Algorithm
Algorithm BuiltDT
Input: D : Training data set
Output: T : Decision tree
Steps
1. If all tuples in D belongs to the same class Cj
Add a leaf node labeled as Cj
Return // Termination condition
2. Select an attribute Ai (so that it is not selected twice in the same branch)
3. Partition D = { D1, D2, …, Dp} based on p different values of Ai in D
4. For each Dk ϵ D
Create a node and add an edge between D and Dk with label as the Ai’s attribute value in Dk
5. For each Dk ϵ D
BuildTD(Dk) // Recursive call
6. Stop 26
Node Splitting in BuildDT Algorithm
BuildDT algorithm must provides a method for expressing an attribute test condition and
corresponding outcome for different attribute type
Case: Binary attribute
This is the simplest case of node splitting
The test condition for a binary attribute generates only two outcomes
27
Node Splitting in BuildDT Algorithm
Case: Nominal attribute
Since a nominal attribute can have many values, its test condition can be expressed
in two ways:
A multi-way split
A binary split
Muti-way split: Outcome depends on the number of distinct values for the
corresponding attribute
Binary splitting by grouping attribute values
CS 40003: Data Analytics 28
Node Splitting in BuildDT Algorithm
Case: Ordinal attribute
It also can be expressed in two ways:
A multi-way split
A binary split
Muti-way split: It is same as in the case of nominal attribute
Binary splitting attribute values should be grouped maintaining the order property of the attribute
values
29
Node Splitting in BuildDT Algorithm
Case: Numerical attribute
For numeric attribute (with discrete or continuous values), a test condition can be expressed
as a comparison set
Binary outcome: A >v or A ≤ v
In this case, decision tree induction must consider all possible split positions
Range query : vi ≤ A < vi+1 for i = 1, 2, …, q (if q number of ranges are chosen)
Here, q should be decided a priori
For a numeric attribute, decision tree induction is a combinatorial optimization problem
30
Illustration : BuildDT Algorithm
Person Gender Height Class Attributes:
1 F 1.6 S Gender = {Male(M), Female (F)} // Binary attribute
2 M 2.0 M Height = {1.5, …, 2.5} // Continuous
3 F 1.9 M attribute
4 F 1.88 M
5 F 1.7 S Class = {Short (S), Medium (M), Tall (T)}
6 M 1.85 M
7 F 1.6 S
8 M 1.7 S
9 M 2.2 T Given a person, we are to test in which class s/he
10 M 2.1 T belongs
11 F 1.8 M
12 M 1.95 M
13 F 1.9 M
14 F 1.8 M
15 F 1.75 S
31
Illustration : BuildDT Algorithm
To built a decision tree, we can select an attribute in two different orderings:
<Gender, Height> or <Height, Gender>
Further, for each ordering, we can choose different ways of splitting
Different instances are shown in the following.
Approach 1 : <Gender, Height>
CS 40003: Data Analytics 32
Illustration : BuildDT Algorithm
33
Illustration : BuildDT Algorithm
Approach 2 : <Height, Gender>
34
Concept of Entropy
More ordered Less ordered
less entropy higher entropy
More organized or Less organized or
ordered (less probable) disordered (more probable)
35
Information Gain
Examples (Gini Index) Examples (Entropy)
Let a bag have 4 red balls Let a bag have 4 red balls
Gini Index (Impurity) = 1 – 1 = 0 Entropy = -1 log2 (1) = 0
Purity is 100% Let a bag have 2 red and 2 blue balls
Let a bag have 2 red and 2 blue balls Entropy = -1/2 log2 (1/2) - 1/2 log2 (1/2)
Gini Index = 1 – (1/4 + 1/4) = 0.5 =1
Let a bag have 3 red and 1 blue balls Let a bag have 3 red and 1 blue balls
Gini Index = 1 – (1/16 + 9/16) = 6/16 Entropy = 0.811
ID3: Decision Tree Induction Algorithms
Iterative Dichotomizer 3 for decision trees,
In ID3, each node corresponds to a splitting attribute and each arc is a possible value of that
attribute.
At each node, the splitting attribute is selected to be the most informative
In ID3, entropy is used to measure how informative a node is.
It is observed that splitting on any attribute has the property that average entropy of the resulting training subsets
will be less than or equal to that of the previous training set.
ID3 algorithm defines a measurement of a splitting called Information Gain to determine the goodness of a
split.
The attribute with the largest value of information gain is chosen as the splitting attribute and
it partitions into a number of smaller training sets based on the distinct values of attribute under split.
37
CART Algorithm
It is observed that information gain measure used in ID3 is biased towards test with many
outcomes, that is, it prefers to select attributes having a large number of values.
L. Breiman, J. Friedman, R. Olshen and C. Stone in 1984 proposed an algorithm to build a
binary decision tree also called CART decision tree.
CART stands for Classification and Regression Tree
In fact, invented independently at the same time as ID3 (1984).
ID3 and CART are two cornerstone algorithms spawned a flurry of work on decision tree induction.
CART is a technique that generates a binary decision tree; That is, unlike ID3, in CART, for
each node only two children is created.
ID3 uses Information gain as a measure to select the best attribute to be splitted, whereas CART
do the same but using another measurement called Gini index . It is also known as Gini Index
of Diversity and is denote as 𝛾.
38
Gini Index of Diversity
Gini Index
Suppose, D is a training set with size |D| and 𝐶 = 𝑐1, 𝑐2, … , 𝑐𝑘 be the set of k classifications and 𝐴 =
𝑎1 , 𝑎2 , … , 𝑎𝑚 be any attribute with m different values of it. Like entropy measure in ID3, CART proposes
Gini Index (denoted by G) as the measure of impurity of D. It can be defined as follows.
𝑘
𝐺 𝐷 = 1 − 𝑝𝑖2
𝑖=1
where 𝑝𝑖 is the probability that a tuple in D belongs to class 𝑐𝑖 and 𝑝𝑖 can be estimated as
|𝐶𝑖 , 𝐷|
𝑝𝑖 =
𝐷
where |𝐶𝑖 , 𝐷| denotes the number of tuples in D with class 𝑐𝑖 .
39
Algorithm C 4.5
J. Ross Quinlan, a researcher in machine learning developed a decision tree induction algorithm
in 1984 known as ID3 (Iterative Dichotometer 3).
Quinlan later presented C4.5, a successor of ID3, addressing some limitations in ID3.
ID3 uses information gain measure, which is, in fact biased towards splitting attribute having a
large number of outcomes.
For example, if an attribute has distinct values for all tuples, then it would result in a large
number of partitions, each one containing just one tuple.
In such a case, note that each partition is pure, and hence the purity measure of the partition, that is
𝐸𝐴 𝐷 = 0
40
Algorithm C4.5 : Introduction
Limitation of ID3
In the following, each tuple belongs to a unique class. The splitting on A is shown.
𝑛 𝑛
𝐷𝑗 1
𝐸𝐴 𝐷 = . 𝐸 𝐷𝑗 = .0 = 0
𝐷 𝐷
𝑗=1 𝑗=1
Thus, 𝛼 𝐴, 𝐷 = 𝐸 𝐷 − 𝐸𝐴 𝐷 is maximum in such a situation. 41
Algorithm: C 4.5 : Introduction
Although, the previous situation is an extreme case, intuitively, we can infer
that ID3 favours splitting attributes having a large number of values
compared to other attributes, which have a less variations in their values.
Such a partition appears to be useless for classification.
This type of problem is called overfitting problem.
Note:
Decision Tree Induction Algorithm ID3 may suffer from overfitting problem.
42
Algorithm: C 4.5 : Introduction
The overfitting problem in ID3 is due to the measurement of information gain.
In order to reduce the effect of the use of the bias due to the use of information gain, C4.5 uses a
different measure called Gain Ratio, denoted as 𝛽.
Gain Ratio is a kind of normalization to information gain using a split information.
Notes on Decision Tree Induction algorithms
1. Optimal Decision Tree: Finding an optimal decision tree is an NP-complete problem. Hence, decision tree induction
algorithms employ a heuristic based approach to search for the best in a large search space. Majority of the algorithms
follow a greedy, top-down recursive divide-and-conquer strategy to build decision trees.
2. Missing data and noise: Decision tree induction algorithms are quite robust to the data set with missing values and
presence of noise. However, proper data pre-processing can be followed to nullify these discrepancies.
3. Redundant Attributes: The presence of redundant attributes does not adversely affect the accuracy of decision trees.
It is observed that if an attribute is chosen for splitting, then another attribute which is redundant is unlikely to chosen
for splitting.
4. Computational complexity: Decision tree induction algorithms are computationally inexpensive, in particular, when
the sizes of training sets are large, Moreover, once a decision tree is known, classifying a test record is extremely fast,
with a worst-case time complexity of O(d), where d is the maximum depth of the tree.
44
RANDOM FOREST
➢Random forest is a supervised learning algorithm can
be used for classification as well as regression.
➢It is mainly used for classification problems.
➢The "forest" it builds is an ensemble of decision trees,
usually trained with the “bagging” method.
➢The general idea of the bagging method is to
increases the overall performance.
Working of Random Forest Algorithm
Step 1: First, start with the selection of random samples from a
given dataset.
Step 2: Next, this algorithm will construct a decision tree for
every sample. Then it will get the prediction result from every
decision tree.
Step 3: In this step, voting will be performed for every predicted
result.
Step 4: At last, select the most voted prediction result as the
final prediction result.
Random Forest Classifier
Training Data
M features
N examples
Random Forest Classifier
M features
N examples
....…
Random Forest Classifier
Construct a decision tree
M features
N examples
....…
Random Forest Classifier
At each node in choosing the split feature
choose only among m<M features
M features
N examples
....…
Random Forest Classifier
Create decision tree
from each bootstrap sample
M features
N examples
....…
....…
Random Forest Classifier
M features
N examples
Take he
majority
vote
....…
....…
Important Points
➢Random forest has nearly the same hyperparameters as a decision tree or
a bagging classifier.
➢Random forest adds additional randomness to the model, while growing
the trees. Instead of searching for the most important feature while splitting
a node, it searches for the best feature among a random subset of
features. This results in a wide diversity that generally results in a better
model.
➢Therefore, in random forest, only a random subset of the features is taken
into consideration by the algorithm for splitting a node. You can even make
trees more random by additionally using random thresholds for each
feature rather than searching for the best possible thresholds.
Important Points
Random forest is based decision trees that are created by randomly
splitting the data.
Generating decision trees is also known as forest.
Each decision tree is formed using feature selection indicators like
information gain.
Each tree is dependent on an independent sample.
Considering it to be a classification problem, then each tree computes
votes and the highest votes class is chosen.
If it's regression, the average of all the tree's outputs is declared as
the result.
It is the most powerful algorithm compared to all others.
FEATURE IMPORTANCE
It is very easy to measure the relative importance of each feature on the
prediction.
Sklearn provides a great tool for this that measures a feature's importance.
Tree nodes that use that feature reduce impurity across all trees in the forest.
By looking at the feature importance you can decide which features to possibly
drop
This is important because a general rule in machine learning is that the more
features you have the more likely your model will suffer from overfitting and vice
versa.
Random forests make use of Gini importance or MDI (Mean decrease impurity)
to compute the importance of each attribute.
The amount of total decrease in node impurity is also called Gini importance.
This is the method through which accuracy or model fit decreases when there is a
drop of feature.
Features and Advantages
The advantages of random forest are:
One of the most accurate learning algorithms available.
Runs efficiently on large databases.
It can handle thousands of input variables.
Identification of importance of variables in classification
Highly effective in estimating missing data.
It has methods for balancing error in class population unbalanced data sets.
Generated forests can be saved for future use on other data.
It computes proximities between pairs of cases (clustering, locating outliers)
The capabilities can be used for unsupervised learning
It offers an experimental method for detecting variable interactions.
58/14
Implementation in Python
import numpy as np
import [Link] as plt
import pandas as pd
from sklearn.model_selection import train_test_split X_train, X_test,
y_train, y_test = train_test_split(X, y, test_size = 0.30)
from [Link] import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 50)
[Link](X_train, y_train)
y_pred = [Link](X_test)
Thank You