0% found this document useful (0 votes)
10 views60 pages

Week 10

The document discusses ensemble methods in machine learning, focusing on decision trees and random forests. It explains the concepts of bagging and boosting techniques, their advantages and disadvantages, and how they improve predictive performance. Additionally, it covers the structure and building of decision trees, including algorithms like ID3 and CART, along with their applications in classification tasks.

Uploaded by

smishra8094
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views60 pages

Week 10

The document discusses ensemble methods in machine learning, focusing on decision trees and random forests. It explains the concepts of bagging and boosting techniques, their advantages and disadvantages, and how they improve predictive performance. Additionally, it covers the structure and building of decision trees, including algorithms like ID3 and CART, along with their applications in classification tasks.

Uploaded by

smishra8094
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Ensemble Methods, Decision Tree

and Random forest

Prof. Gajendra P.S. Raghava


Head, Center for Computational Biology

Web Site: [Link]

These slides were created with using various resources so


no claim of authorship on any slide
Ensemble learning
Ensembles of Classifiers

 Idea
 Combine the classifiers to improve the performance
 Ensembles of Classifiers
 Combine the classification results from different
classifiers to produce the final output
 Unweighted voting
 Weighted voting
Build Ensemble Classifiers
• Basic idea:
Build different “experts”, and let them vote
• Advantages:
Improve predictive performance
Other types of classifiers can be directly included
Easy to implement
No too much parameter tuning
• Disadvantages:
The combined classifier is not so transparent (black box)
Not a compact representation
Ensemble-based methods
 Classifier combination/Aggregator is a general concept
 Creation of large set of classifiers, here called the “base
learners”
 Building large number of base classifiers
 These sets and classifies should be diverse
 Aim to maximize accuracy using combination
 These methods will be:
 Bagging techniques (Random Forest)
 Boosting techniques (AdaBoost)
Ensemble Techniques
Bagging Techniques Boosting Techniques

Bootstrap Aggregation • AdaBoost (Adaptive Boosting)


• Random Forest • Gradient Boosting Machines
• Extra-Tree • XGBoost (Extreme Gradient Boosting)
(Extremely Randomized Trees) • CatBoost (Category Boosting)
• LightGBM

Bagging techniques are a type of ensemble learning method that involves training multiple models in
parallel and then combining their predictions to improve the overall performance and robustness of the
predictive model.
Boosting is an ensemble learning technique that works by sequentially training a series of weak models,
each trying to correct the errors made by its predecessors.
Bias/Variance Tradeoff
 Minimize variance Low Variance High Variance
– Bagging .

– Random Forests Low Bias ... . . . .

• Minimize bias
.
– Functional Gradient Descent .
High Bias .... .
– Boosting .
– Ensemble Selection
Bias: Measure closeness between predicted and actual value, higher difference higher
biasness.
Variance: Predicted values are scattered or dense; variance in predicted values
Bagging
 Bagging = Bootstrap + aggregating
 It uses bootstrap resampling to generate L different training sets
 On the L training sets it trains L base learners
 During testing it aggregates the L learners by taking their average
 In case of classification, it uses majority voting
 The diversity or complementarity of the base learners is not controlled
 The ensemble model is almost always better than the unique base learners
if the base learners are unstable (which means that a small change in the
training dataset may cause a large change in the result of the training)
Bootstrap resampling
 Suppose we have a training set with n samples
 Create L different training sets
 Bootstrap resampling takes random samples
 Randomness is required to obtain different sets for L rounds
 Allowing replacement is required
 As the L training sets are different, the result of the training over
these set will be different
 Works better with unstable learners (e.g. neural nets, decision
trees)
 Not really effective with stable learners (e.g. k-NN, SVM)
Aggregation methods
 There are several methods to combine (aggregate) the outputs of the various classifiers
 When the output is a class label:
 Majority voting
 Weighted majority voting (e.g we can weight each classifier
by ts reliability (which also has to be estimated somehow, of course…)
 When the output is numeric (e.g. a probability
estimate for each class ci):
 We can combine the dj scores by taking their (weighted)
mean, product, minimum, maximum, …
 Stacking
 Instead of using the above simple aggregation rules, we can
train yet another classifier on the output values of the base
classifiers
Decision Tree for PlayTennis

Outlook

Sunny Overcast Rain

Humidity Each internal node tests an attribute

High Normal Each branch corresponds to an


attribute value node
No Yes Each leaf node assigns a classification
16
When to consider Decision Trees
 Instances describable by attribute-value pairs
 Target function is discrete valued
 Disjunctive hypothesis may be required
 Possibly noisy training data
 Missing attribute values
 Examples:
 Medical diagnosis
 Credit risk analysis
 Object classification for robot manipulator (Tan 1993) 17
Converting a Tree to Rules
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak


No Yes No Yes

R1: If (Outlook=Sunny)  (Humidity=High) Then PlayTennis=No


R2: If (Outlook=Sunny)  (Humidity=Normal) Then PlayTennis=Yes
R3: If (Outlook=Overcast) Then PlayTennis=Yes
R4: If (Outlook=Rain)  (Wind=Strong) Then PlayTennis=No
R5: If (Outlook=Rain)  (Wind=Weak) Then PlayTennis=Yes 18
Basic Concept
 A Decision Tree is an important data structure known to
solve many computational problems
Binary Decision Tree
A B C f
0 0 0 m0
0 0 1 m1
0 1 0 m2
0 1 1 m3
1 0 0 m4
1 0 1 m5
1 1 0 m6
1 1 1 m7

19
Basic Concept
 In last example, we have considered a decision tree where
values of any attribute if binary only. Decision tree is also
possible where attributes are of continuous data type
Decision Tree with numeric data

20
Some Characteristics
 Decision tree may be n-array, n ≥ 2.

 There is a special node called root node.


 All nodes drawn with circle (ellipse) are called internal nodes.
 All nodes drawn with rectangle boxes are called terminal nodes or leaf nodes.
 Edges of a node represent the outcome for a value of the node.
 In a path, a node with same label is never repeated.
 Decision tree is not unique, as different ordering of internal nodes can give
different decision tree.

21
Decision Tree and Classification Task
 Decision tree helps us to classify data.
 Internal nodes are some attribute

 Edges are the values of attributes

 External nodes are the outcome of classification


 Such a classification is, in fact, made by posing questions starting from the root
node to each terminal node.

22
Decision Tree and Classification Task
Vertebrate Classification
Name Body Temperature Skin Cover Gives Birth Aquatic Aerial Has Legs Hibernates Class
Creature Creature

Human Warm hair yes no no yes no Mammal


Python Cold scales no no no no yes Reptile
Salmon Cold scales no yes no no no Fish
Whale Warm hair yes yes no no no Mammal
Frog Cold none no semi no yes yes Amphibian
Komodo Cold scales no no no yes no Reptile
Bat Warm hair yes no yes yes yes Mammal
Pigeon Warm feathers no no yes yes no Bird
Cat Warm fur yes no no yes no Mammal
Leopard Cold scales yes yes no no no Fish
Turtle Cold scales no semi no yes no Reptile
Penguin Warm feathers no semi no yes no Bird
Porcupine Warm quills yes no no yes yes Mammal
Eel Cold scales no yes no no no Fish
Salamander Cold none no semi no yes yes Amphibian

23
What are the class label of Dragon and Shark?
Decision Tree and Classification Task
Vertebrate Classification
 Suppose, a new species is discovered as follows.
Name Body Skin Gives Aquatic Aerial Has Hibernates Class
Temperature Cover Birth Creature Creature Legs

Gila Monster cold scale no no no yes yes


?
 Decision Tree that can be inducted based on the data is as follows.

24
Building Decision Tree
 Many decision tree can be constructed from a dataset
 Some of the tree may not be optimum
 Some of them may give inaccurate result
 Two approaches are known
 Greedy strategy
 A top-down recursive divide-and-conquer
 Modification of greedy strategy
 ID3
 C4.5
 CART, etc.
Unsupervised decision trees are structurally similar to hierarchical clustering
25
Built Decision Tree Algorithm
 Algorithm BuiltDT
 Input: D : Training data set
 Output: T : Decision tree
Steps
1. If all tuples in D belongs to the same class Cj
Add a leaf node labeled as Cj
Return // Termination condition
2. Select an attribute Ai (so that it is not selected twice in the same branch)
3. Partition D = { D1, D2, …, Dp} based on p different values of Ai in D
4. For each Dk ϵ D
Create a node and add an edge between D and Dk with label as the Ai’s attribute value in Dk
5. For each Dk ϵ D
BuildTD(Dk) // Recursive call
6. Stop 26
Node Splitting in BuildDT Algorithm
 BuildDT algorithm must provides a method for expressing an attribute test condition and
corresponding outcome for different attribute type

 Case: Binary attribute


 This is the simplest case of node splitting
 The test condition for a binary attribute generates only two outcomes

27
Node Splitting in BuildDT Algorithm
 Case: Nominal attribute
 Since a nominal attribute can have many values, its test condition can be expressed
in two ways:
 A multi-way split
 A binary split

 Muti-way split: Outcome depends on the number of distinct values for the
corresponding attribute

 Binary splitting by grouping attribute values

CS 40003: Data Analytics 28


Node Splitting in BuildDT Algorithm
 Case: Ordinal attribute
 It also can be expressed in two ways:
 A multi-way split
 A binary split

 Muti-way split: It is same as in the case of nominal attribute

 Binary splitting attribute values should be grouped maintaining the order property of the attribute
values

29
Node Splitting in BuildDT Algorithm
 Case: Numerical attribute
 For numeric attribute (with discrete or continuous values), a test condition can be expressed
as a comparison set

 Binary outcome: A >v or A ≤ v

 In this case, decision tree induction must consider all possible split positions

 Range query : vi ≤ A < vi+1 for i = 1, 2, …, q (if q number of ranges are chosen)

 Here, q should be decided a priori

 For a numeric attribute, decision tree induction is a combinatorial optimization problem

30
Illustration : BuildDT Algorithm
Person Gender Height Class Attributes:
1 F 1.6 S Gender = {Male(M), Female (F)} // Binary attribute
2 M 2.0 M Height = {1.5, …, 2.5} // Continuous
3 F 1.9 M attribute
4 F 1.88 M
5 F 1.7 S Class = {Short (S), Medium (M), Tall (T)}
6 M 1.85 M
7 F 1.6 S
8 M 1.7 S
9 M 2.2 T Given a person, we are to test in which class s/he
10 M 2.1 T belongs
11 F 1.8 M
12 M 1.95 M
13 F 1.9 M
14 F 1.8 M
15 F 1.75 S
31
Illustration : BuildDT Algorithm
 To built a decision tree, we can select an attribute in two different orderings:
<Gender, Height> or <Height, Gender>

 Further, for each ordering, we can choose different ways of splitting

 Different instances are shown in the following.

 Approach 1 : <Gender, Height>

CS 40003: Data Analytics 32


Illustration : BuildDT Algorithm

33
Illustration : BuildDT Algorithm
 Approach 2 : <Height, Gender>

34
Concept of Entropy

More ordered Less ordered


less entropy higher entropy
More organized or Less organized or
ordered (less probable) disordered (more probable)

35
Information Gain

Examples (Gini Index) Examples (Entropy)

Let a bag have 4 red balls Let a bag have 4 red balls
Gini Index (Impurity) = 1 – 1 = 0 Entropy = -1 log2 (1) = 0
Purity is 100% Let a bag have 2 red and 2 blue balls
Let a bag have 2 red and 2 blue balls Entropy = -1/2 log2 (1/2) - 1/2 log2 (1/2)
Gini Index = 1 – (1/4 + 1/4) = 0.5 =1
Let a bag have 3 red and 1 blue balls Let a bag have 3 red and 1 blue balls
Gini Index = 1 – (1/16 + 9/16) = 6/16 Entropy = 0.811
ID3: Decision Tree Induction Algorithms

 Iterative Dichotomizer 3 for decision trees,

 In ID3, each node corresponds to a splitting attribute and each arc is a possible value of that
attribute.

 At each node, the splitting attribute is selected to be the most informative


 In ID3, entropy is used to measure how informative a node is.
 It is observed that splitting on any attribute has the property that average entropy of the resulting training subsets
will be less than or equal to that of the previous training set.

 ID3 algorithm defines a measurement of a splitting called Information Gain to determine the goodness of a
split.
 The attribute with the largest value of information gain is chosen as the splitting attribute and

 it partitions into a number of smaller training sets based on the distinct values of attribute under split.

37
CART Algorithm
 It is observed that information gain measure used in ID3 is biased towards test with many
outcomes, that is, it prefers to select attributes having a large number of values.

 L. Breiman, J. Friedman, R. Olshen and C. Stone in 1984 proposed an algorithm to build a


binary decision tree also called CART decision tree.
 CART stands for Classification and Regression Tree
 In fact, invented independently at the same time as ID3 (1984).
 ID3 and CART are two cornerstone algorithms spawned a flurry of work on decision tree induction.

 CART is a technique that generates a binary decision tree; That is, unlike ID3, in CART, for
each node only two children is created.

 ID3 uses Information gain as a measure to select the best attribute to be splitted, whereas CART
do the same but using another measurement called Gini index . It is also known as Gini Index
of Diversity and is denote as 𝛾.

38
Gini Index of Diversity
Gini Index

Suppose, D is a training set with size |D| and 𝐶 = 𝑐1, 𝑐2, … , 𝑐𝑘 be the set of k classifications and 𝐴 =
𝑎1 , 𝑎2 , … , 𝑎𝑚 be any attribute with m different values of it. Like entropy measure in ID3, CART proposes
Gini Index (denoted by G) as the measure of impurity of D. It can be defined as follows.
𝑘
𝐺 𝐷 = 1 − ෍ 𝑝𝑖2
𝑖=1
where 𝑝𝑖 is the probability that a tuple in D belongs to class 𝑐𝑖 and 𝑝𝑖 can be estimated as
|𝐶𝑖 , 𝐷|
𝑝𝑖 =
𝐷
where |𝐶𝑖 , 𝐷| denotes the number of tuples in D with class 𝑐𝑖 .

39
Algorithm C 4.5
 J. Ross Quinlan, a researcher in machine learning developed a decision tree induction algorithm
in 1984 known as ID3 (Iterative Dichotometer 3).

 Quinlan later presented C4.5, a successor of ID3, addressing some limitations in ID3.

 ID3 uses information gain measure, which is, in fact biased towards splitting attribute having a
large number of outcomes.

 For example, if an attribute has distinct values for all tuples, then it would result in a large
number of partitions, each one containing just one tuple.
 In such a case, note that each partition is pure, and hence the purity measure of the partition, that is
𝐸𝐴 𝐷 = 0

40
Algorithm C4.5 : Introduction
Limitation of ID3

In the following, each tuple belongs to a unique class. The splitting on A is shown.

𝑛 𝑛
𝐷𝑗 1
𝐸𝐴 𝐷 = ෍ . 𝐸 𝐷𝑗 = ෍ .0 = 0
𝐷 𝐷
𝑗=1 𝑗=1

Thus, 𝛼 𝐴, 𝐷 = 𝐸 𝐷 − 𝐸𝐴 𝐷 is maximum in such a situation. 41


Algorithm: C 4.5 : Introduction
 Although, the previous situation is an extreme case, intuitively, we can infer
that ID3 favours splitting attributes having a large number of values
 compared to other attributes, which have a less variations in their values.

 Such a partition appears to be useless for classification.

 This type of problem is called overfitting problem.

Note:
Decision Tree Induction Algorithm ID3 may suffer from overfitting problem.

42
Algorithm: C 4.5 : Introduction
 The overfitting problem in ID3 is due to the measurement of information gain.

 In order to reduce the effect of the use of the bias due to the use of information gain, C4.5 uses a
different measure called Gain Ratio, denoted as 𝛽.

 Gain Ratio is a kind of normalization to information gain using a split information.


Notes on Decision Tree Induction algorithms
1. Optimal Decision Tree: Finding an optimal decision tree is an NP-complete problem. Hence, decision tree induction
algorithms employ a heuristic based approach to search for the best in a large search space. Majority of the algorithms
follow a greedy, top-down recursive divide-and-conquer strategy to build decision trees.

2. Missing data and noise: Decision tree induction algorithms are quite robust to the data set with missing values and
presence of noise. However, proper data pre-processing can be followed to nullify these discrepancies.

3. Redundant Attributes: The presence of redundant attributes does not adversely affect the accuracy of decision trees.
It is observed that if an attribute is chosen for splitting, then another attribute which is redundant is unlikely to chosen
for splitting.

4. Computational complexity: Decision tree induction algorithms are computationally inexpensive, in particular, when
the sizes of training sets are large, Moreover, once a decision tree is known, classifying a test record is extremely fast,
with a worst-case time complexity of O(d), where d is the maximum depth of the tree.

44
RANDOM FOREST
➢Random forest is a supervised learning algorithm can
be used for classification as well as regression.
➢It is mainly used for classification problems.
➢The "forest" it builds is an ensemble of decision trees,
usually trained with the “bagging” method.
➢The general idea of the bagging method is to
increases the overall performance.
Working of Random Forest Algorithm
 Step 1: First, start with the selection of random samples from a
given dataset.
 Step 2: Next, this algorithm will construct a decision tree for
every sample. Then it will get the prediction result from every
decision tree.
 Step 3: In this step, voting will be performed for every predicted
result.
 Step 4: At last, select the most voted prediction result as the
final prediction result.
Random Forest Classifier
Training Data

M features
N examples
Random Forest Classifier

M features
N examples

....…
Random Forest Classifier
Construct a decision tree

M features
N examples

....…
Random Forest Classifier
At each node in choosing the split feature
choose only among m<M features

M features
N examples

....…
Random Forest Classifier
Create decision tree
from each bootstrap sample

M features
N examples

....…
....…
Random Forest Classifier

M features
N examples

Take he
majority
vote

....…
....…
Important Points
➢Random forest has nearly the same hyperparameters as a decision tree or
a bagging classifier.
➢Random forest adds additional randomness to the model, while growing
the trees. Instead of searching for the most important feature while splitting
a node, it searches for the best feature among a random subset of
features. This results in a wide diversity that generally results in a better
model.
➢Therefore, in random forest, only a random subset of the features is taken
into consideration by the algorithm for splitting a node. You can even make
trees more random by additionally using random thresholds for each
feature rather than searching for the best possible thresholds.
Important Points
 Random forest is based decision trees that are created by randomly
splitting the data.
 Generating decision trees is also known as forest.
 Each decision tree is formed using feature selection indicators like
information gain.
 Each tree is dependent on an independent sample.
 Considering it to be a classification problem, then each tree computes
votes and the highest votes class is chosen.
 If it's regression, the average of all the tree's outputs is declared as
the result.
 It is the most powerful algorithm compared to all others.
FEATURE IMPORTANCE
 It is very easy to measure the relative importance of each feature on the
prediction.
 Sklearn provides a great tool for this that measures a feature's importance.
 Tree nodes that use that feature reduce impurity across all trees in the forest.
 By looking at the feature importance you can decide which features to possibly
drop
 This is important because a general rule in machine learning is that the more
features you have the more likely your model will suffer from overfitting and vice
versa.
 Random forests make use of Gini importance or MDI (Mean decrease impurity)
to compute the importance of each attribute.
 The amount of total decrease in node impurity is also called Gini importance.
 This is the method through which accuracy or model fit decreases when there is a
drop of feature.
Features and Advantages
The advantages of random forest are:
 One of the most accurate learning algorithms available.
 Runs efficiently on large databases.
 It can handle thousands of input variables.
 Identification of importance of variables in classification
 Highly effective in estimating missing data.
 It has methods for balancing error in class population unbalanced data sets.
 Generated forests can be saved for future use on other data.
 It computes proximities between pairs of cases (clustering, locating outliers)
 The capabilities can be used for unsupervised learning
 It offers an experimental method for detecting variable interactions.

58/14
Implementation in Python
 import numpy as np
 import [Link] as plt
 import pandas as pd
 from sklearn.model_selection import train_test_split X_train, X_test,
y_train, y_test = train_test_split(X, y, test_size = 0.30)
 from [Link] import RandomForestClassifier
 classifier = RandomForestClassifier(n_estimators = 50)
 [Link](X_train, y_train)
 y_pred = [Link](X_test)
Thank You

You might also like