CPE316 - Introduction to
Machine Learning
Week 8
Decision Trees
Assoc. Prof. Dr. Caner ÖZCAN
When you’ve seen beyond
yourself, then you may
find, peace of mind is
waiting there.
~ George Harrison
Machine Learning Glossary
• [Link]
• [Link]
3
Decision Tree
• Decision Trees are Machine Learning algorithms that can perform both
classification and regression tasks, and even multioutput tasks.
• A decision tree is a hierarchical data structure implementing the divide-and-
conquer strategy.
• They are very powerful algorithms, capable of fitting complex datasets.
• Decision Trees are also the fundamental components of Random Forests, which are
among the most powerful Machine Learning algorithms available today.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node.
• Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any
further branches.
4
Decision Tree
• In order to build a tree, we use the CART algorithm, which stands for Classification and Regression
Tree algorithm.
• A decision tree simply asks a question and based on the answer (Yes/No), it further split the tree
into subtrees.
• Below diagram explains the general structure of a decision tree:
A decision tree can contain
categorical data (YES/NO) as well as
numeric data.
5
Decision Tree Terminologies
• Root Node: Root node is from where the decision tree starts. It represents the
entire dataset, which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
6
Decision Tree Algorithm Steps
• In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree.
• This algorithm compares the values of root attribute with the record (real dataset)
attribute and based on the comparison, follows the branch and jumps to the next node.
• For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further.
• It continues the process until it reaches the leaf node of the tree.
• The complete process can be better understood using the below algorithm:
• Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
• Step-3: Divide the S into subsets that contains possible values for the best attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset created in Step-3.
Continue this process until a stage is reached where you cannot further classify the nodes and called
the final node as a leaf node.
7
Decision Tree
• Each decision node m implements a test function 𝑓𝑚 (𝑥) with discrete outcomes
labeling the branches.
• Given an input, at each node, a test is applied and one of the branches is taken
depending on the outcome.
• This process starts at the root and is repeated recursively until a leaf node is hit, at
which point the value written in the leaf constitutes the output.
• Example of a dataset and the corresponding decision
tree.
• Oval nodes are the decision nodes and rectangles are
leaf nodes.
• The univariate decision node splits along one axis,
and successive splits are orthogonal to each other.
• After the first split, {x|x1 < w10} is pure and is not split
8
further.
Decision Tree Example
• Suppose there is a candidate who has a job offer and wants to decide whether he should accept the
offer or Not. So, to solve this problem, the decision tree starts with the root node.
• The root node splits further into the next decision node (distance from the office) and one leaf
node based on the corresponding labels.
• The next decision node further gets split into one decision node (Cab facility) and one leaf node.
• Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
• Consider the below diagram:
9
Decision Tree Example (in Turkish)
10
Decision Tree Attribute Selection Measures
• While implementing a Decision tree, the main issue arises that how to select the best attribute for
the root node and for sub-nodes.
• So, to solve such problems there is a technique which is called as Attribute selection measure or
ASM.
• By this measurement, we can easily select the best attribute for the nodes of the tree.
• There are two popular techniques for ASM, which are:
• Information Gain
• Gini Index
11
Decision Tree Attribute Selection Measures
Information Gain
• Information gain is the measurement of changes in entropy after the segmentation of a dataset based on an
attribute.
• It calculates how much information a feature provides us about a class.
• According to the value of information gain, we split the node and build the decision tree.
• A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute having the
highest information gain is split first.
• It can be calculated using the below formula:
Information Gain= Entropy(S)− [(Weighted Avg) ∗Entropy(each feature)
• Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data. Entropy can
be calculated as:
Entropy(S)= −P(yes) log2 P(yes)− P(no) log2 P(no)
where,
S : Total number of samples
P(yes) : probability of yes
P(no) : probability of no
12
Decision Tree Attribute Selection Measures
Gini Index
• Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
• An attribute with the low Gini index should be preferred as compared to the high Gini index.
• It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
• Gini index can be calculated using the below formula:
Gini index = 1− σ𝑗 𝑝𝑗2
• 𝑝𝑗 is the probability of occurrence of class 𝑗. It is calculated for each class and the sum of the
squares of the results is subtracted.
• The Gini value gets a result between 0 and 1, and the closer the result is to 0, the better the
discrimination.
13
Gini Index Example (in Turkish)
• Elimizde Kırmızı ve Mavi renkte 4 topumuz olduğunu varsayalım;
14
Pruning: Getting an Optimal Decision Tree
• Pruning is a process of deleting the unnecessary nodes from a tree in order to get
the optimal decision tree.
• A too-large tree increases the risk of overfitting, and a small tree may not capture
all the important features of the dataset.
• A technique that decreases the size of the learning tree without reducing accuracy
is known as Pruning.
• There are mainly two types of tree pruning technology used:
• Cost Complexity Pruning
• Reduced Error Pruning
15
Training and Visualizing a Decision Tree
• To understand Decision Trees, let’s just build one and take a look at how it makes predictions.
• The following code trains a DecisionTreeClassifier on the iris dataset.
from [Link] import load_iris
from [Link] import DecisionTreeClassifier
iris = load_iris()
X = [Link][:, 2:] # petal length and width
y = [Link]
tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X, y)
• You can visualize the trained Decision Tree by first using the export_graphviz() method to output a
graph definition file called iris_tree.dot:
from [Link] import export_graphviz
export_graphviz( You can convert this .dot file to a variety of formats such as
tree_clf, PDF or PNG using the dot command-line tool from the
out_file=image_path("iris_tree.dot"), graphviz package. This command line converts
feature_names=iris.feature_names[2:],
class_names=iris.target_names, the .dot file to a .png image file:
rounded=True, $ dot -Tpng iris_tree.dot -o iris_tree.png
filled=True 16
)
Decision Tree
• Your first decision tree looks like:
Iris Decision Tree 17
Making Predictions in Decision Tree
• Let’s see how the tree represented in figure makes
predictions.
• Suppose you find an iris flower and you want to classify it.
• You start at the root node (depth 0, at the top): this node
asks whether the flower’s petal length is smaller than 2.45
cm.
• If it is, then you move down to the root’s left child node
(depth 1, left).
• In this case, it is a leaf node (i.e., it does not have any
children nodes), so it does not ask any questions: you can
simply look at the predicted class for that node and the
Decision Tree predicts that your flower is an Iris-Setosa
(class=setosa).
18
Making Predictions in Decision Tree
• Now suppose you find another flower, but this time the
petal length is greater than 2.45 cm.
• You must move down to the root’s right child node (depth 1,
right), which is not a leaf node, so it asks another question:
is the petal width smaller than 1.75 cm?
• If it is, then your flower is most likely an Iris-Versicolor
(depth 2, left).
• If not, it is likely an Iris-Virginica (depth 2, right). It’s really
that simple.
One of the many qualities of Decision Trees is that
they require very little data preparation.
In particular, they don’t require feature scaling or
centering at all.
19
Making Predictions in Decision Tree
• A node’s samples attribute counts how many training instances it applies to.
• For example, 100 training instances have a petal length greater than 2.45 cm (depth 1,
right), among which 54 have a petal width smaller than 1.75 cm (depth 2, left).
• A node’s value attribute tells you how many training instances of each class this node
applies to: for example, the bottom-right node applies to 0 Iris-Setosa, 1 Iris-Versicolor, and
45 Iris-Virginica.
• Finally, a node’s gini attribute measures its impurity: a node is “pure” (gini=0) if all training
instances it applies to belong to the same class.
• For example, since the depth-1 left node applies only to Iris-Setosa training instances, it is
pure and its gini score is 0.
• Equation shows how the training algorithm computes the gini score Gi of the ith node.
• For example, the depth-2 left node has a gini score equal to 1 – (0/54)2 – (49/54)2 –
(5/54)2 ≈ 0.168.
Pi,k is the ratio of class k instances among the
training instances in the ith node. 20
Making Predictions in Decision Tree
• Figure shows this Decision Tree’s decision boundaries.
• The thick vertical line represents the decision boundary
of the root node (depth 0): petal length = 2.45 cm.
• Since the left area is pure (only Iris-Setosa), it cannot be
split any further.
• However, the right area is impure, so the depth-1 right
node splits it at petal width = 1.75 cm (represented by
the dashed line).
• Since max_depth was set to 2, the Decision Tree stops
right there.
• However, if you set max_depth to 3, then the two depth- Decision Tree decision boundaries
2 nodes would each add another decision boundary
(represented by the dotted lines).
21
Implementation of Decision Tree
• Now we will implement the Decision Tree using Python.
• For this, we will use the dataset "user_data.csv," which we have used in previous classification
models ([Link]
• By using the same dataset, we can compare the Decision tree classifier with other classification
models such as KNN SVM, Logistic Regression, etc.
• Steps will also remain the same, which are given below:
• Data Pre-processing step
• Fitting a Decision-Tree algorithm to the Training set
• Predicting the test result
• Test accuracy of the result(Creation of Confusion matrix)
• Visualizing the test set result.
22
Implementation of Decision Tree
1) Data Pre-processing step:
• In this step, we will pre-process/prepare the data so that we can use it efficiently in our code. It is
similar as we did in data-pre-processing.
import numpy as nm
import [Link] as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('Social_Network_Ads.csv')
#Extracting Independent and dependent Variable
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test =
train_test_split(x, y, test_size= 0.25, random_state=0)
#feature Scaling
from [Link] import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train) 23
x_test= st_x.transform(x_test)
Implementation of Decision Tree
2) Fitting a Decision-Tree Algorithm to the Training set:
• Now we will fit the model to the training set. For this, we will import the DecisionTreeClassifier class
from [Link] library. Below is the code for it:
#Fitting Decision Tree classifier to the training set
from [Link] import DecisionTreeClassifier
classifier= DecisionTreeClassifier(criterion='entropy', random_state=0)
[Link](x_train, y_train)
• In the above code, we have created a classifier object, in which we have passed two main
parameters;
• "criterion='entropy': Criterion is used to
measure the quality of split, which is
calculated by information gain given by entropy.
• random_state=0: For generating the random
states.
24
Implementation of Decision Tree
3) Prediction of the test set result:
• Now we will predict the test set result. For this, we
will create a new predictor variable y_pred and
will use the predict function to make the predictions.
# Predicting the Test set results
y_pred = [Link](x_test)
• The output shows the result for prediction vector
y_pred and real vector y_test.
• We can see that some predications are different from
the real values, which are the incorrect predictions.
25
Implementation of Decision Tree
4) Creating Confusion Matrix:
• In the above output, we have seen that there were some incorrect predictions, so if we want to
know the number of correct and incorrect predictions, we need to use the confusion matrix. Below
is the code for it:
# Making the Confusion Matrix
from [Link] import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
• As we can see in the confusion matrix output, there are
6+3= 9 incorrect predictions, and 62+29=91 correct predictions.
26
Implementation of Decision Tree
4) Visualizing the Training Set Result:
• To visualize the training set result we will plot a graph for the decision tree classifier. The classifier will predict
yes or No for the users who have either Purchased or Not purchased the SUV car.
#Visulaizing the trianing set result
from [Link] import ListedColormap
x_set, y_set = x_train, y_train
x1, x2 = [Link]([Link](start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
[Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
[Link](x1, x2, [Link]([Link]([[Link](), [Link]()]).T).reshape([Link]),
alpha = 0.75, cmap = ListedColormap(('purple','green' )))
[Link]([Link](), [Link]())
[Link]([Link](), [Link]())
for i, j in enumerate([Link](y_set)):
[Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('purple', 'green'))(i), label = j)
[Link]('Decision Tree Algorithm (Training set)')
[Link]('Age')
[Link]('Estimated Salary')
[Link]()
[Link]() 27
Implementation of Decision Tree
4) Visualizing the Test Set Result:
• Visualization of test set result will be similar to the visualization of the training set except that the training set
will be replaced with the test set.
#Visulaizing the test set result
from [Link] import ListedColormap
x_set, y_set = x_test, y_test
x1, x2 = [Link]([Link](start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
[Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
[Link](x1, x2, [Link]([Link]([[Link](), [Link]()]).T).reshape([Link]),
alpha = 0.75, cmap = ListedColormap(('purple','green' )))
[Link]([Link](), [Link]())
[Link]([Link](), [Link]())
for i, j in enumerate([Link](y_set)):
[Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('purple', 'green'))(i), label = j)
[Link]('Decision Tree Algorithm(Test set)')
[Link]('Age')
[Link]('Estimated Salary')
[Link]()
[Link]() 28
CPE316 - Introduction to
Machine Learning
Week 8
LAB.
Assoc. Prof. Dr. Caner ÖZCAN
Decision Tree Example
• Run Decision Tree Python
notebook and analyze the
code.
30
• With the tumor dataset, After the data preprocess part we also need to
standard scaling. Then, we can fit the model to the dataset.
• Splitting is done randomly in the decision tree, so we need to use random
state parameter.
• After that, we can make prediction with the test data and look at the
prediction results.
• The accuracy results show that most of the test data predicted truly.
• The diagonal of the confusion matrix says that 26 sample from two
class predicted wrongly.
• Also we can visualize the train and test set, and compare the sample
distribution.
References
• Ethem Apaydin, Introduction to Machine Learning, 3e. The MIT Press, 2014.
• Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2011.
• Tom Mitchell, Machine Learning, McGraw Hill, 1997.
• Russell, S., and P. Norvig. 2009. Artificial Intelligence: A Modern Approach, 3rd ed. New
York: Prentice Hall.
• “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools,
and Techniques to Build Intelligent Systems”, Aurélien Géron, O'Reilly Media (2019).
• [Link]
• [Link]
• [Link]
makine-%C3%B6%C4%9Frenmesi-serisi-3-a03f3ff00ba5