0% found this document useful (0 votes)

20 views9 pages

Decision Tree Classifier Implementation

The document summarizes machine learning experiments on a dataset using decision tree algorithms like ID3 and CART. It performs data cleaning, splits the data into training and test sets, builds decision tree models and evaluates their accuracy. It also visualizes the decision tree using scikit-learn and pydotplus to get a clean visualization. The conclusion states that decision trees are simple yet effective tools for discrimination problems that can be easily explained to non-experts without complex math.

Uploaded by

Suprit D. Shrestha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views9 pages

Decision Tree Classifier Implementation

Uploaded by

Suprit D. Shrestha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Lab DA1

Machine Learning Lab

Name: Suprit Darshan Shrestha
[Link]:19BCE2584
Data cleaning

Combined different columns with close and similar values to dcrease the
number of columns.

S_algo:
Code:
import csv
a = []
with open('[Link]', 'r') as csvfile:
next(csvfile)
for row in [Link](csvfile):
[Link](row)
print(a)

print("\nThe total number of training instances are : ",len(a))

num_attribute = len(a[0])-1

print("\nThe initial hypothesis is : ")

hypothesis = ['0']*num_attribute
print(hypothesis)

for i in range(0, len(a)):

if a[i][num_attribute] == "0":
print ("\nInstance ", i+1, "is", a[i], " and is Positive Instance")
for j in range(0, num_attribute):
if hypothesis[j] == '0' or hypothesis[j] == a[i][j]:
hypothesis[j] = a[i][j]
else:
hypothesis[j] = '?'
print("The hypothesis for the training instance", i+1," is: " , hypothesis, "\n")

if a[i][num_attribute] == "1":
print ("\nInstance ", i+1, "is", a[i], " and is Negative Instance Hence Ignored")
print("The hypothesis for the training instance", i+1," is: " , hypothesis, "\n")

print("\nThe Maximally specific hypothesis for the training instance is ", hypothesis)
Output:
Decision Tree:
Code:
import pandas as pd
dataset=pd.read_csv('[Link]')
X = [Link][:, :-1].values
y = [Link][:, -1].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
from [Link] import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = [Link](X_test)

from [Link] import DecisionTreeClassifier

classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
[Link](X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,

max_features=None, max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0,
random_state=0, splitter='best')

y_pred = [Link](X_test)
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Output:

ID3:
Code:
import pandas as pd
import math
import numpy as np

data = pd.read_csv('[Link]')
features = [feat for feat in data]
[Link]("Diagnosis")

class Node:
def __init__(self):
[Link] = []
[Link] = ""
[Link] = False
[Link] = ""

def entropy(examples):
pos = 0.0
neg = 0.0
for _, row in [Link]():
if row["Diagnosis"] == 0:
pos += 1
else:
neg += 1
if pos == 0.0 or neg == 0.0:
return 0.0
else:
p = pos / (pos + neg)
n = neg / (pos + neg)
return -(p * [Link](p, 2) + n * [Link](n, 2))

def info_gain(examples, attr):

uniq = [Link](examples[attr])
print ("\n",uniq)
gain = entropy(examples)
#print ("\n",gain)
for u in uniq:
subdata = examples[examples[attr] == u]
#print ("\n",subdata)
sub_e = entropy(subdata)
gain -= (float(len(subdata)) / float(len(examples))) * sub_e
#print ("\n",gain)
return gain

def ID3(examples, attrs):

root = Node()

max_gain = 0
max_feat = ""
for feature in attrs:
#print ("\n",examples)
gain = info_gain(examples, feature)
if gain > max_gain:
max_gain = gain
max_feat = feature
[Link] = max_feat
#print ("\nMax feature attr",max_feat)
uniq = [Link](examples[max_feat])
#print ("\n",uniq)
for u in uniq:
#print ("\n",u)
subdata = examples[examples[max_feat] == u]
#print ("\n",subdata)
if entropy(subdata) == 0.0:
newNode = Node()
[Link] = True
[Link] = u
[Link] = [Link](subdata["Diagnosis"])
[Link](newNode)
else:
dummyNode = Node()
[Link] = u
new_attrs = [Link]()
new_attrs.remove(max_feat)
child = ID3(subdata, new_attrs)
[Link](child)
[Link](dummyNode)
return root

def printTree(root: Node, depth=0):

for i in range(depth):
print("\t", end="")
print([Link], end="")
if [Link]:
print(" -> ", [Link])
print()
for child in [Link]:
printTree(child, depth + 1)

root = ID3(data, features)

printTree(root)
Output:
Clean visualization:
Code:
feature_cols = ['area','peri','ECC','solidity','orient_wbc','nuc,AVG','entropy_cyt','AXIS','Diagnosis']

Common questions

Using 'entropy' as the criterion for decision tree classifiers like ID3 ensures that splits are made based on how well they separate the classes, aiming for maximum information gain. This tends to produce more balanced trees because it minimizes impurity at each node. Compared to criteria like 'gini', which measure similar concepts but in different ways, 'entropy' can sometimes lead to a tree structure that is more interpretable under frameworks evaluating model interpretability. Ultimately, it can improve model accuracy and efficiency .

Visualizing a decision tree makes it easier to interpret and understand the decision-making process of the model. It provides insights into which features are most important, the sequence of decisions, and how input features map to outcomes. This can enhance transparency by allowing non-specialist stakeholders to grasp the model's workings without delving into complex mathematical formulas. In the given context, transitioning the tree to a 'diabetes.png' image aids in simplifying the explanation of predictive paths taken by the model .

The entropy function in the ID3 algorithm calculates disorder by evaluating the distribution of classes within a dataset. It uses the formula -Σ(p(x) * log2(p(x))), where p(x) is the proportion of the dataset belonging to a class. This results in a value that quantifies uncertainty or impurity in classifications. Lower entropy indicates greater homogeneity. This metric is crucial because it guides the selection of attributes for splitting, aiming to reduce entropy with each step, thereby creating branches with purer subgroups .

Combining similar columns during data cleaning might lead to loss of subtle but important variations across those features. This may erase nuances that could contribute significantly to distinguishing between classes, potentially leading to a less accurate model. Additionally, it may introduce redundancy or incorrect associations that dilute the model's predictive power. This step must be balanced carefully to maintain an effective dimensionality reduction while preserving critical information .

Specifying the 'test_size' parameter during the train_test_split operation determines the proportion of the dataset allocated for testing versus training. This ensures that the model is evaluated on a representative sample, providing a reliable estimate of its performance. For example, setting test_size=0.25 uses 25% of the data for testing, helping to avoid issues such as overfitting, where the model performs well on training but poorly on unseen data .

Feature scaling, specifically using StandardScaler, is applied in the given code to standardize the feature values for training the DecisionTreeClassifier. This step ensures that each feature contributes equally to the model training since features with larger ranges can dominate those with smaller ranges. While not strictly necessary for decision trees, which are invariant to scaling, feature scaling can sometimes improve convergence speed and help with interpretability when combined with other algorithms .

The ID3 algorithm creates a leaf node if all the instances in a subset are perfectly classified. This occurs when the subset's entropy is zero, indicating no ambiguity in classification. At this point, the leaf node is labeled with the class that is prevalent in the subset. If all attributes are exhausted or all instances belong to the same class, a leaf node is also created. This decision ensures that further splitting is not needed and the current classification is final .

The 'splitter' parameter, set to 'best' in the DecisionTreeClassifier, decides how the nodes are split. When 'best' is used, it examines all the available splits and selects the one that results in the best separation of the classes concerning a chosen criterion (e.g., 'entropy'). This setting tends to produce a more accurate but computationally expensive model. Alternatively, the 'random' setting makes a less thorough examination of splits, leading to faster but potentially less accurate models. The choice of splitter affects the model's performance and efficiency .

Differentiating between positive and negative instances is crucial because it allows the algorithm to construct a hypothesis that reflects the conditional dependencies observed in the training data. When forming a hypothesis, positive instances contribute directly to refining the hypothesis by identifying consistent patterns, whereas negative instances help in identifying features that do not contribute to the target outcome, thus aiding in ignoring irrelevant attributes. Without differentiating, the model could generalize incorrectly .

Entropy determines the impurity or disorder in the set of examples. It helps in measuring how well an attribute can separate instances with respect to the target label. The ID3 algorithm calculates the entropy for the whole dataset and subsets generated by each attribute. It then uses these values to compute the information gain by subtracting the weighted entropy of each subset from the original entropy of the entire dataset. An attribute with the highest information gain is selected as it most effectively splits the data .

Machine Learning Algorithms Overview
No ratings yet
Machine Learning Algorithms Overview
12 pages
ID3 Decision Tree and Candidate Elimination
No ratings yet
ID3 Decision Tree and Candidate Elimination
7 pages
ID3 Decision Tree Algorithm Implementation
No ratings yet
ID3 Decision Tree Algorithm Implementation
5 pages
ID3 Algorithm with Play Tennis Dataset
No ratings yet
ID3 Algorithm with Play Tennis Dataset
6 pages
Machine Learning Lab Manual: FIND-S & Candidate-Elimination
No ratings yet
Machine Learning Lab Manual: FIND-S & Candidate-Elimination
25 pages
ID3 Algorithm Decision Tree Example
No ratings yet
ID3 Algorithm Decision Tree Example
6 pages
Candidate-Elimination Algorithm Demo
No ratings yet
Candidate-Elimination Algorithm Demo
25 pages
ID3 Algorithm Decision Tree Program
No ratings yet
ID3 Algorithm Decision Tree Program
5 pages
FIND-S and ID3 Algorithm Implementations
No ratings yet
FIND-S and ID3 Algorithm Implementations
9 pages
Machine Learning Hypothesis Algorithms
No ratings yet
Machine Learning Hypothesis Algorithms
9 pages
Candidate-Elimination Algorithm in Python
No ratings yet
Candidate-Elimination Algorithm in Python
34 pages
FIND-S and Candidate-Elimination Algorithms
No ratings yet
FIND-S and Candidate-Elimination Algorithms
12 pages
ID3 Algorithm Decision Tree Code
No ratings yet
ID3 Algorithm Decision Tree Code
4 pages
Machine Learning Algorithm Implementations
No ratings yet
Machine Learning Algorithm Implementations
20 pages
ID3 Algorithm Decision Tree Demo
No ratings yet
ID3 Algorithm Decision Tree Demo
3 pages
Machine Learning Algorithms Overview
No ratings yet
Machine Learning Algorithms Overview
12 pages
Candidate-Elimination Algorithm in Python
No ratings yet
Candidate-Elimination Algorithm in Python
33 pages
Decision Tree and ML Algorithms Overview
No ratings yet
Decision Tree and ML Algorithms Overview
21 pages
Local Weight Regression with Python
No ratings yet
Local Weight Regression with Python
14 pages
FIND-S and Candidate-Elimination Algorithms
No ratings yet
FIND-S and Candidate-Elimination Algorithms
24 pages
Candidate Elimination Algorithm Implementation
No ratings yet
Candidate Elimination Algorithm Implementation
32 pages
Machine Learning
No ratings yet
Machine Learning
14 pages
AD3461 MachineLearning Manual
No ratings yet
AD3461 MachineLearning Manual
38 pages
ID3 Algorithm
No ratings yet
ID3 Algorithm
6 pages
FIND-S and Candidate Elimination Algorithms
No ratings yet
FIND-S and Candidate Elimination Algorithms
33 pages
ML 6
No ratings yet
ML 6
4 pages
FIND-S and Candidate Elimination Algorithms
No ratings yet
FIND-S and Candidate Elimination Algorithms
9 pages
ID3 Decision Tree Algorithm Example
No ratings yet
ID3 Decision Tree Algorithm Example
6 pages
Machine Learning Algorithms Overview
No ratings yet
Machine Learning Algorithms Overview
23 pages
A* and AO* Algorithm Implementations
No ratings yet
A* and AO* Algorithm Implementations
17 pages
ML LAB Removed Removed
No ratings yet
ML LAB Removed Removed
22 pages
ID3 Algorithm Decision Tree Implementation
No ratings yet
ID3 Algorithm Decision Tree Implementation
6 pages
Advanced Machine Learning Lab Manual
No ratings yet
Advanced Machine Learning Lab Manual
28 pages
ID3 Decision Tree Implementation
No ratings yet
ID3 Decision Tree Implementation
6 pages
ID3 Decision Tree Classifier Code
No ratings yet
ID3 Decision Tree Classifier Code
7 pages
Decision Tree ID3 Algorithm in Python
No ratings yet
Decision Tree ID3 Algorithm in Python
3 pages
Decision Tree Classifier in Python
No ratings yet
Decision Tree Classifier in Python
3 pages
ID3 Algorithm Decision Tree Program
No ratings yet
ID3 Algorithm Decision Tree Program
3 pages
Implementing Machine Learning Algorithms
No ratings yet
Implementing Machine Learning Algorithms
10 pages
MLT Lab
No ratings yet
MLT Lab
11 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
33 pages
ID3 Algorithm Implementation with Tennis Data
No ratings yet
ID3 Algorithm Implementation with Tennis Data
3 pages
ML - Lab - Programs 1 To 3
No ratings yet
ML - Lab - Programs 1 To 3
8 pages
Machine Learning Algorithms Overview
No ratings yet
Machine Learning Algorithms Overview
13 pages
Candidate Elimination & ID3 Algorithms in Python
No ratings yet
Candidate Elimination & ID3 Algorithms in Python
90 pages
FIND-S Algorithm Implementation Guide
No ratings yet
FIND-S Algorithm Implementation Guide
11 pages
ML Lab Manul
No ratings yet
ML Lab Manul
39 pages
ID3 Algorithm Decision Tree Program
No ratings yet
ID3 Algorithm Decision Tree Program
3 pages
ID3 Algorithm Implementation in ML Lab
No ratings yet
ID3 Algorithm Implementation in ML Lab
6 pages
Decision Tree Algorithm in Python
No ratings yet
Decision Tree Algorithm in Python
9 pages
Build Decision Trees from Scratch
No ratings yet
Build Decision Trees from Scratch
7 pages
ID3 Algorithm Implementation in Python
No ratings yet
ID3 Algorithm Implementation in Python
1 page
Machine Learning Lab Exercises in Python
No ratings yet
Machine Learning Lab Exercises in Python
37 pages
ID3 Decision Tree Experiment in Python
No ratings yet
ID3 Decision Tree Experiment in Python
9 pages
Candidate-Elimination Algorithm Explained
No ratings yet
Candidate-Elimination Algorithm Explained
8 pages
Decision Tree ID3 Implementation in Python
No ratings yet
Decision Tree ID3 Implementation in Python
18 pages
ML Lab Manual: Candidate Elimination & ID3
No ratings yet
ML Lab Manual: Candidate Elimination & ID3
45 pages
Candidate-Elimination Algorithm Explained
No ratings yet
Candidate-Elimination Algorithm Explained
49 pages
MLRECORD
No ratings yet
MLRECORD
55 pages
Communication Process and Ethics Overview
No ratings yet
Communication Process and Ethics Overview
5 pages
3D Finite Element Analysis On Crack-Tip Plastic Zone
No ratings yet
3D Finite Element Analysis On Crack-Tip Plastic Zone
12 pages
Maritime Learning's Impact on Cadet Competence
No ratings yet
Maritime Learning's Impact on Cadet Competence
2 pages
Clinical Laboratory Glassware & Pipetting Techniques
No ratings yet
Clinical Laboratory Glassware & Pipetting Techniques
13 pages
Gender Digital Divide in Indian Banking
No ratings yet
Gender Digital Divide in Indian Banking
19 pages
VSCO Braze Integration Log Summary
No ratings yet
VSCO Braze Integration Log Summary
2 pages
Junior High School Permanent Record
No ratings yet
Junior High School Permanent Record
2 pages
Understanding Community Services Types
44% (9)
Understanding Community Services Types
45 pages
Degrowth Transition and Wellbeing Debate
No ratings yet
Degrowth Transition and Wellbeing Debate
11 pages
Grade 9 World History Syllabus
No ratings yet
Grade 9 World History Syllabus
7 pages
Passive Earth Pressure in Dense Sand
No ratings yet
Passive Earth Pressure in Dense Sand
9 pages
Annual Project Activities Report
No ratings yet
Annual Project Activities Report
2 pages
Lauren Zhang
No ratings yet
Lauren Zhang
5 pages
Control Room Culvert Design Guidelines
No ratings yet
Control Room Culvert Design Guidelines
6 pages
Basic Electrical Engineering GTU 3110005 (Mahajan) (GTURanker - Com)
100% (1)
Basic Electrical Engineering GTU 3110005 (Mahajan) (GTURanker - Com)
121 pages
ER Diagram and SQL Concepts Quiz
No ratings yet
ER Diagram and SQL Concepts Quiz
5 pages
Vector Calculations and Analysis
No ratings yet
Vector Calculations and Analysis
2 pages
Evaluation of OPD Services in India
No ratings yet
Evaluation of OPD Services in India
4 pages
Corrosion in Amine Treating Units 2nd Edition - PDF Format
100% (3)
Corrosion in Amine Treating Units 2nd Edition - PDF Format
179 pages
1 To 5 Instructional Grouping Influences On Students' Capacity Building
No ratings yet
1 To 5 Instructional Grouping Influences On Students' Capacity Building
10 pages
VLE, LLE, VLLE Analysis in Aspen Plus
100% (1)
VLE, LLE, VLLE Analysis in Aspen Plus
23 pages
Embracing Freedom: Dillard's Weasel Insight
50% (2)
Embracing Freedom: Dillard's Weasel Insight
2 pages
Essay Writing Guide for GAC 008
No ratings yet
Essay Writing Guide for GAC 008
25 pages
Solutions for Miami's Transit Issues
No ratings yet
Solutions for Miami's Transit Issues
2 pages
Heme Synthesis Pathway Mnemonics
No ratings yet
Heme Synthesis Pathway Mnemonics
1 page
Lesson Plan 7ionisation Energies
No ratings yet
Lesson Plan 7ionisation Energies
6 pages
libmagtsync.so Load Failure in WhatsApp
No ratings yet
libmagtsync.so Load Failure in WhatsApp
9 pages
Latin Square Cryptosystem Analysis
No ratings yet
Latin Square Cryptosystem Analysis
1 page
Smart Home Business Models in Europe
No ratings yet
Smart Home Business Models in Europe
4 pages
Understanding Sound: Types and Measurement
No ratings yet
Understanding Sound: Types and Measurement
11 pages

Decision Tree Classifier Implementation

Uploaded by

Decision Tree Classifier Implementation

Uploaded by

Lab DA1

Machine Learning Lab

print("\nThe total number of training instances are : ",len(a))

print("\nThe initial hypothesis is : ")

for i in range(0, len(a)):

from [Link] import DecisionTreeClassifier

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,

def info_gain(examples, attr):

def ID3(examples, attrs):

def printTree(root: Node, depth=0):

root = ID3(data, features)

from [Link] import load_iris

Common questions

What potential benefits are there in using 'entropy' as the criterion for decision tree classifiers compared to other criteria?

In the context of machine learning, why is it beneficial to visualize a decision tree, such as transforming it into a 'diabetes.png' image?

How does the entropy function in the ID3 algorithm calculate disorder, and why is this metric crucial for building decision trees?

What challenges might arise from combining similar columns during data cleaning, and how can these affect a machine learning model?

Why is it important to specify the test_size parameter during the train_test_split operation, as seen in the example?

What role does feature scaling play in the training of the DecisionTreeClassifier, and why is it applied in the given code example?

How does the ID3 algorithm decide when to create a leaf node during the tree construction process?

What is the significance of the 'splitter' parameter in the DecisionTreeClassifier, and how does its setting affect the model?

Why is it important to differentiate between positive and negative instances when forming the hypothesis in a machine learning algorithm?

How does entropy contribute to the calculation of information gain in the ID3 algorithm?

You might also like