0% found this document useful (0 votes)
20 views9 pages

Decision Tree Classifier Implementation

The document summarizes machine learning experiments on a dataset using decision tree algorithms like ID3 and CART. It performs data cleaning, splits the data into training and test sets, builds decision tree models and evaluates their accuracy. It also visualizes the decision tree using scikit-learn and pydotplus to get a clean visualization. The conclusion states that decision trees are simple yet effective tools for discrimination problems that can be easily explained to non-experts without complex math.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views9 pages

Decision Tree Classifier Implementation

The document summarizes machine learning experiments on a dataset using decision tree algorithms like ID3 and CART. It performs data cleaning, splits the data into training and test sets, builds decision tree models and evaluates their accuracy. It also visualizes the decision tree using scikit-learn and pydotplus to get a clean visualization. The conclusion states that decision trees are simple yet effective tools for discrimination problems that can be easily explained to non-experts without complex math.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Lab DA1

Machine Learning Lab


Name: Suprit Darshan Shrestha
[Link]:19BCE2584
Data cleaning

Combined different columns with close and similar values to dcrease the
number of columns.

S_algo:
Code:
import csv
a = []
with open('[Link]', 'r') as csvfile:
next(csvfile)
for row in [Link](csvfile):
[Link](row)
print(a)

print("\nThe total number of training instances are : ",len(a))

num_attribute = len(a[0])-1

print("\nThe initial hypothesis is : ")


hypothesis = ['0']*num_attribute
print(hypothesis)

for i in range(0, len(a)):


if a[i][num_attribute] == "0":
print ("\nInstance ", i+1, "is", a[i], " and is Positive Instance")
for j in range(0, num_attribute):
if hypothesis[j] == '0' or hypothesis[j] == a[i][j]:
hypothesis[j] = a[i][j]
else:
hypothesis[j] = '?'
print("The hypothesis for the training instance", i+1," is: " , hypothesis, "\n")

if a[i][num_attribute] == "1":
print ("\nInstance ", i+1, "is", a[i], " and is Negative Instance Hence Ignored")
print("The hypothesis for the training instance", i+1," is: " , hypothesis, "\n")

print("\nThe Maximally specific hypothesis for the training instance is ", hypothesis)
Output:
Decision Tree:
Code:
import pandas as pd
dataset=pd.read_csv('[Link]')
X = [Link][:, :-1].values
y = [Link][:, -1].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
from [Link] import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = [Link](X_test)

from [Link] import DecisionTreeClassifier


classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
[Link](X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,


max_features=None, max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0,
random_state=0, splitter='best')

y_pred = [Link](X_test)
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Output:

ID3:
Code:
import pandas as pd
import math
import numpy as np

data = pd.read_csv('[Link]')
features = [feat for feat in data]
[Link]("Diagnosis")

class Node:
def __init__(self):
[Link] = []
[Link] = ""
[Link] = False
[Link] = ""

def entropy(examples):
pos = 0.0
neg = 0.0
for _, row in [Link]():
if row["Diagnosis"] == 0:
pos += 1
else:
neg += 1
if pos == 0.0 or neg == 0.0:
return 0.0
else:
p = pos / (pos + neg)
n = neg / (pos + neg)
return -(p * [Link](p, 2) + n * [Link](n, 2))

def info_gain(examples, attr):


uniq = [Link](examples[attr])
print ("\n",uniq)
gain = entropy(examples)
#print ("\n",gain)
for u in uniq:
subdata = examples[examples[attr] == u]
#print ("\n",subdata)
sub_e = entropy(subdata)
gain -= (float(len(subdata)) / float(len(examples))) * sub_e
#print ("\n",gain)
return gain

def ID3(examples, attrs):


root = Node()

max_gain = 0
max_feat = ""
for feature in attrs:
#print ("\n",examples)
gain = info_gain(examples, feature)
if gain > max_gain:
max_gain = gain
max_feat = feature
[Link] = max_feat
#print ("\nMax feature attr",max_feat)
uniq = [Link](examples[max_feat])
#print ("\n",uniq)
for u in uniq:
#print ("\n",u)
subdata = examples[examples[max_feat] == u]
#print ("\n",subdata)
if entropy(subdata) == 0.0:
newNode = Node()
[Link] = True
[Link] = u
[Link] = [Link](subdata["Diagnosis"])
[Link](newNode)
else:
dummyNode = Node()
[Link] = u
new_attrs = [Link]()
new_attrs.remove(max_feat)
child = ID3(subdata, new_attrs)
[Link](child)
[Link](dummyNode)
return root

def printTree(root: Node, depth=0):


for i in range(depth):
print("\t", end="")
print([Link], end="")
if [Link]:
print(" -> ", [Link])
print()
for child in [Link]:
printTree(child, depth + 1)

root = ID3(data, features)


printTree(root)
Output:
Clean visualization:
Code:
feature_cols = ['area','peri','ECC','solidity','orient_wbc','nuc,AVG','entropy_cyt','AXIS','Diagnosis']

from [Link] import load_iris


from sklearn import tree
import six
import sys
[Link]['[Link]'] = six
from [Link] import export_graphviz
from [Link] import StringIO
from [Link] import Image
import pydotplus
dot_data = StringIO()
clf = DecisionTreeClassifier()
clf = [Link](X_train,y_train)
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names =feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('[Link]')
Image(graph.create_png())
Output:
Conclusion:
Decision trees are simply responding to a discriminating problem, and they are one of the few approaches
for processing data that can be presented rapidly to a non-specialist public without getting lost in hard
mathematical formulas.

Common questions

Powered by AI

Using 'entropy' as the criterion for decision tree classifiers like ID3 ensures that splits are made based on how well they separate the classes, aiming for maximum information gain. This tends to produce more balanced trees because it minimizes impurity at each node. Compared to criteria like 'gini', which measure similar concepts but in different ways, 'entropy' can sometimes lead to a tree structure that is more interpretable under frameworks evaluating model interpretability. Ultimately, it can improve model accuracy and efficiency .

Visualizing a decision tree makes it easier to interpret and understand the decision-making process of the model. It provides insights into which features are most important, the sequence of decisions, and how input features map to outcomes. This can enhance transparency by allowing non-specialist stakeholders to grasp the model's workings without delving into complex mathematical formulas. In the given context, transitioning the tree to a 'diabetes.png' image aids in simplifying the explanation of predictive paths taken by the model .

The entropy function in the ID3 algorithm calculates disorder by evaluating the distribution of classes within a dataset. It uses the formula -Σ(p(x) * log2(p(x))), where p(x) is the proportion of the dataset belonging to a class. This results in a value that quantifies uncertainty or impurity in classifications. Lower entropy indicates greater homogeneity. This metric is crucial because it guides the selection of attributes for splitting, aiming to reduce entropy with each step, thereby creating branches with purer subgroups .

Combining similar columns during data cleaning might lead to loss of subtle but important variations across those features. This may erase nuances that could contribute significantly to distinguishing between classes, potentially leading to a less accurate model. Additionally, it may introduce redundancy or incorrect associations that dilute the model's predictive power. This step must be balanced carefully to maintain an effective dimensionality reduction while preserving critical information .

Specifying the 'test_size' parameter during the train_test_split operation determines the proportion of the dataset allocated for testing versus training. This ensures that the model is evaluated on a representative sample, providing a reliable estimate of its performance. For example, setting test_size=0.25 uses 25% of the data for testing, helping to avoid issues such as overfitting, where the model performs well on training but poorly on unseen data .

Feature scaling, specifically using StandardScaler, is applied in the given code to standardize the feature values for training the DecisionTreeClassifier. This step ensures that each feature contributes equally to the model training since features with larger ranges can dominate those with smaller ranges. While not strictly necessary for decision trees, which are invariant to scaling, feature scaling can sometimes improve convergence speed and help with interpretability when combined with other algorithms .

The ID3 algorithm creates a leaf node if all the instances in a subset are perfectly classified. This occurs when the subset's entropy is zero, indicating no ambiguity in classification. At this point, the leaf node is labeled with the class that is prevalent in the subset. If all attributes are exhausted or all instances belong to the same class, a leaf node is also created. This decision ensures that further splitting is not needed and the current classification is final .

The 'splitter' parameter, set to 'best' in the DecisionTreeClassifier, decides how the nodes are split. When 'best' is used, it examines all the available splits and selects the one that results in the best separation of the classes concerning a chosen criterion (e.g., 'entropy'). This setting tends to produce a more accurate but computationally expensive model. Alternatively, the 'random' setting makes a less thorough examination of splits, leading to faster but potentially less accurate models. The choice of splitter affects the model's performance and efficiency .

Differentiating between positive and negative instances is crucial because it allows the algorithm to construct a hypothesis that reflects the conditional dependencies observed in the training data. When forming a hypothesis, positive instances contribute directly to refining the hypothesis by identifying consistent patterns, whereas negative instances help in identifying features that do not contribute to the target outcome, thus aiding in ignoring irrelevant attributes. Without differentiating, the model could generalize incorrectly .

Entropy determines the impurity or disorder in the set of examples. It helps in measuring how well an attribute can separate instances with respect to the target label. The ID3 algorithm calculates the entropy for the whole dataset and subsets generated by each attribute. It then uses these values to compute the information gain by subtracting the weighted entropy of each subset from the original entropy of the entire dataset. An attribute with the highest information gain is selected as it most effectively splits the data .

You might also like