Unit 3 Machine Learning
Unit 3 Machine Learning
Autonomous
(Approved by AICTE, New Delhi, Accredited by NBA (CIV, ECE, MECH, CSE), NAAC with ‘A’ grade
& Permanently Affiliated to JNTU-GV Vizianagaram)
Dakamarri, Bheemunipatnam Mandal, Visakhapatnam Dist. – 531 162 (A.P.)
Ph: +91-8922-248001, 248002 Fax: + 91-8922-248011
e-mail: principal@[Link] website: [Link]
UNIT-3
Supervised Learning2: Regularization, Logistic Regression, Squashing function, KNN, Support
Vector Machine.
Decision Tree Learning – Decision Tree Learning: Representing concepts as decision trees,
entropy and Recursive induction of decision trees, picking the best splitting attribute:
information gain, searching for simple trees and computational complexity, Occam's razor,
over fitting, noisy data, and pruning. Decision Trees – ID3 – CART – Error bounds.
What is Regularization?
Sometimes the machine learning model performs well with the training data but does not perform
well with the test data. It means the model is not able to predict the output when deals with unseen
data by introducing noise in the output, and hence the model is called overfitted. This problem can
be deal with the help of a regularization technique.
This technique can be used in such a way that it will allow to maintain all variables or features in
the model by reducing the magnitude of the variables. Hence, it maintains accuracy as well as a
generalization of the model.
It mainly regularizes or reduces the coefficient of features toward zero. In simple words, "In
regularization technique, we reduce the magnitude of the features by keeping the same number of
features."
Regularization works by adding a penalty or complexity term to the complex model. Let's
consider the simple linear regression equation:
y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b
In the above equation, Y represents the value to be predicted
β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here represents
the bias of the model, and b represents the intercept.
Linear regression models try to optimize the β0 and b to minimize the cost function. The equation
for the cost function for the linear model is given below:
Now, we will add a loss function and optimize parameter to make the model that can predict the
accurate value of Y. The loss function for the linear regression is called as RSS or Residual sum
of squares.
Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:
o Ridge Regression
o Lasso Regression
Ridge Regression
o Ridge regression is one of the types of linear regression in which a small amount of bias is
introduced so that we can get better long-term predictions.
o Ridge regression is a regularization technique, which is used to reduce the complexity of
the model. It is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it. The amount
of bias added to the model is called Ridge Regression penalty. We can calculate it by
multiplying with the lambda to the squared weight of each individual feature.
o The equation for the cost function in ridge regression will be:
o In the above equation, the penalty term regularizes the coefficients of the model, and
hence ridge regression reduces the amplitudes of the coefficients that decreases the
complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation
becomes the cost function of the linear regression model. Hence, for the minimum
value of λ, the model will resemble the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the
model. It stands for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the absolute
weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge
Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function of Lasso
regression will be:
o Some of the features in this technique are completely neglected for model evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model as well as
the feature selection.
Linear regression is used to predict the Logistic regression is used to predict the
continuous dependent variable using a categorical dependent variable using a
1 given set of independent variables. given set of independent variables.
Least square estimation method is used for Maximum likelihood estimation method is
5 estimation of accuracy. used for Estimation of accuracy.
The output must be continuous value,such Output is must be categorical value such as
6 as price,age,etc. 0 or 1, Yes or no, etc.
[Link] Linear Regresssion Logistic Regression
There may be collinearity between the There should not be collinearity between
8 independent variables. independent varible.
Terminologies involved in Logistic Regression:
Here are some common terms involved in logistic regression:
• Independent variables: The input characteristics or predictor factors applied to the
dependent variable’s predictions.
• Dependent variable: The target variable in a logistic regression model, which we are trying
to predict.
• Logistic function: The formula used to represent how the independent and dependent
variables relate to one another. The logistic function transforms the input variables into a
probability value between 0 and 1, which represents the likelihood of the dependent variable
being 1 or 0.
• Odds: It is the ratio of something occurring to something not occurring. it is different from
probability as the probability is the ratio of something occurring to everything that could
possibly occur.
• Log-odds: The log-odds, also known as the logit function, is the natural logarithm of the
odds. In logistic regression, the log odds of the dependent variable are modeled as a linear
combination of the independent variables and the intercept.
• Coefficient: The logistic regression model’s estimated parameters, show how the
independent and dependent variables relate to one another.
• Intercept: A constant term in the logistic regression model, which represents the log odds
when all independent variables are equal to zero.
• Maximum likelihood estimation: The method used to estimate the coefficients of the
logistic regression model, which maximizes the likelihood of observing the data given the
model.
How does Logistic Regression work?
The logistic regression model transforms the linear regression function continuous value output
into categorical value output using a sigmoid function, which maps any real-valued set of
independent variables input into a value between 0 and 1. This function is known as the logistic
function.
Let the independent input features be and the dependent variable is Y having only binary value
i.e. 0 or 1.
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test,\
test_size=0.20,
random_state=23)
# LogisticRegression
clf = LogisticRegression(random_state=0)
[Link](X_train, y_train)
# Prediction
y_pred = [Link](X_test)
Output:
Logistic Regression model accuracy (in %): 95.6140350877193
In Multinomial Logistic Regression, the output variable can have more than two possible
discrete outputs. Consider the Digit Dataset.
• Python3
digits = datasets.load_digits()
X = [Link]
y = [Link]
X_train, X_test,\
test_size=0.4,
random_state=1)
reg = linear_model.LogisticRegression()
[Link](X_train, y_train)
y_pred = [Link](X_test)
metrics.accuracy_score(y_test, y_pred)*100)
Output:
Logistic Regression model accuracy(in %): 96.52294853963839
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point
x1, so this data point will lie in which of these categories. To solve this type of problem, we need
a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:
The K-NN working can be explained on the basis of the below algorithm:
Suppose we have a new data point and we need to put it in the required category. Consider the
below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It
can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.
Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers
in the model.
o Large values for K are good, but it may find some difficulties.
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points
for all the training samples.
To do the Python implementation of the K-NN algorithm, we will use the same problem and
dataset which we have used in Logistic Regression. But here we will improve the performance of
the model. Below is the problem description:
Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a
new SUV car. The company wants to give the ads to the users who are interested in buying that
SUV. So for this problem, we have a dataset that contains multiple user's information through the
social network. The dataset contains lots of information but the Estimated Salary and Age we
will consider for the independent variable and the Purchased variable is for the dependent
variable. Below is the dataset:
The Data Pre-processing step will remain exactly the same as Logistic Regression. Below is the
code for it:
1. # importing libraries
2. import numpy as nm
3. import [Link] as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
16.
17. #feature Scaling
18. from [Link] import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
By executing the above code, our dataset is imported to our program and well pre-processed.
After feature scaling our test dataset will look like:
From the above output image, we can see that our data is successfully scaled.
And then we will fit the classifier to the training data. Below is the code for it:
Output: By executing the above code, we will get the output as:
Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
o Predicting the Test Result: To predict the test set result, we will create a y_pred vector
as we did in Logistic Regression. Below is the code for it:
Output:
In above code, we have imported the confusion_matrix function and called it using the variable
cm.
Output: By executing the above code, we will get the matrix as below:
In the above image, we can see there are 64+29= 93 correct predictions and 3+4= 7 incorrect
predictions, whereas, in Logistic Regression, there were 11 incorrect predictions. So we can say
that the performance of the model is improved by using the K-NN algorithm.
Output:
o As we can see the graph is showing the red point and green points. The green
points are for Purchased(1) and Red Points for not Purchased(0) variable.
o The graph is showing an irregular boundary instead of showing any straight line or
any curve because it is a K-NN algorithm, i.e., finding the nearest neighbor.
o The graph has classified users in the correct categories as most of the users who
didn't buy the SUV are in the red region and users who bought the SUV are in the
green region.
o The graph is showing good result but still, there are some green points in the red
region and red points in the green region. But this is no big issue as by doing this
model is prevented from overfitting issues.
o Hence our model is well trained.
o Visualizing the Test set result:
After the training of the model, we will now test the result by putting a new dataset, i.e.,
Test dataset. Code remains the same except some minor changes: such as x_train and
y_train will be replaced by x_test and y_test.
Below is the code for it:
The above graph is showing the output for the test data set. As we can see in the graph, the
predicted output is well good as most of the red points are in the red region and most of the green
points are in the green region.
However, there are few green points in the red region and a few red points in the green region. So
these are the incorrect observations that we have observed in the confusion matrix(7 Incorrect
output).
Support Vector Machine Algorithm
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using a
decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature. So as
support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.
The dimensions of the hyperplane depend on the features present in the dataset, which means if
there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3
features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of
the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence
called a Support vector.
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a
classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the
below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But
there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the
classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data,
we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be
calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in
2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
o Now we will implement the SVM algorithm using Python. Here we will use the same
dataset user_data, which we have used in Logistic regression and KNN classification.
Data Pre-processing step
Till the Data pre-processing step, the code will remain the same. Below is the code:
After executing the above code, we will pre-process the data. The code will give the dataset as:
The scaled output for the test set will be:
Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we will
import SVC class from [Link] library. Below is the code for it:
In the above code, we have used kernel='linear', as here we are creating SVM for linearly
separable data. However, we can change it for non-linear data. And then we fitted the classifier to
the training dataset(x_train, y_train)
Output:
Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, tol=0.001, verbose=False)
The model performance can be altered by changing the value of C(Regularization factor),
gamma, and kernel.
After getting the y_pred vector, we can compare the result of y_pred and y_test to check the
difference between the actual value and predicted value.
Output: Below is the output for the prediction of the test set:
Output:
As we can see in the above output image, there are 66+24= 90 correct predictions and 8+2= 10
correct predictions. Therefore we can say that our SVM model improved as compared to the
Logistic regression model.
As we can see, the above output is appearing similar to the Logistic regression output. In the
output, we got the straight line as hyperplane because we have used a linear kernel in the
classifier. And we have also discussed above that for the 2d space, the hyperplane in SVM is a
straight line.
Output:
There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy
to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify the
nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node
(Salary attribute by ASM). The root node splits further into the next decision node (distance from
the office) and one leaf node based on the corresponding labels. The next decision node further
gets split into one decision node (Cab facility) and one leaf node. Finally, the decision node splits
into two leaf nodes (Accepted offers and Declined offer). Consider the below diagram:
While implementing a Decision tree, the main issue arises that how to select the best attribute for
the root node and for sub-nodes. So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we can easily select the best
attribute for the nodes of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using
the below formula:
1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness
in data. Entropy can be calculated as:
Where,
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology
used:
o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
Now we will implement the Decision tree using Python. For this, we will use the dataset
"user_data.csv," which we have used in previous classification models. By using the same
dataset, we can compare the Decision tree classifier with other classification models such
as KNN SVM, LogisticRegression, etc.
Steps will also remain the same, which are given below:
1. # importing libraries
2. import numpy as nm
3. import [Link] as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
16.
17. #feature Scaling
18. from [Link] import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
In the above code, we have pre-processed the data. Where we have loaded the dataset, which is
given as:
2. Fitting a Decision-Tree algorithm to the Training set
Now we will fit the model to the training set. For this, we will import
the DecisionTreeClassifier class from [Link] library. Below is the code for it:
In the above code, we have created a classifier object, in which we have passed two main
parameters;
Out[8]:
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=0, splitter='best')
3. Predicting the test result
Now we will predict the test set result. We will create a new prediction vector y_pred. Below is
the code for it:
Output:
In the below output image, the predicted output and real test output are given. We can clearly see
that there are some values in the prediction vector, which are different from the real vector values.
These are prediction errors.
In the above output, we have seen that there were some incorrect predictions, so if we want to
know the number of correct and incorrect predictions, we need to use the confusion matrix. Below
is the code for it:
Output:
In the above output image, we can see the confusion matrix, which has 6+3= 9 incorrect
predictions and62+29=91 correct predictions. Therefore, we can say that compared to other
classification models, the Decision Tree classifier made a good prediction.
Here we will visualize the training set result. To visualize the training set result we will plot a
graph for the decision tree classifier. The classifier will predict yes or No for the users who have
either Purchased or Not purchased the SUV car as we did in Logistic Regression. Below is the
code for it:
The above output is completely different from the rest classification models. It has both vertical
and horizontal lines that are splitting the dataset according to the age and estimated salary
variable.
As we can see, the tree is trying to capture each dataset, which is the case of overfitting.
Visualization of test set result will be similar to the visualization of the training set except that the
training set will be replaced with the test set.
Output:
As we can see in the above image that there are some green data points within the purple region
and vice versa. So, these are the incorrect predictions which we have discussed in the confusion
matrix.
Entropy:
Entropy is the measure of the degree of randomness or uncertainty in the dataset. In the case of
classifications, It measures the randomness based on the distribution of class labels in the
dataset.
The entropy for a subset of the original dataset having K number of classes for the ith node can
be defined as:
Where,
• S is the dataset sample.
• k is the particular class from K classes
• p(k) is the proportion of the data points that belong to class k to the total number of data
Here,
• pi is the proportion of elements in the set that belongs to the ith category.
Information Gain:
Information gain measures the reduction in entropy or variance that results from splitting a
dataset based on a specific property. It is used in decision tree algorithms to determine the
usefulness of a feature by partitioning the dataset into more homogeneous subsets with respect
to the class labels or target variable. The higher the information gain, the more valuable the
feature is in predicting the target variable.
The information gain of an attribute A, with respect to a dataset S, is calculated as follows:
where
• A is the specific attribute or class label
• |H| is the entropy of dataset sample S
• |HV| is the number of instances in the subset S that have the value v for attribute A
Information gain measures the reduction in entropy or variance achieved by partitioning the
dataset on attribute A. The attribute that maximizes information gain is chosen as the splitting
criterion for building the decision tree.
Information gain is used in both classification and regression decision trees. In classification,
entropy is used as a measure of impurity, while in regression, variance is used as a measure of
impurity. The information gain calculation remains the same in both cases, except that entropy
or variance is used instead of entropy in the formula.
How does the Decision Tree algorithm Work?
The decision tree operates by analyzing the data set to predict its classification. It commences
from the tree’s root node, where the algorithm views the value of the root attribute compared to
the attribute of the record in the actual data set. Based on the comparison, it proceeds to follow
the branch and move to the next node.
The algorithm repeats this action for every subsequent node by comparing its attribute values
with those of the sub-nodes and continuing the process further. It repeats until it reaches the leaf
node of the tree. The complete mechanism can be better explained through the algorithm given
below.
• Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
• Step-3: Divide the S into subsets that contains possible values for the best attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf nodeClassification and Regression Tree algorithm.
Advantages of the Decision Tree:
1. It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
2. It can be very useful for solving decision-related problems.
3. It helps to think about all the possible outcomes for a problem.
4. There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree:
1. The decision tree contains lots of layers, which makes it complex.
2. It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
3. For more class labels, the computational complexity of the decision tree may increase.
What are appropriate problems for Decision tree learning?
Although a variety of decision tree learning methods have been developed with somewhat
differing capabilities and requirements, decision tree learning is generally best suited to
problems with the following characteristics:
1. Instances are represented by attribute-value pairs:
In the world of decision tree learning, we commonly use attribute-value pairs to represent
instances. An instance is defined by a predetermined group of attributes, such as temperature,
and its corresponding value, such as hot. Ideally, we want each attribute to have a finite set of
distinct values, like hot, mild, or cold. This makes it easy to construct decision trees. However,
more advanced versions of the algorithm can accommodate attributes with continuous numerical
values, such as representing temperature with a numerical scale.
2. The target function has discrete output values:
The marked objective has distinct outcomes. The decision tree method is ordinarily employed
for categorizing Boolean examples, such as yes or no. Decision tree approaches can be readily
expanded for acquiring functions with beyond dual conceivable outcome values. A more
substantial expansion lets us gain knowledge about aimed objectives with numeric outputs,
although the practice of decision trees in this framework is comparatively rare.
3. Disjunctive descriptions may be required:
Decision trees naturally represent disjunctive expressions.
[Link] training data may contain errors:
“Techniques of decision tree learning demonstrate high resilience towards discrepancies,
including inconsistencies in categorization of sample cases and discrepancies in the feature
details that characterize these cases.”
5. The training data may contain missing attribute values:
In certain cases, the input information designed for training might have absent characteristics.
Employing decision tree approaches can still be possible despite experiencing unknown features
in some training samples. For instance, when considering the level of humidity throughout the
day, this information may only be accessible for a specific set of training specimens.
Practical issues in learning decision trees include:
• Determining how deeply to grow the decision tree,
• Handling continuous attributes,
• Choosing an appropriate attribute selection measure,
• Handling training data with missing attribute values,
• Handling attributes with differing costs, and
• Improving computational efficiency.
•
To build the Decision Tree, CART (Classification and Regression Tree) algorithm is used. It
works by selecting the best split at each node based on metrics like Gini impurity or information
Gain. In order to create a decision tree. Here are the basic steps of the CART algorithm:
1. The root node of the tree is supposed to be the complete training dataset.
2. Determine the impurity of the data based on each feature present in the dataset. Impurity can
be measured using metrics like the Gini index or entropy for classification and Mean squared
error, Mean Absolute Error, friedman_mse, or Half Poisson deviance for regression.
3. Then selects the feature that results in the highest information gain or impurity reduction
when splitting the data.
4. For each possible value of the selected feature, split the dataset into two subsets (left and
right), one where the feature takes on that value, and another where it does not. The split
should be designed to create subsets that are as pure as possible with respect to the target
variable.
5. Based on the target variable, determine the impurity of each resulting subset.
6. For each subset, repeat steps 2–5 iteratively until a stopping condition is met. For example,
the stopping condition could be a maximum tree depth, a minimum number of samples
required to make a split or a minimum impurity threshold.
7. Assign the majority class label for classification tasks or the mean value for regression tasks
for each terminal node (leaf node) in the tree.
Classification and Regression Tree algorithm for Classification
Let the data available at node m be Qm and it has nm samples. and tm as the threshold for node
m. then, The classification and regression tree algorithm for classification can be written as :
Here,
• H is the measure of impurities of the left and right subsets at node m. it can be entropy or
Gini impurity.
• nm is the number of instances in the left and right subsets at node m.
To select the parameter, we can write as: