0% found this document useful (0 votes)
4 views42 pages

Unit 3 Machine Learning

The document discusses key concepts in supervised learning, focusing on regularization techniques like Ridge and Lasso regression to prevent overfitting in machine learning models. It also explains logistic regression, a classification algorithm used to predict probabilities for binary and categorical outcomes, along with its mathematical foundations and applications. Additionally, it provides Python code examples for implementing logistic regression using popular datasets.

Uploaded by

pandiakshaya123
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views42 pages

Unit 3 Machine Learning

The document discusses key concepts in supervised learning, focusing on regularization techniques like Ridge and Lasso regression to prevent overfitting in machine learning models. It also explains logistic regression, a classification algorithm used to predict probabilities for binary and categorical outcomes, along with its mathematical foundations and applications. Additionally, it provides Python code examples for implementing logistic regression using popular datasets.

Uploaded by

pandiakshaya123
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

RAGHU ENGINEERING COLLEGE

Autonomous
(Approved by AICTE, New Delhi, Accredited by NBA (CIV, ECE, MECH, CSE), NAAC with ‘A’ grade
& Permanently Affiliated to JNTU-GV Vizianagaram)
Dakamarri, Bheemunipatnam Mandal, Visakhapatnam Dist. – 531 162 (A.P.)
Ph: +91-8922-248001, 248002 Fax: + 91-8922-248011
e-mail: principal@[Link] website: [Link]

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


(Common to AI&ML,DATA SCINECE)

UNIT-3
Supervised Learning2: Regularization, Logistic Regression, Squashing function, KNN, Support
Vector Machine.
Decision Tree Learning – Decision Tree Learning: Representing concepts as decision trees,
entropy and Recursive induction of decision trees, picking the best splitting attribute:
information gain, searching for simple trees and computational complexity, Occam's razor,
over fitting, noisy data, and pruning. Decision Trees – ID3 – CART – Error bounds.

Regularization in Machine Learning

What is Regularization?

Regularization is one of the most important concepts of machine learning. It is a technique to


prevent the model from overfitting by adding extra information to it.

Sometimes the machine learning model performs well with the training data but does not perform
well with the test data. It means the model is not able to predict the output when deals with unseen
data by introducing noise in the output, and hence the model is called overfitted. This problem can
be deal with the help of a regularization technique.

This technique can be used in such a way that it will allow to maintain all variables or features in
the model by reducing the magnitude of the variables. Hence, it maintains accuracy as well as a
generalization of the model.

It mainly regularizes or reduces the coefficient of features toward zero. In simple words, "In
regularization technique, we reduce the magnitude of the features by keeping the same number of
features."

How does Regularization Work?

Regularization works by adding a penalty or complexity term to the complex model. Let's
consider the simple linear regression equation:

y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b
In the above equation, Y represents the value to be predicted

X1, X2, …Xn are the features for Y.

β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here represents
the bias of the model, and b represents the intercept.

Linear regression models try to optimize the β0 and b to minimize the cost function. The equation
for the cost function for the linear model is given below:

Now, we will add a loss function and optimize parameter to make the model that can predict the
accurate value of Y. The loss function for the linear regression is called as RSS or Residual sum
of squares.

Techniques of Regularization

There are mainly two types of regularization techniques, which are given below:

o Ridge Regression
o Lasso Regression

Ridge Regression
o Ridge regression is one of the types of linear regression in which a small amount of bias is
introduced so that we can get better long-term predictions.
o Ridge regression is a regularization technique, which is used to reduce the complexity of
the model. It is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it. The amount
of bias added to the model is called Ridge Regression penalty. We can calculate it by
multiplying with the lambda to the squared weight of each individual feature.
o The equation for the cost function in ridge regression will be:

o In the above equation, the penalty term regularizes the coefficients of the model, and
hence ridge regression reduces the amplitudes of the coefficients that decreases the
complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation
becomes the cost function of the linear regression model. Hence, for the minimum
value of λ, the model will resemble the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the
model. It stands for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the absolute
weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge
Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function of Lasso
regression will be:

o Some of the features in this technique are completely neglected for model evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model as well as
the feature selection.

Key Difference between Ridge Regression and Lasso Regression


o Ridge regression is mostly used to reduce the overfitting in the model, and it includes all
the features present in the model. It reduces the complexity of the model by shrinking the
coefficients.
o Lasso regression helps to reduce the overfitting in the model as well as feature selection.

Logistic Regression in Machine Learning


Logistic regression is a supervised machine learning algorithm mainly used for classification
tasks where the goal is to predict the probability that an instance of belonging to a given class or
not. It is a kind of statistical algorithm, which analyze the relationship between a set of
independent variables and the dependent binary variables. It is a powerful tool for decision-
making. For example email spam or not.
Logistic Regression
Logistic regression is a supervised machine learning algorithm mainly used
for classification tasks where the goal is to predict the probability that an instance of belonging
to a given class. It is used for classification algorithms its name is logistic regression. it’s
referred to as regression because it takes the output of the linear regression function as input and
uses a sigmoid function to estimate the probability for the given class. The difference between
linear regression and logistic regression is that linear regression output is the continuous value
that can be anything while logistic regression predicts the probability that an instance belongs to
a given class or not.
Logistic Regression:
It is used for predicting the categorical dependent variable using a given set of independent
variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value.
• It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0
and 1, it gives the probabilistic values which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1).
• The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
• Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
• Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification.
Logistic Function (Sigmoid Function):
• The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
• It maps any real value into another value within a range of 0 and 1. o The value of the logistic
regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve
like the “S” form.
• The S-form curve is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the probability
of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.
Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
1. Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as “cat”, “dogs”, or “sheep”
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as “low”, “Medium”, or “High”.
[Link] Linear Regresssion Logistic Regression

Linear regression is used to predict the Logistic regression is used to predict the
continuous dependent variable using a categorical dependent variable using a
1 given set of independent variables. given set of independent variables.

Linear regression is used for solving It is used for solving classification


2 Regression problem. problems.

In this we predict the value of continuous In this we predict values of categorical


3 variables varibles

4 In this we find best fit line. In this we find S-Curve .

Least square estimation method is used for Maximum likelihood estimation method is
5 estimation of accuracy. used for Estimation of accuracy.

The output must be continuous value,such Output is must be categorical value such as
6 as price,age,etc. 0 or 1, Yes or no, etc.
[Link] Linear Regresssion Logistic Regression

It required linear relationship between


It not required linear relationship.
7 dependent and independent variables.

There may be collinearity between the There should not be collinearity between
8 independent variables. independent varible.
Terminologies involved in Logistic Regression:
Here are some common terms involved in logistic regression:
• Independent variables: The input characteristics or predictor factors applied to the
dependent variable’s predictions.
• Dependent variable: The target variable in a logistic regression model, which we are trying
to predict.
• Logistic function: The formula used to represent how the independent and dependent
variables relate to one another. The logistic function transforms the input variables into a
probability value between 0 and 1, which represents the likelihood of the dependent variable
being 1 or 0.
• Odds: It is the ratio of something occurring to something not occurring. it is different from
probability as the probability is the ratio of something occurring to everything that could
possibly occur.
• Log-odds: The log-odds, also known as the logit function, is the natural logarithm of the
odds. In logistic regression, the log odds of the dependent variable are modeled as a linear
combination of the independent variables and the intercept.
• Coefficient: The logistic regression model’s estimated parameters, show how the
independent and dependent variables relate to one another.
• Intercept: A constant term in the logistic regression model, which represents the log odds
when all independent variables are equal to zero.
• Maximum likelihood estimation: The method used to estimate the coefficients of the
logistic regression model, which maximizes the likelihood of observing the data given the
model.
How does Logistic Regression work?
The logistic regression model transforms the linear regression function continuous value output
into categorical value output using a sigmoid function, which maps any real-valued set of
independent variables input into a value between 0 and 1. This function is known as the logistic
function.
Let the independent input features be and the dependent variable is Y having only binary value
i.e. 0 or 1.

then apply the multi-linear function to the input variables X


Here is the ith observation of X, is the weights
or Coefficient, and b is the bias term also known as intercept. simply this can be represented as
the dot product of weight and bias.

whatever we discussed above is the linear regression.


Sigmoid Function
Now we use the sigmoid function where the input will be z and we find the probability between
0 and 1. i.e predicted y.
Sigmoid function
As shown above, the figure sigmoid function converts the continuous variable data into
the probability i.e. between 0 and 1.
• tends towards 1 as
• tends towards 0 as
• is always bounded between 0 and 1
where the probability of being a class can be measured as:

Logistic Regression Equation


The odd is the ratio of something occurring to something not occurring. it is different from
probability as the probability is the ratio of something occurring to everything that could
possibly occur. so odd will be

Applying natural log on odd. then log odd will be

then the final logistic regression equation will be:

Likelihood function for Logistic Regression


The predicted probabilities will p(X;b,w) = p(x) for y=1 and for y = 0 predicted probabilities
will 1-p(X;b,w) = 1-p(x)

Taking natural logs on both sides

Gradient of the log-likelihood function


To find the maximum likelihood estimates, we differentiate w.r.t w,

Assumptions for Logistic Regression


The assumptions for Logistic regression are as follows:
• Independent observations: Each observation is independent of the other. meaning there is
no correlation between any input variables.
• Binary dependent variables: It takes the assumption that the dependent variable must be
binary or dichotomous, meaning it can take only two values. For more than two categories
softmax functions are used.
• Linearity relationship between independent variables and log odds: The relationship
between the independent variables and the log odds of the dependent variable should be
linear.
• No outliers: There should be no outliers in the dataset.
• Large sample size: The sample size is sufficiently large
Types of Logistic Regression
Based on the number of categories, Logistic regression can be classified as:
Binomial Logistic regression:
target variable can have only 2 possible types: “0” or “1” which may represent “win” vs “loss”,
“pass” vs “fail”, “dead” vs “alive”, etc., in this case, sigmoid functions are used, which is
already discussed above.
• Python3

# import the necessary libraries

from [Link] import load_breast_cancer

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from [Link] import accuracy_score

# load the breast cancer dataset

X, y = load_breast_cancer(return_X_y=True)

# split the train and test dataset

X_train, X_test,\

y_train, y_test = train_test_split(X, y,

test_size=0.20,

random_state=23)

# LogisticRegression

clf = LogisticRegression(random_state=0)

[Link](X_train, y_train)

# Prediction

y_pred = [Link](X_test)

acc = accuracy_score(y_test, y_pred)


print("Logistic Regression model accuracy (in %):", acc*100)

Output:
Logistic Regression model accuracy (in %): 95.6140350877193

Multinomial Logistic Regression


target variable can have 3 or more possible types which are not ordered(i.e. types have no
quantitative significance) like “disease A” vs “disease B” vs “disease C”.
In this case, the softmax function is used in place of the sigmoid function. Softmax function for
K classes will be:

Then the probability will be:

In Multinomial Logistic Regression, the output variable can have more than two possible
discrete outputs. Consider the Digit Dataset.
• Python3

from sklearn.model_selection import train_test_split

from sklearn import datasets, linear_model, metrics

# load the digit dataset

digits = datasets.load_digits()

# defining feature matrix(X) and response vector(y)

X = [Link]

y = [Link]

# splitting X and y into training and testing sets

X_train, X_test,\

y_train, y_test = train_test_split(X, y,

test_size=0.4,
random_state=1)

# create logistic regression object

reg = linear_model.LogisticRegression()

# train the model using the training sets

[Link](X_train, y_train)

# making predictions on the testing set

y_pred = [Link](X_test)

# comparing actual response values (y_test)

# with predicted response values (y_pred)

print("Logistic Regression model accuracy(in %):",

metrics.accuracy_score(y_test, y_pred)*100)

Output:
Logistic Regression model accuracy(in %): 96.52294853963839

Ordinal Logistic Regression


It deals with target variables with ordered categories. For example, a test score can be
categorized as: “very poor”, “poor”, “good”, or “very good”. Here, each category can be given a
score like 0, 1, 2, or 3.
Applying steps in logistic regression modeling:
The following are the steps involved in logistic regression modeling:
• Define the problem: Identify the dependent variable and independent variables and
determine if the problem is a binary classification problem.
• Data preparation: Clean and preprocess the data, and make sure the data is suitable for
logistic regression modeling.
• Exploratory Data Analysis (EDA): Visualize the relationships between the dependent and
independent variables, and identify any outliers or anomalies in the data.
• Feature Selection: Choose the independent variables that have a significant relationship with
the dependent variable, and remove any redundant or irrelevant features.
• Model Building: Train the logistic regression model on the selected independent variables
and estimate the coefficients of the model.
• Model Evaluation: Evaluate the performance of the logistic regression model using
appropriate metrics such as accuracy, precision, recall, F1-score, or AUC-ROC.
• Model improvement: Based on the results of the evaluation, fine-tune the model by
adjusting the independent variables, adding new features, or using regularization techniques
to reduce overfitting.
• Model Deployment: Deploy the logistic regression model in a real-world scenario and make
predictions on new data.
Logistic Regression Model Thresholding
Logistic regression becomes a classification technique only when a decision threshold is
brought into the picture. The setting of the threshold value is a very important aspect of Logistic
regression and is dependent on the classification problem itself.
The decision for the value of the threshold value is majorly affected by the values of precision
and recall. Ideally, we want both precision and recall to be 1, but this seldom is the case.
In the case of a Precision-Recall tradeoff, we use the following arguments to decide upon the
threshold:
1. Low Precision/High Recall: In applications where we want to reduce the number of false
negatives without necessarily reducing the number of false positives, we choose a decision
value that has a low value of Precision or a high value of Recall. For example, in a cancer
diagnosis application, we do not want any affected patient to be classified as not affected
without giving much heed to if the patient is being wrongfully diagnosed with cancer. This is
because the absence of cancer can be detected by further medical diseases but the presence of
the disease cannot be detected in an already rejected candidate.
2. High Precision/Low Recall: In applications where we want to reduce the number of false
positives without necessarily reducing the number of false negatives, we choose a decision
value that has a high value of Precision or a low value of Recall. For example, if we are
classifying customers whether they will react positively or negatively to a personalized
advertisement, we want to be absolutely sure that the customer will react positively to the
advertisement because otherwise, a negative reaction can cause a loss of potential sales from
the customer.
K-Nearest Neighbor(KNN) Algorithm for Machine Learning

o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then
it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data point
x1, so this data point will lie in which of these categories. To solve this type of problem, we need
a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:

How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor
is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the
below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It
can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN algorithm:

o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers
in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:

o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points
for all the training samples.

Python implementation of the KNN algorithm

To do the Python implementation of the K-NN algorithm, we will use the same problem and
dataset which we have used in Logistic Regression. But here we will improve the performance of
the model. Below is the problem description:

Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a
new SUV car. The company wants to give the ads to the users who are interested in buying that
SUV. So for this problem, we have a dataset that contains multiple user's information through the
social network. The dataset contains lots of information but the Estimated Salary and Age we
will consider for the independent variable and the Purchased variable is for the dependent
variable. Below is the dataset:

Steps to implement the K-NN algorithm:

o Data Pre-processing step


o Fitting the K-NN algorithm to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

Data Pre-Processing Step:

The Data Pre-processing step will remain exactly the same as Logistic Regression. Below is the
code for it:

1. # importing libraries
2. import numpy as nm
3. import [Link] as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
16.
17. #feature Scaling
18. from [Link] import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)

By executing the above code, our dataset is imported to our program and well pre-processed.
After feature scaling our test dataset will look like:

From the above output image, we can see that our data is successfully scaled.

o Fitting K-NN classifier to the Training data:


Now we will fit the K-NN classifier to the training data. To do this we will import
the KNeighborsClassifier class of Sklearn Neighbors library. After importing the class,
we will create the Classifier object of the class. The Parameter of this class will be
o n_neighbors: To define the required neighbors of the algorithm. Usually, it takes
5.
o metric='minkowski': This is the default parameter and it decides the distance
between the points.
o p=2: It is equivalent to the standard Euclidean metric.

And then we will fit the classifier to the training data. Below is the code for it:

1. #Fitting K-NN classifier to the training set


2. from [Link] import KNeighborsClassifier
3. classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
4. [Link](x_train, y_train)

Output: By executing the above code, we will get the output as:

Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
o Predicting the Test Result: To predict the test set result, we will create a y_pred vector
as we did in Logistic Regression. Below is the code for it:

1. #Predicting the test set result


2. y_pred= [Link](x_test)

Output:

The output for the above code will be:

o Creating the Confusion Matrix:


Now we will create the Confusion Matrix for our K-NN model to see the accuracy of the
classifier. Below is the code for it:

1. #Creating the Confusion matrix


2. from [Link] import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

In above code, we have imported the confusion_matrix function and called it using the variable
cm.

Output: By executing the above code, we will get the matrix as below:
In the above image, we can see there are 64+29= 93 correct predictions and 3+4= 7 incorrect
predictions, whereas, in Logistic Regression, there were 11 incorrect predictions. So we can say
that the performance of the model is improved by using the K-NN algorithm.

o Visualizing the Training set result:


Now, we will visualize the training set result for K-NN model. The code will remain same
as we did in Logistic Regression, except the name of the graph. Below is the code for it:

1. #Visulaizing the trianing set result


2. from [Link] import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = [Link]([Link](start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
5. [Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. [Link](x1, x2, [Link]([Link]([[Link](), [Link]()]).T).reshape([Link]),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. [Link]([Link](), [Link]())
9. [Link]([Link](), [Link]())
10. for i, j in enumerate([Link](y_set)):
11. [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. [Link]('K-NN Algorithm (Training set)')
14. [Link]('Age')
15. [Link]('Estimated Salary')
16. [Link]()
17. [Link]()

Output:

By executing the above code, we will get the below graph:


The output graph is different from the graph which we have occurred in Logistic Regression. It
can be understood in the below points:

o As we can see the graph is showing the red point and green points. The green
points are for Purchased(1) and Red Points for not Purchased(0) variable.
o The graph is showing an irregular boundary instead of showing any straight line or
any curve because it is a K-NN algorithm, i.e., finding the nearest neighbor.
o The graph has classified users in the correct categories as most of the users who
didn't buy the SUV are in the red region and users who bought the SUV are in the
green region.
o The graph is showing good result but still, there are some green points in the red
region and red points in the green region. But this is no big issue as by doing this
model is prevented from overfitting issues.
o Hence our model is well trained.
o Visualizing the Test set result:
After the training of the model, we will now test the result by putting a new dataset, i.e.,
Test dataset. Code remains the same except some minor changes: such as x_train and
y_train will be replaced by x_test and y_test.
Below is the code for it:

1. #Visualizing the test set result


2. from [Link] import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = [Link]([Link](start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
5. [Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. [Link](x1, x2, [Link]([Link]([[Link](), [Link]()]).T).reshape([Link]),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. [Link]([Link](), [Link]())
9. [Link]([Link](), [Link]())
10. for i, j in enumerate([Link](y_set)):
11. [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. [Link]('K-NN algorithm(Test set)')
14. [Link]('Age')
15. [Link]('Estimated Salary')
16. [Link]()
17. [Link]()
Output:

The above graph is showing the output for the test data set. As we can see in the graph, the
predicted output is well good as most of the red points are in the red region and most of the green
points are in the green region.

However, there are few green points in the red region and a few red points in the green region. So
these are the incorrect observations that we have observed in the confusion matrix(7 Incorrect
output).
Support Vector Machine Algorithm

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using a
decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature. So as
support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:

SVM algorithm can


be used for Face detection, image classification, text categorization, etc.

Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-


dimensional space, but we need to find out the best decision boundary that helps to classify the
data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which means if
there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3
features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the position of
the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence
called a Support vector.

How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a
classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the
below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But
there can be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the
classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data,
we cannot draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be
calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in
2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.

Python Implementation of Support Vector Machine

o Now we will implement the SVM algorithm using Python. Here we will use the same
dataset user_data, which we have used in Logistic regression and KNN classification.
Data Pre-processing step

Till the Data pre-processing step, the code will remain the same. Below is the code:

1. #Data Pre-processing Step


2. # importing libraries
3. import numpy as nm
4. import [Link] as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')
9.
10. #Extracting Independent and dependent Variable
11. x= data_set.iloc[:, [2,3]].values
12. y= data_set.iloc[:, 4].values
13.
14. # Splitting the dataset into training and test set.
15. from sklearn.model_selection import train_test_split
16. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
17. #feature Scaling
18. from [Link] import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)

After executing the above code, we will pre-process the data. The code will give the dataset as:
The scaled output for the test set will be:

Fitting the SVM classifier to the training set:

Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we will
import SVC class from [Link] library. Below is the code for it:

1. from [Link] import SVC # "Support vector classifier"


2. classifier = SVC(kernel='linear', random_state=0)
3. [Link](x_train, y_train)

In the above code, we have used kernel='linear', as here we are creating SVM for linearly
separable data. However, we can change it for non-linear data. And then we fitted the classifier to
the training dataset(x_train, y_train)

Output:

Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, tol=0.001, verbose=False)

The model performance can be altered by changing the value of C(Regularization factor),
gamma, and kernel.

o Predicting the test set result:


Now, we will predict the output for test set. For this, we will create a new vector y_pred.
Below is the code for it:

1. #Predicting the test set result


2. y_pred= [Link](x_test)

After getting the y_pred vector, we can compare the result of y_pred and y_test to check the
difference between the actual value and predicted value.

Output: Below is the output for the prediction of the test set:

o Creating the confusion matrix:


Now we will see the performance of the SVM classifier that how many incorrect
predictions are there as compared to the Logistic regression classifier. To create the
confusion matrix, we need to import the confusion_matrix function of the sklearn library.
After importing the function, we will call it using a new variable cm. The function takes
two parameters, mainly y_true( the actual values) and y_pred (the targeted value return
by the classifier). Below is the code for it:

1. #Creating the Confusion matrix


2. from [Link] import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

Output:

As we can see in the above output image, there are 66+24= 90 correct predictions and 8+2= 10
correct predictions. Therefore we can say that our SVM model improved as compared to the
Logistic regression model.

o Visualizing the training set result:


Now we will visualize the training set result, below is the code for it:

1. from [Link] import ListedColormap


2. x_set, y_set = x_train, y_train
3. x1, x2 = [Link]([Link](start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
4. [Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
5. [Link](x1, x2, [Link]([Link]([[Link](), [Link]()]).T).reshape([Link]),
6. alpha = 0.75, cmap = ListedColormap(('red', 'green')))
7. [Link]([Link](), [Link]())
8. [Link]([Link](), [Link]())
9. for i, j in enumerate([Link](y_set)):
10. [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
11. c = ListedColormap(('red', 'green'))(i), label = j)
12. [Link]('SVM classifier (Training set)')
13. [Link]('Age')
14. [Link]('Estimated Salary')
15. [Link]()
16. [Link]()
Output:

By executing the above code, we will get the output as:

As we can see, the above output is appearing similar to the Logistic regression output. In the
output, we got the straight line as hyperplane because we have used a linear kernel in the
classifier. And we have also discussed above that for the 2d space, the hyperplane in SVM is a
straight line.

o Visualizing the test set result:

1. #Visulaizing the test set result


2. from [Link] import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = [Link]([Link](start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
5. [Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. [Link](x1, x2, [Link]([Link]([[Link](), [Link]()]).T).reshape([Link]),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. [Link]([Link](), [Link]())
9. [Link]([Link](), [Link]())
10. for i, j in enumerate([Link](y_set)):
11. [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. [Link]('SVM classifier (Test set)')
14. [Link]('Age')
15. [Link]('Estimated Salary')
16. [Link]()
17. [Link]()

Output:

By executing the above code, we will get the output as:


As we can see in the above output image, the SVM classifier has divided the users into two
regions (Purchased or Not purchased). Users who purchased the SUV are in the red region with
the red scatter points. And users who did not purchase the SUV are in the green region with green
scatter points. The hyperplane has divided the two classes into Purchased and not purchased
variable.

Decision Tree Learning


o Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It
is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is easy
to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

Decision Tree Terminologies

 Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from the tree.
 Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify the
nodes and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node
(Salary attribute by ASM). The root node splits further into the next decision node (distance from
the office) and one leaf node based on the corresponding labels. The next decision node further
gets split into one decision node (Cab facility) and one leaf node. Finally, the decision node splits
into two leaf nodes (Accepted offers and Declined offer). Consider the below diagram:

Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best attribute for
the root node and for sub-nodes. So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we can easily select the best
attribute for the nodes of the tree. There are two popular techniques for ASM, which are:

o Information Gain
o Gini Index

1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using
the below formula:
1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness
in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology
used:

o Cost Complexity Pruning


o Reduced Error Pruning.

Advantages of the Decision Tree

o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which makes it complex.


o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

Python Implementation of Decision Tree

Now we will implement the Decision tree using Python. For this, we will use the dataset
"user_data.csv," which we have used in previous classification models. By using the same
dataset, we can compare the Decision tree classifier with other classification models such
as KNN SVM, LogisticRegression, etc.

Steps will also remain the same, which are given below:

o Data Pre-processing step


o Fitting a Decision-Tree algorithm to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

1. Data Pre-Processing Step:

Below is the code for the pre-processing step:

1. # importing libraries
2. import numpy as nm
3. import [Link] as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
16.
17. #feature Scaling
18. from [Link] import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)

In the above code, we have pre-processed the data. Where we have loaded the dataset, which is
given as:
2. Fitting a Decision-Tree algorithm to the Training set

Now we will fit the model to the training set. For this, we will import
the DecisionTreeClassifier class from [Link] library. Below is the code for it:

1. #Fitting Decision Tree classifier to the training set


2. From [Link] import DecisionTreeClassifier
3. classifier= DecisionTreeClassifier(criterion='entropy', random_state=0)
4. [Link](x_train, y_train)

In the above code, we have created a classifier object, in which we have passed two main
parameters;

o "criterion='entropy': Criterion is used to measure the quality of split, which is calculated


by information gain given by entropy.
o random_state=0": For generating the random states.

Below is the output for this:

Out[8]:
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=0, splitter='best')
3. Predicting the test result

Now we will predict the test set result. We will create a new prediction vector y_pred. Below is
the code for it:

1. #Predicting the test set result


2. y_pred= [Link](x_test)

Output:

In the below output image, the predicted output and real test output are given. We can clearly see
that there are some values in the prediction vector, which are different from the real vector values.
These are prediction errors.

4. Test accuracy of the result (Creation of Confusion matrix)

In the above output, we have seen that there were some incorrect predictions, so if we want to
know the number of correct and incorrect predictions, we need to use the confusion matrix. Below
is the code for it:

1. #Creating the Confusion matrix


2. from [Link] import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

Output:
In the above output image, we can see the confusion matrix, which has 6+3= 9 incorrect
predictions and62+29=91 correct predictions. Therefore, we can say that compared to other
classification models, the Decision Tree classifier made a good prediction.

5. Visualizing the training set result:

Here we will visualize the training set result. To visualize the training set result we will plot a
graph for the decision tree classifier. The classifier will predict yes or No for the users who have
either Purchased or Not purchased the SUV car as we did in Logistic Regression. Below is the
code for it:

1. #Visulaizing the trianing set result


2. from [Link] import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = [Link]([Link](start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
5. [Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. [Link](x1, x2, [Link]([Link]([[Link](), [Link]()]).T).reshape([Link]
pe),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. [Link]([Link](), [Link]())
9. [Link]([Link](), [Link]())
10. fori, j in enumerate([Link](y_set)):
11. [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. [Link]('Decision Tree Algorithm (Training set)')
14. [Link]('Age')
15. [Link]('Estimated Salary')
16. [Link]()
17. [Link]()
Output:

The above output is completely different from the rest classification models. It has both vertical
and horizontal lines that are splitting the dataset according to the age and estimated salary
variable.

As we can see, the tree is trying to capture each dataset, which is the case of overfitting.

6. Visualizing the test set result:

Visualization of test set result will be similar to the visualization of the training set except that the
training set will be replaced with the test set.

1. #Visulaizing the test set result


2. from [Link] import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = [Link]([Link](start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
5. [Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. [Link](x1, x2, [Link]([Link]([[Link](), [Link]()]).T).reshape([Link]
pe),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. [Link]([Link](), [Link]())
9. [Link]([Link](), [Link]())
10. fori, j in enumerate([Link](y_set)):
11. [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. [Link]('Decision Tree Algorithm(Test set)')
14. [Link]('Age')
15. [Link]('Estimated Salary')
16. [Link]()
17. [Link]()

Output:
As we can see in the above image that there are some green data points within the purple region
and vice versa. So, these are the incorrect predictions which we have discussed in the confusion
matrix.

Entropy:
Entropy is the measure of the degree of randomness or uncertainty in the dataset. In the case of
classifications, It measures the randomness based on the distribution of class labels in the
dataset.
The entropy for a subset of the original dataset having K number of classes for the ith node can
be defined as:

Where,
• S is the dataset sample.
• k is the particular class from K classes
• p(k) is the proportion of the data points that belong to class k to the total number of data

points in dataset sample S.


• Here p(i,k) should not be equal to zero.
Important points related to Entropy:
1. The entropy is 0 when the dataset is completely homogeneous, meaning that each instance
belongs to the same class. It is the lowest entropy indicating no uncertainty in the dataset
sample.
2. when the dataset is equally divided between multiple classes, the entropy is at its maximum
value. Therefore, entropy is highest when the distribution of class labels is even, indicating
maximum uncertainty in the dataset sample.
3. Entropy is used to evaluate the quality of a split. The goal of entropy is to select the attribute
that minimizes the entropy of the resulting subsets, by splitting the dataset into more
homogeneous subsets with respect to the class labels.
4. The highest information gain attribute is chosen as the splitting criterion (i.e., the reduction in
entropy after splitting on that attribute), and the process is repeated recursively to build the
decision tree.
Gini Impurity or index:
Gini Impurity is a score that evaluates how accurate a split is among the classified groups. The
Gini Impurity evaluates a score in the range between 0 and 1, where 0 is when all observations
belong to one class, and 1 is a random distribution of the elements within classes. In this case,
we want to have a Gini index score as low as possible. Gini Index is the evaluation metric we
shall use to evaluate our Decision Tree Model.

Here,
• pi is the proportion of elements in the set that belongs to the ith category.
Information Gain:
Information gain measures the reduction in entropy or variance that results from splitting a
dataset based on a specific property. It is used in decision tree algorithms to determine the
usefulness of a feature by partitioning the dataset into more homogeneous subsets with respect
to the class labels or target variable. The higher the information gain, the more valuable the
feature is in predicting the target variable.
The information gain of an attribute A, with respect to a dataset S, is calculated as follows:

where
• A is the specific attribute or class label
• |H| is the entropy of dataset sample S
• |HV| is the number of instances in the subset S that have the value v for attribute A
Information gain measures the reduction in entropy or variance achieved by partitioning the
dataset on attribute A. The attribute that maximizes information gain is chosen as the splitting
criterion for building the decision tree.
Information gain is used in both classification and regression decision trees. In classification,
entropy is used as a measure of impurity, while in regression, variance is used as a measure of
impurity. The information gain calculation remains the same in both cases, except that entropy
or variance is used instead of entropy in the formula.
How does the Decision Tree algorithm Work?
The decision tree operates by analyzing the data set to predict its classification. It commences
from the tree’s root node, where the algorithm views the value of the root attribute compared to
the attribute of the record in the actual data set. Based on the comparison, it proceeds to follow
the branch and move to the next node.
The algorithm repeats this action for every subsequent node by comparing its attribute values
with those of the sub-nodes and continuing the process further. It repeats until it reaches the leaf
node of the tree. The complete mechanism can be better explained through the algorithm given
below.
• Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
• Step-3: Divide the S into subsets that contains possible values for the best attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf nodeClassification and Regression Tree algorithm.
Advantages of the Decision Tree:
1. It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
2. It can be very useful for solving decision-related problems.
3. It helps to think about all the possible outcomes for a problem.
4. There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree:
1. The decision tree contains lots of layers, which makes it complex.
2. It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
3. For more class labels, the computational complexity of the decision tree may increase.
What are appropriate problems for Decision tree learning?
Although a variety of decision tree learning methods have been developed with somewhat
differing capabilities and requirements, decision tree learning is generally best suited to
problems with the following characteristics:
1. Instances are represented by attribute-value pairs:
In the world of decision tree learning, we commonly use attribute-value pairs to represent
instances. An instance is defined by a predetermined group of attributes, such as temperature,
and its corresponding value, such as hot. Ideally, we want each attribute to have a finite set of
distinct values, like hot, mild, or cold. This makes it easy to construct decision trees. However,
more advanced versions of the algorithm can accommodate attributes with continuous numerical
values, such as representing temperature with a numerical scale.
2. The target function has discrete output values:
The marked objective has distinct outcomes. The decision tree method is ordinarily employed
for categorizing Boolean examples, such as yes or no. Decision tree approaches can be readily
expanded for acquiring functions with beyond dual conceivable outcome values. A more
substantial expansion lets us gain knowledge about aimed objectives with numeric outputs,
although the practice of decision trees in this framework is comparatively rare.
3. Disjunctive descriptions may be required:
Decision trees naturally represent disjunctive expressions.
[Link] training data may contain errors:
“Techniques of decision tree learning demonstrate high resilience towards discrepancies,
including inconsistencies in categorization of sample cases and discrepancies in the feature
details that characterize these cases.”
5. The training data may contain missing attribute values:
In certain cases, the input information designed for training might have absent characteristics.
Employing decision tree approaches can still be possible despite experiencing unknown features
in some training samples. For instance, when considering the level of humidity throughout the
day, this information may only be accessible for a specific set of training specimens.
Practical issues in learning decision trees include:
• Determining how deeply to grow the decision tree,
• Handling continuous attributes,
• Choosing an appropriate attribute selection measure,
• Handling training data with missing attribute values,
• Handling attributes with differing costs, and
• Improving computational efficiency.

To build the Decision Tree, CART (Classification and Regression Tree) algorithm is used. It
works by selecting the best split at each node based on metrics like Gini impurity or information
Gain. In order to create a decision tree. Here are the basic steps of the CART algorithm:
1. The root node of the tree is supposed to be the complete training dataset.
2. Determine the impurity of the data based on each feature present in the dataset. Impurity can
be measured using metrics like the Gini index or entropy for classification and Mean squared
error, Mean Absolute Error, friedman_mse, or Half Poisson deviance for regression.
3. Then selects the feature that results in the highest information gain or impurity reduction
when splitting the data.
4. For each possible value of the selected feature, split the dataset into two subsets (left and
right), one where the feature takes on that value, and another where it does not. The split
should be designed to create subsets that are as pure as possible with respect to the target
variable.
5. Based on the target variable, determine the impurity of each resulting subset.
6. For each subset, repeat steps 2–5 iteratively until a stopping condition is met. For example,
the stopping condition could be a maximum tree depth, a minimum number of samples
required to make a split or a minimum impurity threshold.
7. Assign the majority class label for classification tasks or the mean value for regression tasks
for each terminal node (leaf node) in the tree.
Classification and Regression Tree algorithm for Classification
Let the data available at node m be Qm and it has nm samples. and tm as the threshold for node
m. then, The classification and regression tree algorithm for classification can be written as :

Here,
• H is the measure of impurities of the left and right subsets at node m. it can be entropy or
Gini impurity.
• nm is the number of instances in the left and right subsets at node m.
To select the parameter, we can write as:

Computational Complexity of Machine Learning Models - II


In our previous discussion Computational Complexity of Machine Learning Models - I we got
familiar with What is Computational Complexity? Different Types? Some Examples &
Cheatsheet
In this discussion, we will be looking at the Computational Complexities of different ML
Models.
Assumptions:
n = number of training examples, m = number of features, n' = number of support vectors,
k = number of neighbors, k' = number of trees
• Linear Regression
o Train Time Complexity=O(n*m^2 + m^3)
o Test Time Complexity=O(m)
o Space Complexity = O(m)
• Logistic Regression
o Train Time Complexity=O(n*m)
o Test Time Complexity=O(m)
o Space Complexity = O(m)
• K Nearest Neighbors
o Train Time Complexity=O(k*n*m)
o Test Time Complexity=O(n*m)
o Space Complexity = O(n*m)
• SVM
o Train Time Complexity=O(n^2)
o Test Time Complexity=O(n'*m)
o Space Complexity = O(n*m)
• Decision Tree
o Train Time Complexity=O(n*log(n)*m)
o Test Time Complexity=O(m)
o Space Complexity = O(depth of tree)
• Random Forest
o Train Time Complexity=O(k'*n*log(n)*m)
o Test Time Complexity=O(m*k')
o Space Complexity = O(k'*depth of tree)
• Naive Bayes
o Training Time Complexity = O(n*m)
o Test Time Complexity=O(m)
o Run-time Complexity = O(c*m)

You might also like