Supervised Learning
Regression and
Classification
Algorithms
Module 2
Syllabus
• Regression: Introduction to Regression-
Regression Models- Linear Regression,
Polynomial Regression-. Decision Trees:
Introduction to Decision Trees-Tree
Construction, Splitting Criteria, and Pruning-
Handling Missing Values and Categorical
Features- Gini Index-ID3-CART.
Regression
• Regression is a statistical technique used in
machine learning to model the relationship
between a dependent (target) variable and one
or more independent (predictor) variables.
• It allows us to predict or explain the
dependent variable based on the information
from the independent variables.
• The general goal of regression analysis is to
identify the strength and nature of the
relationships between the variables.
Linear Regression
• Linear regression, as the name implies, is commonly used to
estimate the linear relationship between independent
variables - (x₁, x₂, …, xₙ) and dependent variables- (y).
• You would use linear regression when your dependent variable
is a continuous variable (value ranging between [-∞,+∞]).
• For example, predicting prices of houses, cars and stocks.
Linear Regression
• Formula:
• The equation for a simple linear regression is:
• Example:
• Let's say you want to predict the salary of an
employee based on their years of experience.
You collect data from different employees and
plot the relationship between the years of
experience (x) and salary (y).
• Linear regression will help you find the best-
fitting line that describes this relationship.
Univariate linear regression
• Univariate linear regression focuses on determining relationship
between one independent (explanatory variable) variable
and one dependent variable.
• For instance, estimating a person’s weight (y) given his/her
height (x).
• Independent variable: Input feature. (x₁, x₂, …, xₙ)
• Dependent variable: The variable you are trying to predict.
Sometimes also known as target variable. (y)
Example
• Here, we are interested in finding the relationship between a
person’s height and his/her weight.
• Intuitively, we would know that these two variables are positively
correlated.
• In other words, taller people tends to be heavier than those that
are shorter.
• For this example, height is our independent variable and weight
is our dependent variable. To calculate the relationship between
the two variables, we need some data.
• If we have one input variable, we need to draw
a line in the XY plane; if we have two input
variables, we need to construct a plane in the
three-dimensional coordinate system. More than
two input variables are extremely tough to
imagine.
Example
• In the graph above, the y-axis is
the price (K$) of some houses, and
the x-axis is the house's size
(feet²).
• We want to draw a straight line on this graph
(such as the image below) so that when we know
the size of a new home not in this dataset, we
can predict its price by using that line.
• For example, we can predict that the price of a
3000 feet² house is approximately 500 K$, as
seen in the graph below.
• It is also obvious that we can fit a straight linear line to describe
the relationship between these two variables.
where m is the slope and c is the y-
intercept of the line.
It is common to see these two parameters
to be denoted as c=θ₀ and m=θ₁.
• In linear regression, we use a training set to come up with an algorithm
that creates a function “h” that maps x to y.
• The function that we are trying to develop looks like this:
• For a univariate linear regression task, the hypothesis is of the form:
• A hypothesis is a potential model for a machine learning system
Regression Model
Representation
• The algorithm finds the values for 𝜃₀ and 𝜃₁ that best fit the
inputs and outputs given to the algorithm. This is
called univariate linear regression because the 𝜃 parameters
only go up to 1.
• The function that multivariate linear regression produces looks
like this:
Numerical on Univariate (Simple) Linear
Regression
Problem
A company records the following data:
Advertising Spend (x, in $1000) Sales (y, in $1000)
1 2
2 2.8
3 3.6
4 4.5
5 5.1
Fit a line: y=β0+β1x
Data
We model exam score y from hours studied x.
Observation Hours studied x Score y(%)
1 2 50
2 3 60
3 5 65
4 7 80
5 9 95
Problem: Predict daily ice cream sales y(units sold)
from daily temperature x (degrees Celsius).
Day Temperature x (°C) Sales y (units)
1 10 100
2 12 120
3 14 130
4 16 150
5 18 160
6 20 180
Ridge Regression vs Lasso Regression
• Ridge and Lasso Regression are two popular techniques in machine
learning used for regularizing linear models to avoid overfitting and
improve predictive performance. Both methods add a penalty term to
the model’s cost function to constrain the coefficients, but they differ
in how they apply this penalty.
• Ridge Regression, also known as L2 regularization, adds the squared
magnitude of the coefficients as a penalty.
• On the other hand, Lasso Regression, or L1 regularization, introduces
a penalty based on the absolute value of the coefficients.
Characteristic Ridge Regression Lasso Regression
Applies L2 regularization, adding a penalty Applies L1 regularization, adding a
Regularization
term proportional to the square of the penalty term proportional to the absolute
Type
coefficients value of the coefficients.
Does not perform feature selection. All Performs automatic feature selection. Less
Feature predictors are retained, although their important predictors are completely
Selection coefficients are reduced in size to excluded by setting their coefficients to
minimize overfitting zero.
Best suited for situations where all Ideal when you suspect that only a subset
predictors are potentially relevant, and of predictors is important, and the model
When to use
the goal is to reduce overfitting rather should focus on those while ignoring the
than eliminate features irrelevant ones.
Produces a model that is simpler,
Produces a model that includes all
retaining only the most significant
Output model features, but their coefficients are
features and ignoring the rest by setting
smaller in magnitude to prevent overfitting
their coefficients to zero.
Reduces the magnitude of coefficients, Shrinks some coefficients to exactly zero,
Impact on shrinking them towards zero, but does not effectively removing their influence from
Prediction set any coefficients exactly to zero. All the model. This leads to a simpler model
predictors remain in the model with fewer features
Generally faster as it doesn’t involve May be slower due to the feature selection
Computation
feature selection process
Use when you have many predictors, all Use when you believe only some predictors
contributing to the outcome (e.g., are truly important (e.g., genetic studies
Example Use Case
predicting house prices where all features where only a few genes out of thousands
like size, location, etc., matter) are relevant).
When to Use Ridge Regression?
Ridge Regression is most suitable when all predictors are expected to contribute to the
outcome and none should be excluded from the model. It reduces overfitting by shrinking
the coefficients, ensuring they don’t become too large, while still keeping all the predictors in
the model.
For example, when predicting house prices, features like size, number of bedrooms, location,
and year built are all likely relevant. Ridge Regression ensures these features remain in the
model but with reduced influence to create a balanced and robust prediction.
When to Use Lasso Regression?
Lasso Regression is ideal when you suspect that only a few predictors are truly important,
and the rest may add noise or redundancy. It performs automatic feature selection by
shrinking the coefficients of less important predictors to zero, effectively removing them
from the model.
For example, in genetic research, where thousands of genes are analyzed for their effect on a
disease, Lasso Regression helps by identifying only the most impactful genes and ignoring the
irrelevant ones, leading to a simpler and more interpretable model.
Least square method
• The least square method is the process of finding the best-fitting
curve or line of best fit for a set of data points by reducing the sum
of the squares of the offsets (residual part) of the points from the
curve.
• During the process of finding the relation between two variables, the
trend of outcomes are estimated quantitatively. This process is
termed as regression analysis.
• The least square method is the process of finding the best-fitting
curve or line of best fit for a set of data points by reducing the sum
of the squares of the offsets (residual part) of the points from the
curve.
• It is quite obvious that the fitting of curves for a particular data set
are not always unique.
• Thus, it is required to find a curve having a minimal deviation from
all the measured data points. This is known as the best-fitting curve
and is found by using the least-squares method.
Least Square Method Definition
• The least-squares method is a crucial statistical method that is
practiced to find a regression line or a best-fit line for the given
pattern.
• This method is described by an equation with specific parameters.
The method of least squares is generously used in evaluation and
regression.
• The method of least squares actually defines the solution for the
minimization of the sum of squares of deviations or the errors in the
result of each equation.
• Find the formula for sum of squares of errors, which help to find the
variation in observed data.
Least Square Method Formula
• The least-square method states that the curve that best fits a given set of
observations, is said to be a curve having a minimum sum of the squared
residuals (or deviations or errors) from the given data points.
• Let us assume that the given points of data are (x1, y1), (x2, y2), (x3, y3), …, (xn,
yn) in which all x’s are independent variables, while all y’s are dependent ones.
Also, suppose that f(x) is the fitting curve and d represents error or deviation
from each given point.
• Now, we can write:
• d1 = y1 − f(x1)
• d2 = y2 − f(x2)
• d3 = y3 − f(x3)
• …..
• dn = yn – f(xn)
• The least-squares explain that the curve that best fits is represented
by the property that the sum of squares of all the deviations from
given values must be minimum, i.e:
Least Square Method Graph
• In linear regression, the line of best fit is a
straight line as shown in the following
diagram:
• The given data points are to be minimized by the method of reducing
residuals or offsets of each point from the line.
• The given data points are to be minimized by the method of
reducing residuals or offsets of each point from the line.
• The vertical offsets are generally used in surface, polynomial and
hyperplane problems, while perpendicular offsets are utilized in
common practice.
• Sum = Minimum Quantity
• Suppose when we have to determine the equation of line of best fit for
the given data, then we first use the following formula.
• The equation of least square line is given by Y = a + bX
• Normal equation for ‘a’:
• ∑Y = na + b∑X
• Normal equation for ‘b’:
• ∑XY = a∑X + b∑X2
• Solving these two normal equations we can get the required trend line
equation.
• Thus, we can get the line of best fit with formula y = a + bX
Example
• The Least Squares Model for a set of data (x1, y1), (x2, y2), (x3, y3), …,
(xn, yn) passes through the point (xa, ya), where xa is the average of
the xi‘s and ya is the average of the yi‘s.
• The below example explains how to find the equation of a straight
line or a least square line using the least square method.
• Consider the following data points
• Mean of xi values = (8 + 3 + 2 + 10 + 11 + 3 + 6 + 5 + 6 + 8)/10 =
62/10 = 6.2
• Mean of yi values = (4 + 12 + 1 + 12 + 9 + 4 + 9 + 6 + 1 + 14)/10 =
72/10 = 7.2
• Straight line equation is y = a + bx.
• The normal equations are
• ∑y = an + b∑x
• ∑xy = a∑x + b∑x2
• Substituting these values in the normal equations,
• 10a + 62b = 72….(1)
• 62a + 468b = 503….(2)
• (1) × 62 – (2) × 10,
• 620a + 3844b – (620a + 4680b) = 4464 – 5030
• -836b = -566
• b = 566/836
• b = 283/418
• b = 0.677
• Substituting b = 0.677 in equation (1),
• 10a + 62(0.677) = 72
• 10a + 41.974 = 72
• 10a = 72 – 41.974
• 10a = 30.026
• a = 30.026/10
• a = 3.0026
• Therefore, the equation becomes,
• y = a + bx
• y = 3.0026 + 0.677x
•
• Now, we can find the sum of squares of deviations from the obtained
values as:
• d1 = [4 – (3.0026 + 0.677*8)] = (-4.4186)
• d2 = [12 – (3.0026 + 0.677*3)] = (6.9664)
• d3 = [1 – (3.0026 + 0.677*2)] = (-3.3566)
• d4 = [12 – (3.0026 + 0.677*10)] = (2.2274)
• d5 = [9 – (3.0026 + 0.677*11)] =(-1.4496)
• d6 = [4 – (3.0026 + 0.677*3)] = (-1.0336)
• d7 = [9 – (3.0026 + 0.677*6)] = (1.9354)
• d8 = [6 – (3.0026 + 0.677*5)] = (-0.3876)
• d9 = [1 – (3.0026 + 0.677*6)] = (-6.0646)
• d10 = [14 – (3.0026 + 0.677*8)] = (5.5814)
• ∑d2 = (-4.4186)2 + (6.9664)2 + (-3.3566)2 + (2.2274)2 + (-1.4496)2 + (-
1.0336)2 + (1.9354)2 + (-0.3876)2 + (-6.0646)2 + (5.5814)2 = 159.27990
Multivariate Linear Regression
• Multivariate linear regression resembles simple linear
regression except that in multivariate linear regression, multiple
independent variables contribute to the dependent variables
and so multiple coefficients are used in the computation.
• It is used to derive a mathematical relationship amongst
multiple random variables. It explains how many multiple
independent variables are associated with one dependent
variable.
• The details of the multiple independent variables are used to
make an accurate prediction of the influence they have on the
outcome variable.
• Multivariate linear regression model generates a relationship in
a linear form (a form of a straight line) with the best
approximation of each data point.
• The equation of the Multivariate linear regression model is:
Regression cost Function:
• Regression models deal with predicting a continuous value
for example salary of an employee, price of a car, loan
prediction, etc.
• A cost function used in the regression problem is called
“Regression Cost Function”. They are calculated on the
distance-based error as follows:
• Error = y-y’
• Where,
• Y – Actual Input
• Y’ – Predicted output
1 Mean Error (ME)
• In this cost function, the error for each training data is calculated and
then the mean value of all these errors is derived.
• Calculating the mean of the errors is the simplest and most intuitive
way possible.
• The errors can be both negative and positive. So they can cancel each
other out during summation giving zero mean error for the model.
• Thus this is not a recommended cost function but it does lay the
foundation for other cost functions of regression models.
Mean Error (ME) – “Am I systematically too high or too low?”
Think of this as checking whether your model is biased.
If ME is positive, on average your predictions are too low (you tend to under-predict).
If ME is negative, on average your predictions are too high (you tend to over-predict).
If ME is close to zero, the over- and under-estimates cancel out.
Important: A small ME does not mean you are accurate. Big positive and negative
mistakes can cancel each other.
2 Mean Squared Error (MSE)
• This improves the drawback we encountered in Mean Error above. Here a square of the
difference between the actual and predicted value is calculated to avoid any possibility
of negative error.
• It is measured as the average of the sum of squared differences between predictions and
actual observations.
• MSE cost function
• MSE = (sum of squared errors)/n
• It is also known as L2 loss.
• In MSE, since each error is squared, it helps to penalize even small deviations in
prediction when compared to MAE.
• But if our dataset has outliers that contribute to larger prediction errors, then squaring
this error further will magnify the error many times more and lead to higher MSE error.
• Hence we can say that it is less robust to outliers.
Mean Squared Error (MSE) – “How much do big mistakes hurt me?”
This is like MAE but it squares each error before averaging.
Squaring makes large errors count much more.
An error twice as big counts four times as much.
Good if you want a metric that strongly punishes occasional very large
mistakes.
Its unit is the square of the original unit (for house prices it would be dollars
squared), so it’s not as easy to read directly.
3 Mean Absolute Error (MAE)
• This cost function also addresses the
shortcoming of mean error differently. Here an
absolute difference between the actual and
predicted value is calculated to avoid any
possibility of negative error.
• So in this cost function, MAE is measured as
the average of the sum of absolute differences
between predictions and actual observations.
• MAE Cost function
• MAE = (sum of absolute errors)/n
• It is also known as L1 Loss.
• It is robust to outliers thus it will give better results even when our
dataset has noise or outliers.
Mean Absolute Error (MAE) – “How far off am I, on average?”
This is the typical size of the mistakes, ignoring whether you were high or low.
Always positive, because we take the absolute value.
Easy to read: “On average my prediction is about 12 thousand dollars away from the
real price.”
Use MAE when you want a simple, human-readable number for “average miss”.
• 4. Root mean squared error (RMSE)
• It is the square root of the mean of
the square of all of the errors. Root
Mean Square Error (RMSE) measures the
error between two data sets. In other
words, it compares an observed or
known value and a predicted value.
• The lower the RMSE, the better a given model is
able to “fit” a dataset.
• The formula to find the root mean square error,
often abbreviated RMSE, is as follows:
• RMSE =
Root Mean Squared Error (RMSE) – “Like MSE, but back in the original
units”
This is simply the square root of MSE.
Has the same units as your data, so you can say:
“Typical prediction error is about 12.6 thousand dollars.”
Keeps the “big mistakes matter more” property of MSE, but easier to
interpret.
ME – checks direction of bias.
MAE – checks average size of mistakes.
MSE – checks size but heavily punishes big mistakes.
RMSE – same as MSE but in the same units as the data so it’s
easier to explain.
House Actual y Predicted y^
1 200 210
2 250 240
3 300 310
4 280 260
5 260 270
House Error ei
1 200 − 210 = −10
2 250 − 240 = 10
3 300 − 310 = −10
4 280 − 260 = 20
5 260 − 270 = −10
These are in $1000s. Negative means the model predicted too high, positive means predicted too low.
Problem
A model predicts the monthly electricity bill (in dollars) for 4 households.
Household Actual bill y Predicted bill y^
1 120 110
2 150 160
3 130 140
4 170 155
Regression Coefficients
• The aim of linear regression is to find the regression coefficients that
produce the best-fitted line.
• The regression coefficients in linear regression help in predicting the
value of an unknown variable using a known variable.
• What are Regression Coefficients?
• Regression coefficients can be defined as estimates of some
unknown parameters to describe the relationship between a
predictor variable and the corresponding response.
• Linear regression is used to quantify how a unit change in an
independent variable causes an effect in the dependent variable by
determining the equation of the best-fitted straight line. This process
is known as regression analysis.
Formula for Regression Coefficients
• The goal of linear regression is to find the equation of the
straight line that best describes the relationship between two or
more variables.
• Suppose the equation of the best-fitted line is given by Y = aX +
b then, the regression coefficients formula is given as follows:
• here, n refers to the number of data points in the given data
sets.
How to Find Regression Coefficients?
• Before determining the regression coefficients to find the best-fitted
line, it is necessary to check whether the variables follow a linear
relationship or not.
• This can be done by using the correlation coefficient and interpreting
the corresponding value. Given below are the steps to find the
regression coefficients for regression analysis.
• To find the coefficient of X use the formula
• To find the constant term the formula is b
• Now input the regression coefficients in the equation Y = aX + b.
• A scatter plot can also be made so as to visually depict the
regression line as shown below.
Why do we use Regression Analysis?
• Regression estimates the relationship between the target and
the independent variable.
• It is used to find the trends in data.
• It helps to predict real/continuous values.
• By performing the regression, we can confidently determine
the most important factor, the least important factor, and
how each factor is affecting the other factors.
Polynomial Regression
• Polynomial Regression is a regression algorithm that models the
relationship between a dependent(y) and independent variable(x) as
nth degree polynomial. The Polynomial Regression equation is given
below:
y= b0+b1x1+ b2x12+ b2x13+...... bnx1n
• The dataset used in Polynomial regression for training is of non-
linear nature.
• It makes use of a linear regression model to fit the complicated and
non-linear functions and datasets.
• In Polynomial regression, the original features are converted
into Polynomial features of required degree (2,3,..,n) and then
modeled using a linear model."
Need for Polynomial Regression:
• If we apply a linear model on a linear dataset, then
it provides us a good result as we have seen in
Simple Linear Regression, but if we apply the same
model without any modification on a non-linear
dataset, then it will produce a drastic output. Due
to which loss function will increase, the error rate
will be high, and accuracy will be decreased.
• So for such cases, where data points are
arranged in a non-linear fashion, we need the
Polynomial Regression model. We can understand
it in a better way using the below comparison
diagram of the linear dataset and non-linear dataset.
Example
• There is a Human Resource company, which
is going to hire a new candidate. The
candidate has told his previous salary 160K
per annum, and the HR have to check whether
he is telling the truth or bluff.
• So to identify this, they only have a dataset of
his previous company in which the salaries of
the top 10 positions are mentioned with their
levels.
• By checking the dataset available, we have
found that there is a non-linear relationship
between the Position levels and the salaries.
Our goal is to build a Bluffing detector
regression model, so HR can hire an honest
candidate.
Steps for Polynomial Regression:
• Data Pre-processing
• Build a Linear Regression model and fit it to the dataset
• Build a Polynomial Regression model and fit it to the dataset
• Visualize the result for Linear Regression and Polynomial
Regression model.
• Predicting the output.
Example Code
• #Extracting Independent and depende
# importing libraries nt Variable
import numpy as nm • x= data_set.iloc[:, 1:2].values
import [Link] as mtp • y= data_set.iloc[:, 2].values
import pandas as pd
#Fitting the Linear Regression to the dat
#importing datasets aset
data_set= pd.read_csv('Position_Sala from sklearn.linear_model import Linea
rRegression
[Link]') lin_regs= LinearRegression()
lin_regs.fit(x,y)
#Fitting the Polynomial regression to th
e dataset
from [Link] import Poly
nomialFeatures
poly_regs= PolynomialFeatures(degree
= 2)
x_poly= poly_regs.fit_transform(x)
lin_reg_2 =LinearRegression()
lin_reg_2.fit(x_poly, y)
#Visulaizing the result for Polynomia
l Regression
[Link](x,y,color="blue")
[Link](x, lin_reg_2.predict(poly_re
gs.fit_transform(x)), color="red")
[Link]("Bluff detection model(Poly
nomial Regression)")
[Link]("Position Levels")
[Link]("Salary")
[Link]()
#Visulaizing the result for Linear
Regression model
[Link](x,y,color="blue")
[Link](x,lin_regs.predict(x), colo
r="red")
[Link]("Bluff detection model(Li
near Regression)")
[Link]("Position Levels")
[Link]("Salary")
[Link]()
Decision Trees
Introduction
• Classification is a two-step process, learning step and prediction step, in
machine learning.
• In the learning step, the model is developed based on given training data. In
the prediction step, the model is used to predict the response for given data.
• Decision Tree is one of the easiest and popular classification algorithms to
understand and interpret.
• The decision tree algorithm can be used for solving regression and
classification problems
Decision Tree(CART)
• The goal of using a Decision Tree is to create a training model that can use to
predict the class or value of the target variable by learning simple decision
rules inferred from training data.
• In Decision Trees, for predicting a class label for a record we start from
the root of the tree.
• We compare the values of the root attribute with the record’s attribute. On the
basis of comparison, we follow the branch corresponding to that value and
jump to the next node.
Terminologies
• Important Terminology related to Decision Trees
1. Root Node: It represents the entire population or sample and this further gets divided into two or
more homogeneous sets.
2. Splitting: It is a process of dividing a node into two or more sub-nodes.
[Link] Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
[Link] / Terminal Node: Nodes do not split is called Leaf or Terminal node.
[Link]: When we remove sub-nodes of a decision node, this process is called pruning.
• By removing error-prone components, the classifier’s performance may be
improved
[Link] / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
[Link] and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-
nodes whereas sub-nodes are the child of a parent node.
Decision Tree
• Decision trees classify the examples by sorting them down the tree from the
root to some leaf/terminal node, with the leaf/terminal node providing the
classification of the example.
• Each node in the tree acts as a test case for some attribute, and each edge
descending from the node corresponds to the possible answers to the test
case. This process is recursive in nature and is repeated for every subtree
rooted at the new node.
• The primary challenge in the decision tree implementation is to identify which
attributes do we need to consider as the root node and each level.
• Handling this is to know as the attributes selection. We have different
attributes selection measures to identify the attribute which can be
considered as the root node at each level.
• Decision trees use multiple algorithms to decide to split a node into two or
more sub-nodes.
• The creation of sub-nodes increases the homogeneity of resultant sub-nodes.
In other words, we can say that the purity of the node increases with respect to
the target variable. The decision tree splits the nodes on all available variables
and then selects the split which results in most homogeneous sub-nodes.
• Example
CART Model Representation
• The representation for the CART model is a binary
tree.
• Each root node represents a single input variable
(x) and a split point on that variable.
• The leaf nodes of the tree contain an output
variable (y) which is used to make a prediction.
• Given a dataset with two inputs (x) of height in
centimeters and weight in kilograms the output of
sex as male or female, below is a crude example of
a binary decision tree (completely fictitious for
demonstration purposes only).
• For example, given the input of [height = 160 cm, weight = 65
kg], we would traverse the above tree as follows:
• Height > 180 cm: No
• Weight > 80 kg: No
• Therefore: Female
• Given a new input, the tree is traversed by evaluating the
specific input started at the root node of the tree.
• A learned binary tree is actually a partitioning of the input space.
You can think of each input variable as a dimension on a p-
dimensional space. The decision tree split this up into
rectangles (when p=2 input variables) or some kind of hyper-
rectangles with more inputs.
• New data is filtered through the tree and lands in one of the
rectangles and the output value for that rectangle is the
prediction made by the model. This gives you some feeling for
the type of decisions that a CART model is capable of making,
• For example, given the input of [height = 160 cm, weight = 65
kg], we would traverse the above tree as follows:
Learn a CART Model From Data
• Creating a CART model involves selecting input
variables and split points on those variables
until a suitable tree is constructed.
• CART (Classification and Regression Trees) → uses Gini
Index(Classification) as metric.
• If all the data belong to a single class, then it can be called
pure.
• Its Degree will be always between 0 and 1. If 0, means all
data belongs to the single class/variable. If 1, the data
belong to the different class/field.
GINI INDEX
• Gini Index is the metric for classification
task in CART.
• Gini Index(Attribute=value) =
• Gini Index(Attribute)= σ𝑣=𝑣𝑎𝑙𝑢𝑒𝑠 𝑃𝑣 𝑋 𝐺𝐼(𝑣)
Data set
There are 14 instances of golf playing decisions based on outlook, temperature, humidity
and wind factors.
• Gini index
• Gini index is a metric for classification tasks in CART. It stores
sum of squared probabilities of each class. We can formulate it
as illustrated below.
• Gini = 1 – Σ (Pi)2 , for i=1 to number of classes
• Outlook
• Outlook is a nominal feature. It can be sunny, overcast or rain. I
will summarize the final decisions for outlook feature.
• Gini(Outlook=Rain) = 1 – (3/5)2 – (2/5)2 = 1 – 0.36 – 0.16 = 0.48
• Then, we will calculate weighted sum of gini indexes for outlook
feature.
• Gini(Outlook) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48 =
0.171 + 0 + 0.171 = 0.342
• Temperature
• Similarly, temperature is a nominal feature and it could have 3
different values: Cool, Hot and Mild. Let’s summarize decisions
for temperature feature.
• Gini(Temp=Hot) = 1 – (2/4)2 – (2/4)2 = 0.5
• Gini(Temp=Cool) = 1 – (3/4)2 – (1/4)2 = 1 – 0.5625 – 0.0625 =
0.375
• Gini(Temp=Mild) = 1 – (4/6)2 – (2/6)2 = 1 – 0.444 – 0.111 =
0.445
• We’ll calculate weighted sum of gini index for temperature
feature
• Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 =
0.142 + 0.107 + 0.190 = 0.439
• Humidity
• Humidity is a binary class feature. It can be high or normal.
• Gini(Humidity=High) = 1 – (3/7)2 – (4/7)2 = 1 – 0.183 – 0.326 =
0.489
• Gini(Humidity=Normal) = 1 – (6/7)2 – (1/7)2 = 1 – 0.734 – 0.02 =
0.244
• Weighted sum for humidity feature will be calculated next
• Gini(Humidity) = (7/14) x 0.489 + (7/14) x 0.244 = 0.367
• Wind
• Wind is a binary class similar to humidity. It can be weak and
strong.
• Gini(Wind=Weak) = 1 – (6/8)2 – (2/8)2 = 1 – 0.5625 – 0.062 =
0.375
• Gini(Wind=Strong) = 1 – (3/6)2 – (3/6)2 = 1 – 0.25 – 0.25 = 0.5
• Gini(Wind) = (8/14) x 0.375 + (6/14) x 0.5 = 0.428
• Time to decide
• We’ve calculated gini index values for each feature. The winner
will be outlook feature because its cost is the lowest.
• Focus on the sub dataset for sunny outlook. We need to find the gini
index scores for temperature, humidity and wind features respectively.
Gini(Outlook=Sunny and Temp.=Hot) = 1 – (0/2)2 – (2/2)2 = 0
Gini(Outlook=Sunny and Temp.=Cool) = 1 – (1/1)2 – (0/1)2 = 0
Gini(Outlook=Sunny and Temp.=Mild) = 1 – (1/2)2 – (1/2)2 = 1 – 0.25 – 0.25 = 0.5
Gini(Outlook=Sunny and Temp.) = (2/5)x0 + (1/5)x0 + (2/5)x0.5 = 0.2
Entropy
• Entropy is a measure of disorder or impurity in the given dataset.
• In the decision tree, messy data are split based on values of the feature
vector associated with each data point.
• With each split, the data becomes more homogenous which will
decrease the entropy.
• The higher the entropy, the harder it is to draw any conclusion. When
the tree finally reaches the terminal or leaf node maximum purity is
added.
• For a dataset that has C classes and the probability of randomly
choosing data from class, i is Pi. Then entropy E(S) can be
mathematically represented as
• If we have a dataset of 10 observations belonging to two classes YES and NO. If 6
observations belong to the class, YES, and 4 observations belong to class NO, then
entropy can be written as below.
• Pyes is the probability of choosing Yes and Pno is the
probability of choosing a No. Here Pyes is 6/10 and Pno is 4/10.
• If all the 10 observations belong to 1 class then entropy will be
equal to zero. Which implies the node is a pure node.
• If both classes YES and NO have an equal number of observations, then
entropy will be equal to 1.
Information Gain
• The Information Gain measures the expected reduction in
entropy.
• Entropy measures impurity in the data and information gain
measures reduction in impurity in the data.
• The feature which has minimum impurity will be considered as
the root node.
• Information gain is used to decide which feature to split on at
each step in building the tree.
• Information gain of a parent node can be calculated as the
entropy of the parent node is the subtracted entropy of the
weighted average of the child node.
• As per the above example, the dataset has 10 observations
belonging to two classes YES and NO. Where 6 observations
belong to the class, YES, and 4 observations belong to class
NO.
• Red color has 3 Yes outcome and 3 No outcome whereas
yellow has 3 Yes outcome and 1 No outcome.
• E(S), we have already calculated and it is approximately equal
to 0.971
•
• For a dataset having many features, the information gain of each feature is
calculated. The feature having maximum information gain will be the most
important feature which will be the root node for the decision tree.
Decision Tree (ID3 Algorithm)
• ID3 or Iterative Dichotomiser3 Algorithm is used in machine learning for
building decision trees from a given dataset. It was developed
in 1986 by Ross Quinlan.
• It is a greedy algorithm that builds a decision tree by recursively
partitioning the data set into smaller and smaller subsets until all data points
in each subset belong to the same class.
• It employs a top-down approach, recursively selecting features to split the
dataset based on information gain.
• used for both classification and regression tasks.ID3 deals primarily with
categorical properties, which means that it can efficiently handle objects
with a discrete set of values. This property is consistent with its suitability
for problems where the input features are categorical rather than continuous
Steps of the ID3 Algorithm
[Link] entropy for the overall the dataset using class
distribution.
[Link] each feature.
2. Calculate Entropy for Categorical Values.
3. Assess information gain for each unique categorical value of the
feature.
[Link] the feature that generates highest information gain.
[Link] apply all above steps to build the decision tree
structure.
def ID3(D, A):
if D is pure or A is empty:
return a leaf node with the majority class in D
else:
A_best = argmax(InformationGain(D, A))
root = Node(A_best)
for v in values(A_best):
D_v = subset(D, A_best, v)
child = ID3(D_v, A - {A_best})
root.add_child(v, child)
return root
Applications of ID3
[Link] detection: ID3 can be used to develop models that can detect
fraudulent transactions or activities.
[Link] diagnosis: ID3 can be used to develop models that can diagnose
diseases or medical conditions.
[Link] segmentation: ID3 can be used to segment customers into
different groups based on their demographics, purchase history, or other
factors.
[Link] assessment: ID3 can be used to assess risk in a variety of different
areas, such as insurance, finance, and healthcare.
[Link] systems: ID3 can be used to develop recommendation
systems that can recommend products, services, or content to users based on
their past behavior or preferences.
Example:
• Build a decision tree using
ID3 algorithm for the given
training data in the table
(Buy Computer data), and
predict the class of the
following new
example: age<=30,
income=medium,
student=yes, credit-
rating=fair
• First, check which attribute provides the highest Information Gain in order to split the training set
based on that attribute. We need to calculate the expected information to classify the set and the
entropy of each attribute.
• The mutual information of the two classes,
• Entropy(S)= E(9,5)= -9/14 log2(9/14) – 5/14 log2(5/14)=0.94
• Now Consider the Age attribute
• For Age, we have three values age<=30 (2 yes and 3 no), age31..40 (4 yes and 0 no), and age>40 (3 yes
and 2 no)
• Entropy(age) = 5/14 (-2/5 log2(2/5)-3/5log2(3/5)) + 4/14 (0) + 5/14 (-3/5log2(3/5)-2/5log2(2/5))
• = 5/14(0.9709) + 0 + 5/14(0.9709) = 0.6935
• Gain(age) = 0.94 – 0.6935 = 0.2465
• Next, consider Income Attribute
• For Income, we have three values incomehigh (2 yes and 2 no), incomemedium (4 yes and 2 no), and
incomelow (3 yes 1 no)
• Entropy(income) = 4/14(-2/4log2(2/4)-2/4log2(2/4)) + 6/14 (-
4/6log2(4/6)-2/6log2(2/6)) + 4/14 (-3/4log2(3/4)-1/4log2(1/4))
• = 4/14 (1) + 6/14 (0.918) + 4/14 (0.811)
• = 0.285714 + 0.393428 + 0.231714 = 0.9108
• Gain(income) = 0.94 – 0.9108 = 0.0292
• Next, consider Student Attribute
• For Student, we have two values studentyes (6 yes and 1 no)
and studentno (3 yes 4 no)
Entropy(student) = 7/14(-6/7log2(6/7)-1/7log2(1/7)) + 7/14(-3/7log2(3/7)-4/7log2(4/7)
= 7/14(0.5916) + 7/14(0.9852)
= 0.2958 + 0.4926 = 0.7884
Gain (student) = 0.94 – 0.7884 = 0.1516
Finally, consider Credit_Rating Attribute
For Credit_Rating we have two values credit_ratingfair (6 yes and 2 no) and
credit_ratingexcellent (3 yes 3 no)
Entropy(credit_rating) = 8/14(-6/8log2(6/8)-2/8log2(2/8)) + 6/14(-3/6log2(3/6)-
3/6log2(3/6))
= 8/14(0.8112) + 6/14(1)
= 0.4635 + 0.4285 = 0.8920
• Gain(credit_rating) = 0.94 – 0.8920 = 0.479
• Since Age has the highest Information Gain we start splitting
the dataset using the age attribute.
Decision Tree after step 1
Decision Tree after step 1_1
• Since all records under the branch age31..40 are all of the
class, Yes, we can replace the leaf with Class=Yes
• Now build the decision tree for the left subtree
• The same process of splitting has to happen for the two
remaining branches.
• For branch age<=30 we still have attributes income, student,
and credit_rating. Which one should be used to split the
partition?
• The mutual information is E(Sage<=30)= E(2,3)= -2/5 log2(2/5) –
3/5 log2(3/5)=0.97
• For Income, we have three values incomehigh (0 yes and 2 no),
incomemedium (1 yes and 1 no) and incomelow (1 yes and 0 no)
• Entropy(income) = 2/5(0) + 2/5 (-1/2log2(1/2)-1/2log2(1/2)) + 1/5
(0) = 2/5 (1) = 0.4
• Gain(income) = 0.97 – 0.4 = 0.57
• For Student, we have two values studentyes (2
yes and 0 no) and studentno (0 yes 3 no)
• Entropy(student) = 2/5(0) + 3/5(0) = 0
• Gain (student) = 0.97 – 0 = 0.97
• We can then safely split on attribute student
without checking the other attributes since the
information gain is maximized.
Decision Tree after step 2
Decision Tree after step 2_2
• Since these two new branches are from distinct classes, we
make them into leaf nodes with their respective class as label:
Right sub-branch
• Now build the decision tree for right left subtree
• The mutual information is Entropy(Sage>40)= I(3,2)= -3/5 log2(3/5) – 2/5 log2(2/5)=0.97
• For Income, we have two values incomemedium (2 yes and 1 no) and incomelow (1 yes and
1 no)
• Entropy(income) = 3/5(-2/3log2(2/3)-1/3log2(1/3)) + 2/5 (-1/2log2(1/2)-1/2log2(1/2))
• = 3/5(0.9182)+2/5 (1) = 0.55+0. 4= 0.95
• Gain(income) = 0.97 – 0.95 = 0.02
• For Student, we have two values studentyes (2 yes and 1 no) and studentno (1 yes and 1
no)
• Entropy(student) = 3/5(-2/3log2(2/3)-1/3log2(1/3)) + 2/5(-1/2log2(1/2)-1/2log2(1/2)) = 0.95
• Gain (student) = 0.97 – 0.95 = 0.02
• For Credit_Rating, we have two values credit_ratingfair (3 yes and 0 no) and
credit_ratingexcellent (0 yes and 2 no)
• Entropy(credit_rating) = 0
• Gain(credit_rating) = 0.97 – 0 = 0.97
• We then split based on credit_rating. These splits give partitions
each with records from the same class. We just need to make
these into leaf nodes with their class label attached:
• New example: age<=30, income=medium, student=yes, credit-
rating=fair
• Follow branch(age<=30) then student=yes we predict
Class=yes
• Buys_computer = yes
Day Outlook Temp Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No