0% found this document useful (0 votes)
10 views60 pages

Machine Learning 24MCA31 Module3

Module 3 of the Machine Learning course focuses on regression algorithms, which are a subset of supervised machine learning techniques used to predict output values based on input features. It covers various types of regression analysis techniques, including linear and logistic regression, explaining their applications, benefits, and differences between simple and multiple linear regression. The module emphasizes the importance of understanding the relationship between dependent and independent variables, as well as the impact of outliers on regression models.

Uploaded by

Tanmay Rana
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views60 pages

Machine Learning 24MCA31 Module3

Module 3 of the Machine Learning course focuses on regression algorithms, which are a subset of supervised machine learning techniques used to predict output values based on input features. It covers various types of regression analysis techniques, including linear and logistic regression, explaining their applications, benefits, and differences between simple and multiple linear regression. The module emphasizes the importance of understanding the relationship between dependent and independent variables, as well as the impact of outliers on regression models.

Uploaded by

Tanmay Rana
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Machine Learning-24MCA31 MODULE - 3

MACHINE LEARNING

MODULE – 3
Machine Learning Algorithms - II

3.1 REGRESSION
Introduction:
 Regression algorithms fall under the family of Supervised Machine Learning
algorithms which is a subset of machine learning algorithms.
 One of the main features of supervised learning algorithms is that they model
dependencies and relationships between the target output and input features to
predict the value for new data.
 Regression algorithms predict the output values based on input features from
the data fed in the system.
 The algorithm builds a model on the features of training data and using the
model to predict the value for new data.
 Regression analysis is a form of predictive modelling technique which
investigates the relationship between a dependent (target) and independent
variable (s) (predictor).
 It involves determining the best fit line, which is a line that passes through all
the data points in such a way that distance of the line from each data point is
minimized.
 Linear regression and logistic regression are two types of regression
analysis techniques that are used to solve the regression problem using machine
learning.
 They are the most prominent techniques of regression. But, there are many
types of regression analysis techniques available in machine learning to make
predictions, and their usage varies according to the nature of the data involved.
 These techniques are mostly driven by three metrics (number of independent
variables, type of dependent variables and shape of regression line).

Fig 3.1 Regression Types

Dr. V. Asha, Professor, MCA Dept, NHCE Page 1


Machine Learning-24MCA31 MODULE - 3

Types of Regression Analysis Techniques


There are many types of regression analysis techniques, and the use of each method
depends upon the number of factors. These factors include the type of target variable,
shape of the regression line, and the number of independent variables.
The different regression techniques are:
1. Linear Regression
2. Logistic Regression
3. Polynomial Regression
4. Step wise Regression
5. Bayesian Linear Regression
6. Ridge Regression
7. Lasso Regression
8. Decision Tree Regression
9. Random Forest Regression
10. Support Vector Machines

1. Linear Regression:
How would linear regression be described in layman’s terms?
Linear regression can be thought of as the answer to the question ―How can I use X to
predict Y?‖, where X is some information that you have and Y is some information.
Let‘s look at a concrete example. You might be wondering how much you can sell your
house for. You have information about your house, for instance, the number of
bedrooms is 2 - this is your X. And you want to know how much your estate could be
worth on the market. This is Y - the price in $ that you can sell your house for.
Linear regression creates an equation in which you input your given numbers (X) and it
outputs the target variable that you want to find out (Y). We obtain the equation by
training it on pairs of (X, Y) values. We would use a dataset containing historic records
of house purchases in the form of (―number of bedrooms‖, ―selling price‖):

We would then visualize the data points on a scatter plot to see if there are any
trends. A scatter plot is a two-dimensional plot, with each data point representing a
house. On the x-axis, we would have values for ―Number of bedrooms‖, while on the y-
axis, we would have the ―Selling price‖ for the same houses:

Dr. V. Asha, Professor, MCA Dept, NHCE Page 2


Machine Learning-24MCA31 MODULE - 3

Fig 3.2 Plotting of points


Looking at the scatter plot, it seems that there is a trend: the more bedrooms that a
house has, the higher its selling price (which is not surprising, to be honest). Now, let‘s
say that we trained a linear regression model to get an equation in the form:
Selling price = $77,143 * (Number of bedrooms) - $74,286
The equation acts as a prediction. If you input the number of bedrooms, you get the
predicted value for the price at which the house is sold. For the specific example
above: Your selling price = $77,143 * 2 bedrooms - $74,286 = $80,000
In other words, you could sell your 2-bedroom house for approximately $80,000. But
linear regression does more than just that. We can also visualize it graphically to see
what the price would be for houses with a different number of bedrooms:
This is because linear regression tries to find a straight line that best fits the
data. Linear regression is not limited to real-estate problems: it can also be applied to a
variety of business use cases.

Linear Regression:
 Linear regression is a supervised machine-learning regression algorithm.
 Linear regression is the statistical model, simplest and basic regression
technique widely known in machine learning.
 It is fast and easy to model and useful when the target relationship is not
complex or enough data is not available and easy to learn and evaluate.
 Linear regression is a linear model that establishes a linear relationship between
one or more input variables (x) (independent variable) and the single output
variable (y) (dependent variable).
 There must be linear relationship between independent and dependent variables
Here the y can be calculated from a linear combination of the input variables
(x). The best fit line is determined by varying the values of m and c.
 The predictor error is the difference between the observed values and the
predicted value.
 The values of m and c get selected in such a way that it gives the minimum
predictor error.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 3


Machine Learning-24MCA31 MODULE - 3

 It is important to note that a simple linear regression model is susceptible


(liable to be influenced) to outliers. Therefore, it should not be used in case of
big size data.
 Linear Regression is very sensitive to Outliers. It can terribly affect the
regression line and eventually the forecasted values.
 Outlier : An outlier is an object that deviates significantly from the rest of the
objects. They can be caused by measurement or execution error. Outliers are
those data points that are significantly different from the rest of the dataset.
They are often abnormal observations that skew the data distribution, and arise
due to inconsistent data entry, or erroneous observations.

Fig 3.3 Outlier


 When there is a single input variable (x), the method is called a simple linear
regression.
 When there are multiple input variables, the procedure is referred as multiple
linear regression.
 Here we predict a target variable Y based on the input variable X.
 A linear relationship should exist between target variable and predictor and so
comes the name Linear Regression.
 The linear regression model consists of a predictor variable and a dependent
variable related linearly to each other.
 In this technique, the dependent variable is continuous, independent variable(s)
can be continuous or discrete, and nature of regression line is linear.
 The below-given equation is used to denote the linear regression model:
y=mx+c+e, where
 y is the output variable. It is also called the target variable in machine learning,
or the dependent variable in statistical modeling. It represents the continuous
value that we are trying to predict.
 m is the slope of the line or it is the regression coefficient. In classical
statistics, it is the equivalent of the slope on the best-fit straight line that is
produced after the linear regression model has been fitted.
 c is an intercept or bias term, and e represents the error in the model.
 x is the input variable. In machine learning, x is referred to as
the feature, while in statistics, it is called the independent variable. It
represents the information given to us at any given time.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 4


Machine Learning-24MCA31 MODULE - 3

The goal of the regression analysis (modeling) is to find the values for the unknown
parameters of the equation; that is, to find the values for the weights m and c. This
equation can be used to predict the value of target variable based on given
predictor variable(s). Consider predicting the salary of an employee based on his/her
age. We can easily identify that there seems to be a correlation between employee‘s
age and salary (more the age more is the salary). The hypothesis of linear regression is
Y = b0+b1X. Y represents salary, X is employee‘s age and a and b are the coefficients
of the equation. So in order to predict Y (salary) given X (age), we need to know the
values of a and b (the model‘s coefficients).

Fig 3.4 Graph of Linear Regression


In the figure, the red points are the data points and the blue line is the predicted line
for the training data. To get the predicted value, these data points are projected on to
the line. In case the data involves more than one independent variable, then linear
regression is called multiple linear regression models. The difference between simple
linear regression and multiple linear regression is that, multiple linear regression has
(>1) independent variables, whereas simple linear regression has only 1 independent
variable.

How to obtain best fit line (Value of a and b)?


 This task can be easily accomplished by Least Square Method. It is the most
common method used for fitting a regression line. It calculates the best-fit line for
the observed data by minimizing the sum of the squares of the vertical deviations
from each data point to the line. Because the deviations are first squared, when
added, there is no cancelling out between positive and negative values. The sloped
straight line representing the linear relationship that fits the given data best is
called as a regression line. It is also called as best fit line.

Fig 3.5 Best fit line

Dr. V. Asha, Professor, MCA Dept, NHCE Page 5


Machine Learning-24MCA31 MODULE - 3

Applications of Linear Regression


1. Predict future prices/costs: If your business is buying items or services (e.g. raw
materials expenses, stock prices, labor costs, etc.), you can use linear regression to
predict what the prices of these items are going to be in the future.
2. Relationship between rash driving and number of road accidents by a driver is best
studied through regression.
3. Predict future revenue: Linear regression can be used to model the advertising
data, discover the relation between advertising data and revenue, and predict how
much revenue the business will generate depending on how much you spend on ads in
a given month.
4. Compare performance: You just launched a new product line, but it‘s unclear
whether it is attracting more customers than your existing ones. Use linear
regression to determine how your new product compares to the ones that you
already have.
5. Application: some of the most popular applications of Linear regression algorithm
are in financial portfolio prediction, salary forecasting, real estate predictions and
in traffic in arriving at ETAs.
6. This technique is used for forecasting, time series modeling and analyzing data. The
regression technique gets used mainly to determine the predictor strength,
forecast trend and in case of cause & effect relation.

Benefits of Linear Regression for Deployment in Production


1. When it comes to production data science settings, linear regression is the popular
choice due to its many benefits:
2. Ease of use: The model is simple to implement computationally. It does not require a
lot of engineering overhead, neither before launch nor during maintenance.
3. Interpretability: Unlike the deep learning models (neural networks), linear
regression is straightforward to interpret. This positions the machine learning
algorithm ahead of black-box models, which do not explain which input variable
causes the output variable to change.
4. Scalability: The algorithm is not computationally heavy, which means that linear
regression is perfect for use cases where scaling is expected. It scales well with
increases in data volume (big data) and data velocity too.
5. Performs well in online settings: Because of the ease of computation, linear
regression can be used in online settings.

These specific features explain why linear regression is one of the best models for
making predictions using machine learning.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 6


Machine Learning-24MCA31 MODULE - 3

Types of Linear Regression-


Based on the number of independent variables, there are two types of linear
regression- Simple Linear Regression, Multiple Linear Regression

Simple linear regression vs. Multiple linear regression


Both simple and multiple linear regressions assume that there is a linear relationship
between the input variable(s) and the output target variable.
The main difference is the number of independent variables that they take as inputs.
Simple linear regression just takes a single feature, while multiple linear regression
takes multiple x values. In multiple linear regression, the dependent variable depends
on more than one independent variables. For multiple linear regression, the formula can
be written with n-input variables as: Y = c + m1x1 + m2x2 + m3x3 + …. + mnxn
Where xi is the i-th feature with its own mi weight.
Here,
 Y is a dependent variable.
 X1, X2, …., Xn are independent variables.
 m1,…, mn are the regression coefficients.
 mj (1<=j<=n) is the slope or weight that specifies the factor by which Xj has an
impact on Y.
The simple linear regression model can be represented graphically as a best-fit line
between the data points, while the multiple linear regression model can be represented
as a plane (in 2-dimensions) or a hyperplane (in higher dimensions).

Fig 3.6 Simple Linear Regression vs Multiple Linear Regression

Despite their differences, both the simple and multiple regression models are linear
models - they adopt the form of a linear equation.

2. Logistic Regression:
Logistic regression is one of the types of regression analysis technique, which gets
used when the dependent variable is discrete. Example: 0 or 1, true or false, etc. This
means the target variable can have only two values, and a sigmoid curve denotes the
relation between the target variable and the independent variable.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 7


Machine Learning-24MCA31 MODULE - 3

Logit function is used in Logistic Regression to measure the relationship between the
target variable and independent variables. Below is the equation that denotes the
logistic regression.
Logistic regression is used to find the probability of event=Success and event=Failure.
We should use logistic regression when the dependent variable is binary (0/ 1, True/
False, Yes/ No) in nature. Here the value of Y ranges from 0 to 1 and it can be
represented by following equation.
Odds = p/(1-p) = probability of event occurrence / probability of no. event occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3….+bkXk
where p is the probability of occurrence of the feature.
Above, p is the probability of presence of the characteristic of interest. The reason
for using log in the equation is since we are working here with a binomial distribution
(dependent variable), we need to choose a link function which is best suited for
this distribution. And, it is logit function.

Fig 3.7 Sigmoid Function

For selecting logistic regression, the size of data is large with the almost equal
occurrence of values to come in target variables. Also, there should be no
multicollinearity, which means that there should be no correlation between independent
variables in the dataset.
Important Points:

Fig 3.8 Logistic Regression

Dr. V. Asha, Professor, MCA Dept, NHCE Page 8


Machine Learning-24MCA31 MODULE - 3

 Logistic regression is widely used for classification problems


 Logistic regression doesn‘t require linear relationship between dependent and
independent variables. It can handle various types of relationships because it
applies a non-linear log transformation to the predicted odds ratio
 To avoid over fitting and under fitting, we should include all significant variables.
A good approach to ensure this practice is to use a step wise method to estimate
the logistic regression.
 It requires large sample sizes because maximum likelihood estimates are less
powerful at low sample sizes than ordinary least square
 The independent variables should not be correlated with each other i.e. no multi
collinearity.
 If the values of dependent variable is ordinal, then it is called as Ordinal logistic
regression
 If dependent variable is multi class then it is known as Multinomial Logistic
regression.

3. Polynomial Regression
 Polynomial Regression is another type of regression analysis techniques in machine
learning, which is the same as Multiple Linear Regression with a little modification.
In Polynomial Regression, the relationship between independent and dependent
variables, that is X and Y, is denoted by the n-th degree, i.e., the power of
independent variable is more than 1.
 The linear model Y = a+bX is transformed into something like Y = a + bX + cX2.
 It is still a linear model but the curve is now quadratic rather than a line. If we
increase the degree to a very high value, the curve becomes overfitted as it
learns the noise in the data as well. In this regression technique, the best fit line
is not a straight line. It is rather a curve that fits into the data points.
 It is a linear model as an estimator. Least Mean Squared Method is used in
Polynomial Regression also. While trying to reduce the Mean Squared Error to a
minimum and to get the best fit line, the model can be prone to overfitting.
 It is represented by the equation: Y=b0+b1x1+b2x22+........bn xnn

Fig 3.9 Polynomial Regression

Dr. V. Asha, Professor, MCA Dept, NHCE Page 9


Machine Learning-24MCA31 MODULE - 3

Important Points:
 While there might be a temptation to fit a higher degree polynomial to get lower
error, this can result in over-fitting. Always plot the relationships to see the fit
and focus on making sure that the curve fits the nature of the problem.
 Especially look out for curve towards the ends and see whether those shapes and
trends make sense. Higher polynomials can end up producing wierd results on
extrapolation.
 Here is an example of how plotting can help:

Fig 3.10 Underfitting, Just right, Overfitting

4. Stepwise Regression
 Standard stepwise regression does two things. It adds and removes predictors as
needed for each step.
 Forward selection starts with most significant predictor in the model and adds
variable for each step. It continuously adds the variables in order to review the
performance and stops when no improvement is needed up to an extent
 Backward elimination starts with all predictors in the model and removes the least
significant variable for each step. It includes the removal of variables at a time
until no extra variables would be deleted without considerable loss.
 And bidirectional elimination is the blend of the above two approaches.
 The aim of this modeling technique is to maximize the prediction power with
minimum number of predictor variables. It is one of the method to handle higher
dimensionality of data set.

5. Ridge Regression:
 This is another type of regression in machine learning which is usually used when
there is a high correlation between the independent variables. This is because, in
the case of multi collinear data, the least square estimates give unbiased values.
But, in case the collinearity is very high, there can be some bias value. Therefore,
a bias matrix is introduced in the equation of Ridge Regression. This is a powerful
regression method where the model is less susceptible to overfitting.
 Below is the equation used to denote the Ridge Regression, where the
introduction of λ (lambda) solves the problem of multicollinearity:
β = (X^{T}X + λ*I)^{-1}X^{T}y

Dr. V. Asha, Professor, MCA Dept, NHCE Page 10


Machine Learning-24MCA31 MODULE - 3

6. Bayesian Linear Regression


 Bayesian Regression is one of the types of regression in machine learning that
uses the Bayes theorem to find out the value of regression coefficients. In this
method, the posterior distribution of the features is determined instead of
finding the least-squares. Bayesian Linear Regression is like both Linear
Regression and Ridge Regression but is more stable than the simple Linear
Regression.

7. Lasso Regression:
 Lasso Regression in machine learning that performs regularization along with
feature selection. It prohibits the absolute size of the regression coefficient. As
a result, the coefficient value gets nearer to zero, which does not happen in the
case of Ridge Regression.
 Due to this, feature selection gets used in Lasso Regression, which allows
selecting a set of features from the dataset to build the model. In the case of
Lasso Regression, only the required features are used, and the other ones are
made zero. This helps in avoiding the overfitting in the model. In case the
independent variables are highly collinear, then Lasso regression picks only one
variable and makes other variables to shrink to zero.
 Application: Lasso regression algorithms have been widely used in financial
networks and economics. In finance, its application is seen in forecasting
probabilities of default and Lasso-based forecasting models are used in assessing
enterprise wide risk framework. It is also used to perform stress test platforms
to analyze multiple stress scenarios.

8. Decision Tree Regression


 Decision trees can be used for classification as well as regression. In decision
trees, at each level, we need to identify the splitting attribute. In the case of
regression, the ID3 algorithm can be used to identify the splitting node
by reducing the standard deviation (in classification information gain is used).
 A decision tree is built by partitioning the data into subsets containing instances
with similar values (homogenous). Standard deviation is used to calculate the
homogeneity of a numerical sample. If the numerical sample is completely
homogeneous, its standard deviation is zero.
 The steps for finding the splitting node is briefly described below:
 Calculate the standard deviation of the target variable using below formula.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 11


Machine Learning-24MCA31 MODULE - 3

 Split the dataset on different attributes and calculate the standard deviation
for each branch (standard deviation for target and predictor). This value is
subtracted from the standard deviation before the split. The result is the
standard deviation reduction. SDR(T,X) = S(T) – S(T,X)
 The attribute with the largest standard deviation reduction is chosen as the
splitting node.
 The dataset is divided based on the values of the selected attribute. This
process is run recursively on the non-leaf branches until all data is processed.
 To avoid overfitting, the Coefficient of Deviation (CV) is used which decides when
to stop branching. Finally, the average of each branch is assigned to the related
leaf node (in regression mean is taken whereas in classification mode of leaf
nodes is taken).

9. Random Forest Regression


 Random forest is an ensemble approach where we take into account the
predictions of several decision regression trees.
1. Select K random points
2. Identify n where n is the number of decision tree regressors to be created.
Repeat steps 1 and 2 to create several regression trees.
3. The average of each branch is assigned to the leaf node in each decision tree.
4. To predict output for a variable, the average of all the predictions of all
decision trees are taken into consideration.
 Random Forest prevents overfitting (common in decision trees) by creating
random subsets of the features and building smaller trees using these subsets.

10. Support Vector Machines:


 Support Vector Machine (SVM) is another most powerful algorithm with strong
theoretical foundations based on the Vapnik-Chervonenkis theory, as defined by
Oracle docs. This supervised machine learning algorithm has strong regularization
and can be leveraged both for classification or regression challenges. They are
characterized by usage of kernels, the sparseness of the solution and the
capacity control gained by acting on the margin, or on number of support vectors,
etc. The capacity of the system is controlled by parameters that do not depend
on the dimensionality of feature space. Since the SVM algorithm operates
natively on numeric attributes, it uses a z-score normalization on numeric
attributes. In regression, Support Vector Machines algorithms use epsilon-
insensitivity (margin of tolerance) loss function to solve regression problems.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 12


Machine Learning-24MCA31 MODULE - 3

Fig 3.11 Bayesian Linear Regression

 Application: Support vector machine regression algorithm has found several


applications in the oil and gas industry, classification of images and text and
hypertext categorization.

PROBLEMS:
Problem 1
Derive the linear regression equations for the following to find the value of y.
a. The Best fit line goes through (0,40) and (10,35) and x = 30
b. The Best fit line goes through (0,5) and (1,6.5) and x = 2.5

[Link] Best fit line goes through (0,40) and (10,35) and x = 30
Step 1: Find the slope.
This line goes through (0,40) and (10,35), so the slope is 35-40/10-0 = -1/2
Step 2: Find the y-intercept.
We can see that the line passes through (0,40), so the y-intercept is 40.
Step 3: Write the equation in y=mx+by,
So the equation is y=-0.5x+40
If x = 30, then
Y = −0.5x+40 = (−0.5)(30)+40 = −15+40 =25
So the Linear Regression equation is y=-0.5x+40 and y = 25.

b. The Best fit line goes through (0,5) and (1,6.5) and x = 2.5
Step 1: Find the slope.
This line goes through (0,5) and (1,6.5), so the slope is 6.5-5/1-0 = 1.5
Step 2: Find the y-intercept.
We can see that the line passes through (0,5) so the y-intercept is 5.
Step 3: Write the equation in y=mx+by,
The equation that best describes the model is y=1.5x+5
If x = 2.5, then
Y = 1.5x+5 = (1.5)(2.5)+5 =3.75+5 =8.75
So the Linear Regression equation is y=1.5x+5y and y = 8.75.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 13


Machine Learning-24MCA31 MODULE - 3

Problem 2
Resort A charges $55.50 per night, plus a one-time surcharge of $100 a night. Resort
B charges $65.50 per night, plus a one-time surcharge of $50.
A) State the dependent variable and write a Regression Equation.
B) State the independent variable.
C) After how many days is the cost the same?
D) Examine the total cost at that point?

Simple Linear Regression is often used to examine the relationship between dependent
and independent variable. X is independent variable and y is dependent variable. Simple
linear regression can be written in the form of y = m(x) + b. Here m represents the
value of the slope and b represents the value of the y-intercept.

A) In this scenario the dependent variable is equal to the total cost of the stay at a
resort. The regression equation associated with the cost at resort A can be written as
follows, based on the form given above:
y = m(x) + b
Total cost of the stay in a resort = (The cost per one night stay at the resort)(The
number of days stayed at a resort) + The value of the one-time surcharge associated
with a resort. Y is the dependent variable.

B) The independent variable in this scenario is the number of nights stayed at a resort.

C) The regression equation representing resort A can be written as follows:


Total cost of the stay in resort A = ($55.50)(The number of days stayed at a resort) + $100
The regression equation representing resort B can be written as follows:
Total cost of the stay in resort B= (65.50)(The number of days stayed at a resort) +
$50
In order to assess at what point the cost is the same for both resorts, arrange the
equations to the right of the equal sign so that they equal one another as follows. To
make applying algebra to solve the equation simpler, we will replace the values for the
dependent and independent variables with the coefficients of y and x respectively:
(55.50)(x) + 100 = (65.50)(x) + 50
(55.50)(x) + 50 = (65.50)(x)
50 = 10(x)
5=x
As the value of x is equal to 5, this means that after 5 days, the cost to stay at both
resorts is the same.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 14


Machine Learning-24MCA31 MODULE - 3

D) To find the total cost after 5 days, simply select either the regression formula for
resort A or resort B and solve for the value of y. In this example, we will use the
equation from resort A:
y = (55.50)(x) + 100 = (55.50)(5) + 100 = 277.50 + 100 = 377.50
The total cost for 5 nights in both resorts is equal to $377.50.

Problem 3:
The data set given below has the rating of 2 movies by 6 people.
Person Xi=Rating for a movie Yi=Rating for a movie
th
Bahubali-I by i person Bahubali-II by ith person
1 4 3
2 2 4
3 3 2
4 5 5
5 1 3
6 3 1
i) Find the values of b0& b1 and Regression Equation w.r.t. Linear regression model
which best fits given data.
ii) Find regression line that fits best for given sample data.
iii) If new person rates the Bahubali-I as 3 then predict the rating of same person
for ―Bahubali-II‖
Sample Number x y x-𝑥 y-𝑦 (x-𝑥)2 (x-𝑥).(y-𝑦)
1 4 3 1 0 1 0
2 2 4 -1 1 1 -1
3 3 2 0 -1 0 0
4 5 5 2 2 4 4
5 1 3 -2 0 4 0
6 3 1 0 -2 0 0
∑ 18 18 10 3
Mean 𝑥=3 𝑦 =3

i) Find the values of b0 & b1 w.r.t. Linear regression model which best fits given
data.
𝑌 = b0 + b1.x
b0 is the intercept, b1 is slope.
∑(x−𝑥)(y−𝑦) 3
b1 = ∑(x−𝑥)2
= 10
= 0.3
𝑌 = b0 + b1.x (for x and y, put 𝑥 and 𝑦 value which is 3 and 3)
3 = b0 + (0.3)(3)
b0 = 3-0.9
b0 = 2.1, and b1 = 0.3

Dr. V. Asha, Professor, MCA Dept, NHCE Page 15


Machine Learning-24MCA31 MODULE - 3

𝑌 = b0 + b1.x; 𝑌 = 2.1 + 0.3x ------- Regression Equation.


ii) Find regression line that fits best for given sample data.
Regression line always passes through all the points, in such a way that the distance of
those points from the line will be minimum. Regression line i.e., Line of best fit always
passes through mean value ( 𝑥 and 𝑦 are mean values) and y intercept, (b0 is y
intercept). Line will pass through 2.1 and (3,3). And the line is called line of best fit.

Line of Best Fit


4
3
Y-Axis

2
1
0 Line of Best Fit

0 1 2 3 4
X -Axis

iii) If new person rates the Bahubali-I as 3 then predict the rating of same person for
―Bahubali-II‖.
𝑌 = b0 + b1.x
𝑌 = 2.1 + (0.3).3; 𝑌 =3
New person rates Bahubali-I as 3, he is going to rate Bahubali-II as also 3.

3.2 SUPPORT VECTOR MACHINE:


SVM is a powerful supervised algorithm that works best on smaller datasets but on
complex ones. Support Vector Machine, abbreviated as SVM can be used for both
regression and classification tasks, but generally, they work best in classification
problems. In SVM, we try to find a hyperplane that best separates the two
classes. SVM and Logistic regression algorithms try to find the best hyperplane, but
the main difference is logistic regression is a probabilistic approach whereas support
vector machine is based on statistical approaches. SVM finds the maximum margin
between the hyperplanes that means maximum distances between the two classes.
Depending on the number of features, we can either choose Logistic Regression or
SVM. SVM works best when the dataset is small and complex. It is usually advisable to
first use logistic regression and see how does it performs, if it fails to give a good
accuracy you can go for SVM without any kernel. Logistic regression and SVM without
any kernel have similar performance but depending on the features, one may be more
efficient than the other.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 16


Machine Learning-24MCA31 MODULE - 3

Types of Support Vector Machine (SVM) Algorithms


 Linear SVM: When the data is perfectly linearly separable only then we can use
Linear SVM. Perfectly linearly separable means that the data points can be
classified into 2 classes by using a single straight line(if 2D).
 Non-Linear SVM: When the data is not linearly separable then we can use Non-
Linear SVM, which means when the data points cannot be separated into 2 classes
by using a straight line (if 2D) then we use some advanced techniques like kernel
tricks to classify them. In most real-world applications we do not find linearly
separable datapoints hence we use kernel trick to solve them.

The main terms used are :


 Support Vectors: These are the points that are closest to the hyperplane. A
separating line will be defined with the help of these data points.
 Margin: it is the distance between the hyperplane and the observations closest to
the hyperplane (support vectors). In SVM large margin is considered a good margin.
There are two types of margins hard margin and soft margin.

Fig 3.12 Margins in SVM

How Does Support Vector Machine Work?


SVM is defined such that it is defined in terms of the support vectors only, the margin
is made using the points which are closest to the hyperplane (support vectors), whereas
in logistic regression the classifier is defined over all the points. Hence SVM enjoys
some natural speed-ups. The working of SVM is illustrated using an example. Suppose
we have a dataset that has two classes (green and blue). We want to classify that the
new data point as either blue or green as shown in Fig 3.13a.
To classify these points, we can have many decision boundaries, but the question is
which is the best and how do we find it? NOTE: Since we are plotting the data points
in a 2-dimensional graph we call this decision boundary a straight line but if we have
more dimensions, we call this decision boundary a “hyperplane” shown in Fig 3.13b.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 17


Machine Learning-24MCA31 MODULE - 3

The best hyperplane is that plane that has the maximum distance from both the
classes, and this is the main aim of SVM. This is done by finding different hyperplanes
which classify the labels in the best way then it will choose the one which is farthest
from the data points or the one which has a maximum margin shown in Fig 3.13c.

Fig 3.13a Points to classify Fig 3.13b Use of Straight line Fig 3.13c Use of hyperplances

Advantages of SVM
 SVM works better when the data is Linear
 It is more effective in high dimensions
 With the help of the kernel trick, we can solve any complex problem
 SVM is not sensitive to outliers
 Can help us with Image classification

Uses of SVM :
 SVMs are used in applications like handwriting recognition, intrusion detection, face
detection, email classification, gene classification, and in web pages. This is one of
the reasons we use SVMs in machine learning. It can handle both classification and
regression on linear and non-linear data.

Disadvantages of SVM
 Choosing a good kernel is not easy
 It doesn‘t show good results on a big dataset
 The SVM hyperparameters are Cost -C and gamma. It is not that easy to fine-tune
these hyper-parameters. It is hard to visualize their impact

3.3 ENSEMBLE LEARNING


Ensemble learning is a supervised learning technique used in machine learning to
improve overall performance by combining the predictions from multiple models. This
technique permits higher predictive performance. It combines multiple machine
learning models into one predictive model. Ensemble learning is one of the most
powerful machine learning techniques that use the combined output of two or more
models/weak learners and solve a particular computational intelligence problem. Eg., a

Dr. V. Asha, Professor, MCA Dept, NHCE Page 18


Machine Learning-24MCA31 MODULE - 3

Random Forest algorithm is an ensemble of various decision trees combined. Ensembling


is nothing but the technique to combine several individual predictive models to come up
with the final predictive model. Ensemble learning is primarily used to improve the
model performance, such as classification and regression problems. In simple words, we
can summarise the ensemble learning as follows: "An ensembled model is a machine
learning model that combines the predictions from two or more models.”

Fig 3.14 Overview of Ensemble Technique

Here we have M1 to Mn individual models which are giving predictions from Pred 1 to
Pred n respectively. And we combine them to come up with a single model, and this is
where the art of ensembling comes into the picture.

Principle of ensemble
The basic principle of ensemble learning is to combine a number of weak learners into
strong learners. There are two main types of ensemble methods: Bagging (Bootstrap
Aggregating): In bagging, multiple models are trained on different random subsets of
the training data with replacement.

Benefits of ensemble learning


By leveraging the diverse strengths of different models, ensemble learning aims to
mitigate errors, enhance performance, and increase the overall robustness of
predictions, leading to improved results across various tasks in machine learning and
data analysis.

Major deficiency of any ensemble method


A serious deficiency of current ensemble methods is the lack of comprehensibility, i.e.,
the knowledge learned by ensembles is not understandable to the user. Improving the
comprehensibility of ensembles [15] is an important yet largely understudied direction

Advantage of ensemble learning


In learning models, noise, variance, and bias are the major sources of error. The
ensemble methods in machine learning help minimize these error-causing factors,
thereby ensuring the accuracy and stability of machine learning (ML) algorithms.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 19


Machine Learning-24MCA31 MODULE - 3

Disadvantages of Ensemble learning


 Ensembling is less interpretable, the output of the ensembled model is hard to
predict and explain.
 The art of ensembling is hard to learn and any wrong selection can lead to lower
predictive accuracy than an individual model.
 Ensembling is expensive in terms of both time and space.
 Simple Ensemble Techniques

When to use ensemble learning?


 You can employ ensemble learning techniques when you want to improve the
performance of machine learning models. For example to increase the accuracy
of classification models or to reduce the mean absolute error for regression
models. Ensembling also results in a more stable model.
 When your model is overfitting on the training set, you can also employ
ensembling learning methods to create a more complex model. The models in the
ensemble would then improve performance on the dataset by combining their
predictions.

Simple Ensemble Techniques:


1. Voting
2. Averaging
3. Weighted Averaging

1. Voting:
The voting method is generally used for classification problems. In this technique,
multiple models are used to make predictions for each data point. The predictions by
each model are considered as a ‗vote‘. The predictions which we get from the majority
of the models are used as the final prediction.

For example, when you ask 5 of your colleagues to rate your movie (out of 5); assume
three of them rated it as 4, while two of them rated it a 5. Since the majority gave a
rating of 4, the final rating will be taken as 4. You can consider this as taking the
mode of all the predictions.
The result of max voting would be something like this:
Colleague 1 Colleague 2 Colleague 3 Colleague 4 Colleague 5 Final rating
5 4 5 4 4 4

It can be used for classification or regression. In the case of regression, this involves
calculating the average of the predictions from the models. In the case of

Dr. V. Asha, Professor, MCA Dept, NHCE Page 20


Machine Learning-24MCA31 MODULE - 3

classification, the predictions for each label are summed and the label with the
majority vote is predicted.

 Regression Voting Ensemble: Predictions are the average of contributing models.


 Classification Voting Ensemble: Predictions are the majority vote of contributing
models.

Ensemble voting offers several benefits:


 Improved Accuracy: It leverages the collective knowledge and diversity of multiple
models to make more accurate predictions than any individual model alone. It can
reduce bias and variance and achieve better overall performance.
 Robustness: This technique can be more robust to noisy or incorrect predictions
from individual models. It reduces the impact of individual model errors or biases
by considering multiple viewpoints.
 Model Combination: It allows for the combination of different types of models,
each with its strengths and weaknesses. By aggregating their predictions, ensemble
voting can take advantage of the complementary aspects of different models.
 Interpretability: It can provide insights into the relative importance and
agreement among different models. It allows for analyzing the patterns and
consistency of predictions across the ensemble, aiding in model interpretability.

Voting Classifier supports two types of votings.


Hard Voting: In hard voting, the predicted output class is a class with the highest
majority of votes i.e the class which had the highest probability of being predicted by
each of the classifiers.
Example 1: Suppose three classifiers predicted the output class(A, A, B), so here the
majority predicted A as output. Hence A will be the final prediction.
Example 2: When 5 people are asked to rate the movie (out of 5); we‘ll assume three
of them rated it as 4 while two of them gave it a 5. Since the majority gave a rating of
4, the final rating will be taken as 4. You can consider this as taking the mode of all the
predictions.
Example 3:
So consider the car analogy, if the car model has to be chosen, let‘s say Model A and Model
B, if multiple people are recommending Model A over Model B, I‘ll go with Model A.

Soft Voting: In soft voting, the output class is the prediction based on the average of
probability given to that class. Suppose given some input to three models, the
prediction probability for class A = (0.30, 0.47, 0.53) and B = (0.20, 0.32, 0.40). So the
average for class A is 0.4333 and B is 0.3067, the winner is clearly class A because it
had the highest probability averaged by each classifier.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 21


Machine Learning-24MCA31 MODULE - 3

2. Averaging:
It is mainly used for regression problems. In simple averaging method, for every
instance of test dataset, the average predictions are calculated. In this method, we
take an average of predictions from all the models and use it to make the final
prediction. Averaging can be used for making predictions in regression problems or
while calculating probabilities for classification problems. This method often reduces
overfit and creates a smoother regression model. The method consists of building
multiple models independently and returning the average of the prediction of all the
models. In general, the combined output is better than an individual output because
variance is reduced.
Example 1: In the below case, the averaging method would take the average of all the
values. i.e. (5+4+5+4+4)/5 = 4.4
Colleague 1 Colleague 2 Colleague 3 Colleague 4 Colleague 5 Final rating
5 4 5 4 4 4.4

Example 2: M1, M2, and M3 are three different models which are giving us individual
predictions. To make the final prediction, average is taken for individual models.

3. Weighted Averaging
This is an extension of the averaging method. All models are assigned different
weights defining the importance of each model for prediction. For instance, if two of
your colleagues are critics, while others have no prior experience in this field, then the
answers by these two friends are given more importance as compared to the other
people. The prediction of each model is multiplied by the weight and then their average
is calculated.
The result is calculated as [(5*0.23) + (4*0.23) + (5*0.18) + (4*0.18) + (4*0.18)] = 4.41.
Colleague 1 Colleague 2 Colleague 3 Colleague 4 Colleague 5 Final rating
Weight 0.23 0.23 0.18 0.18 0.18
Rating 5 4 5 4 4 4.41

Dr. V. Asha, Professor, MCA Dept, NHCE Page 22


Machine Learning-24MCA31 MODULE - 3

Rank Averaging : is used when weights are not known.

Rank 1 is given to the worst-performing model. So rank 1 is to the model which


performs the least, rank 2 is the next best model, and rank 3 is the next best model.
So these ranks are given to each of these models and then weights are taken from
these ranks. So these ranks are summed up and divide each of these ranks by that
total value.

For example- In the case of Model 2, we‘ll divide 1 by the sum of 1+2+3 = 6. So the
weight for Model 2 comes down to 1/6 = 0.16. Similarly, the weights for each of these
models are calculated and then multiply those weights by individual models. So, all the
predicted values of Model 1 get multiplied by 0.33. Similarly, M2 and M3 are multiplied
by 0.16 and 0.5 respectively. And then we have to sum up all of these values-

Advanced Ensemble techniques


There are 3 most advanced ensemble learning methods in machine learning. These are
as follows:
 Bagging
 Boosting
 Stacking
1. Bagging
Bagging (bootstrap aggregating) is an ensemble method that involves training multiple
models independently on random subsets of the data, and aggregating their predictions
through voting or averaging. It is a method of ensemble modeling, which is primarily
used to solve supervised machine learning problems. It is generally completed in two
steps as follows:
 Bootstrapping: It is a random sampling method that is used to derive samples
from the data using the replacement procedure. In this method, first, random
data samples are fed to the primary model, and then a base learning algorithm is
run on the samples to complete the learning process.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 23


Machine Learning-24MCA31 MODULE - 3

 Aggregation: This is a step that involves the process of combining the output of
all base models and, based on their output, predicting an aggregate result with
greater accuracy and reduced variance.

Fig 3.15 Bagging

In detail, each model is trained on a random subset of the data sampled with
replacement, meaning that the individual data points can be chosen more than once.
This random subset is known as a bootstrap sample. By training models on different
bootstraps, bagging reduces the variance of the individual models. It also avoids
overfitting by exposing the constituent models to different parts of the dataset. It
makes the model more robust and accurate, especially in cases where the individual
models are prone to high variability.
The predictions from all the sampled models are then combined through a simple
averaging to make the overall prediction. This way, the aggregated model incorporates
the strengths of the individual ones and cancels out their errors.
Let's break it down step by step:
Original training dataset: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Resampled training set 1: [2, 3, 3, 5, 6, 1, 8, 10, 9, 1]
Resampled training set 2: [1, 1, 5, 6, 3, 8, 9, 10, 2, 7]
Resampled training set 3: [1, 5, 8, 9, 2, 10, 9, 7, 5, 4]

Example: In the Random Forest method, predictions from multiple decision trees are
ensembled parallelly. Further, in regression problems, we use an average of these
predictions to get the final output, whereas, in classification problems, the model is
selected as the predicted class.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 24


Machine Learning-24MCA31 MODULE - 3

Implementation Steps of Bagging

Fig 3.16 Steps of Bagging


 Step 1: Multiple subsets are created from the original data set with equal tuples,
selecting observations with replacement.
 Step 2: A base model is created on each of these subsets.
 Step 3: Each model is learnt in parallel with each training set and independent of
each other.
 Step 4: The final predictions are determined by combining the predictions from
all the models.

Advantages of Bagging:
1. Reduces overfitting: It can reduce the chance of an overfit model, resulting in
improved model accuracy on unseen data.
2. Decreases model variance: Multiple models trained on different subsets of data
average out their predictions, leading to lower variance than a single model.
3. Improves stability: Changes in the training dataset have less impact on bagged
models, making the overall model more stable.
4. Handles high variability: Especially effective for algorithms like decision trees,
which tend to have high variance.
5. Parallelizable computation: Each model in the ensemble can be trained
independently, allowing for parallel processing and efficient use of computational
resources.
6. Easy to understand and implement: The concept behind bagging is
straightforward and can be implemented without complex modifications to the
learning algorithm.
7. Good with noisy data: The averaging process helps in reducing the noise in the
final prediction.
8. Handles imbalanced data: Bagging can help in scenarios where the dataset is
imbalanced, improving the performance of the model in such situations.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 25


Machine Learning-24MCA31 MODULE - 3

Disadvantages of Bagging:
1. Flexible less: As a method, Bagging works particularly correctly with algorithms
that are much less solid. One which can be more stable or a problem with high
amounts of bias does now not provide an awful lot of gain as there is less variation
in the dataset of the version. As noted within the hands-On guide for machine
learning, "the bagging is a linear regression version will efficaciously just return the
original predictions for huge enough b."
2. Loss of interpretability: The Bagging slows down and grows extra in depth because
of the quantity of iterations growth. accordingly, it is no longer adequately suitable
for actual-time applications. Clustered structures or large processing cores are
perfect for quickly growing bagged ensembles on massive look-at units.
3. Expensive for computation: The Bagging is tough to draw unique business insights
via Bagging because of the averaging concerned throughout predictions. While the
output is more precise than any person's information point, a more accurate or
whole dataset may yield greater precision within a single classification or
regression model.

Applications of Bagging Classifier


Bagging Classifier can be applied in various real-world tasks:
1. Fraud Detection: Bagging Classifier can be used to detect fraudulent transactions
by aggregating predictions from multiple fraud detection models.
2. Spam filtering: Bagging classifier can be used to filter spam emails by aggregating
predictions from multiple spam filters trained on different subsets of the spam
emails.
3. Credit scoring: Bagging classifier can be used to improve the accuracy of credit
scoring models by combining the predictions of multiple models trained on
different subsets of the credit data.
4. Image Classification: Bagging classifier can be used to improve the accuracy of
image classification tasks by combining the predictions of multiple classifiers
trained on different subsets of the training images.
5. Natural language processing: In NLP tasks, the bagging classifier can combine
predictions from multiple language models to achieve better text classification
results.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 26


Machine Learning-24MCA31 MODULE - 3

Random Forest Algorithm


Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in
ML. It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of the
model.
As the name suggests, "Random Forest is a classifier that contains a number of
decision trees on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset." Instead of relying on one decision
tree, the random forest takes the prediction from each tree and based on the majority
votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents
the problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:

Fig 3.17 Random Forest

Assumptions for Random Forest


Since the random forest combines multiple trees to predict the class of the dataset, it
is possible that some decision trees may predict the correct output, while others may
not. But together, all the trees predict the correct output. Therefore, below are two
assumptions for a better Random forest classifier:
 There should be some actual values in the feature variable of the dataset so
that the classifier can predict accurate results rather than a guessed result.
 The predictions from each tree must have very low correlations.

Why use Random Forest?


 It takes less training time as compared to other algorithms.
 It predicts output with high accuracy, even for the large dataset it runs
efficiently.
 It can also maintain accuracy when a large proportion of data is missing.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 27


Machine Learning-24MCA31 MODULE - 3

How does Random Forest algorithm work?


Random Forest works in two-phase, first is to create the random forest by combining
N decision tree, and second is to make predictions for each tree created in the first
phase. The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and assign the
new data points to the category that wins the majority votes.

Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random forest classifier. The dataset is divided into subsets and given to each
decision tree. During the training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results, the Random Forest
classifier predicts the final decision. Consider the below image:

Fig 3.18 Example for Random Forest

Applications of Random Forest


There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the
disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest


 Random Forest is capable of performing both Classification and Regression
tasks.
 It is capable of handling large datasets with high dimensionality.
 It enhances the accuracy of the model and prevents the overfitting issue.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 28


Machine Learning-24MCA31 MODULE - 3

Disadvantages of Random Forest


 Although random forest can be used for both classification and regression
tasks, it is not more suitable for Regression tasks.

2. Boosting
Boosting is an ensemble method that enables each member to learn from the preceding
member's mistakes and make better predictions for the future. Unlike the bagging
method, in boosting, all base learners (weak) are arranged in a sequential format so
that they can learn from the mistakes of their preceding learner. Hence, in this way,
all weak learners get turned into strong learners and make a better predictive model
with significantly improved performance. Boosting is an ensemble modeling technique
that attempts to build a strong classifier from the number of weak classifiers. It is
done by building a model by using weak models in series. Firstly, a model is built from
the training data. Then the second model is built which tries to correct the errors
present in the first model. This procedure is continued and models are added until
either the complete training data set is predicted correctly or the maximum number
of models are added.

Fig. 3.19 Strong learner vs weak learner


Algorithm:
1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified data points.
3. Increase the weight of the wrongly classified data points and decrease the
weights of correctly classified data points. And then normalize the weights of all
data points.
4. if (got required results)
Goto step 5
else
Goto step 2
5. End

Dr. V. Asha, Professor, MCA Dept, NHCE Page 29


Machine Learning-24MCA31 MODULE - 3

Fig 3.20 Boosting

How does the Boosting Algorithm Work?


The basic principle behind the working of the boosting algorithm is to generate
multiple weak learners and combine their predictions to form one strict rule. These
weak rules are generated by applying base Machine Learning algorithms on different
distributions of the data set. These algorithms generate weak rules for each iteration.
After multiple iterations, the weak learners are combined to form a strong learner
that will predict a more accurate outcome.
Step 1: The base algorithm reads the data and assigns equal weight to each sample
observation.
Step 2: False predictions made by the base learner are identified. In the next
iteration, these false predictions are assigned to the next base learner with a higher
weightage on these incorrect predictions.
Step 3: Repeat step 2 until the algorithm can correctly classify the output.
The following conditions are used to recognize the email as spam or not :
 If an email contains lots of sources, that means it is spam.
 If an email contains only one file image, then it is spam.
 If an email contains the message "You Own a lottery of $xxxxx," it is spam.
 If an email contains some known source, then it is not spam.
 If it contains the official domain like [Link], etc., it is not spam.
The rules mentioned above are not that powerful to recognize spam or not; hence these
rules are called weak learners. To convert weak learners to the strong learner,
combine the prediction of the weak learner using the following methods.
1. Using average or weighted average.
2. Consider prediction has a higher vote.
Consider the 5 rules mentioned above; there are 3 votes for spam and 2 votes for not
spam. Since there is high vote spam, we consider it spam.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 30


Machine Learning-24MCA31 MODULE - 3

Types of Boosting
Boosting methods are focused on iteratively combining weak learners to build a strong
learner that can predict more accurate outcomes. A weak learner classifies data
slightly better than random guessing. This approach can provide robust prediction
problem results, outperform neural networks, and support vector machines for tasks.
Boosting algorithms can differ in how they create and aggregate weak learners during
the sequential process. Three popular types of boosting methods include:

1. Adaptive boosting or AdaBoost: This method operates iteratively, identifying


misclassified data points and adjusting their weights to minimize the training error.
The model continues to optimize sequentially until it yields the strongest predictor.
AdaBoost is implemented by combining several weak learners into a single strong
learner. The weak learners in AdaBoost take into account a single input feature and
draw out a single split decision tree called the decision stump. Each observation is
weighted equally while drawing out the first decision stump.
The results from the first decision stump are analyzed, and if any observations are
wrongfully classified, they are assigned higher weights. A new decision stump is drawn
by considering the higher-weight observations as more significant. Again if any
observations are misclassified, they're given a higher weight, and this process
continues until all the observations fall into the right class.
AdaBoost can be used for both classification and regression-based problems. However,
it is more commonly used for classification purposes.

2. Gradient Boosting: Gradient Boosting is also based on sequential ensemble learning.


Here the base learners are generated sequentially so that the present base learner is
always more effective than the previous one, i.e., and the overall model improves
sequentially with each iteration.
The difference in this boosting type is that the weights for misclassified outcomes are
not incremented. Instead, the Gradient Boosting method tries to optimize the loss
function of the previous learner by adding a new model that adds weak learners to
reduce the loss function. The main idea here is to overcome the errors in the previous
learner's predictions. This boosting has three main components:
 Loss function:The use of the loss function depends on the type of problem. The
advantage of gradient boosting is that there is no need for a new boosting
algorithm for each loss function.
 Weak learner:In gradient boosting, decision trees are used as a weak learners.
A regression tree is used to give true values, which can combine to create
correct predictions. Like in the AdaBoost algorithm, small trees with a single
split are used, i.e., decision stump. Larger trees are used for large levels.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 31


Machine Learning-24MCA31 MODULE - 3

 Additive Model: Trees are added one at a time in this model. Existing trees
remain the same. During the addition of trees, gradient descent is used to
minimize the loss function.
Like AdaBoost, Gradient Boosting can also be used for classification and regression
problems.

3. Extreme gradient boosting or XGBoost: XGBoost is an advanced gradient boosting


method. XGBoost, developed by Tianqi Chen, falls under the Distributed Machine
Learning Community (DMLC) category. The main aim of this algorithm is to increase
the speed and efficiency of computation. The Gradient Descent Boosting algorithm
computes the output slower since they sequentially analyze the data set. Therefore
XGBoost is used to boost or extremely boost the model's performance. XGBoost is
designed to focus on computational speed and model efficiency. The main features
provided by XGBoost are:
 Parallel Processing: XG Boost provides Parallel Processing for tree construction
which uses CPU cores while training.
 Cross-Validation: XG Boost enables users to run cross-validation of the
boosting process at each iteration, making it easy to get the exact optimum
number of boosting iterations in one run.
 Cache Optimization: It provides Cache Optimization of the algorithms for
higher execution speed.
 Distributed Computing: For training large models, XG Boost allows Distributed
Computing.
Advantages of Boosting
 Improved Accuracy – Boosting can improve the accuracy of the model by combining
several weak models‘ accuracies and averaging them for regression or voting over
them for classification to increase the accuracy of the final model.
 Robustness to Overfitting – Boosting can reduce the risk of overfitting by
reweighting the inputs that are classified wrongly.
 Better handling of imbalanced data – Boosting can handle the imbalance data by
focusing more on the data points that are misclassified
 Better Interpretability – Boosting can increase the interpretability of the model
by breaking the model decision process into multiple processes.

Disadvantages of Boosting Algorithms


Boosting algorithms also have some disadvantages these are:
 Boosting Algorithms are vulnerable to the outliers
 It is difficult to use boosting algorithms for Real-Time applications.
 It is computationally expensive for large datasets

Dr. V. Asha, Professor, MCA Dept, NHCE Page 32


Machine Learning-24MCA31 MODULE - 3

Benefits of Boosting
 Ease of Implementation: Boosting can be used with several hyper-parameter
tuning options to improve fitting. No data preprocessing is required, and boosting
algorithms have built-in routines to handle missing data. In Python, the sci-kit-learn
library of ensemble methods makes it easy to implement the popular boosting
methods, including AdaBoost, XGBoost, etc.
 Reduction of bias: Boosting algorithms combine multiple weak learners in a
sequential method, iteratively improving upon observations. This approach can help
to reduce high bias, commonly seen in shallow decision trees and logistic regression
models.
 Computational Efficiency: Since boosting algorithms have special features that
increase their predictive power during training, it can help reduce dimensionality
and increase computational efficiency.

Challenges of Boosting:
 Overfitting: There's some dispute in the research around whether or not boosting
can help reduce overfitting or make it worse. We include it under challenges
because in the instances that it does occur, predictions cannot be generalized to
new datasets.
 Intense computation: Sequential training in boosting is hard to scale up. Since each
estimator is built on its predecessors, boosting models can be computationally
expensive, although XGBoost seeks to address scalability issues in other boosting
methods. Boosting algorithms can be slower to train when compared to bagging, as a
large number of parameters can also influence the model's behavior.
 Vulnerability to outlier data: Boosting models are vulnerable to outliers or data
values that are different from the rest of the dataset. Because each model
attempts to correct the faults of its predecessor, outliers can skew results
significantly.
 Real-time implementation: You might find it challenging to use boosting for real-
time implementation because the algorithm is more complex than other processes.
Boosting methods have high adaptability, so you can use various model parameters
that immediately affect the model's performance.

Applications of Boosting
Boosting algorithms are well suited for artificial intelligence projects across a broad
range of industries, including:
 Healthcare: Boosting is used to lower errors in medical data predictions, such as
predicting cardiovascular risk factors and cancer patient survival rates. Ensemble
methods significantly improve the accuracy in identifying patients who could

Dr. V. Asha, Professor, MCA Dept, NHCE Page 33


Machine Learning-24MCA31 MODULE - 3

benefit from preventive treatment of cardiovascular disease while avoiding


unnecessary treatment of others. Applying boosting to multiple genomics platforms
can improve the prediction of cancer survival time.
 IT: Gradient boosted regression trees are used in search engines for page
rankings, while the Viola-Jones boosting algorithm is used for image retrieval. As
noted by Cornell, boosted classifiers allow the computations to be stopped sooner
when it's clear which direction a prediction is headed. A search engine can stop
evaluating lower-ranked pages, while image scanners will only consider images
containing the desired object.
 Finance: Boosting is used with deep learning models to automate critical tasks,
including fraud detection, pricing analysis, and more. For example, boosting methods
in credit card fraud detection and financial product pricing analysis improves the
accuracy of analyzing massive data sets to minimize financial losses.

Bagging Vs. Boosting


Bagging:
 models are trained independently in parallel on different random subsets of the
data involves simple averaging of models
 reduces variance
 can be used with unstable models like decision trees
 simpler to run parallelly
 used on susceptible novices, mainly showcasing excessive variance & occasional bias
Boosting:
 models are trained sequentially, with each model learning from the errors of the
previous one, assigns weights based on accuracy
 reduces bias
 more powerful and accurate
 works better for stable models like linear regression
 are leveraged while low friction and high tendency are located
In practice, it helps to test both on a new problem to see which performs better.

Fig 3.21 Bagging vs Boosting

Dr. V. Asha, Professor, MCA Dept, NHCE Page 34


Machine Learning-24MCA31 MODULE - 3

Bagging Boosting
The most effective manner of mixing A manner of mixing predictions that
predictions that belong to the same type. belong to different sorts.
The main task of it is decrease the The main task of it is decrease the bias
variance but not bias. but not variance.
Here each of the model is different Here each of the model is same weight.
weight.
Each of the model is built here Each of the model is built here
independently. dependently.
This training records subsets are decided Each new subset consists of the factors
on using row sampling with alternative and that were misclassified through
random sampling techniques from the preceding models.
whole training dataset.
It is trying to solve by over fitting It is trying to solve by reducing the
problem. bias.
If the classifier is volatile (excessive If the classifier is stable and easy
variance), then apply bagging. (excessive bias) the practice boosting.
In the bagging base, the classifier is In the boosting base, the classifier is
works parallelly. works sequentially.
Example is random forest model by using Example is AdaBoost using the boosting
bagging. technique.

Similarities between Bagging and Boosting


1. They both are ensemble techniques to get the N novices from 1 learner.
2. Each generates numerous training statistics sets through random sampling.
3. They each make the very last decision by averaging the N number of beginners
(or they take most of the people of them, i.e., the Majority of voting).
4. The Bagging and boosting are exact at reducing the variance and offer better
stability.

3. Stacking
Stacking is one of the popular ensemble modeling techniques in machine learning.
Various weak learners are ensembled in a parallel manner in such a way that by
combining them with Meta learners, we can predict better predictions for the future.
This ensemble technique works by applying input of combined multiple weak learners'
predictions and Meta learners so that a better output prediction model can be
achieved. In stacking, an algorithm takes the outputs of sub-models as input and
attempts to learn how to best combine the input predictions to make a better output
prediction. Stacking is also known as a stacked generalization and is an extended form

Dr. V. Asha, Professor, MCA Dept, NHCE Page 35


Machine Learning-24MCA31 MODULE - 3

of the Model Averaging Ensemble technique in which all sub-models equally participate
as per their performance weights and build a new model with better predictions. This
new model is stacked up on top of the others; this is the reason why it is named
stacking.

Architecture of Stacking
The architecture of the stacking model is designed in such as way that it consists of
two or more base/learner's models and a meta-model that combines the predictions of
the base models. These base models are called level 0 models, and the meta-model is
known as the level 1 model. So, the Stacking ensemble method includes original
(training) data, primary level models, primary level prediction, secondary level
model, and final prediction. The basic architecture of stacking can be represented as
shown below the image.
 Original data: This data is divided into n-folds and is also considered test data or
training data.
 Base models: These models are also referred to as level-0 models. These models
use training data and provide compiled predictions (level-0) as an output.
 Level-0 Predictions: Each base model is triggered on some training data and
provides different predictions, which are known as level-0 predictions.
 Meta Model: The architecture of the stacking model consists of one meta-model,
which helps to best combine the predictions of the base models. The meta-model is
also known as the level-1 model.
 Level-1 Prediction: The meta-model learns how to best combine the predictions of
the base models and is trained on different predictions made by individual base
models, i.e., data not used to train the base models are fed to the meta-model,
predictions are made, and these predictions, along with the expected outputs,
provide the input and output pairs of the training dataset used to fit the meta-
model.

Fig 3.22 Architecture of Stacking

Dr. V. Asha, Professor, MCA Dept, NHCE Page 36


Machine Learning-24MCA31 MODULE - 3

Steps to implement Stacking models: (Working of Stacking)


1. We split the training data into K-folds just like K-fold cross-validation.
2. A base model is fitted on the K-1 parts and predictions are made for Kth part.
3. We do for each part of the training data.
4. The base model is then fitted on the whole train data set to calculate its
performance on the test set.
5. We repeat the last 3 steps for other base models.
6. Predictions from the train set are used as features for the second level model.
7. Second level model is used to make a prediction on the test set.

3.2 Clustering
Clustering is the task of dividing data points into a number of groups such that data
points in the same groups are more similar to other data points in the same group than
those in other groups. In simple words, the aim is to segregate groups with similar
traits and assign them into clusters. Clustering is dividing data points into homogeneous
classes or clusters. Points in the same group are as similar as possible. Points in
different group are as dissimilar as possible. When a collection of objects is given, we
put objects into group based on similarity.
Cluster is a group of objects that belongs to the same class. In other words, similar
objects are grouped in one cluster and dissimilar objects are grouped in another
cluster.

Classification Vs. Clustering:


Classification and clustering are two methods of pattern identification used in machine
learning. Although both techniques have certain similarities, the difference lies in the
fact that classification uses predefined classes in which objects are assigned, while
clustering identifies similarities between objects, which it groups according to those
characteristics in common and which differentiate them from other groups of objects.
These groups are known as "clusters".

Clustering: In the field of machine learning, clustering is framed in unsupervised


learning; that is, for this type of algorithm we only have one set of input data (not
labelled), about which we must obtain information, without previously knowing what the
output will be. Clustering is used in projects for companies that want to find common
aspects within their customers to find groups and focus products or services. Thus, if
a significant percentage of customers have certain aspects in common (age, type of
family, etc.) the company can justify a particular campaign, service or product.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 37


Machine Learning-24MCA31 MODULE - 3

Applications of Clustering:
 Market Segmentation - Clustering helps marketers improve their customer base
and work on the target areas. It helps group people based on their similarity in
many ways related to the product under consideration.
 Clustering helps in identification of group of houses on the basis of their value,
type and geographical locations.
 Clustering is used to study earth-quake. Based on the areas hit by an earthquake in
a region, clustering can help analyse the next probable location where earthquake
can occur.
 In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures
inherent to populations.
 Clustering also helps in classifying documents on the web for information discovery.
 Clustering is also used in outlier detection applications such as detection of credit
card fraud.
 As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.
 Let‘s understand this with an example. Suppose, you are the head of a rental store
and wish to understand preferences of your costumers to scale up your business. Is
it possible for you to look at details of each costumer and devise a unique business
strategy for each one of them? Definitely not. But, what you can do is to cluster all
of your costumers into say 10 groups based on their purchasing habits and use a
separate strategy for costumers in each of these 10 groups. And this is what we
call clustering.
 Search Result Grouping - Examples: Netflix - A well-known application of clustering
algorithms are Netflix recommendation systems. Although the company is quite
discreet with its algorithms, it is confirmed that there are about 2,000 clusters or
communities that have common audiovisual tastes. Cluster 290 is the one that
includes people who like the series "Lost", "Black Mirror" and "Groundhog Day".
Netflix uses these clusters to refine its knowledge of the tastes of viewers and
thus make better decisions in the creation of new original series.
 Example : Fraud Detection/Anomaly Detection - Classification is commonly used in
the financial sector. In the era of online transactions where the use of cash has
decreased markedly, it is necessary to determine whether movements made
through cards are safe. Entities can classify transactions as correct or fraudulent
using historical data on customer behavior to detect fraud very accurately.
 Clustering analysis is broadly used in many applications such as market research,
pattern recognition, data analysis, and image processing, Social network analysis,
Recommendation engines, Medical imaging, Image segmentation.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 38


Machine Learning-24MCA31 MODULE - 3

Classification : On the other hand, classification belongs to supervised learning, which


means that we know the input data (labeled in this case) and we know the possible
output of the algorithm. There is the binary classification that responds to problems
with categorical answers (such as "yes" and "no", for example), and the multi-
classification, for problems where we find more than two classes, responding to more
open answers such as "great", "regular" and "insufficient".

Fig 3.23 Classification & Clustering

Applications of Classification:
 Classification is used in many fields, such as biology or in the Dewey decimal
classification for books, in the detection of spam in e-mails...
 Classification is used when you need to know users or customers to decide which
products or campaigns will be launched in the future. For example, at Bismart we
developed a project for the insurance industry in which the client needed to
classify customers according to accident claims, so that the policy could be
classified according to the number of claims predicted. Thus, the company can
choose the costumers with the lowest number of claims.

Types of Clustering Algorithms/Methods


The task of clustering is subjective and every methodology follows a different set of
rules for defining the ‗similarity‘ among data points. In fact, there are more than 100
clustering algorithms known. But few of the algorithms are used popularly, let‘s look at
them in detail:
 Partitioning Method
 Centroid Method
 Distribution based Method
 Density-based Method
 Constraint-based Method
 Connectivity Method
 Hierarchical Method

Dr. V. Asha, Professor, MCA Dept, NHCE Page 39


Machine Learning-24MCA31 MODULE - 3

1. Partitioning Method :
Suppose we are given a database of ‗n‘ objects and the partitioning method constructs
‗k‘ partition of data. Each partition will represent a cluster and k ≤ n. It means that it
will classify the data into k groups, which satisfy the following requirements −
 Each group contains at least one object.
 Each object must belong to exactly one group.
For a given number of partitions (say k), the partitioning method will create an initial
partitioning. Then it uses the iterative relocation technique to improve the partitioning
by moving objects from one group to other.

2. Centroid Method:

Fig 3.24 Example of Centroid-based Clustering

These are iterative clustering algorithms in which the notion of similarity is derived by
the closeness of a data point to the centroid of the clusters. Centroid-based clustering
organizes the data into non-hierarchical clusters. K-means is the most widely-used and
popular centroid-based clustering algorithm K-Means clustering algorithm is a popular
algorithm that falls into this category. Centroid-based algorithms are efficient but
sensitive to initial conditions and outliers. In these models, the no. of clusters required
at the end have to be mentioned beforehand, which makes it important to have prior
knowledge of the dataset. These models run iteratively to find the local optima.

3. Distribution-based Clustering Method

Fig 3.25 Example of Distribution-based Clustering

Dr. V. Asha, Professor, MCA Dept, NHCE Page 40


Machine Learning-24MCA31 MODULE - 3

These clustering models are based on the notion of how probable is it that all data
points in the cluster belong to the same distribution This clustering approach assumes
data is composed of distributions, such as Gaussian distributions or Normal
Distribution. In Figure, the distribution-based algorithm clusters data into three
Gaussian distributions. As distance from the distribution's center increases, the
probability that a point belongs to the distribution decreases. The bands show that
decrease in probability. When you do not know the type of distribution in your data,
you should use a different algorithm. These models often suffer from overfitting. A
popular example of these models is Expectation-maximization algorithm which uses
multivariate normal distributions.

4. Density-based Clustering Method

Fig 3.26 Example of Density-based Clustering


This method is based on the notion of density. The basic idea is to continue growing
the given cluster as long as the density in the neighborhood exceeds some threshold,
i.e., for each data point within a given cluster, the radius of a given cluster has to
contain at least a minimum number of points. These models search the data space for
areas of varied density of data points in the data space. It isolates various different
density regions and assign the data points within these regions in the same cluster.
These algorithms have difficulty with data of varying densities and high dimensions.
Further, by design, these algorithms do not assign outliers to clusters. Density Models:
Popular examples of density models are DBSCAN and OPTICS.

5. Constraint-based Method:
In this method, the clustering is performed by the incorporation of user or application-
oriented constraints. A constraint refers to the user expectation or the properties of
desired clustering results. Constraints provide us with an interactive way of
communication with the clustering process. Constraints can be specified by the user or
the application requirement.

6. Connectivity Method: As the name suggests, these models are based on the notion
that the data points closer in data space exhibit more similarity to each other than the
data points lying farther away. These models can follow two approaches. In the first

Dr. V. Asha, Professor, MCA Dept, NHCE Page 41


Machine Learning-24MCA31 MODULE - 3

approach, they start with classifying all data points into separate clusters & then
aggregating them as the distance decreases. In the second approach, all data points
are classified as a single cluster and then partitioned as the distance increases. Also,
the choice of distance function is subjective. These models are very easy to interpret
but lacks scalability for handling big datasets. Examples of these models are
hierarchical clustering algorithm and its variants.

7. Hierarchical Methods:
Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of
clusters. This algorithm starts with all the data points assigned to a cluster of their
own. Then two nearest clusters are merged into the same cluster. In the end, this
algorithm terminates when there is only a single cluster left. This method creates a
hierarchical decomposition of the given set of data objects. The results of hierarchical
clustering can be shown using dendrogram.

Fig 3.27 Hierarchical Clustering

Fig.3.18 Hierarchical Clustering

The dendrogram can be interpreted as follows:


At the bottom, we start with 25 data points, each assigned to separate clusters. Two
closest clusters are then merged till we have just one cluster at the top. The height

Dr. V. Asha, Professor, MCA Dept, NHCE Page 42


Machine Learning-24MCA31 MODULE - 3

in the dendrogram at which two clusters are merged represents the distance
between two clusters in the data space. The decision of the no. of clusters that can
best depict different groups can be chosen by observing the dendrogram. The best
choice of the no. of clusters is the no. of vertical lines in the dendrogram cut by a
horizontal line that can transverse the maximum distance vertically without
intersecting a cluster. In the above example, the best choice of no. of clusters will be
4 as the red horizontal line in the dendrogram below covers maximum vertical
distance AB.
There are two approaches of Hierarchical Clustering –
 Agglomerative Approach
 Divisive Approach

Agglomerative Approach :
An agglomerative hierarchical clustering method is the most common type of
hierarchical clustering and is also known as a bottom-up approach. Bottom-up
algorithms treat each data point as a single cluster at the outset and then
successively merge (or agglomerate) pairs of clusters until all clusters have been
merged into a single larger cluster that contains all data points until all the objects
are in a single cluster or certain termination conditions are satisfied. The single
cluster becomes the hierarchy‘s root. For the merging step, it finds the two clusters
that are closest to each other (according to some similarity measure), and combines
the two to form one cluster. Because two clusters are merged per iteration, where
each cluster contains at least one object, an agglomerative method requires at most n
iterations.

Fig 3.19 Hierarchical agglomerative Clustering

Steps of Agglomerative Clustering:


 Initially, all the data-points are a cluster of its own.
 Take two nearest clusters and join them to form one single cluster.
 Proceed recursively step 2 until you obtain the desired number of clusters.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 43


Machine Learning-24MCA31 MODULE - 3

Working of Agglomerative Hierarchical Clustering :


We begin by treating each data point as a single cluster i.e if there are X data points
in our dataset then we have X clusters. We then select a distance metric that
measures the distance between two clusters. As an example, we will use average
linkage which defines the distance between two clusters to be the average distance
between data points in the first cluster and data points in the second cluster.
On each iteration, we combine two clusters into one. The two clusters to be combined
are selected as those with the smallest average linkage. That is according to our
selected distance metric, these two clusters have the smallest distance between
each other and therefore are the most similar and should be combined.
Step 2 is repeated until we reach the root of the tree i.e we only have one cluster
which contains all data points. In this way we can select how many clusters we want in
the end, simply by choosing when to stop combining the clusters i.e when we stop
building the tree!
Hierarchical clustering does not require us to specify the number of clusters and we
can even select which number of clusters looks best since we are building a tree.
Additionally, the algorithm is not sensitive to the choice of distance metric; all of
them tend to work equally well whereas with other clustering algorithms, the choice
of distance metric is critical. A particularly good use case of hierarchical clustering
methods is when the underlying data has a hierarchical structure and you want to
recover the hierarchy; other clustering algorithms can‘t do this. These advantages of
hierarchical clustering come at the cost of lower efficiency, as it has a time
complexity of O(n³), unlike the linear complexity of K-Means and GMM.

Divisive Approach:
A divisive hierarchical clustering method employs a top-down strategy. It starts by
placing all objects or data sets in one cluster, which is the hierarchy‘s root. It then
divides the root cluster into several smaller subclusters, and recursively partitions
those clusters into smaller ones. The partitioning process continues until each cluster
at the lowest level is coherent enough—either containing only one object, or the
objects within a cluster are sufficiently similar to each other. In either
agglomerative or divisive hierarchical clustering, a user can specify the desired
number of clusters as a termination condition.
Steps of Divisive Clustering:
 Initially, all points in the dataset belong to one single cluster.
 Partition the cluster into two least similar cluster
 Proceed recursively to form new clusters until the desired number of clusters is
obtained.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 44


Machine Learning-24MCA31 MODULE - 3

Fig 3.20 Hierarchical divisive Clustering

Distance Measure : Distance measure determines the similarity between two


elements and influences the shape of clusters.
K-Means clustering supports various kinds of distance measures, such as:
 Euclidean distance measure
 Manhattan distance measure
 A squared Euclidean distance measure
 Cosine distance measure
1. Euclidean Distance Measure
The most common case is determining the distance between two points. If we have
a point P and point Q, the euclidean distance is an ordinary straight line. It is the
distance between the two points in Euclidean space.
The formula for distance between two points is shown below:

Fig 3.21 Graph to represent Euclidean distance

Dr. V. Asha, Professor, MCA Dept, NHCE Page 45


Machine Learning-24MCA31 MODULE - 3

2. Squared Euclidean Distance Measure


This is identical to the Euclidean distance measurement but does not take the
square root at the end. The formula is shown below:

3. Manhattan Distance Measure


The Manhattan distance is the simple sum of the horizontal and vertical
components or the distance between two points measured along axes at right
angles.
Note that we are taking the absolute value so that the negative values don't come
into play. The formula is shown below:

Fig 3.22 Graph to represent Manhattan distance

4. Cosine Distance Measure


In this case, we take the angle between the two vectors formed by joining the
points from the origin. The formula is shown below:

Fig 3.23 Graph to represent Cosine distance

Dr. V. Asha, Professor, MCA Dept, NHCE Page 46


Machine Learning-24MCA31 MODULE - 3

K Means Clustering:
K-Means is probably the most well-known clustering algorithm. K means is an iterative
clustering algorithm that aims to find local maxima in each iteration.
 K-Means clustering is an unsupervised iterative clustering technique.
 It partitions the given data set into k predefined distinct clusters.
 A cluster is defined as a collection of data points exhibiting certain similarities.
It partitions the data set such that-
 Each data point belongs to a cluster with the nearest mean.
 Data points belonging to one cluster have high degree of similarity.
 Data points belonging to different clusters have high degree of dissimilarity.

Fig 3.24 K-Means Clustering

K-Means Clustering Algorithm involves the following steps-


Step-01: Choose the number of clusters K.
Step-02: Randomly select any K data points as cluster centers. Select cluster
centers in such a way that they are as farther as possible from each
other.
Step-03: Calculate the distance between each data point and each cluster center.
The distance may be calculated either by using given distance function or
by using euclidean distance formula.
Step-04: Assign each data point to some cluster. A data point is assigned to that
cluster whose center is nearest to that data point.
Step-05: Re-compute the center of newly formed clusters. The center of a cluster
is computed by taking mean of all the data points contained in that
cluster.
Step-06: Keep repeating the procedure from Step-03 to Step-05 until any of the
following stopping criteria is met-
 Center of newly formed clusters do not change
 Data points remain present in the same cluster
 Maximum number of iterations are reached

Dr. V. Asha, Professor, MCA Dept, NHCE Page 47


Machine Learning-24MCA31 MODULE - 3

Advantages K-Means Clustering Algorithm -


1. It is pretty fast, as we‘re doing is computing the distances between points and
group centers. Very few computations. It is relatively efficient with time
complexity O(nkt) where-
n = number of instances, k = number of clusters, t = number of iterations
2. It often terminates at local optimum. Techniques such as Simulated Annealing or
Genetic Algorithms may be used to find the global optimum.

Disadvantages K-Means Clustering Algorithm -


1. It requires to specify the number of clusters (k) in advance.
2. It can not handle noisy data and outliers.
3. It is not suitable to identify clusters with non-convex shapes.
4. K-means also starts with a random choice of cluster centers and therefore it
may yield different clustering results on different runs of the algorithm. Thus,
the results may not be repeatable and lack consistency.

PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM


Problem 1
Cluster the following eight points (with (x, y) representing locations) into three
clusters: A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
Use K-Means Algorithm to find the three cluster centers and the points belonging to
three different clusters.

The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|
Iteration-01:
Calculating Distance between A1(2, 10) and C1(2, 10) -
Ρ(A1, C1) = |x2 – x1| + |y2 – y1| = |2 – 2| + |10 – 10| = 0
Calculating Distance between A1(2, 10) and C2(5, 8) -
Ρ(A1, C2) = |x2 – x1| + |y2 – y1| = |5 – 2| + |8 – 10| = 3+2 = 5
Calculating Distance between A1(2, 10) and C3(1, 2) -
Ρ(A1, C3) = |x2 – x1| + |y2 – y1| = |1 – 2| + |2 – 10| = 1+8 = 9

Dr. V. Asha, Professor, MCA Dept, NHCE Page 48


Machine Learning-24MCA31 MODULE - 3

In the similar manner, we calculate the distance of other points from each of the
center of the three clusters. Next, we draw a table showing all the results. Using the
table, we decide which point belongs to which cluster. The given point belongs to that
cluster whose center is nearest to it.

Table 3.1 Problem1-Kmeans

From here, new clusters are-


Cluster-01: First cluster contains points- A1(2, 10)
Cluster-02: Second cluster contains points- A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4),
A8(4, 9)
Cluster-03:Third cluster contains points- A2(2, 5), A7(1, 2)

Now, we re-compute the new cluster. The new cluster center is computed by taking
mean of all the points contained in that cluster.
For Cluster-01: We have only one point A1(2, 10) in Cluster-01.
So, cluster center remains the same.
For Cluster-02:
Center of Cluster-02 = ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5) = (6, 6)
For Cluster-03:
Center of Cluster-03 = ((2 + 1)/2, (5 + 2)/2) = (1.5, 3.5)
This is completion of Iteration-01.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 49


Machine Learning-24MCA31 MODULE - 3

Problem-2
Cluster the following five points (with (x, y) representing locations) into two clusters:
A(2, 2), B(3, 2), C(1, 1), D(3, 1), E(1.5, 0.5). Initial cluster centers are: A(2, 2), C(1, 1).
Use K-Means Algorithm to find the two cluster centers and the points belonging to two
different clusters. Use Euclidean distance measure.

Calculating Distance between A(2, 2) and C1(2, 2)-


Ρ(A, C1) = sqrt [ (x2 – x1)2 + (y2 – y1)2 ] = sqrt [ (2 – 2)2 + (2 – 2)2 ] = sqrt [ 0 + 0 ] = 0
Calculating Distance between A(2, 2) and C2(1, 1)-
Ρ(A, C2) = sqrt [ (x2 – x1)2 + (y2 – y1)2 ] = sqrt [ (1 – 2)2 + (1 – 2)2 ] = sqrt [ 1 + 1 ]
= sqrt [ 2 ] = 1.41

In the similar manner, we calculate the distance of other points from each of the
center of the two clusters. Next, we draw a table showing all the results. Using the
table, we decide which point belongs to which cluster. The given point belongs to that
cluster whose center is nearest to it.

Table 3.2 Problem2-Kmeans

From here, new clusters are-


Cluster-01: First cluster contains points- A(2, 2), B(3, 2), D(3, 1)
Cluster-02: Second cluster contains points- C(1, 1), E(1.5, 0.5)

Now, we re-compute the new clusters.


The new cluster center is computed by taking mean of all the points contained in that
cluster.

For Cluster-01:
Center of Cluster-01 = ((2 + 3 + 3)/3, (2 + 2 + 1)/3) = (2.67, 1.67)
For Cluster-02:
Center of Cluster-02 = ((1 + 1.5)/2, (1 + 0.5)/2) = (1.25, 0.75)

Dr. V. Asha, Professor, MCA Dept, NHCE Page 50


Machine Learning-24MCA31 MODULE - 3

Problem 3:
Divide the given sample data in two clusters using K-Means Algorithm. Use Euclidean
Distance. Initial cluster centers are C1(185,72) and C2(170,56).
Height (H) Weight (W)
185 72
170 56
168 60
179 68
182 72
188 77
180 71
180 70
183 84
180 88

Difference between K Means and Hierarchical Clustering:


 Hierarchical clustering can‘t handle big data well but K Means clustering can.
This is because the time complexity of K Means is linear i.e. O(n) while that of
hierarchical clustering is quadratic i.e. O(n2).
 In K Means clustering, since we start with random choice of clusters, the
results produced by running the algorithm multiple times might differ. While
results are reproducible in Hierarchical clustering.
 K Means is found to work well when the shape of the clusters is hyper spherical
(like circle in 2D, sphere in 3D).
 K Means clustering requires prior knowledge of K i.e. no. of clusters you want to
divide your data into. But, you can stop at whatever number of clusters you find
appropriate in hierarchical clustering by interpreting the dendrogram.

K-Mode Clustering :
Clustering is an unsupervised learning method whose task is to divide the population or
data points into a number of groups, such that data points in a group are more similar
to other data points in the same group and dissimilar to the data points in other groups.
It is basically a collection of objects based on similarity and dissimilarity between them.

KModes vs KMeans
K-means clustering works efficiently only for numerical dataset. K-Means Clustering
fails to find patterns in the categorical dataset. KModes clustering is one of the
unsupervised Machine Learning algorithms that groups similar data points into clusters
based on their categorical attributes. Unlike traditional clustering algorithms that use

Dr. V. Asha, Professor, MCA Dept, NHCE Page 51


Machine Learning-24MCA31 MODULE - 3

distance metrics, KModes works by identifying the modes or most frequent values
within each cluster to determine its centroid. This is in contrast to the more well-
known k-means algorithm, which clusters numerical data based on Euclidean distance.
KMeans uses mathematical measures (distance) to cluster continuous data. The lesser
the distance, the more similar our data points are. Centroids are updated by Means.
But for categorical data points, we cannot calculate the distance. So we go for KModes
algorithm. It uses the dissimilarities(total mismatches) between the data points. The
lesser the dissimilarities the more similar our data points are. It uses Modes instead of
means.

Most of the real world datasets are in categorical form. For example, if we are working
on analysing the social media, we have categorical data like gender (male or female),
profession and so on. KModes is ideal for clustering categorical data such as customer
demographics, market segments, or survey responses. So to deal with all this
categorical data or cluster the categorical variables we use K Modes Clustering. It is
widely used algorithm for grouping the categorical data because it is easy to implement
and efficiently handles large amount of data. It defines clusters based on the number
of matching categories between data points.

Algorithm of K-Mode:
Input: Data objects X, Number of clusters K.
Step 1. Pick K observations at random and use them as leaders/clusters.
Step 2. Calculate the simple matching dissimilarity measure between the each K initial
cluster modes and each data objects using the following equation and assign each
observation to its closest cluster. The dissimilarity measure is defined as follows:

Step 3. Find the minimum mode values in each data object i.e. finding the objects
nearest to the initial cluster modes.
Step 4: Assign the data objects to the nearest cluster centroid modes.
Step 5: Update the modes by frequency-based method on newly formed clusters.
Step 6: Recalculate the similarity between the data objects and the updated modes.
Step 7: Repeat the step 3 and step 4 until no changes in the cluster ship of data
objects.
Output: Clustered data objects

Dr. V. Asha, Professor, MCA Dept, NHCE Page 52


Machine Learning-24MCA31 MODULE - 3

Problem 1: Imagine we have a dataset that has the information about hair color, eye
color, and skin color of persons. We aim to group them based on the available
information. Hair color, eye color, and skin color are all categorical variables. Below is
how the dataset looks like.

Step 1: Pick K observations at random and use them as leaders/clusters. Here P1, P7,
P8 are chosen as leaders/clusters.

Step 2: Calculate the dissimilarities(no. of mismatches) and assign each observation to


its closest cluster. Iteratively compare the cluster data points to each of the
observations. Similar data points give 0, dissimilar data points give 1.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 53


Machine Learning-24MCA31 MODULE - 3

Comparing leader/Cluster P1 to the observation P1 gives 0 dissimilarities.

Comparing leader/cluster P1 to the observation P2 gives 3(1+1+1) dissimilarities.


Likewise, calculate all the dissimilarities and put them in a matrix as shown below and
assign the observations to their closest cluster(cluster that has the least dissimilarity)

After step 2, the observations P1, P2, P5 are assigned to cluster 1; P3, P7 are assigned
to Cluster 2; and P4, P6, P8 are assigned to cluster 3.

Note: If all the clusters have the same dissimilarity with an observation, assign to any
cluster randomly. In our case, the observation P2 has 3 dissimilarities with all the
leaders. So it was randomly assigned to Cluster 1.

Step 3: Define new modes for the clusters


Mode is simply the most observed value. Mark the observations according to the
cluster they belong to. Observations of Cluster 1 are marked in Yellow, Cluster 2 are
marked in Brick red, and Cluster 3 are marked in Purple.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 54


Machine Learning-24MCA31 MODULE - 3

Considering one cluster at a time, for each feature, look for the Mode and update the
new leaders.
Explanation: Cluster 1 observations(P1, P2, P5) has brunette as the most observed hair
color, amber as the most observed eye color, and fair as the most observed skin color.
Note: If you observe the same occurrence of values, take the mode randomly. In
our case, the observations of Cluster 3(P3, P7) have one occurrence of brown, fair
skin color. Brown is randomly chosen as the mode.
Below are our new leaders after the update.

Iteration – 2; Repeat steps 2–4 (From Iteration-2 is for understanding)


After obtaining the new leaders, again calculate the dissimilarities between the
observations and the newly obtained leaders.

Comparing Cluster 1 to the observation P1 gives 1 dissimilarity.

Comparing Cluster 1 to the observation P2 gives 2 dissimilarities.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 55


Machine Learning-24MCA31 MODULE - 3

Likewise, calculate all the dissimilarities and put them in a matrix. Assign each
observation to its closest cluster.

The observations P1, P2, P5 are assigned to Cluster 1; P3, P7 are assigned to Cluster 2;
and P4, P6, P8 are assigned to Cluster 3.
We stop here as we see there is no change in the assignment of observations.

DBScan : Density-Based Spatial Clustering of Applications With Noise


(DBSCAN)
It groups ‗densely grouped‘ data points into a single cluster. It can identify clusters in
large spatial datasets by looking at the local density of the data points. The most
exciting feature of DBSCAN clustering is that it is robust to outliers. It also does
not require the number of clusters to be told beforehand, unlike K-Means, where we
have to specify the number of centroids.

DBSCAN requires only two parameters: epsilon and minPoints. Epsilon is the radius of
the circle to be created around each data point to check the density and minPoints is
the minimum number of data points required inside that circle for that data point to be
classified as a Core point. In higher dimensions the circle becomes
hypersphere, epsilon becomes the radius of that hypersphere, and minPoints is the
minimum number of data points required inside that hypersphere. Here, we have some
data points represented by grey color. Let‘s see how DBSCAN clusters these data
points.

DBSCAN creates a circle of epsilon radius around every data point and classifies them
into Core point, Border point, and Noise. A data point is a Core point if the circle
around it contains at least ‗minPoints’ number of points. If the number of points is less
than minPoints, then it is classified as Border Point, and if there are no other data
points around any data point within epsilon radius, then it treated as Noise.
The above figure shows us a cluster created by DBCAN with minPoints = 3. Here, we
draw a circle of equal radius epsilon around every data point. These two parameters
help in creating spatial clusters.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 56


Machine Learning-24MCA31 MODULE - 3

All the data points with at least 3 points in the circle including itself are considered
as Core points represented by red color. All the data points with less than 3 but
greater than 1 point in the circle including itself are considered as Border points. They
are represented by yellow color. Finally, data points with no point other than itself
present inside the circle are considered as Noise represented by the purple color.
For locating data points in space, DBSCAN uses Euclidean distance.

Fig 3.25 K-Means Clustering

Parameter Selection in DBSCAN Clustering


DBSCAN is very sensitive to the values of epsilon and minPoints. Therefore, it is very
important to understand how to select the values of epsilon and minPoints. A slight
variation in these values can significantly change the results produced by the DBSCAN
algorithm. The value of minPoints should be at least one greater than the number of
dimensions of the dataset, i.e., minPoints>=Dimensions+1.

Fig 3.26 DBScan Clustering

It does not make sense to take minPoints as 1 because it will result in each point being
a separate cluster. Therefore, it must be at least 3. Generally, it is twice the
dimensions. But domain knowledge also decides its value. The value of epsilon can be
decided from the K-distance graph. The point of maximum curvature (elbow) in this
graph tells us about the value of epsilon. If the value of epsilon chosen is too small then
a higher number of clusters will be created, and more data points will be taken as noise.
Whereas, if chosen too big then various small clusters will merge into a big cluster, and
we will lose details. Here, both X and Y are density-reachable from O, therefore, we
can say that X is density-connected from Y.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 57


Machine Learning-24MCA31 MODULE - 3

Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for
finding spherical-shaped clusters or convex clusters. In other words, they are suitable
only for compact and well-separated clusters. K-Means and Hierarchical Clustering
both fail in creating clusters of arbitrary shapes. They are not able to form clusters
based on varying densities. That‘s why we need DBSCAN clustering. Moreover, they are
also severely affected by the presence of noise and outliers in the data.
Real-life data may contain irregularities, like: Clusters can be of arbitrary shape, Data
may contain noise. Given a dataset containing non-convex shape clusters and outliers,
the k-means algorithm has difficulties in identifying these clusters with arbitrary
shapes. DBSCAN is not just able to cluster the data points correctly, but it also
perfectly detects noise in the dataset.

Steps Used In DBSCAN Algorithm


1. Find all the neighbor points within eps and identify the core points or visited with
more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density-connected points and assign them to the same
cluster as the core point.
4. A point a and b are said to be density connected if there exists a point c which has
a sufficient number of points in its neighbors and both points a and b are within
the eps distance. This is a chaining process. So, if b is a neighbor of c, c is a
neighbor of d, and d is a neighbor of e, which in turn is neighbor of a implying
that b is a neighbor of a.
5. Iterate through the remaining unvisited points in the dataset. Those points that do
not belong to any cluster are noise.

Difference between DBSCAN and K-Means.


DBSCAN K-Means
In DBSCAN we need not specify the K-Means is very sensitive to the number of
number of clusters. clusters so it need to specified
Clusters formed in DBSCAN can be of Clusters formed in K-Means are spherical
any arbitrary shape. or convex in shape
DBSCAN can work well with datasets K-Means does not work well with outliers data.
having noise and outliers Outliers can skew the clusters in K-Means to a
very large extent.
In DBSCAN two parameters are In K-Means only one parameter is required is
required for training the Model for training the model

Dr. V. Asha, Professor, MCA Dept, NHCE Page 58


Machine Learning-24MCA31 MODULE - 3

Module 3 - Question Bank

Sl. No. Questions Marks Level

1. Explain linear regression model in detail. L2 5


2. Derive the linear regression equations for the following to
find the value of y.
L4 10
a. The Best fit line goes through (0,40) and (10,35) and x= 30
b. The Best fit line goes through (0,5) and (1,6.5) and x = 2.5
3. Resort A charges $55.50 per night, plus a one-time
surcharge of $100 a night. Resort B charges $65.50 per
night, plus a one-time surcharge of $50.
A) State the dependent variable and write a Regression
L4 10
Equation.
B) State the independent variable.
C) After how many days is the cost the same?
D) Examine the total cost at that point?
4. Given three classifiers with predictions (Class A, Class A,
L3 5
Class B), what would be the final prediction using hard voting?
5. Given three classifiers with predictions (Class A, Class A, L3 5
Class B), what would be the final prediction using hard voting?
6. How does Boosting address bias in a model? Explain in detail. L2 10
7. Compare Agglomerative approach and Divisive approach in
L2 10
Hierarchical clustering with neat diagram.
8. What is K-Means Clustering? Discuss the steps for K-Means
L2 10
clustering algorithm.
9. Cluster the following eight points (with (x, y) representing
locations) into three clusters: A1(2, 10), A2(2, 5), A3(8, 4),
A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9). Initial cluster
L3 5
centers are: A1(2, 10), A4(5, 8) and A7(1, 2). Use K-Means
Algorithm to find the three cluster centers and the points
belonging to three different clusters.
10. Cluster the following five points (with (x, y) representing
locations) into two clusters: A(2, 2), B(3, 2), C(1, 1), D(3, 1),
E(1.5, 0.5). Initial cluster centers are: A(2, 2), C(1, 1). Use K-
L3 5
Means Algorithm to find the two cluster centers and the
points belonging to two different clusters. Use Euclidean
distance measure.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 59


Machine Learning-24MCA31 MODULE - 3

11. Divide the given sample data in two clusters using K-Means
Algorithm. Use Euclidean Distance. Initial cluster centers are
C1(185,72) and C2(170,56).
Height (H) Weight (W)
185 72
170 56
168 60
179 68 L3 10

182 72
188 77
180 71
180 70
183 84
180 88

12. Why use Random forest? How does Random Forest algorithm
L2 10
work? Explain with an example.

Dr. V. Asha, Professor, MCA Dept, NHCE Page 60

You might also like