Machine Learning 24MCA31 Module3
Machine Learning 24MCA31 Module3
MACHINE LEARNING
MODULE – 3
Machine Learning Algorithms - II
3.1 REGRESSION
Introduction:
Regression algorithms fall under the family of Supervised Machine Learning
algorithms which is a subset of machine learning algorithms.
One of the main features of supervised learning algorithms is that they model
dependencies and relationships between the target output and input features to
predict the value for new data.
Regression algorithms predict the output values based on input features from
the data fed in the system.
The algorithm builds a model on the features of training data and using the
model to predict the value for new data.
Regression analysis is a form of predictive modelling technique which
investigates the relationship between a dependent (target) and independent
variable (s) (predictor).
It involves determining the best fit line, which is a line that passes through all
the data points in such a way that distance of the line from each data point is
minimized.
Linear regression and logistic regression are two types of regression
analysis techniques that are used to solve the regression problem using machine
learning.
They are the most prominent techniques of regression. But, there are many
types of regression analysis techniques available in machine learning to make
predictions, and their usage varies according to the nature of the data involved.
These techniques are mostly driven by three metrics (number of independent
variables, type of dependent variables and shape of regression line).
1. Linear Regression:
How would linear regression be described in layman’s terms?
Linear regression can be thought of as the answer to the question ―How can I use X to
predict Y?‖, where X is some information that you have and Y is some information.
Let‘s look at a concrete example. You might be wondering how much you can sell your
house for. You have information about your house, for instance, the number of
bedrooms is 2 - this is your X. And you want to know how much your estate could be
worth on the market. This is Y - the price in $ that you can sell your house for.
Linear regression creates an equation in which you input your given numbers (X) and it
outputs the target variable that you want to find out (Y). We obtain the equation by
training it on pairs of (X, Y) values. We would use a dataset containing historic records
of house purchases in the form of (―number of bedrooms‖, ―selling price‖):
We would then visualize the data points on a scatter plot to see if there are any
trends. A scatter plot is a two-dimensional plot, with each data point representing a
house. On the x-axis, we would have values for ―Number of bedrooms‖, while on the y-
axis, we would have the ―Selling price‖ for the same houses:
Linear Regression:
Linear regression is a supervised machine-learning regression algorithm.
Linear regression is the statistical model, simplest and basic regression
technique widely known in machine learning.
It is fast and easy to model and useful when the target relationship is not
complex or enough data is not available and easy to learn and evaluate.
Linear regression is a linear model that establishes a linear relationship between
one or more input variables (x) (independent variable) and the single output
variable (y) (dependent variable).
There must be linear relationship between independent and dependent variables
Here the y can be calculated from a linear combination of the input variables
(x). The best fit line is determined by varying the values of m and c.
The predictor error is the difference between the observed values and the
predicted value.
The values of m and c get selected in such a way that it gives the minimum
predictor error.
The goal of the regression analysis (modeling) is to find the values for the unknown
parameters of the equation; that is, to find the values for the weights m and c. This
equation can be used to predict the value of target variable based on given
predictor variable(s). Consider predicting the salary of an employee based on his/her
age. We can easily identify that there seems to be a correlation between employee‘s
age and salary (more the age more is the salary). The hypothesis of linear regression is
Y = b0+b1X. Y represents salary, X is employee‘s age and a and b are the coefficients
of the equation. So in order to predict Y (salary) given X (age), we need to know the
values of a and b (the model‘s coefficients).
These specific features explain why linear regression is one of the best models for
making predictions using machine learning.
Despite their differences, both the simple and multiple regression models are linear
models - they adopt the form of a linear equation.
2. Logistic Regression:
Logistic regression is one of the types of regression analysis technique, which gets
used when the dependent variable is discrete. Example: 0 or 1, true or false, etc. This
means the target variable can have only two values, and a sigmoid curve denotes the
relation between the target variable and the independent variable.
Logit function is used in Logistic Regression to measure the relationship between the
target variable and independent variables. Below is the equation that denotes the
logistic regression.
Logistic regression is used to find the probability of event=Success and event=Failure.
We should use logistic regression when the dependent variable is binary (0/ 1, True/
False, Yes/ No) in nature. Here the value of Y ranges from 0 to 1 and it can be
represented by following equation.
Odds = p/(1-p) = probability of event occurrence / probability of no. event occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3….+bkXk
where p is the probability of occurrence of the feature.
Above, p is the probability of presence of the characteristic of interest. The reason
for using log in the equation is since we are working here with a binomial distribution
(dependent variable), we need to choose a link function which is best suited for
this distribution. And, it is logit function.
For selecting logistic regression, the size of data is large with the almost equal
occurrence of values to come in target variables. Also, there should be no
multicollinearity, which means that there should be no correlation between independent
variables in the dataset.
Important Points:
3. Polynomial Regression
Polynomial Regression is another type of regression analysis techniques in machine
learning, which is the same as Multiple Linear Regression with a little modification.
In Polynomial Regression, the relationship between independent and dependent
variables, that is X and Y, is denoted by the n-th degree, i.e., the power of
independent variable is more than 1.
The linear model Y = a+bX is transformed into something like Y = a + bX + cX2.
It is still a linear model but the curve is now quadratic rather than a line. If we
increase the degree to a very high value, the curve becomes overfitted as it
learns the noise in the data as well. In this regression technique, the best fit line
is not a straight line. It is rather a curve that fits into the data points.
It is a linear model as an estimator. Least Mean Squared Method is used in
Polynomial Regression also. While trying to reduce the Mean Squared Error to a
minimum and to get the best fit line, the model can be prone to overfitting.
It is represented by the equation: Y=b0+b1x1+b2x22+........bn xnn
Important Points:
While there might be a temptation to fit a higher degree polynomial to get lower
error, this can result in over-fitting. Always plot the relationships to see the fit
and focus on making sure that the curve fits the nature of the problem.
Especially look out for curve towards the ends and see whether those shapes and
trends make sense. Higher polynomials can end up producing wierd results on
extrapolation.
Here is an example of how plotting can help:
4. Stepwise Regression
Standard stepwise regression does two things. It adds and removes predictors as
needed for each step.
Forward selection starts with most significant predictor in the model and adds
variable for each step. It continuously adds the variables in order to review the
performance and stops when no improvement is needed up to an extent
Backward elimination starts with all predictors in the model and removes the least
significant variable for each step. It includes the removal of variables at a time
until no extra variables would be deleted without considerable loss.
And bidirectional elimination is the blend of the above two approaches.
The aim of this modeling technique is to maximize the prediction power with
minimum number of predictor variables. It is one of the method to handle higher
dimensionality of data set.
5. Ridge Regression:
This is another type of regression in machine learning which is usually used when
there is a high correlation between the independent variables. This is because, in
the case of multi collinear data, the least square estimates give unbiased values.
But, in case the collinearity is very high, there can be some bias value. Therefore,
a bias matrix is introduced in the equation of Ridge Regression. This is a powerful
regression method where the model is less susceptible to overfitting.
Below is the equation used to denote the Ridge Regression, where the
introduction of λ (lambda) solves the problem of multicollinearity:
β = (X^{T}X + λ*I)^{-1}X^{T}y
7. Lasso Regression:
Lasso Regression in machine learning that performs regularization along with
feature selection. It prohibits the absolute size of the regression coefficient. As
a result, the coefficient value gets nearer to zero, which does not happen in the
case of Ridge Regression.
Due to this, feature selection gets used in Lasso Regression, which allows
selecting a set of features from the dataset to build the model. In the case of
Lasso Regression, only the required features are used, and the other ones are
made zero. This helps in avoiding the overfitting in the model. In case the
independent variables are highly collinear, then Lasso regression picks only one
variable and makes other variables to shrink to zero.
Application: Lasso regression algorithms have been widely used in financial
networks and economics. In finance, its application is seen in forecasting
probabilities of default and Lasso-based forecasting models are used in assessing
enterprise wide risk framework. It is also used to perform stress test platforms
to analyze multiple stress scenarios.
Split the dataset on different attributes and calculate the standard deviation
for each branch (standard deviation for target and predictor). This value is
subtracted from the standard deviation before the split. The result is the
standard deviation reduction. SDR(T,X) = S(T) – S(T,X)
The attribute with the largest standard deviation reduction is chosen as the
splitting node.
The dataset is divided based on the values of the selected attribute. This
process is run recursively on the non-leaf branches until all data is processed.
To avoid overfitting, the Coefficient of Deviation (CV) is used which decides when
to stop branching. Finally, the average of each branch is assigned to the related
leaf node (in regression mean is taken whereas in classification mode of leaf
nodes is taken).
PROBLEMS:
Problem 1
Derive the linear regression equations for the following to find the value of y.
a. The Best fit line goes through (0,40) and (10,35) and x = 30
b. The Best fit line goes through (0,5) and (1,6.5) and x = 2.5
[Link] Best fit line goes through (0,40) and (10,35) and x = 30
Step 1: Find the slope.
This line goes through (0,40) and (10,35), so the slope is 35-40/10-0 = -1/2
Step 2: Find the y-intercept.
We can see that the line passes through (0,40), so the y-intercept is 40.
Step 3: Write the equation in y=mx+by,
So the equation is y=-0.5x+40
If x = 30, then
Y = −0.5x+40 = (−0.5)(30)+40 = −15+40 =25
So the Linear Regression equation is y=-0.5x+40 and y = 25.
b. The Best fit line goes through (0,5) and (1,6.5) and x = 2.5
Step 1: Find the slope.
This line goes through (0,5) and (1,6.5), so the slope is 6.5-5/1-0 = 1.5
Step 2: Find the y-intercept.
We can see that the line passes through (0,5) so the y-intercept is 5.
Step 3: Write the equation in y=mx+by,
The equation that best describes the model is y=1.5x+5
If x = 2.5, then
Y = 1.5x+5 = (1.5)(2.5)+5 =3.75+5 =8.75
So the Linear Regression equation is y=1.5x+5y and y = 8.75.
Problem 2
Resort A charges $55.50 per night, plus a one-time surcharge of $100 a night. Resort
B charges $65.50 per night, plus a one-time surcharge of $50.
A) State the dependent variable and write a Regression Equation.
B) State the independent variable.
C) After how many days is the cost the same?
D) Examine the total cost at that point?
Simple Linear Regression is often used to examine the relationship between dependent
and independent variable. X is independent variable and y is dependent variable. Simple
linear regression can be written in the form of y = m(x) + b. Here m represents the
value of the slope and b represents the value of the y-intercept.
A) In this scenario the dependent variable is equal to the total cost of the stay at a
resort. The regression equation associated with the cost at resort A can be written as
follows, based on the form given above:
y = m(x) + b
Total cost of the stay in a resort = (The cost per one night stay at the resort)(The
number of days stayed at a resort) + The value of the one-time surcharge associated
with a resort. Y is the dependent variable.
B) The independent variable in this scenario is the number of nights stayed at a resort.
D) To find the total cost after 5 days, simply select either the regression formula for
resort A or resort B and solve for the value of y. In this example, we will use the
equation from resort A:
y = (55.50)(x) + 100 = (55.50)(5) + 100 = 277.50 + 100 = 377.50
The total cost for 5 nights in both resorts is equal to $377.50.
Problem 3:
The data set given below has the rating of 2 movies by 6 people.
Person Xi=Rating for a movie Yi=Rating for a movie
th
Bahubali-I by i person Bahubali-II by ith person
1 4 3
2 2 4
3 3 2
4 5 5
5 1 3
6 3 1
i) Find the values of b0& b1 and Regression Equation w.r.t. Linear regression model
which best fits given data.
ii) Find regression line that fits best for given sample data.
iii) If new person rates the Bahubali-I as 3 then predict the rating of same person
for ―Bahubali-II‖
Sample Number x y x-𝑥 y-𝑦 (x-𝑥)2 (x-𝑥).(y-𝑦)
1 4 3 1 0 1 0
2 2 4 -1 1 1 -1
3 3 2 0 -1 0 0
4 5 5 2 2 4 4
5 1 3 -2 0 4 0
6 3 1 0 -2 0 0
∑ 18 18 10 3
Mean 𝑥=3 𝑦 =3
i) Find the values of b0 & b1 w.r.t. Linear regression model which best fits given
data.
𝑌 = b0 + b1.x
b0 is the intercept, b1 is slope.
∑(x−𝑥)(y−𝑦) 3
b1 = ∑(x−𝑥)2
= 10
= 0.3
𝑌 = b0 + b1.x (for x and y, put 𝑥 and 𝑦 value which is 3 and 3)
3 = b0 + (0.3)(3)
b0 = 3-0.9
b0 = 2.1, and b1 = 0.3
2
1
0 Line of Best Fit
0 1 2 3 4
X -Axis
iii) If new person rates the Bahubali-I as 3 then predict the rating of same person for
―Bahubali-II‖.
𝑌 = b0 + b1.x
𝑌 = 2.1 + (0.3).3; 𝑌 =3
New person rates Bahubali-I as 3, he is going to rate Bahubali-II as also 3.
The best hyperplane is that plane that has the maximum distance from both the
classes, and this is the main aim of SVM. This is done by finding different hyperplanes
which classify the labels in the best way then it will choose the one which is farthest
from the data points or the one which has a maximum margin shown in Fig 3.13c.
Fig 3.13a Points to classify Fig 3.13b Use of Straight line Fig 3.13c Use of hyperplances
Advantages of SVM
SVM works better when the data is Linear
It is more effective in high dimensions
With the help of the kernel trick, we can solve any complex problem
SVM is not sensitive to outliers
Can help us with Image classification
Uses of SVM :
SVMs are used in applications like handwriting recognition, intrusion detection, face
detection, email classification, gene classification, and in web pages. This is one of
the reasons we use SVMs in machine learning. It can handle both classification and
regression on linear and non-linear data.
Disadvantages of SVM
Choosing a good kernel is not easy
It doesn‘t show good results on a big dataset
The SVM hyperparameters are Cost -C and gamma. It is not that easy to fine-tune
these hyper-parameters. It is hard to visualize their impact
Here we have M1 to Mn individual models which are giving predictions from Pred 1 to
Pred n respectively. And we combine them to come up with a single model, and this is
where the art of ensembling comes into the picture.
Principle of ensemble
The basic principle of ensemble learning is to combine a number of weak learners into
strong learners. There are two main types of ensemble methods: Bagging (Bootstrap
Aggregating): In bagging, multiple models are trained on different random subsets of
the training data with replacement.
1. Voting:
The voting method is generally used for classification problems. In this technique,
multiple models are used to make predictions for each data point. The predictions by
each model are considered as a ‗vote‘. The predictions which we get from the majority
of the models are used as the final prediction.
For example, when you ask 5 of your colleagues to rate your movie (out of 5); assume
three of them rated it as 4, while two of them rated it a 5. Since the majority gave a
rating of 4, the final rating will be taken as 4. You can consider this as taking the
mode of all the predictions.
The result of max voting would be something like this:
Colleague 1 Colleague 2 Colleague 3 Colleague 4 Colleague 5 Final rating
5 4 5 4 4 4
It can be used for classification or regression. In the case of regression, this involves
calculating the average of the predictions from the models. In the case of
classification, the predictions for each label are summed and the label with the
majority vote is predicted.
Soft Voting: In soft voting, the output class is the prediction based on the average of
probability given to that class. Suppose given some input to three models, the
prediction probability for class A = (0.30, 0.47, 0.53) and B = (0.20, 0.32, 0.40). So the
average for class A is 0.4333 and B is 0.3067, the winner is clearly class A because it
had the highest probability averaged by each classifier.
2. Averaging:
It is mainly used for regression problems. In simple averaging method, for every
instance of test dataset, the average predictions are calculated. In this method, we
take an average of predictions from all the models and use it to make the final
prediction. Averaging can be used for making predictions in regression problems or
while calculating probabilities for classification problems. This method often reduces
overfit and creates a smoother regression model. The method consists of building
multiple models independently and returning the average of the prediction of all the
models. In general, the combined output is better than an individual output because
variance is reduced.
Example 1: In the below case, the averaging method would take the average of all the
values. i.e. (5+4+5+4+4)/5 = 4.4
Colleague 1 Colleague 2 Colleague 3 Colleague 4 Colleague 5 Final rating
5 4 5 4 4 4.4
Example 2: M1, M2, and M3 are three different models which are giving us individual
predictions. To make the final prediction, average is taken for individual models.
3. Weighted Averaging
This is an extension of the averaging method. All models are assigned different
weights defining the importance of each model for prediction. For instance, if two of
your colleagues are critics, while others have no prior experience in this field, then the
answers by these two friends are given more importance as compared to the other
people. The prediction of each model is multiplied by the weight and then their average
is calculated.
The result is calculated as [(5*0.23) + (4*0.23) + (5*0.18) + (4*0.18) + (4*0.18)] = 4.41.
Colleague 1 Colleague 2 Colleague 3 Colleague 4 Colleague 5 Final rating
Weight 0.23 0.23 0.18 0.18 0.18
Rating 5 4 5 4 4 4.41
For example- In the case of Model 2, we‘ll divide 1 by the sum of 1+2+3 = 6. So the
weight for Model 2 comes down to 1/6 = 0.16. Similarly, the weights for each of these
models are calculated and then multiply those weights by individual models. So, all the
predicted values of Model 1 get multiplied by 0.33. Similarly, M2 and M3 are multiplied
by 0.16 and 0.5 respectively. And then we have to sum up all of these values-
Aggregation: This is a step that involves the process of combining the output of
all base models and, based on their output, predicting an aggregate result with
greater accuracy and reduced variance.
In detail, each model is trained on a random subset of the data sampled with
replacement, meaning that the individual data points can be chosen more than once.
This random subset is known as a bootstrap sample. By training models on different
bootstraps, bagging reduces the variance of the individual models. It also avoids
overfitting by exposing the constituent models to different parts of the dataset. It
makes the model more robust and accurate, especially in cases where the individual
models are prone to high variability.
The predictions from all the sampled models are then combined through a simple
averaging to make the overall prediction. This way, the aggregated model incorporates
the strengths of the individual ones and cancels out their errors.
Let's break it down step by step:
Original training dataset: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Resampled training set 1: [2, 3, 3, 5, 6, 1, 8, 10, 9, 1]
Resampled training set 2: [1, 1, 5, 6, 3, 8, 9, 10, 2, 7]
Resampled training set 3: [1, 5, 8, 9, 2, 10, 9, 7, 5, 4]
Example: In the Random Forest method, predictions from multiple decision trees are
ensembled parallelly. Further, in regression problems, we use an average of these
predictions to get the final output, whereas, in classification problems, the model is
selected as the predicted class.
Advantages of Bagging:
1. Reduces overfitting: It can reduce the chance of an overfit model, resulting in
improved model accuracy on unseen data.
2. Decreases model variance: Multiple models trained on different subsets of data
average out their predictions, leading to lower variance than a single model.
3. Improves stability: Changes in the training dataset have less impact on bagged
models, making the overall model more stable.
4. Handles high variability: Especially effective for algorithms like decision trees,
which tend to have high variance.
5. Parallelizable computation: Each model in the ensemble can be trained
independently, allowing for parallel processing and efficient use of computational
resources.
6. Easy to understand and implement: The concept behind bagging is
straightforward and can be implemented without complex modifications to the
learning algorithm.
7. Good with noisy data: The averaging process helps in reducing the noise in the
final prediction.
8. Handles imbalanced data: Bagging can help in scenarios where the dataset is
imbalanced, improving the performance of the model in such situations.
Disadvantages of Bagging:
1. Flexible less: As a method, Bagging works particularly correctly with algorithms
that are much less solid. One which can be more stable or a problem with high
amounts of bias does now not provide an awful lot of gain as there is less variation
in the dataset of the version. As noted within the hands-On guide for machine
learning, "the bagging is a linear regression version will efficaciously just return the
original predictions for huge enough b."
2. Loss of interpretability: The Bagging slows down and grows extra in depth because
of the quantity of iterations growth. accordingly, it is no longer adequately suitable
for actual-time applications. Clustered structures or large processing cores are
perfect for quickly growing bagged ensembles on massive look-at units.
3. Expensive for computation: The Bagging is tough to draw unique business insights
via Bagging because of the averaging concerned throughout predictions. While the
output is more precise than any person's information point, a more accurate or
whole dataset may yield greater precision within a single classification or
regression model.
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random forest classifier. The dataset is divided into subsets and given to each
decision tree. During the training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results, the Random Forest
classifier predicts the final decision. Consider the below image:
2. Boosting
Boosting is an ensemble method that enables each member to learn from the preceding
member's mistakes and make better predictions for the future. Unlike the bagging
method, in boosting, all base learners (weak) are arranged in a sequential format so
that they can learn from the mistakes of their preceding learner. Hence, in this way,
all weak learners get turned into strong learners and make a better predictive model
with significantly improved performance. Boosting is an ensemble modeling technique
that attempts to build a strong classifier from the number of weak classifiers. It is
done by building a model by using weak models in series. Firstly, a model is built from
the training data. Then the second model is built which tries to correct the errors
present in the first model. This procedure is continued and models are added until
either the complete training data set is predicted correctly or the maximum number
of models are added.
Types of Boosting
Boosting methods are focused on iteratively combining weak learners to build a strong
learner that can predict more accurate outcomes. A weak learner classifies data
slightly better than random guessing. This approach can provide robust prediction
problem results, outperform neural networks, and support vector machines for tasks.
Boosting algorithms can differ in how they create and aggregate weak learners during
the sequential process. Three popular types of boosting methods include:
Additive Model: Trees are added one at a time in this model. Existing trees
remain the same. During the addition of trees, gradient descent is used to
minimize the loss function.
Like AdaBoost, Gradient Boosting can also be used for classification and regression
problems.
Benefits of Boosting
Ease of Implementation: Boosting can be used with several hyper-parameter
tuning options to improve fitting. No data preprocessing is required, and boosting
algorithms have built-in routines to handle missing data. In Python, the sci-kit-learn
library of ensemble methods makes it easy to implement the popular boosting
methods, including AdaBoost, XGBoost, etc.
Reduction of bias: Boosting algorithms combine multiple weak learners in a
sequential method, iteratively improving upon observations. This approach can help
to reduce high bias, commonly seen in shallow decision trees and logistic regression
models.
Computational Efficiency: Since boosting algorithms have special features that
increase their predictive power during training, it can help reduce dimensionality
and increase computational efficiency.
Challenges of Boosting:
Overfitting: There's some dispute in the research around whether or not boosting
can help reduce overfitting or make it worse. We include it under challenges
because in the instances that it does occur, predictions cannot be generalized to
new datasets.
Intense computation: Sequential training in boosting is hard to scale up. Since each
estimator is built on its predecessors, boosting models can be computationally
expensive, although XGBoost seeks to address scalability issues in other boosting
methods. Boosting algorithms can be slower to train when compared to bagging, as a
large number of parameters can also influence the model's behavior.
Vulnerability to outlier data: Boosting models are vulnerable to outliers or data
values that are different from the rest of the dataset. Because each model
attempts to correct the faults of its predecessor, outliers can skew results
significantly.
Real-time implementation: You might find it challenging to use boosting for real-
time implementation because the algorithm is more complex than other processes.
Boosting methods have high adaptability, so you can use various model parameters
that immediately affect the model's performance.
Applications of Boosting
Boosting algorithms are well suited for artificial intelligence projects across a broad
range of industries, including:
Healthcare: Boosting is used to lower errors in medical data predictions, such as
predicting cardiovascular risk factors and cancer patient survival rates. Ensemble
methods significantly improve the accuracy in identifying patients who could
Bagging Boosting
The most effective manner of mixing A manner of mixing predictions that
predictions that belong to the same type. belong to different sorts.
The main task of it is decrease the The main task of it is decrease the bias
variance but not bias. but not variance.
Here each of the model is different Here each of the model is same weight.
weight.
Each of the model is built here Each of the model is built here
independently. dependently.
This training records subsets are decided Each new subset consists of the factors
on using row sampling with alternative and that were misclassified through
random sampling techniques from the preceding models.
whole training dataset.
It is trying to solve by over fitting It is trying to solve by reducing the
problem. bias.
If the classifier is volatile (excessive If the classifier is stable and easy
variance), then apply bagging. (excessive bias) the practice boosting.
In the bagging base, the classifier is In the boosting base, the classifier is
works parallelly. works sequentially.
Example is random forest model by using Example is AdaBoost using the boosting
bagging. technique.
3. Stacking
Stacking is one of the popular ensemble modeling techniques in machine learning.
Various weak learners are ensembled in a parallel manner in such a way that by
combining them with Meta learners, we can predict better predictions for the future.
This ensemble technique works by applying input of combined multiple weak learners'
predictions and Meta learners so that a better output prediction model can be
achieved. In stacking, an algorithm takes the outputs of sub-models as input and
attempts to learn how to best combine the input predictions to make a better output
prediction. Stacking is also known as a stacked generalization and is an extended form
of the Model Averaging Ensemble technique in which all sub-models equally participate
as per their performance weights and build a new model with better predictions. This
new model is stacked up on top of the others; this is the reason why it is named
stacking.
Architecture of Stacking
The architecture of the stacking model is designed in such as way that it consists of
two or more base/learner's models and a meta-model that combines the predictions of
the base models. These base models are called level 0 models, and the meta-model is
known as the level 1 model. So, the Stacking ensemble method includes original
(training) data, primary level models, primary level prediction, secondary level
model, and final prediction. The basic architecture of stacking can be represented as
shown below the image.
Original data: This data is divided into n-folds and is also considered test data or
training data.
Base models: These models are also referred to as level-0 models. These models
use training data and provide compiled predictions (level-0) as an output.
Level-0 Predictions: Each base model is triggered on some training data and
provides different predictions, which are known as level-0 predictions.
Meta Model: The architecture of the stacking model consists of one meta-model,
which helps to best combine the predictions of the base models. The meta-model is
also known as the level-1 model.
Level-1 Prediction: The meta-model learns how to best combine the predictions of
the base models and is trained on different predictions made by individual base
models, i.e., data not used to train the base models are fed to the meta-model,
predictions are made, and these predictions, along with the expected outputs,
provide the input and output pairs of the training dataset used to fit the meta-
model.
3.2 Clustering
Clustering is the task of dividing data points into a number of groups such that data
points in the same groups are more similar to other data points in the same group than
those in other groups. In simple words, the aim is to segregate groups with similar
traits and assign them into clusters. Clustering is dividing data points into homogeneous
classes or clusters. Points in the same group are as similar as possible. Points in
different group are as dissimilar as possible. When a collection of objects is given, we
put objects into group based on similarity.
Cluster is a group of objects that belongs to the same class. In other words, similar
objects are grouped in one cluster and dissimilar objects are grouped in another
cluster.
Applications of Clustering:
Market Segmentation - Clustering helps marketers improve their customer base
and work on the target areas. It helps group people based on their similarity in
many ways related to the product under consideration.
Clustering helps in identification of group of houses on the basis of their value,
type and geographical locations.
Clustering is used to study earth-quake. Based on the areas hit by an earthquake in
a region, clustering can help analyse the next probable location where earthquake
can occur.
In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures
inherent to populations.
Clustering also helps in classifying documents on the web for information discovery.
Clustering is also used in outlier detection applications such as detection of credit
card fraud.
As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.
Let‘s understand this with an example. Suppose, you are the head of a rental store
and wish to understand preferences of your costumers to scale up your business. Is
it possible for you to look at details of each costumer and devise a unique business
strategy for each one of them? Definitely not. But, what you can do is to cluster all
of your costumers into say 10 groups based on their purchasing habits and use a
separate strategy for costumers in each of these 10 groups. And this is what we
call clustering.
Search Result Grouping - Examples: Netflix - A well-known application of clustering
algorithms are Netflix recommendation systems. Although the company is quite
discreet with its algorithms, it is confirmed that there are about 2,000 clusters or
communities that have common audiovisual tastes. Cluster 290 is the one that
includes people who like the series "Lost", "Black Mirror" and "Groundhog Day".
Netflix uses these clusters to refine its knowledge of the tastes of viewers and
thus make better decisions in the creation of new original series.
Example : Fraud Detection/Anomaly Detection - Classification is commonly used in
the financial sector. In the era of online transactions where the use of cash has
decreased markedly, it is necessary to determine whether movements made
through cards are safe. Entities can classify transactions as correct or fraudulent
using historical data on customer behavior to detect fraud very accurately.
Clustering analysis is broadly used in many applications such as market research,
pattern recognition, data analysis, and image processing, Social network analysis,
Recommendation engines, Medical imaging, Image segmentation.
Applications of Classification:
Classification is used in many fields, such as biology or in the Dewey decimal
classification for books, in the detection of spam in e-mails...
Classification is used when you need to know users or customers to decide which
products or campaigns will be launched in the future. For example, at Bismart we
developed a project for the insurance industry in which the client needed to
classify customers according to accident claims, so that the policy could be
classified according to the number of claims predicted. Thus, the company can
choose the costumers with the lowest number of claims.
1. Partitioning Method :
Suppose we are given a database of ‗n‘ objects and the partitioning method constructs
‗k‘ partition of data. Each partition will represent a cluster and k ≤ n. It means that it
will classify the data into k groups, which satisfy the following requirements −
Each group contains at least one object.
Each object must belong to exactly one group.
For a given number of partitions (say k), the partitioning method will create an initial
partitioning. Then it uses the iterative relocation technique to improve the partitioning
by moving objects from one group to other.
2. Centroid Method:
These are iterative clustering algorithms in which the notion of similarity is derived by
the closeness of a data point to the centroid of the clusters. Centroid-based clustering
organizes the data into non-hierarchical clusters. K-means is the most widely-used and
popular centroid-based clustering algorithm K-Means clustering algorithm is a popular
algorithm that falls into this category. Centroid-based algorithms are efficient but
sensitive to initial conditions and outliers. In these models, the no. of clusters required
at the end have to be mentioned beforehand, which makes it important to have prior
knowledge of the dataset. These models run iteratively to find the local optima.
These clustering models are based on the notion of how probable is it that all data
points in the cluster belong to the same distribution This clustering approach assumes
data is composed of distributions, such as Gaussian distributions or Normal
Distribution. In Figure, the distribution-based algorithm clusters data into three
Gaussian distributions. As distance from the distribution's center increases, the
probability that a point belongs to the distribution decreases. The bands show that
decrease in probability. When you do not know the type of distribution in your data,
you should use a different algorithm. These models often suffer from overfitting. A
popular example of these models is Expectation-maximization algorithm which uses
multivariate normal distributions.
5. Constraint-based Method:
In this method, the clustering is performed by the incorporation of user or application-
oriented constraints. A constraint refers to the user expectation or the properties of
desired clustering results. Constraints provide us with an interactive way of
communication with the clustering process. Constraints can be specified by the user or
the application requirement.
6. Connectivity Method: As the name suggests, these models are based on the notion
that the data points closer in data space exhibit more similarity to each other than the
data points lying farther away. These models can follow two approaches. In the first
approach, they start with classifying all data points into separate clusters & then
aggregating them as the distance decreases. In the second approach, all data points
are classified as a single cluster and then partitioned as the distance increases. Also,
the choice of distance function is subjective. These models are very easy to interpret
but lacks scalability for handling big datasets. Examples of these models are
hierarchical clustering algorithm and its variants.
7. Hierarchical Methods:
Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of
clusters. This algorithm starts with all the data points assigned to a cluster of their
own. Then two nearest clusters are merged into the same cluster. In the end, this
algorithm terminates when there is only a single cluster left. This method creates a
hierarchical decomposition of the given set of data objects. The results of hierarchical
clustering can be shown using dendrogram.
in the dendrogram at which two clusters are merged represents the distance
between two clusters in the data space. The decision of the no. of clusters that can
best depict different groups can be chosen by observing the dendrogram. The best
choice of the no. of clusters is the no. of vertical lines in the dendrogram cut by a
horizontal line that can transverse the maximum distance vertically without
intersecting a cluster. In the above example, the best choice of no. of clusters will be
4 as the red horizontal line in the dendrogram below covers maximum vertical
distance AB.
There are two approaches of Hierarchical Clustering –
Agglomerative Approach
Divisive Approach
Agglomerative Approach :
An agglomerative hierarchical clustering method is the most common type of
hierarchical clustering and is also known as a bottom-up approach. Bottom-up
algorithms treat each data point as a single cluster at the outset and then
successively merge (or agglomerate) pairs of clusters until all clusters have been
merged into a single larger cluster that contains all data points until all the objects
are in a single cluster or certain termination conditions are satisfied. The single
cluster becomes the hierarchy‘s root. For the merging step, it finds the two clusters
that are closest to each other (according to some similarity measure), and combines
the two to form one cluster. Because two clusters are merged per iteration, where
each cluster contains at least one object, an agglomerative method requires at most n
iterations.
Divisive Approach:
A divisive hierarchical clustering method employs a top-down strategy. It starts by
placing all objects or data sets in one cluster, which is the hierarchy‘s root. It then
divides the root cluster into several smaller subclusters, and recursively partitions
those clusters into smaller ones. The partitioning process continues until each cluster
at the lowest level is coherent enough—either containing only one object, or the
objects within a cluster are sufficiently similar to each other. In either
agglomerative or divisive hierarchical clustering, a user can specify the desired
number of clusters as a termination condition.
Steps of Divisive Clustering:
Initially, all points in the dataset belong to one single cluster.
Partition the cluster into two least similar cluster
Proceed recursively to form new clusters until the desired number of clusters is
obtained.
K Means Clustering:
K-Means is probably the most well-known clustering algorithm. K means is an iterative
clustering algorithm that aims to find local maxima in each iteration.
K-Means clustering is an unsupervised iterative clustering technique.
It partitions the given data set into k predefined distinct clusters.
A cluster is defined as a collection of data points exhibiting certain similarities.
It partitions the data set such that-
Each data point belongs to a cluster with the nearest mean.
Data points belonging to one cluster have high degree of similarity.
Data points belonging to different clusters have high degree of dissimilarity.
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|
Iteration-01:
Calculating Distance between A1(2, 10) and C1(2, 10) -
Ρ(A1, C1) = |x2 – x1| + |y2 – y1| = |2 – 2| + |10 – 10| = 0
Calculating Distance between A1(2, 10) and C2(5, 8) -
Ρ(A1, C2) = |x2 – x1| + |y2 – y1| = |5 – 2| + |8 – 10| = 3+2 = 5
Calculating Distance between A1(2, 10) and C3(1, 2) -
Ρ(A1, C3) = |x2 – x1| + |y2 – y1| = |1 – 2| + |2 – 10| = 1+8 = 9
In the similar manner, we calculate the distance of other points from each of the
center of the three clusters. Next, we draw a table showing all the results. Using the
table, we decide which point belongs to which cluster. The given point belongs to that
cluster whose center is nearest to it.
Now, we re-compute the new cluster. The new cluster center is computed by taking
mean of all the points contained in that cluster.
For Cluster-01: We have only one point A1(2, 10) in Cluster-01.
So, cluster center remains the same.
For Cluster-02:
Center of Cluster-02 = ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5) = (6, 6)
For Cluster-03:
Center of Cluster-03 = ((2 + 1)/2, (5 + 2)/2) = (1.5, 3.5)
This is completion of Iteration-01.
Problem-2
Cluster the following five points (with (x, y) representing locations) into two clusters:
A(2, 2), B(3, 2), C(1, 1), D(3, 1), E(1.5, 0.5). Initial cluster centers are: A(2, 2), C(1, 1).
Use K-Means Algorithm to find the two cluster centers and the points belonging to two
different clusters. Use Euclidean distance measure.
In the similar manner, we calculate the distance of other points from each of the
center of the two clusters. Next, we draw a table showing all the results. Using the
table, we decide which point belongs to which cluster. The given point belongs to that
cluster whose center is nearest to it.
For Cluster-01:
Center of Cluster-01 = ((2 + 3 + 3)/3, (2 + 2 + 1)/3) = (2.67, 1.67)
For Cluster-02:
Center of Cluster-02 = ((1 + 1.5)/2, (1 + 0.5)/2) = (1.25, 0.75)
Problem 3:
Divide the given sample data in two clusters using K-Means Algorithm. Use Euclidean
Distance. Initial cluster centers are C1(185,72) and C2(170,56).
Height (H) Weight (W)
185 72
170 56
168 60
179 68
182 72
188 77
180 71
180 70
183 84
180 88
K-Mode Clustering :
Clustering is an unsupervised learning method whose task is to divide the population or
data points into a number of groups, such that data points in a group are more similar
to other data points in the same group and dissimilar to the data points in other groups.
It is basically a collection of objects based on similarity and dissimilarity between them.
KModes vs KMeans
K-means clustering works efficiently only for numerical dataset. K-Means Clustering
fails to find patterns in the categorical dataset. KModes clustering is one of the
unsupervised Machine Learning algorithms that groups similar data points into clusters
based on their categorical attributes. Unlike traditional clustering algorithms that use
distance metrics, KModes works by identifying the modes or most frequent values
within each cluster to determine its centroid. This is in contrast to the more well-
known k-means algorithm, which clusters numerical data based on Euclidean distance.
KMeans uses mathematical measures (distance) to cluster continuous data. The lesser
the distance, the more similar our data points are. Centroids are updated by Means.
But for categorical data points, we cannot calculate the distance. So we go for KModes
algorithm. It uses the dissimilarities(total mismatches) between the data points. The
lesser the dissimilarities the more similar our data points are. It uses Modes instead of
means.
Most of the real world datasets are in categorical form. For example, if we are working
on analysing the social media, we have categorical data like gender (male or female),
profession and so on. KModes is ideal for clustering categorical data such as customer
demographics, market segments, or survey responses. So to deal with all this
categorical data or cluster the categorical variables we use K Modes Clustering. It is
widely used algorithm for grouping the categorical data because it is easy to implement
and efficiently handles large amount of data. It defines clusters based on the number
of matching categories between data points.
Algorithm of K-Mode:
Input: Data objects X, Number of clusters K.
Step 1. Pick K observations at random and use them as leaders/clusters.
Step 2. Calculate the simple matching dissimilarity measure between the each K initial
cluster modes and each data objects using the following equation and assign each
observation to its closest cluster. The dissimilarity measure is defined as follows:
Step 3. Find the minimum mode values in each data object i.e. finding the objects
nearest to the initial cluster modes.
Step 4: Assign the data objects to the nearest cluster centroid modes.
Step 5: Update the modes by frequency-based method on newly formed clusters.
Step 6: Recalculate the similarity between the data objects and the updated modes.
Step 7: Repeat the step 3 and step 4 until no changes in the cluster ship of data
objects.
Output: Clustered data objects
Problem 1: Imagine we have a dataset that has the information about hair color, eye
color, and skin color of persons. We aim to group them based on the available
information. Hair color, eye color, and skin color are all categorical variables. Below is
how the dataset looks like.
Step 1: Pick K observations at random and use them as leaders/clusters. Here P1, P7,
P8 are chosen as leaders/clusters.
After step 2, the observations P1, P2, P5 are assigned to cluster 1; P3, P7 are assigned
to Cluster 2; and P4, P6, P8 are assigned to cluster 3.
Note: If all the clusters have the same dissimilarity with an observation, assign to any
cluster randomly. In our case, the observation P2 has 3 dissimilarities with all the
leaders. So it was randomly assigned to Cluster 1.
Considering one cluster at a time, for each feature, look for the Mode and update the
new leaders.
Explanation: Cluster 1 observations(P1, P2, P5) has brunette as the most observed hair
color, amber as the most observed eye color, and fair as the most observed skin color.
Note: If you observe the same occurrence of values, take the mode randomly. In
our case, the observations of Cluster 3(P3, P7) have one occurrence of brown, fair
skin color. Brown is randomly chosen as the mode.
Below are our new leaders after the update.
Likewise, calculate all the dissimilarities and put them in a matrix. Assign each
observation to its closest cluster.
The observations P1, P2, P5 are assigned to Cluster 1; P3, P7 are assigned to Cluster 2;
and P4, P6, P8 are assigned to Cluster 3.
We stop here as we see there is no change in the assignment of observations.
DBSCAN requires only two parameters: epsilon and minPoints. Epsilon is the radius of
the circle to be created around each data point to check the density and minPoints is
the minimum number of data points required inside that circle for that data point to be
classified as a Core point. In higher dimensions the circle becomes
hypersphere, epsilon becomes the radius of that hypersphere, and minPoints is the
minimum number of data points required inside that hypersphere. Here, we have some
data points represented by grey color. Let‘s see how DBSCAN clusters these data
points.
DBSCAN creates a circle of epsilon radius around every data point and classifies them
into Core point, Border point, and Noise. A data point is a Core point if the circle
around it contains at least ‗minPoints’ number of points. If the number of points is less
than minPoints, then it is classified as Border Point, and if there are no other data
points around any data point within epsilon radius, then it treated as Noise.
The above figure shows us a cluster created by DBCAN with minPoints = 3. Here, we
draw a circle of equal radius epsilon around every data point. These two parameters
help in creating spatial clusters.
All the data points with at least 3 points in the circle including itself are considered
as Core points represented by red color. All the data points with less than 3 but
greater than 1 point in the circle including itself are considered as Border points. They
are represented by yellow color. Finally, data points with no point other than itself
present inside the circle are considered as Noise represented by the purple color.
For locating data points in space, DBSCAN uses Euclidean distance.
It does not make sense to take minPoints as 1 because it will result in each point being
a separate cluster. Therefore, it must be at least 3. Generally, it is twice the
dimensions. But domain knowledge also decides its value. The value of epsilon can be
decided from the K-distance graph. The point of maximum curvature (elbow) in this
graph tells us about the value of epsilon. If the value of epsilon chosen is too small then
a higher number of clusters will be created, and more data points will be taken as noise.
Whereas, if chosen too big then various small clusters will merge into a big cluster, and
we will lose details. Here, both X and Y are density-reachable from O, therefore, we
can say that X is density-connected from Y.
Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for
finding spherical-shaped clusters or convex clusters. In other words, they are suitable
only for compact and well-separated clusters. K-Means and Hierarchical Clustering
both fail in creating clusters of arbitrary shapes. They are not able to form clusters
based on varying densities. That‘s why we need DBSCAN clustering. Moreover, they are
also severely affected by the presence of noise and outliers in the data.
Real-life data may contain irregularities, like: Clusters can be of arbitrary shape, Data
may contain noise. Given a dataset containing non-convex shape clusters and outliers,
the k-means algorithm has difficulties in identifying these clusters with arbitrary
shapes. DBSCAN is not just able to cluster the data points correctly, but it also
perfectly detects noise in the dataset.
11. Divide the given sample data in two clusters using K-Means
Algorithm. Use Euclidean Distance. Initial cluster centers are
C1(185,72) and C2(170,56).
Height (H) Weight (W)
185 72
170 56
168 60
179 68 L3 10
182 72
188 77
180 71
180 70
183 84
180 88
12. Why use Random forest? How does Random Forest algorithm
L2 10
work? Explain with an example.