0% found this document useful (0 votes)
20 views51 pages

Supervised Learning Techniques Overview

Social network analysis

Uploaded by

Varshh
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views51 pages

Supervised Learning Techniques Overview

Social network analysis

Uploaded by

Varshh
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

SOCIAL NETWORK AND ANALYSIS

UNIT - |

Supervised Learning–Decision tree-Naïve Bayesian Text Classification-Support Vector


Machines - Ensemble of Classifiers – Unsupervised Learning – K-means Clustering –
Hierarchical Clustering Partially Supervised Learning–Markov Models –Probability-
Based Clustering–Vector Space Model

[Link] LEARNING

 Supervised learning is defined as when a model gets trained on a “Labelled Dataset”.

Labelled datasets have both input and output parameters. In Supervised

Learning algorithms learn to map points between inputs and correct outputs. It has both

training and validation datasets labelled.

 Let’s understand it with the help of an example.

 Example: Consider a scenario where you have to build an image classifier to differentiate

between cats and dogs. If you feed the datasets of dogs and cats labelled images to the

algorithm, the machine will learn to classify between a dog or a cat from these labeled

images. When we input new dog or cat images that it has never seen before, it will use the

learned algorithms and predict whether it is a dog or a cat. This is how supervised

learning works, and this is particularly an image classification.

Supervised Learning

There are two main categories of supervised learning that are mentioned below:

 Classification

 Regression

1
Classification

Classification deals with predicting categorical target variables, which represent discrete
classes

or labels. For instance, classifying emails as spam or not spam, or predicting whether a
patient has

a high risk of heart disease. Classification algorithms learn to map the input features to one of
the

predefined classes.

Here are some classification algorithms:

 Logistic Regression

 Support Vector Machine

 Random Forest

 Decision Tree

 K-Nearest Neighbors (KNN)

 Naive Bayes

Regression

Regression, on the other hand, deals with predicting continuous target variables, which
represent

numerical values. For example, predicting the price of a house based on its size, location, and

amenities, or forecasting the sales of a product. Regression algorithms learn to map the input

features to a continuous numerical value.

Here are some regression algorithms:

 Linear Regression

 Polynomial Regression

 Ridge Regression

2
 Lasso Regression

 Decision tree

 Random Forest

Advantages of Supervised Machine Learning

 Supervised Learning models can have high accuracy as they are trained on labelled

data.

 The process of decision-making in supervised learning models is often interpretable.

 It can often be used in pre-trained models which saves time and resources when

developing new models from scratch.

Disadvantages of Supervised Machine Learning

 It has limitations in knowing patterns and may struggle with unseen or unexpected

patterns that are not present in the training data.

 It can be time-consuming and costly as it relies on labeled data only.

 It may lead to poor generalizations based on new data.

Applications of Supervised Learning

Supervised learning is used in a wide variety of applications, including:

 Image classification: Identify objects, faces, and other features in images.

 Natural language processing: Extract information from text, such as sentiment,

entities, and relationships.

 Speech recognition: Convert spoken language into text.

 Recommendation systems: Make personalized recommendations to users.

 Predictive analytics: Predict outcomes, such as sales, customer churn, and stock

prices.

 Medical diagnosis: Detect diseases and other medical conditions.

3
 Fraud detection: Identify fraudulent transactions.

 Autonomous vehicles: Recognize and respond to objects in the environment.

 Email spam detection: Classify emails as spam or not spam.

 Quality control in manufacturing: Inspect products for defects.

 Credit scoring: Assess the risk of a borrower defaulting on a loan.

 Gaming: Recognize characters, analyze player behavior, and create NPCs.

 Customer support: Automate customer support tasks.

 Weather forecasting: Make predictions for temperature, precipitation, and other

meteorological parameters.

[Link] TREE

INTRODUCTION

Decision Tree is a Supervised learning technique that can be used for both Classification and
Regression problems, but mostly it is preferred for solving Classification problems.

 It is a tree-structured classifier
 Internal nodes represent the features of a dataset
 Branches represent the decision rules
 Each leaf node represents the outcome.
 In a Decision tree, there are two nodes
 [Link] Decision Node : are used to make any decision and have multiple branches
 2. Leaf Node: are the output of those decisions and do not contain any further
branches.
 The decisions or the test are performed on the basis of features of the given dataset.
 It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
 It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure
 It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
4
 In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
 A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.

DECISION TREE TERMINOLOGIES

Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from the tree.
 Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes

How Decision Tree works?

 Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
 Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
 Step-3: Divide the S into subsets that contains possible values for the best
attributes.
 Step-4: Generate the decision tree node, which contains the best attribute.
 Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node as a leaf node.

Decision Tree : Attribute Selection Measures

5
There are two types :

o Information Gain
o Gini Index

✔ Information Gain:

 Information gain is the measurement of changes in entropy after the segmentation of


a dataset based on an attribute.
 It calculates how much information a feature provides us about a class.
 According to the value of information gain, we split the node and build the decision
tree.
 A decision tree algorithm always tries to maximize the value of information gain, and
a node/attribute having the highest information gain is split first.
 Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
 Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies
randomness in data.
 Entropy can be calculated as:
 Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
 Where, S= Total number of samples, P(yes)= probability of yes,
 P(no)= probability of no

Decision Tree :Attribute Selection Measures

 Gini Index
 Gini index is a measure of impurity or purity used while creating a decision
tree in the CART(Classification and Regression Tree) algorithm.
 An attribute with the low Gini index should be preferred as compared to the
high Gini index.
 It only creates binary splits, and the CART algorithm uses the Gini index to
create binary splits.
 Gini index can be calculated using the below formula:
 Gini Index= 1- ∑ jPj2

6
[Link]ÏVE BAYESIANNaive Bayes and Text Classification

Definition

Naive Bayes is a probabilistic machine learning algorithm used for classification tasks. It
is based on Bayes' Theorem with a strong (naive) assumption that the features (predictors)
are independent of each other given the class label.

In text classification, Naive Bayes is used to predict the category of a given text document
(e.g., spam or not spam) based on the words it contains.

📐 Bayes’ Theorem Formula

 X=(x1,x2,…,xn) are features


 All features are conditionally independent given the class CCC

Then:

P(C∣X)∝P(C)⋅i=1∏nP(xi∣C)

You ignore P(X) in most cases because it’s constant for all classes (used only for
normalization).

⚙️Naive Bayes Algorithm (for Text Classification)

Step-by-step:

1. Collect Training Data: Labeled text documents (e.g., emails marked as spam or not
spam).
2. Preprocess Text:
o Tokenize (split text into words)
o Remove stopwords (e.g., "the", "is")

7
o Stem or lemmatize (reduce words to root form)
3. Convert to Numerical Features:
o Bag of Words or TF-IDF
4. Calculate Probabilities:
o Calculate prior probability for each class:

P(Ci)=documents in class Citotal documentsP(C_i) = \frac{\text{documents in


class } C_i}{\text{total documents}}P(Ci
)=total documentsdocuments in class Ci

o For each word www, calculate the likelihood:

P(w|C_i) = \frac{\text{count(w in class C_i)} + 1}{\text{total words in class


C_i} + V}

(Laplace Smoothing: add 1 to avoid zero probability; VVV = vocabulary size)

5. Classify New Text:


o For each class, compute:

P(Ci)⋅∏k=1nP(wk∣Ci)P(C_i) \cdot \prod_{k=1}^{n} P(w_k|C_i)P(Ci)⋅k=1∏n


P(wk∣Ci)

o Choose the class with the highest score.

Example: Spam Detection

Let's say we want to classify an email as Spam or Not Spam using the words in the email.

We have a small dataset:

Email Word: "Free" Word: "Win" Spam/Not Spam


1 Yes Yes Spam
2 Yes No Spam
3 No Yes Not Spam

8
Email Word: "Free" Word: "Win" Spam/Not Spam
4 No No Not Spam

Step 1: Calculate Prior Probabilities

P(Spam)=42=0.5,P(Not Spam)=42=0.5

Step 2: Calculate Likelihoods

(How often each word appears in each class)

For Spam (2 emails):

 P("Free" = Yes | Spam) = 2/2 = 1.0


 P("Win" = Yes | Spam) = 1/2 = 0.5

For Not Spam (2 emails):

 P("Free" = Yes | Not Spam) = 0/2 = 0


 P("Win" = Yes | Not Spam) = 1/2 = 0.5

We use Laplace smoothing to avoid 0 probabilities:

P("Free"=Yes∣Not Spam)=2+20+1=0.25

Step 3: Predict New Email

New email: contains "Free" and "Win" → [Free=Yes, Win=Yes]

For Spam:

P(Spam∣Free, Win)∝P(Spam)⋅P(Free∣Spam)⋅P(Win∣Spam)=0.5⋅1.0⋅0.5=0.25

For Not Spam:

P(Not Spam∣Free, Win)∝0.5⋅0.25⋅0.5=0.0625

9
Prediction:

Since 0.25 > 0.0625, the email is predicted as Spam

Advantages

 Simple and fast to train and predict.


 Works well with high-dimensional data, like text.
 Requires less training data.
 Performs well in multi-class problems.
 Robust to irrelevant features.

Limitations

 Strong independence assumption between features.


 If a word in test data was not seen in training, it can give zero probability (handled
by Laplace smoothing).
 Doesn’t handle contextual or sequential relationships (like in deep NLP).
 Assumes all features contribute equally, which may not always be true.

Applications of Naive Bayes in Text Classification

1. Spam Filtering (Spam vs. Not Spam)


2. Sentiment Analysis (Positive, Negative, Neutral)
3. Document Categorization (News: Politics, Sports, Entertainment)
4. Language Detection (Detect whether a text is in English, Spanish, etc.)
5. Email Classification (Work vs. Personal)
6. Fake News Detection

[Link] VECTOR MACHINE

10
Support Vector Machine (SVM) is a supervised machine learning algorithm used for
both classification and regression. Though we say regression problems as well it’s best
suited for classification.
The main objective of the SVM algorithm is to find the optimal hyperplane in an N-
dimensional space that can separate the data points in different classes in the feature space.
The hyperplane tries that the margin between the closest points of different classes should
be as maximum as possible. The dimension of the hyperplane depends upon the number of
features. If the number of input features is two, then the hyperplane is just a line. If the
number of input features is three, then the hyperplane becomes a 2-D plane. It becomes
difficult to imagine when the number of features exceeds three.
Let’s consider two independent variables x 1, x2, and one dependent variable which is either
a blue circle or a red circle.

From the figure above it’s very clear that there are multiple lines (our hyperplane here is a
line because we are considering only two input features x 1, x2) that segregate our data
points or do a classification between red and blue circles.

Support Vector Machine Terminology:

1. Hyperplane: Hyperplane is the decision boundary that is used to separate the data points
of different classes in a feature space. In the case of linear classifications, it will be a
linear equation i.e. wx+b = 0.
2. Support Vectors: Support vectors are the closest data points to the hyperplane, which
makes a critical role in deciding the hyperplane and margin.
3. Margin: Margin is the distance between the support vector and hyperplane. The main
objective of the support vector machine algorithm is to maximize the margin. The
wider margin indicates better classification performance.
4. Kernel: Kernel is the mathematical function, which is used in SVM to map the original
input data points into high-dimensional feature spaces, so, that the hyperplane can be
easily found out even if the data points are not linearly separable in the original input

11
space. Some of the common kernel functions are linear, polynomial, radial basis
function(RBF), and sigmoid.
5. Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is a
hyperplane that properly separates the data points of different categories without any
misclassifications.
6. Soft Margin: When the data is not perfectly separable or contains outliers, SVM permits
a soft margin technique. Each data point has a slack variable introduced by the soft-
margin SVM formulation, which softens the strict margin requirement and permits
certain misclassifications or violations. It discovers a compromise between increasing
the margin and reducing violations.
7. C: Margin maximisation and misclassification fines are balanced by the regularisation
parameter C in SVM. The penalty for going over the margin or misclassifying data
items is decided by it. A stricter penalty is imposed with a greater value of C, which
results in a smaller margin and perhaps fewer misclassifications.
8. Hinge Loss: A typical loss function in SVMs is hinge loss. It punishes incorrect
classifications or margin violations. The objective function in SVM is frequently
formed by combining it with the regularisation term.
9. Dual Problem: A dual Problem of the optimisation problem that requires locating the
Lagrange multipliers related to the support vectors can be used to solve SVM. The dual
formulation enables the use of kernel tricks and more effective computing.

Types of Support Vector Machine


Based on the nature of the decision boundary, Support Vector Machines (SVM) can be
divided into two main parts:
 Linear SVM: Linear SVMs use a linear decision boundary to separate the data points of
different classes. When the data can be precisely linearly separated, linear SVMs are
very suitable. This means that a single straight line (in 2D) or a hyperplane (in
higher dimensions) can entirely divide the data points into their respective
classes. A hyperplane that maximizes the margin between the classes is the
decision boundary.
 Non-Linear SVM: Non-Linear SVM can be used to classify data when it
cannot be separated into two classes by a straight line (in the case of 2D). By
using kernel functions, nonlinear SVMs can handle nonlinearly separable data.
The original input data is transformed by these kernel functions into a higher-
dimensional feature space, where the data points can be linearly separated. A
linear SVM is used to locate a nonlinear decision boundary in this modified
space.

Popular kernel functions in SVM

12
The SVM kernel is a function that takes low-dimensional input space and
transforms it into higher-dimensional space, ie it converts nonseparable problems to
separable problems. It is mostly useful in non-linear separation problems. Simply put
the kernel, does some extremely complex data transformations and then finds out the
process to separate the data based on the labels or outputs defined.
Advantages of SVM
 Effective in high-dimensional cases.
 Its memory is efficient as it uses a subset of training points in the decision function
called support vectors.
 Different kernel functions can be specified for the decision functions and its possible to
specify custom kernels.

[Link] CLASSIFIER

Ensemble learning is a method many small models are used instead of just one. Each
of these models may not be very strong on its own, but when we put their results
together, we get a better and more accurate answer

Ensemble Classifiers are class models that combine the predictive power of several models to
generate more powerful models than individual ones. A group of classifiers is learned and the
final is selected using the voting mechanism.

Ensemble methods combine multiple learning algorithms to create a stronger predictor than
any individual algorithm alone. The fundamental principle is that a group of weak learners
can come together to form a strong learner.

Why do we need ensemble classifiers?

Ensemble models(classifiers) can solve many problems and have several advantages over
other singular methods. They are
 Prediction accuracy is increased
 The accuracy of the final model is higher even if the basis models fail to classify
accurately.
 They can be paralyzed and enables efficient resource management.

 Can improve on the errors produced by previous models and generate efficient
models.

13
Single Classifier:

Input → [Classifier] → Prediction

Ensemble:

Input → [Classifier 1] → Vote 1

→ [Classifier 2] → Vote 2 → [Combiner] → Final Prediction

→ [Classifier 3] → Vote 3

14
There are three main types of ensemble methods:
1. Bagging (Bootstrap Aggregating):
Models are trained independently on different random subsets of the training data. Their
results are then combined—usually by averaging (for regression) or voting (for
classification). This helps reduce variance and prevents overfitting.
2. Boosting:
Models are trained one after another. Each new model focuses on fixing the errors made
by the previous ones. The final prediction is a weighted combination of all models,
which helps reduce bias and improve accuracy.
3. Stacking (Stacked Generalization):
Multiple different models (often of different types) are trained, and their predictions are
used as inputs to a final model, called a meta-model. The meta-model learns how to best
combine the predictions of the base models, aiming for better performance than any
individual model.

Types of Ensemble Methods

Bagging (Bootstrap Aggregating)

Bagging is a technique that involves creating multiple versions of a model and combining
their outputs to improve overall performance.
In bagging several base models are trained on different subsets of the training data, then
aggregate their predictions to make the final decision. The subsets of the data are created

15
using bootstrapping, a statistical technique where samples are drawn with replacement,
meaning some data points can appear more than once in a subset.
The final prediction from the ensemble is typically made by either:
 Averaging the predictions (for regression problems), or
 Majority voting (for classification problems).
This approach helps to reduce variance, especially with models that are prone to overfitting,
such as decision trees.

Concept: Train multiple models on different bootstrap samples of the training data.

Bagging Algorithm:

16
1. Create k bootstrap samples from training data
2. Train a classifier on each sample
3. Combine predictions (voting/averaging)

Algorithm Explanation
Bagging classifier can be used for both regression and classification tasks. Here is an
overview of Bagging classifier algorithm:
 Bootstrap Sampling: Divides the original training data into ‘N’ subsets and randomly
selects a subset with replacement in some rows from other subsets. This step ensures
that the base models are trained on diverse subsets of the data and there is no class
imbalance.
 Base Model Training: For each bootstrapped sample we train a base model
independently on that subset of data. These weak models are trained in parallel to
increase computational efficiency and reduce time consumption. We can use different
base learners i.e. different ML models as base learners to bring variety and robustness.
 Prediction Aggregation: To make a prediction on testing data combine the predictions
of all base models. For classification tasks it can include majority voting or weighted
majority while for regression it involves averaging the predictions.
 Out-of-Bag (OOB) Evaluation: Some samples are excluded from the training subset of
particular base models during the bootstrapping method. These “out-of-bag” samples
can be used to estimate the model’s performance without the need for cross-validation.
 Final Prediction: After aggregating the predictions from all the base models, Bagging
produces a final prediction for each instance.

Original Dataset (n samples)

Bootstrap Sampling (with replacement)

↓ ↓ ↓

Sample 1 Sample 2 Sample 3

↓ ↓ ↓

17
Model 1 Model 2 Model 3

↓ ↓ ↓

Vote 1 Vote 2 Vote 3

Final Prediction

(Majority Vote)

Advantages:

 Reduces variance
 Parallel training possible
 Works well with high-variance models

Common Algorithms Using Bagging

1. Random Forest
 Random forest is an ensemble method based on decision trees. Multiple decision trees
are trained using different bootstrapped samples of the data.
 In addition to bagging, Random Forest also introduces randomness by selecting a
random subset of features at each node, further reducing variance and overfitting.
2. Bagged Decision Trees
 In Bagged Decision Trees, multiple decision trees are trained using bootstrapped
samples of the data.
 Each tree is trained independently and the final prediction is made by averaging the
predictions of all the trees in the ensemble.

18
Boosting

Boosting is an ensemble learning technique that sequentially combines multiple weak


classifiers to create a strong classifier. It is done by training a model using training
data and is then evaluated. Next model is built on that which tries to correct the errors
present in the first model. This procedure is continued and models are added until
either the complete training data set is predicted correctly or predefined number of
iterations is reached.

Concept: Sequential training where each model focuses on correcting errors of previous
models.

Algorithm:

1. Train initial classifier


2. Identify misclassified samples
3. Increase weights of misclassified samples
4. Train next classifier on reweighted data
5. Repeat until desired number of classifiers

Algorithm explanation:

19
 Initialize Model Weights: Begin with a single weak learner and assign equal weights to
all training examples.
 Train Weak Learner: Train weak learners on these dataset.
 Sequential Learning: Boosting works by training models sequentially where each
model focuses on correcting the errors of its predecessor. Boosting typically uses a
single type of weak learner like decision trees.
 Weight Adjustment: Boosting assigns weights to training datapoints. Misclassified
examples receive higher weights in the next iteration so that next models pay more
attention to them

Diagram:

Training Data

[Classifier 1] → Errors identified

Reweight samples (↑ error weights)

[Classifier 2] → Errors identified

Reweight samples (↑ error weights)

[Classifier 3]

Weighted Combination → Final Prediction

20
In the diagram above we have taken a dataset with train and test data. Initially we train a
weak learner w1(Decision Tree Tree1)on this data. In the next step we consider the errors
produced by the first model and try to improve upon the errors produced in the test by the
second weak learner w2 which is trained on the errors produced by the previous model. Weak
learner w2 is a trained decision tree Tree 2. This process is continued till n steps , till we
reach an optimal accuracy or maximum number of steps n.

Advantages:

 Reduces bias
 Can convert weak learners to strong learners
 Often achieves high accuracy

Examples: AdaBoost, Gradient Boosting, XGBoost

21
2.3 Stacking (Stacked Generalization)

Stacking is a ensemble learning technique where the final model known as the “stacked
model" combines the predictions from multiple base models. The goal is to create a
stronger model by using different models and combining them.

Concept: Use a meta-classifier to learn how to best combine predictions from multiple base
classifiers.

Architecture:

Level 0 (Base Classifiers):

Input → [Classifier 1] → Prediction 1

→ [Classifier 2] → Prediction 2

→ [Classifier 3] → Prediction 3

22
Level 1 (Meta-Classifier):

[Pred 1, Pred 2, Pred 3] → [Meta-Classifier] → Final Prediction

Architecture of Stacking
Stacking architecture is like a team of models working together in two layers to improve
prediction accuracy. Each layer has a specific job and the process is designed to make the
final result more accurate than any single model alone. It has two parts:
1. Base Models (Level-0)
These are the first models that directly learn from the original training data. You can think
of them as the “helpers” that try to make predictions in their own way.
 Base models can be Decision Tree, Logistic Regression, Random Forest, etc.
 Each model is trained separately using the same training data.
2. Meta-Model (Level-1)
This is the final model that learns from the output of the base models instead of the raw
data. Its job is to combine the base models predictions in a smart way to make the final
prediction.
 A simple Linear Regression or Logistic Regression can act as a meta-model.
 It looks at the outputs of the base models and finds patterns in how they make mistakes
or agree.

23
Process:

1. Train base classifiers on training data


2. Use base classifier predictions as features for meta-classifier
3. Train meta-classifier to make final prediction

Steps to Implement Stacking


 Start with training data: We begin with the usual training data that contains both input
features and the target output.
 Train base models: The base models are trained on this training data. Each model tries
to make predictions based on what it learns.
 Generate predictions: After training the base models make predictions on new data
called validation data or out-of-fold data. These predictions are collected.
 Train meta-model: The meta-model is trained using the predictions from the base
models as new features. The target output stays the same and the meta-model learns
how to combine the base model predictions.
 Final prediction: When testing the base models make predictions on new, unseen data.
These predictions are passed to the meta-model which then gives the final prediction.

24
3. Combination Methods

3.1 Voting Methods

Hard Voting (Majority Vote):

In hard voting, the predicted output class is a class with the highest majority of votes, i.e.,
the class with the highest probability of being predicted by each classifier.

Problem: 3-class classification (A, B, C)

Number of classifiers: 5

Classifier 1: Class A

Classifier 2: Class A

Classifier 3: Class B

Classifier 4: Class A

Classifier 5: Class C

Vote Count:

Class A: 3 votes ← Winner

Class B: 1 vote

Class C: 1 vote

Final Prediction: Class A

Soft Voting (Probability Averaging):

25
In this, the average probabilities of the classes determine which one will be the final
prediction

Problem: 3-class classification (A, B, C)

Number of classifiers: 3

Classifier 1 probabilities: [0.7, 0.2, 0.1] (A, B, C)

Classifier 2 probabilities: [0.3, 0.4, 0.3] (A, B, C)

Classifier 3 probabilities: [0.5, 0.3, 0.2] (A, B, C)

Average probabilities:

Class A: (0.7 + 0.3 + 0.5) / 3 = 0.50 ← Highest

Class B: (0.2 + 0.4 + 0.3) / 3 = 0.30

Class C: (0.1 + 0.3 + 0.2) / 3 = 0.20

Final Prediction: Class A (probability = 0.50)

3.2 Weighted Voting

Assign different weights to classifiers based on their performance:

Final Prediction = Σ(wi × pi) / Σ(wi)

where wi = weight of classifier i, pi = prediction of classifier i

4. Popular Ensemble Algorithms

4.1 Random Forest

Key Features:

 Bagging + Random feature selection

26
 Each tree trained on bootstrap sample
 Random subset of features at each split

Diagram:

Training Data

Bootstrap Samples + Random Features

↓ ↓ ↓

Tree 1 Tree 2 Tree 3 ... Tree n

↓ ↓ ↓ ↓

Majority Vote

Final Prediction

Advantages:

 Handles overfitting well


 Works with missing values
 Provides feature importance
 Fast training and prediction

4.2 AdaBoost (Adaptive Boosting)

Algorithm Steps:

1. Initialize equal weights for all samples


2. Train weak classifier
3. Calculate error rate and classifier weight
4. Update sample weights (increase for misclassified)
5. Normalize weights
6. Repeat steps 2-5

Weight Update Formula:

27
αt = 0.5 × ln((1 - εt) / εt)

wi(t+1) = wi(t) × exp(-αt × yi × ht(xi))

4.3 Gradient Boosting

Concept: Fit new models to residual errors of previous models.

Process:

F0(x) = initial prediction

For m = 1 to M:

1. Compute residuals: ri = yi - Fm-1(xi)

2. Fit weak learner hm(x) to residuals

3. Update: Fm(x) = Fm-1(x) + αm × hm(x)

[Link] LEARNING

Unsupervised learning is a type of machine learning technique in which an algorithm


discovers patterns and relationships using unlabeled data. Unlike supervised learning,
unsupervised learning doesn’t involve providing the algorithm with labeled target outputs.
Theprimary goal of Unsupervised learning is often to discover hidden patterns, similarities, or
clusters within the data, which can then be used for various purposes, such as data
exploration, visualization,

dimensionality reduction, and more.

Unsupervised Learning

Let’s understand it with the help of an example.

Example: Consider that you have a dataset that contains information about the purchases you

made from the shop. Through clustering, the algorithm can group the same purchasing
behavior

among you and other customers, which reveals potential customers without predefined labels.

This type of information can help businesses get target customers as well as identify outliers.

28
There are two main categories of unsupervised learning that are mentioned below:

 Clustering

 Association

Clustering

Clustering is the process of grouping data points into clusters based on their similarity. This

technique is useful for identifying patterns and relationships in data without the need for
labeled

examples.

Here are some clustering algorithms:

 K-Means Clustering algorithm

 Mean-shift algorithm

 DBSCAN Algorithm

 Principal Component Analysis

 Independent Component Analysis

Association

Association rule learning is a technique for discovering relationships between items in a


dataset. It identifies rules that indicate the presence of one item implies the presence of
another item with a specific probability.

Here are some association rule learning algorithms:

 Apriori Algorithm

 Eclat

 FP-growth Algorithm

Advantages of Unsupervised Machine Learning

 It helps to discover hidden patterns and various relationships between the data.

 Used for tasks such as customer segmentation, anomaly detection, and data

29
exploration.

 It does not require labeled data and reduces the effort of data labeling.

Disadvantages of Unsupervised Machine Learning

 Without using labels, it may be difficult to predict the quality of the model’s output.

 Cluster Interpretability may not be clear and may not have meaningful interpretations.

 It has techniques such as autoencoders and dimensionality reduction that can be used

to extract meaningful features from raw data.

Applications of Unsupervised Learning

Here are some common applications of unsupervised learning:

 Clustering: Group similar data points into clusters.

 Anomaly detection: Identify outliers or anomalies in data.

 Dimensionality reduction: Reduce the dimensionality of data while preserving its

essential information.

 Recommendation systems: Suggest products, movies, or content to users based on

their historical behavior or preferences.

 Topic modeling: Discover latent topics within a collection of documents.

 Density estimation: Estimate the probability density function of data.

 Image and video compression: Reduce the amount of storage required for multimedia

content.

7.K MEANS CLUSTERING

✔ K-Means Clustering is an unsupervised learning algorithm that is used to solve the


clustering problems in machine learning or data science.
✔ K-Means Clustering , which groups the unlabeled dataset into different clusters.

30
✔ Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters,
and so on.
✔ t is a centroid-based algorithm, where each cluster is associated with a centroid.
✔ The main aim of this algorithm is to minimize the sum of distances between the
data point and their corresponding clusters.
✔ The k-means clustering algorithm mainly performs two tasks:
✔ Determines the best value for K center points or centroids by an iterative process.
✔ Assigns each data point to its closest k-center.
✔ Those data points which are near to the particular k-center, create a cluster.

How does the K-Means Algorithm Work?

✔ Step-1: Select the number K to decide the number of clusters.


✔ Step-2: Select random K points or centroids. (It can be other from the input
dataset).
✔ Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters. ✔ Step-4: Calculate the variance and place a new centroid of
each cluster.
✔ Step-5: Repeat the third steps, which means reassign each data point to the new
closest centroid of each cluster.
✔ Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
✔ Step-7: The model is ready.

[Link] CLUSTERING

Hierarchical clustering is used to group similar data points together based on their similarity
creating a hierarchy or tree-like structure. The key idea is to begin with each data point as
its own separate cluster and then progressively merge or split them based on their similarity.
Lets understand this with the help of an example
Imagine you have four fruits with different weights: an apple (100g), a banana (120g), a
cherry (50g) and a grape (30g). Hierarchical clustering starts by treating each fruit as its
own group.

31
 It then merges the closest groups based on their weights.
 First the cherry and grape are grouped together because they are the lightest.
 Next the apple and banana are grouped together.
Finally all the fruits are merged into one large group, showing how hierarchical clustering
progressively combines the most similar data points.

Dendrogram
A dendrogram is like a family tree for clusters. It shows how individual data points or
groups of data merge together. The bottom shows each data point as its own group, and as
you move up, similar groups are combined. The lower the merge point, the more similar the
groups are. It helps you see how things are grouped step by step. The working of the
dendrogram can be explained using the below diagram:

32
Dendrogra
m
In the above image on the left side there are five points labeled P, Q, R, S and T. These
represent individual data points that are being clustered. On the right side there’s
a dendrogram which show how these points are grouped together step by step.
 At the bottom of the dendrogram the points P, Q, R, S and T are all separate.
 As you move up, the closest points are merged into a single group.
 The lines connecting the points show how they are progressively merged based on
similarity.
 The height at which they are connected shows how similar the points are to each other;
the shorter the line the more similar they are
Types of Hierarchical Clustering
Now we understand the basics of hierarchical clustering. There are two main types of
hierarchical clustering.
1. Agglomerative Clustering
2. Divisive clustering

33
Hierarchical Agglomerative Clustering
It is also known as the bottom-up approach or hierarchical agglomerative clustering
(HAC). Unlike flat clustering hierarchical clustering provides a structured way to group data.
This clustering algorithm does not require us to prespecify the number of clusters. Bottom-up
algorithms treat each data as a singleton cluster at the outset and then successively
agglomerate pairs of clusters until all clusters have been merged into a single cluster that
contains all data.

Hierarchi
cal Agglomerative Clustering

34
Workflow for Hierarchical Agglomerative clustering
1. Start with individual points: Each data point is its own cluster. For example if you have
5 data points you start with 5 clusters each containing just one data point.
2. Calculate distances between clusters: Calculate the distance between every pair of
clusters. Initially since each cluster has one point this is the distance between the two data
points.
3. Merge the closest clusters: Identify the two clusters with the smallest distance and merge
them into a single cluster.
4. Update distance matrix: After merging you now have one less cluster. Recalculate the
distances between the new cluster and the remaining clusters.
5. Repeat steps 3 and 4: Keep merging the closest clusters and updating the distance matrix
until you have only one cluster left.
6. Create a dendrogram: As the process continues you can visualize the merging of
clusters using a tree-like diagram called a dendrogram. It shows the hierarchy of how
clusters are merged.

Hierarchical Divisive clustering


It is also known as a top-down approach. This algorithm also does not require to prespecify
the number of clusters. Top-down clustering requires a method for splitting a cluster that
contains the whole data and proceeds by splitting clusters recursively until individual data
have been split into singleton clusters.
Workflow for Hierarchical Divisive clustering :
1. Start with all data points in one cluster: Treat the entire dataset as a single large cluster.
2. Split the cluster: Divide the cluster into two smaller clusters. The division is typically
done by finding the two most dissimilar points in the cluster and using them to separate
the data into two parts.
3. Repeat the process: For each of the new clusters, repeat the splitting process:
1. Choose the cluster with the most dissimilar points.
2. Split it again into two smaller clusters.
4. Stop when each data point is in its own cluster: Continue this process until every data
point is its own cluster, or the stopping condition (such as a predefined number of
clusters) is met.

35
Types of Linkages in Hierarchical Clustering
Hierarchical clustering is used to group similar data points and organise data in a tree-like
structure. Key part of this process is linkage which calculates the distance between clusters
before they are merged or divided. Different types of linkage is used measure this distance
differently.

36
1. Single Linkage
For two clusters R and S the single linkage returns the minimum distance between two
points. This method creates long, chain-like clusters because it is sensitive to outliers and
can connect clusters based on a very small number of close points.
L(R,S)=min(D(i,j)),iϵR,jϵS
where
 D(i, j): Distance function between points i and j.

Single Linkage

2. Complete Linkage
For two clusters R and S the complete linkage returns the maximum distance between
two points. It tends to create compact and spherical clusters because it is more sensitive to
outliers and tries to make sure that the clusters are not too far.
L(R,S)=max(D(i,j)),i∈R,j∈S
where
 D(i, j): Distance function between points i and j.

37
Complete Linkage

3. Average Linkage
It returns the average distance between all pairs of points from two clusters. This method
maintain a balance between single and complete linkage by considering all pairs of points
not just the closest or farthest point. It usually results in clusters that are moderately
compact.

where
 nR : Number of data-points in R
 nS : Number of data-points in S

Average
Linkage

38
4. Ward's Linkage
It calculates the distance between two clusters by looking at total spread or variance
increase when the clusters are combined. This method creates compact, well-separated
clusters by making sure that data within each cluster is as similar as possible.

D(i,j),i∈R,j∈S
where
 nR and nS are the sizes of clusters R and S
 D(i, j) is the distance between points i∈R and j∈S.

Ward Linkage

5. Centroid Linkage
It calculates the distance between two clusters based on the distance between their central
points i.e the average of all points in the cluster. This method works well when clusters
are round or evenly shaped but it may not be the best for irregularly shaped clusters.
L(R,S)=D(Rˉ,Sˉ)where
 RˉRˉ and SˉSˉ are the centroids (mean points) of clusters R and S
 D(Rˉ,Sˉ) is the distance between the centroids of clusters R and S.

39
Centroid Linkage

Each linkage method has its own advantages and we can use them based on our needs and
type of data we have

[Link] SUPERVISED LEARNING

Partially supervised learning is a machine learning paradigm that utilizes both labelled and
unlabelled data during model training. It falls between supervised learning, which requires
completely labelled datasets, and unsupervised learning, which uses no labels. This approach
is especially powerful in real-world situations where labelled data is scarce or expensive, but
large volumes of raw, unlabelled data are available.

PSL is a machine learning paradigm that addresses the common real-world scenario where
labelled data is scarce or expensive to obtain, while unlabelled data is abundant. It bridges the
gap between supervised learning (which requires fully labelled datasets) and unsupervised
learning (which uses no labels).

Key Characteristics:

 Hybrid approach: Combines labelled and unlabelled data


 Cost-effective: Reduces the need for extensive manual labelling
 Practical relevance: Mirrors real-world data availability scenarios

40
 Performance enhancement: Often outperforms purely supervised methods with
limited labelled data

Need for Partially Supervised Learning

In many practical domains such as medical imaging, natural language processing, and web
mining, obtaining labelled data is time-consuming, expensive, or requires domain expertise.
However, unlabelled data—such as raw text, images, or logs—is often abundant. Partially
supervised learning helps address the gap by making use of this vast unlabelled data,
significantly reducing the labelling cost while still maintaining good model accuracy.

How It Works

In partially supervised learning, a small amount of labelled data is used to guide the learning
process, while the model simultaneously learns from a large amount of unlabelled data. The
core idea is to allow the model to generalize from the labelled data and apply what it learns to
structure or classify the unlabelled data.

Types of Partially Supervised Learning

[Link]-Supervised Learning

 Definition: Uses a small amount of labelled data combined with a large amount of
unlabelled data
 Assumption: Unlabelled data contains valuable information about the underlying data
distribution
 Goal: Improve learning accuracy beyond what's achievable with labelled data alone

[Link] Supervised Learning

 Definition: Learning from imprecise, incomplete, or noisy labels


 Types of weak supervision:
o Incomplete supervision: Only subset of training data is labelled
o Inexact supervision: Training data has coarse-grained labels
o Inaccurate supervision: Labels may contain errors

41
3. Active Learning

 Definition: Iteratively selects the most informative unlabelled examples for manual
labelling
 Goal: Minimize labelling effort while maximizing performance
 Query strategies: Uncertainty sampling, diversity sampling, expected model change

Semi-Supervised Learning Techniques

1. Self-Training (Self-Labeling)

Process:

1. Train initial model on labelled data


• Start with a small supervised dataset (e.g., 1000 labeled examples)
 Train your model using standard supervised learning
 This gives you a "teacher" model with basic performance

2. Use model to predict labels for unlabelled data


 Apply the trained model to a large pool of unlabeled data
 Generate predictions with confidence scores (probability score or percentage)
 Example: Model predicts "cat" with 95% confidence, "dog" with 60%
confidence

3. Add high-confidence predictions to training set


 Set a confidence threshold (e.g., 90%)
 Select predictions above this threshold as "pseudo-labels"
 Add these newly labelled examples to your training set
 This is where careful threshold selection is crucial

4. Retrain model on expanded dataset


 Combine original labeled data + pseudo-labeled data
 Train a new model on this larger dataset
 The model now has more training examples to learn from

5. Repeat until convergence


 Continue the cycle: predict → select → retrain

42
 Stop when performance plateaus or no new high-confidence examples are found

Applications

Medical Imaging

 Radiology: Training deep learning models for medical image analysis without
requiring large quantities of labelled training data Self-supervised learning for
medical image classification: a systematic review and implementation guidelines | npj
Digital Medicine
 Diagnostic Imaging: X-rays, MRIs, CT scans analysis with limited expert
annotations
 Medical Image Segmentation: Utilizing large amounts of unlabelled data in
conjunction with labelled data to train higher-performing segmentation model

[Link]-Training

Requirements:

 Two different views((feature sets) /representations of the same data

 Views should be conditionally independent given the class

 Each view should be sufficient for classification

Process:

1. Train separate classifiers on each view using labelled data

2. Each classifier, labels unlabelled examples for the other

3. Add high-confidence predictions to training sets

4. Retrain classifiers

5. Iterate until convergence

Applications:

43
 Web page classification (text content + hyperlinks)

How Co-Training Works Here:

1. Train classifier A on View 1 (text content) using a small labeled set.


2. Train classifier B on View 2 (hyperlink context).
3. Let classifier A label some confident examples in unlabeled data and pass them to B.
4. Let classifier B do the same for A.
5. Repeat the process iteratively — both classifiers improve over time using mutual
teaching.

3. Graph-Based Methods

Core Idea: Construct a graph where nodes represent data points and edges represent
similarity

Label Propagation Algorithm:

1. Create similarity graph from all data points

 Transform your dataset into a graph structure where:


 Nodes = Data points (both labeled and unlabeled)
 Edges = Connections between similar data points
 Weights = How similar two data points are

2. Initialize labelled nodes with known labels

3. Propagate labels through graph based on similarity

44
4. Unlabelled nodes receive labels based on weighted neighbors

Advantages:

 Captures local data structure

 Theoretically well-founded

 Handles multi-class problems naturally

4. Generative Models

Approach: Model the joint distribution P(X,Y) of features and labels

Expectation-Maximization (EM):

1. E-step: Estimate missing labels for unlabeled data

2. M-step: Update model parameters based on all data

3. Iterate until convergence

Examples:

 Gaussian Mixture Models

 Naive Bayes with EM

 Hidden Markov Models

Advantages

 Cost-effective: Reduces the need for large-scale manual labelling.


 Improves generalization: Leverages the underlying structure in unlabelled data for
better model accuracy.
 Scalable: Works well with large datasets where labelling all examples would be
infeasible.
 Applicable across domains: Can be used in various fields with minimal domain-
specific adjustments.

45
10. MARVOK MODELS

Markov Models and Hidden Markov Models (HMMs)


✔ Both types of probabilistic models used for modeling sequential data, such as time
series or sequences of observations.
✔ Markov Models:
✔ A Markov Model is a type of stochastic model that represents a system whose state
is assumed to depend only on its previous state.
✔ This property is known as the Markov property.
The model is based on the assumption that the future state of the system depends only
on its current state and not on the sequence of events that preceded it.

✔ Key Concepts:
✔ States: A set of possible conditions or situations of the system.
✔ Transitions: Probabilities of moving from one state to another. These probabilities
are often represented in a transition matrix.
✔ Markov Property: The probability of transitioning to any particular state depends
solely on the current state and time elapsed, regardless of how the system arrived at its
current state.
✔ Markov Models Applications:
✔ Weather Prediction: Modeling weather conditions where tomorrow's weather
depends only on today's weather.
✔ Game Theory: Analyzing strategic interactions where the outcome of a decision
depends on the current state.

✔ Hidden Markov Models (HMMs):


✔ A Hidden Markov Model is an extension of the basic Markov Model.

46
✔ In an HMM, the states are not directly observable, but the system emits observable
symbols or observations.
✔ The idea is that each state has a probability distribution over possible observations,
and transitions between states have associated probabilities.
✔ Key Concepts:
✔ States: Hidden states that are not directly observable.
✔ Observations: Observable symbols or data points emitted by each hidden state.
✔ Transition Probabilities: Probabilities of transitioning from one hidden state to
another.
✔ Emission Probabilities: Probabilities of emitting a particular observation given a
hidden state..

✔Hidden Markov Models (HMMs): Applications:


✔Speech Recognition: Modeling phonemes as hidden states and observed acoustic
signals as emitted symbols.
✔Part-of-Speech Tagging: Identifying the grammatical category (noun, verb, etc.) of
each word in a sentence.
✔Bioinformatics: Analyzing DNA sequences, where hidden states represent
biological processes and observed symbols represent data from experiments

Hidden Markov Models (HMMs): Applications:

✔Speech Recognition: Modeling phonemes as hidden states and observed acoustic


signals as emitted symbols.
✔Part-of-Speech Tagging: Identifying the grammatical category (noun, verb, etc.) of
each word in a sentence.
✔Bioinformatics: Analyzing DNA sequences, where hidden states represent
biological processes and observed symbols represent data from experiments.

[Link] BASED CLUSTERING

Probability-based clustering (also called soft clustering) is a technique where each data
point is not assigned to a single cluster, but instead is given a probability of belonging to
each cluster.

47
This approach is more flexible than hard clustering (like K-Means), especially when clusters
overlap or the data is not clearly separated.

🧠 Key Ideas

1. Each cluster is modelled as a probability distribution


Usually a Gaussian (Normal) distribution: bell-shaped curve with a center (mean)
and spread (variance).

2. Each data point has a degree of membership("How much" a point belongs to a


cluster x (a probability) for each cluster
For example: a data point might belong to:

o Cluster A: 70%

o Cluster B: 20%

o Cluster C: 10%

3. The clustering is done using an algorithm like Gaussian Mixture Model (GMM)
with the Expectation-Maximization (EM) method.

How It Works (Using GMM + EM)

1. Initialize: Start with random guesses for cluster parameters (mean, variance, etc.)

2. E-step (Expectation):
For each data point, compute the probability (likelihood) it belongs to each cluster.

3. M-step (Maximization):
Update each cluster's parameters to maximize the likelihood of the data given these
probabilities.

4. Repeat until convergence.

Advantages

 Allows overlapping clusters

 Better for data with uncertainty or noise

48
 More flexible and accurate in complex distributions

Limitations

 Computationally more intensive than k-means

 Requires assumptions about data distribution (e.g., Gaussian)

 Needs careful parameter initialization

[Link] SPACE MODEL

The Vector Space Model (VSM) is a way of representing documents through the words that
they contain  It is a standard technique in Information Retrieval  The VSM allows
decisions to be made about which documents are similar to each other and to keyword
queries

How it works:

 Each document is broken down into a word frequency table


 The tables are called vectors and can be stored as arrays
 A vocabulary is built from all the words in all documents in the system
 Each document is represented as a vector based against the vocabulary

Example

 Document A–

“A dog and a cat.”

a 2 dog 1

Document B a and cat 1–

“A frog.” frog 1 1

Example, continued

 The vocabulary contains all words used– a, dog, and, cat, frog  The vocabulary
needs to be sorted– a, and, cat, dog, frog

Queries

49
Queries can be represented as vectors in the same way as documents:– Dog = (0,0,0,1,0)–
Frog = ( – Dog and frog = ( ) )

Similarity measures

o There are many different ways to measure how similar two documents are, or how
similar a document is to a query
o The cosine measure is a very common similarity measure
o Using a similarity measure, a set of documents can be compared to a query and the
most similar document returned

The cosine measure

o For two vectors d and d’ the cosine similarity between d and d’ is given by: d d d ' d
'
o Here d X d’ is the vector product of d and d’, calculated by multiplying corresponding
frequencies together
o The cosine measure calculates the angle between the vectors in a high-dimensional
virtual space

Example

 Let d = (2,1,1,1,0) and d’ = (0,0,0,1,0)

– dXd’ = 2X0 + 1X0 + 1X0 + 1X1 + 0X0=1

– |d| = (22+12+12+12+02) = 7=2.646

– |d’| = (02+02+02+12+02) = 1=1

– Similarity = 1/(1 X 2.646) = 0.378

Let d = (1,0,0,0,1) and d’ = (0,0,0,1,0)– Similarity

Ranking documents

 A user enters a query


 The query is compared to all documents using a similarity measure
 The user is shown the documents in decreasing order of similarity to the query term

50
Vocabulary

Stopword lists

o – Commonly occurring words are unlikely to give useful information and may be
removed from the vocabulary to speed processing– Stopword lists contain frequent
words to be excluded– Stopword lists need to be used carefully • E.g. “to be or not
to be

Term weighting

 Not all words are equally useful


 A word is most likely to be highly relevant to document A if it is:– Infrequent in other
documents– Frequent in document A
 The cosine measure needs to be modified to reflect this

Normalised term frequency (tf)

A normalised measure of the importance of a word to a document is its frequency, divided by


the maximum frequency of any term in the document

o This is known as the tf factor.  Document A: raw frequency vector: (2,1,1,1,0), tf


vector: ( )
o This stops large documents from scoring higher

Inverse document frequency (idf)

o A calculation designed to make rare words more important than common words
o The idf of word i is given by idf log  i n N i
o Where N is the number of documents and ni is the number that contain word i

tf-idf

o The tf-idf weighting scheme is to multiply each word in each document by its tf factor
and idf factor
o Different schemes are usually used for query vectors
o Different variants of tf-idf are also used

51

You might also like