Supervised Learning Techniques Overview
Supervised Learning Techniques Overview
UNIT - |
[Link] LEARNING
Learning algorithms learn to map points between inputs and correct outputs. It has both
Example: Consider a scenario where you have to build an image classifier to differentiate
between cats and dogs. If you feed the datasets of dogs and cats labelled images to the
algorithm, the machine will learn to classify between a dog or a cat from these labeled
images. When we input new dog or cat images that it has never seen before, it will use the
learned algorithms and predict whether it is a dog or a cat. This is how supervised
Supervised Learning
There are two main categories of supervised learning that are mentioned below:
Classification
Regression
1
Classification
Classification deals with predicting categorical target variables, which represent discrete
classes
or labels. For instance, classifying emails as spam or not spam, or predicting whether a
patient has
a high risk of heart disease. Classification algorithms learn to map the input features to one of
the
predefined classes.
Logistic Regression
Random Forest
Decision Tree
Naive Bayes
Regression
Regression, on the other hand, deals with predicting continuous target variables, which
represent
numerical values. For example, predicting the price of a house based on its size, location, and
amenities, or forecasting the sales of a product. Regression algorithms learn to map the input
Linear Regression
Polynomial Regression
Ridge Regression
2
Lasso Regression
Decision tree
Random Forest
Supervised Learning models can have high accuracy as they are trained on labelled
data.
It can often be used in pre-trained models which saves time and resources when
It has limitations in knowing patterns and may struggle with unseen or unexpected
Predictive analytics: Predict outcomes, such as sales, customer churn, and stock
prices.
3
Fraud detection: Identify fraudulent transactions.
meteorological parameters.
[Link] TREE
INTRODUCTION
Decision Tree is a Supervised learning technique that can be used for both Classification and
Regression problems, but mostly it is preferred for solving Classification problems.
It is a tree-structured classifier
Internal nodes represent the features of a dataset
Branches represent the decision rules
Each leaf node represents the outcome.
In a Decision tree, there are two nodes
[Link] Decision Node : are used to make any decision and have multiple branches
2. Leaf Node: are the output of those decisions and do not contain any further
branches.
The decisions or the test are performed on the basis of features of the given dataset.
It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure
It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
4
In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes
Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
Step-3: Divide the S into subsets that contains possible values for the best
attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node as a leaf node.
5
There are two types :
o Information Gain
o Gini Index
✔ Information Gain:
Gini Index
Gini index is a measure of impurity or purity used while creating a decision
tree in the CART(Classification and Regression Tree) algorithm.
An attribute with the low Gini index should be preferred as compared to the
high Gini index.
It only creates binary splits, and the CART algorithm uses the Gini index to
create binary splits.
Gini index can be calculated using the below formula:
Gini Index= 1- ∑ jPj2
6
[Link]ÏVE BAYESIANNaive Bayes and Text Classification
Definition
Naive Bayes is a probabilistic machine learning algorithm used for classification tasks. It
is based on Bayes' Theorem with a strong (naive) assumption that the features (predictors)
are independent of each other given the class label.
In text classification, Naive Bayes is used to predict the category of a given text document
(e.g., spam or not spam) based on the words it contains.
Then:
P(C∣X)∝P(C)⋅i=1∏nP(xi∣C)
You ignore P(X) in most cases because it’s constant for all classes (used only for
normalization).
Step-by-step:
1. Collect Training Data: Labeled text documents (e.g., emails marked as spam or not
spam).
2. Preprocess Text:
o Tokenize (split text into words)
o Remove stopwords (e.g., "the", "is")
7
o Stem or lemmatize (reduce words to root form)
3. Convert to Numerical Features:
o Bag of Words or TF-IDF
4. Calculate Probabilities:
o Calculate prior probability for each class:
Let's say we want to classify an email as Spam or Not Spam using the words in the email.
8
Email Word: "Free" Word: "Win" Spam/Not Spam
4 No No Not Spam
P(Spam)=42=0.5,P(Not Spam)=42=0.5
P("Free"=Yes∣Not Spam)=2+20+1=0.25
For Spam:
P(Spam∣Free, Win)∝P(Spam)⋅P(Free∣Spam)⋅P(Win∣Spam)=0.5⋅1.0⋅0.5=0.25
9
Prediction:
Advantages
Limitations
10
Support Vector Machine (SVM) is a supervised machine learning algorithm used for
both classification and regression. Though we say regression problems as well it’s best
suited for classification.
The main objective of the SVM algorithm is to find the optimal hyperplane in an N-
dimensional space that can separate the data points in different classes in the feature space.
The hyperplane tries that the margin between the closest points of different classes should
be as maximum as possible. The dimension of the hyperplane depends upon the number of
features. If the number of input features is two, then the hyperplane is just a line. If the
number of input features is three, then the hyperplane becomes a 2-D plane. It becomes
difficult to imagine when the number of features exceeds three.
Let’s consider two independent variables x 1, x2, and one dependent variable which is either
a blue circle or a red circle.
From the figure above it’s very clear that there are multiple lines (our hyperplane here is a
line because we are considering only two input features x 1, x2) that segregate our data
points or do a classification between red and blue circles.
1. Hyperplane: Hyperplane is the decision boundary that is used to separate the data points
of different classes in a feature space. In the case of linear classifications, it will be a
linear equation i.e. wx+b = 0.
2. Support Vectors: Support vectors are the closest data points to the hyperplane, which
makes a critical role in deciding the hyperplane and margin.
3. Margin: Margin is the distance between the support vector and hyperplane. The main
objective of the support vector machine algorithm is to maximize the margin. The
wider margin indicates better classification performance.
4. Kernel: Kernel is the mathematical function, which is used in SVM to map the original
input data points into high-dimensional feature spaces, so, that the hyperplane can be
easily found out even if the data points are not linearly separable in the original input
11
space. Some of the common kernel functions are linear, polynomial, radial basis
function(RBF), and sigmoid.
5. Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is a
hyperplane that properly separates the data points of different categories without any
misclassifications.
6. Soft Margin: When the data is not perfectly separable or contains outliers, SVM permits
a soft margin technique. Each data point has a slack variable introduced by the soft-
margin SVM formulation, which softens the strict margin requirement and permits
certain misclassifications or violations. It discovers a compromise between increasing
the margin and reducing violations.
7. C: Margin maximisation and misclassification fines are balanced by the regularisation
parameter C in SVM. The penalty for going over the margin or misclassifying data
items is decided by it. A stricter penalty is imposed with a greater value of C, which
results in a smaller margin and perhaps fewer misclassifications.
8. Hinge Loss: A typical loss function in SVMs is hinge loss. It punishes incorrect
classifications or margin violations. The objective function in SVM is frequently
formed by combining it with the regularisation term.
9. Dual Problem: A dual Problem of the optimisation problem that requires locating the
Lagrange multipliers related to the support vectors can be used to solve SVM. The dual
formulation enables the use of kernel tricks and more effective computing.
12
The SVM kernel is a function that takes low-dimensional input space and
transforms it into higher-dimensional space, ie it converts nonseparable problems to
separable problems. It is mostly useful in non-linear separation problems. Simply put
the kernel, does some extremely complex data transformations and then finds out the
process to separate the data based on the labels or outputs defined.
Advantages of SVM
Effective in high-dimensional cases.
Its memory is efficient as it uses a subset of training points in the decision function
called support vectors.
Different kernel functions can be specified for the decision functions and its possible to
specify custom kernels.
[Link] CLASSIFIER
Ensemble learning is a method many small models are used instead of just one. Each
of these models may not be very strong on its own, but when we put their results
together, we get a better and more accurate answer
Ensemble Classifiers are class models that combine the predictive power of several models to
generate more powerful models than individual ones. A group of classifiers is learned and the
final is selected using the voting mechanism.
Ensemble methods combine multiple learning algorithms to create a stronger predictor than
any individual algorithm alone. The fundamental principle is that a group of weak learners
can come together to form a strong learner.
Ensemble models(classifiers) can solve many problems and have several advantages over
other singular methods. They are
Prediction accuracy is increased
The accuracy of the final model is higher even if the basis models fail to classify
accurately.
They can be paralyzed and enables efficient resource management.
Can improve on the errors produced by previous models and generate efficient
models.
13
Single Classifier:
Ensemble:
→ [Classifier 3] → Vote 3
14
There are three main types of ensemble methods:
1. Bagging (Bootstrap Aggregating):
Models are trained independently on different random subsets of the training data. Their
results are then combined—usually by averaging (for regression) or voting (for
classification). This helps reduce variance and prevents overfitting.
2. Boosting:
Models are trained one after another. Each new model focuses on fixing the errors made
by the previous ones. The final prediction is a weighted combination of all models,
which helps reduce bias and improve accuracy.
3. Stacking (Stacked Generalization):
Multiple different models (often of different types) are trained, and their predictions are
used as inputs to a final model, called a meta-model. The meta-model learns how to best
combine the predictions of the base models, aiming for better performance than any
individual model.
Bagging is a technique that involves creating multiple versions of a model and combining
their outputs to improve overall performance.
In bagging several base models are trained on different subsets of the training data, then
aggregate their predictions to make the final decision. The subsets of the data are created
15
using bootstrapping, a statistical technique where samples are drawn with replacement,
meaning some data points can appear more than once in a subset.
The final prediction from the ensemble is typically made by either:
Averaging the predictions (for regression problems), or
Majority voting (for classification problems).
This approach helps to reduce variance, especially with models that are prone to overfitting,
such as decision trees.
Concept: Train multiple models on different bootstrap samples of the training data.
Bagging Algorithm:
16
1. Create k bootstrap samples from training data
2. Train a classifier on each sample
3. Combine predictions (voting/averaging)
Algorithm Explanation
Bagging classifier can be used for both regression and classification tasks. Here is an
overview of Bagging classifier algorithm:
Bootstrap Sampling: Divides the original training data into ‘N’ subsets and randomly
selects a subset with replacement in some rows from other subsets. This step ensures
that the base models are trained on diverse subsets of the data and there is no class
imbalance.
Base Model Training: For each bootstrapped sample we train a base model
independently on that subset of data. These weak models are trained in parallel to
increase computational efficiency and reduce time consumption. We can use different
base learners i.e. different ML models as base learners to bring variety and robustness.
Prediction Aggregation: To make a prediction on testing data combine the predictions
of all base models. For classification tasks it can include majority voting or weighted
majority while for regression it involves averaging the predictions.
Out-of-Bag (OOB) Evaluation: Some samples are excluded from the training subset of
particular base models during the bootstrapping method. These “out-of-bag” samples
can be used to estimate the model’s performance without the need for cross-validation.
Final Prediction: After aggregating the predictions from all the base models, Bagging
produces a final prediction for each instance.
↓ ↓ ↓
↓ ↓ ↓
17
Model 1 Model 2 Model 3
↓ ↓ ↓
Final Prediction
(Majority Vote)
Advantages:
Reduces variance
Parallel training possible
Works well with high-variance models
1. Random Forest
Random forest is an ensemble method based on decision trees. Multiple decision trees
are trained using different bootstrapped samples of the data.
In addition to bagging, Random Forest also introduces randomness by selecting a
random subset of features at each node, further reducing variance and overfitting.
2. Bagged Decision Trees
In Bagged Decision Trees, multiple decision trees are trained using bootstrapped
samples of the data.
Each tree is trained independently and the final prediction is made by averaging the
predictions of all the trees in the ensemble.
18
Boosting
Concept: Sequential training where each model focuses on correcting errors of previous
models.
Algorithm:
Algorithm explanation:
19
Initialize Model Weights: Begin with a single weak learner and assign equal weights to
all training examples.
Train Weak Learner: Train weak learners on these dataset.
Sequential Learning: Boosting works by training models sequentially where each
model focuses on correcting the errors of its predecessor. Boosting typically uses a
single type of weak learner like decision trees.
Weight Adjustment: Boosting assigns weights to training datapoints. Misclassified
examples receive higher weights in the next iteration so that next models pay more
attention to them
Diagram:
Training Data
[Classifier 3]
20
In the diagram above we have taken a dataset with train and test data. Initially we train a
weak learner w1(Decision Tree Tree1)on this data. In the next step we consider the errors
produced by the first model and try to improve upon the errors produced in the test by the
second weak learner w2 which is trained on the errors produced by the previous model. Weak
learner w2 is a trained decision tree Tree 2. This process is continued till n steps , till we
reach an optimal accuracy or maximum number of steps n.
Advantages:
Reduces bias
Can convert weak learners to strong learners
Often achieves high accuracy
21
2.3 Stacking (Stacked Generalization)
Stacking is a ensemble learning technique where the final model known as the “stacked
model" combines the predictions from multiple base models. The goal is to create a
stronger model by using different models and combining them.
Concept: Use a meta-classifier to learn how to best combine predictions from multiple base
classifiers.
Architecture:
→ [Classifier 2] → Prediction 2
→ [Classifier 3] → Prediction 3
22
Level 1 (Meta-Classifier):
Architecture of Stacking
Stacking architecture is like a team of models working together in two layers to improve
prediction accuracy. Each layer has a specific job and the process is designed to make the
final result more accurate than any single model alone. It has two parts:
1. Base Models (Level-0)
These are the first models that directly learn from the original training data. You can think
of them as the “helpers” that try to make predictions in their own way.
Base models can be Decision Tree, Logistic Regression, Random Forest, etc.
Each model is trained separately using the same training data.
2. Meta-Model (Level-1)
This is the final model that learns from the output of the base models instead of the raw
data. Its job is to combine the base models predictions in a smart way to make the final
prediction.
A simple Linear Regression or Logistic Regression can act as a meta-model.
It looks at the outputs of the base models and finds patterns in how they make mistakes
or agree.
23
Process:
24
3. Combination Methods
In hard voting, the predicted output class is a class with the highest majority of votes, i.e.,
the class with the highest probability of being predicted by each classifier.
Number of classifiers: 5
Classifier 1: Class A
Classifier 2: Class A
Classifier 3: Class B
Classifier 4: Class A
Classifier 5: Class C
Vote Count:
Class B: 1 vote
Class C: 1 vote
25
In this, the average probabilities of the classes determine which one will be the final
prediction
Number of classifiers: 3
Average probabilities:
Key Features:
26
Each tree trained on bootstrap sample
Random subset of features at each split
Diagram:
Training Data
↓ ↓ ↓
↓ ↓ ↓ ↓
Majority Vote
Final Prediction
Advantages:
Algorithm Steps:
27
αt = 0.5 × ln((1 - εt) / εt)
Process:
For m = 1 to M:
[Link] LEARNING
Unsupervised Learning
Example: Consider that you have a dataset that contains information about the purchases you
made from the shop. Through clustering, the algorithm can group the same purchasing
behavior
among you and other customers, which reveals potential customers without predefined labels.
This type of information can help businesses get target customers as well as identify outliers.
28
There are two main categories of unsupervised learning that are mentioned below:
Clustering
Association
Clustering
Clustering is the process of grouping data points into clusters based on their similarity. This
technique is useful for identifying patterns and relationships in data without the need for
labeled
examples.
Mean-shift algorithm
DBSCAN Algorithm
Association
Apriori Algorithm
Eclat
FP-growth Algorithm
It helps to discover hidden patterns and various relationships between the data.
Used for tasks such as customer segmentation, anomaly detection, and data
29
exploration.
It does not require labeled data and reduces the effort of data labeling.
Without using labels, it may be difficult to predict the quality of the model’s output.
Cluster Interpretability may not be clear and may not have meaningful interpretations.
It has techniques such as autoencoders and dimensionality reduction that can be used
essential information.
Image and video compression: Reduce the amount of storage required for multimedia
content.
30
✔ Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters,
and so on.
✔ t is a centroid-based algorithm, where each cluster is associated with a centroid.
✔ The main aim of this algorithm is to minimize the sum of distances between the
data point and their corresponding clusters.
✔ The k-means clustering algorithm mainly performs two tasks:
✔ Determines the best value for K center points or centroids by an iterative process.
✔ Assigns each data point to its closest k-center.
✔ Those data points which are near to the particular k-center, create a cluster.
[Link] CLUSTERING
Hierarchical clustering is used to group similar data points together based on their similarity
creating a hierarchy or tree-like structure. The key idea is to begin with each data point as
its own separate cluster and then progressively merge or split them based on their similarity.
Lets understand this with the help of an example
Imagine you have four fruits with different weights: an apple (100g), a banana (120g), a
cherry (50g) and a grape (30g). Hierarchical clustering starts by treating each fruit as its
own group.
31
It then merges the closest groups based on their weights.
First the cherry and grape are grouped together because they are the lightest.
Next the apple and banana are grouped together.
Finally all the fruits are merged into one large group, showing how hierarchical clustering
progressively combines the most similar data points.
Dendrogram
A dendrogram is like a family tree for clusters. It shows how individual data points or
groups of data merge together. The bottom shows each data point as its own group, and as
you move up, similar groups are combined. The lower the merge point, the more similar the
groups are. It helps you see how things are grouped step by step. The working of the
dendrogram can be explained using the below diagram:
32
Dendrogra
m
In the above image on the left side there are five points labeled P, Q, R, S and T. These
represent individual data points that are being clustered. On the right side there’s
a dendrogram which show how these points are grouped together step by step.
At the bottom of the dendrogram the points P, Q, R, S and T are all separate.
As you move up, the closest points are merged into a single group.
The lines connecting the points show how they are progressively merged based on
similarity.
The height at which they are connected shows how similar the points are to each other;
the shorter the line the more similar they are
Types of Hierarchical Clustering
Now we understand the basics of hierarchical clustering. There are two main types of
hierarchical clustering.
1. Agglomerative Clustering
2. Divisive clustering
33
Hierarchical Agglomerative Clustering
It is also known as the bottom-up approach or hierarchical agglomerative clustering
(HAC). Unlike flat clustering hierarchical clustering provides a structured way to group data.
This clustering algorithm does not require us to prespecify the number of clusters. Bottom-up
algorithms treat each data as a singleton cluster at the outset and then successively
agglomerate pairs of clusters until all clusters have been merged into a single cluster that
contains all data.
Hierarchi
cal Agglomerative Clustering
34
Workflow for Hierarchical Agglomerative clustering
1. Start with individual points: Each data point is its own cluster. For example if you have
5 data points you start with 5 clusters each containing just one data point.
2. Calculate distances between clusters: Calculate the distance between every pair of
clusters. Initially since each cluster has one point this is the distance between the two data
points.
3. Merge the closest clusters: Identify the two clusters with the smallest distance and merge
them into a single cluster.
4. Update distance matrix: After merging you now have one less cluster. Recalculate the
distances between the new cluster and the remaining clusters.
5. Repeat steps 3 and 4: Keep merging the closest clusters and updating the distance matrix
until you have only one cluster left.
6. Create a dendrogram: As the process continues you can visualize the merging of
clusters using a tree-like diagram called a dendrogram. It shows the hierarchy of how
clusters are merged.
35
Types of Linkages in Hierarchical Clustering
Hierarchical clustering is used to group similar data points and organise data in a tree-like
structure. Key part of this process is linkage which calculates the distance between clusters
before they are merged or divided. Different types of linkage is used measure this distance
differently.
36
1. Single Linkage
For two clusters R and S the single linkage returns the minimum distance between two
points. This method creates long, chain-like clusters because it is sensitive to outliers and
can connect clusters based on a very small number of close points.
L(R,S)=min(D(i,j)),iϵR,jϵS
where
D(i, j): Distance function between points i and j.
Single Linkage
2. Complete Linkage
For two clusters R and S the complete linkage returns the maximum distance between
two points. It tends to create compact and spherical clusters because it is more sensitive to
outliers and tries to make sure that the clusters are not too far.
L(R,S)=max(D(i,j)),i∈R,j∈S
where
D(i, j): Distance function between points i and j.
37
Complete Linkage
3. Average Linkage
It returns the average distance between all pairs of points from two clusters. This method
maintain a balance between single and complete linkage by considering all pairs of points
not just the closest or farthest point. It usually results in clusters that are moderately
compact.
where
nR : Number of data-points in R
nS : Number of data-points in S
Average
Linkage
38
4. Ward's Linkage
It calculates the distance between two clusters by looking at total spread or variance
increase when the clusters are combined. This method creates compact, well-separated
clusters by making sure that data within each cluster is as similar as possible.
D(i,j),i∈R,j∈S
where
nR and nS are the sizes of clusters R and S
D(i, j) is the distance between points i∈R and j∈S.
Ward Linkage
5. Centroid Linkage
It calculates the distance between two clusters based on the distance between their central
points i.e the average of all points in the cluster. This method works well when clusters
are round or evenly shaped but it may not be the best for irregularly shaped clusters.
L(R,S)=D(Rˉ,Sˉ)where
RˉRˉ and SˉSˉ are the centroids (mean points) of clusters R and S
D(Rˉ,Sˉ) is the distance between the centroids of clusters R and S.
39
Centroid Linkage
Each linkage method has its own advantages and we can use them based on our needs and
type of data we have
Partially supervised learning is a machine learning paradigm that utilizes both labelled and
unlabelled data during model training. It falls between supervised learning, which requires
completely labelled datasets, and unsupervised learning, which uses no labels. This approach
is especially powerful in real-world situations where labelled data is scarce or expensive, but
large volumes of raw, unlabelled data are available.
PSL is a machine learning paradigm that addresses the common real-world scenario where
labelled data is scarce or expensive to obtain, while unlabelled data is abundant. It bridges the
gap between supervised learning (which requires fully labelled datasets) and unsupervised
learning (which uses no labels).
Key Characteristics:
40
Performance enhancement: Often outperforms purely supervised methods with
limited labelled data
In many practical domains such as medical imaging, natural language processing, and web
mining, obtaining labelled data is time-consuming, expensive, or requires domain expertise.
However, unlabelled data—such as raw text, images, or logs—is often abundant. Partially
supervised learning helps address the gap by making use of this vast unlabelled data,
significantly reducing the labelling cost while still maintaining good model accuracy.
How It Works
In partially supervised learning, a small amount of labelled data is used to guide the learning
process, while the model simultaneously learns from a large amount of unlabelled data. The
core idea is to allow the model to generalize from the labelled data and apply what it learns to
structure or classify the unlabelled data.
[Link]-Supervised Learning
Definition: Uses a small amount of labelled data combined with a large amount of
unlabelled data
Assumption: Unlabelled data contains valuable information about the underlying data
distribution
Goal: Improve learning accuracy beyond what's achievable with labelled data alone
41
3. Active Learning
Definition: Iteratively selects the most informative unlabelled examples for manual
labelling
Goal: Minimize labelling effort while maximizing performance
Query strategies: Uncertainty sampling, diversity sampling, expected model change
1. Self-Training (Self-Labeling)
Process:
42
Stop when performance plateaus or no new high-confidence examples are found
Applications
Medical Imaging
Radiology: Training deep learning models for medical image analysis without
requiring large quantities of labelled training data Self-supervised learning for
medical image classification: a systematic review and implementation guidelines | npj
Digital Medicine
Diagnostic Imaging: X-rays, MRIs, CT scans analysis with limited expert
annotations
Medical Image Segmentation: Utilizing large amounts of unlabelled data in
conjunction with labelled data to train higher-performing segmentation model
[Link]-Training
Requirements:
Process:
4. Retrain classifiers
Applications:
43
Web page classification (text content + hyperlinks)
3. Graph-Based Methods
Core Idea: Construct a graph where nodes represent data points and edges represent
similarity
44
4. Unlabelled nodes receive labels based on weighted neighbors
Advantages:
Theoretically well-founded
4. Generative Models
Expectation-Maximization (EM):
Examples:
Advantages
45
10. MARVOK MODELS
✔ Key Concepts:
✔ States: A set of possible conditions or situations of the system.
✔ Transitions: Probabilities of moving from one state to another. These probabilities
are often represented in a transition matrix.
✔ Markov Property: The probability of transitioning to any particular state depends
solely on the current state and time elapsed, regardless of how the system arrived at its
current state.
✔ Markov Models Applications:
✔ Weather Prediction: Modeling weather conditions where tomorrow's weather
depends only on today's weather.
✔ Game Theory: Analyzing strategic interactions where the outcome of a decision
depends on the current state.
46
✔ In an HMM, the states are not directly observable, but the system emits observable
symbols or observations.
✔ The idea is that each state has a probability distribution over possible observations,
and transitions between states have associated probabilities.
✔ Key Concepts:
✔ States: Hidden states that are not directly observable.
✔ Observations: Observable symbols or data points emitted by each hidden state.
✔ Transition Probabilities: Probabilities of transitioning from one hidden state to
another.
✔ Emission Probabilities: Probabilities of emitting a particular observation given a
hidden state..
Probability-based clustering (also called soft clustering) is a technique where each data
point is not assigned to a single cluster, but instead is given a probability of belonging to
each cluster.
47
This approach is more flexible than hard clustering (like K-Means), especially when clusters
overlap or the data is not clearly separated.
🧠 Key Ideas
o Cluster A: 70%
o Cluster B: 20%
o Cluster C: 10%
3. The clustering is done using an algorithm like Gaussian Mixture Model (GMM)
with the Expectation-Maximization (EM) method.
1. Initialize: Start with random guesses for cluster parameters (mean, variance, etc.)
2. E-step (Expectation):
For each data point, compute the probability (likelihood) it belongs to each cluster.
3. M-step (Maximization):
Update each cluster's parameters to maximize the likelihood of the data given these
probabilities.
Advantages
48
More flexible and accurate in complex distributions
Limitations
The Vector Space Model (VSM) is a way of representing documents through the words that
they contain It is a standard technique in Information Retrieval The VSM allows
decisions to be made about which documents are similar to each other and to keyword
queries
How it works:
Example
Document A–
a 2 dog 1
“A frog.” frog 1 1
Example, continued
The vocabulary contains all words used– a, dog, and, cat, frog The vocabulary
needs to be sorted– a, and, cat, dog, frog
Queries
49
Queries can be represented as vectors in the same way as documents:– Dog = (0,0,0,1,0)–
Frog = ( – Dog and frog = ( ) )
Similarity measures
o There are many different ways to measure how similar two documents are, or how
similar a document is to a query
o The cosine measure is a very common similarity measure
o Using a similarity measure, a set of documents can be compared to a query and the
most similar document returned
o For two vectors d and d’ the cosine similarity between d and d’ is given by: d d d ' d
'
o Here d X d’ is the vector product of d and d’, calculated by multiplying corresponding
frequencies together
o The cosine measure calculates the angle between the vectors in a high-dimensional
virtual space
Example
Ranking documents
50
Vocabulary
Stopword lists
o – Commonly occurring words are unlikely to give useful information and may be
removed from the vocabulary to speed processing– Stopword lists contain frequent
words to be excluded– Stopword lists need to be used carefully • E.g. “to be or not
to be
Term weighting
o A calculation designed to make rare words more important than common words
o The idf of word i is given by idf log i n N i
o Where N is the number of documents and ni is the number that contain word i
tf-idf
o The tf-idf weighting scheme is to multiply each word in each document by its tf factor
and idf factor
o Different schemes are usually used for query vectors
o Different variants of tf-idf are also used
51