0% found this document useful (0 votes)

4 views63 pages

Odule Lustering Nalysis

Module 4 covers clustering analysis, including its definition, applications, and desired features. It discusses various clustering methods such as k-means, hierarchical clustering, and density-based clustering, along with distance metrics like Euclidean and Manhattan distances. The module also highlights the advantages and disadvantages of these methods, particularly focusing on the k-means algorithm and hierarchical clustering.

Uploaded by

veeba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views63 pages

Odule Lustering Nalysis

Uploaded by

veeba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

MODULE 4: CLUSTERING ANALYSIS

OUTLINE
 Introduction to clustering
 Applications of Cluster Analysis
 Desired Features of Clustering
 Distance Metrics
 Major Clustering Methods/Algorithms
 Partitioning Clustering: k-means clustering
 Hierarchical Clustering Algorithms (HCA):
 Agglomerative clustering
 Divisive clustering
 Density-based clustering
 DBSCAN algorithm.
 Implementation of Clustering methods
INTRODUCTION TO CLUSTER ANALYSIS
 Clustering is defined as grouping a set of similar objects into
classes or clusters.
 In other words, during cluster analysis, the data is grouped into
classes or clusters, so that records within a cluster (intra-cluster)
have high similarity with one another but have high
dissimilarities in comparison to objects in other clusters (inter-
cluster).
CLUSTER ANALYSIS
APPLICATIONS OF CLUSTER ANALYSIS
 Marketing: It helps companies group customers with similar
behavior to improve targeted marketing and promotions.
 Land use: is used for identifying areas of similar land (agricultural,
forest, industrial, residential etc.).
 Insurance: is helpful for Insurance companies group customers
based on claim patterns.
 City-planning: It helps to group houses based on type, location,
and value (Luxury villas, middle class apartments, low cost housing
etc.).
 Earthquake studies: helps to identify areas where earthquakes
happen frequently.
 Biology studies: helps to group plants, animals, or genes with
similar features.
 Web discovery: groups similar online documents or websites.
 Fraud detection: helps find unusual behavior (outliers).
DESIRED FEATURES OF CLUSTERING
 Scalability: should work well for both small and large amounts of
data.
 Ability to handle different types of attributes: Should work with
numerical, categories and binary data.
 Independent of data input order: should not dependent on the
ordering of input data.
 Identification of clusters with different shapes: should be capable of
identifying clusters of any shape.
 Ability to handle noisy data: should work properly even if the data
has errors, missing values, or unwanted information
 High performance: It should work fast and not use too much
memory or processing power.
 Interpretability: results should be easy to understand and useful for
decision-making.
 Ability to stop and resume: For very large data, the process should
allow pausing and continuing later.
 Minimal user guidance: should not require too many instructions or
expert supervision to work properly.
REAL TIME APPLICATIONS OF CLUSTER
ANALYSIS:
 It is widely used in image processing, data analysis and
pattern recognition.
 It helps marketers to find the distinct groups in their customer
base and they can characterize their customer groups by using
purchasing patterns.
 It can be used in the field of biology, by deriving animal and
plant taxonomies and identifying genes with the same
capabilities.
 It also helps in information discovery by classifying documents
on the web
DISTANCE METRICS
• A distance metric is a function d(x, y) that specifies the distance
between elements of a set as a non-negative real number.

• Two elements are equal under a particular metric if the distance

between them is zero.

• Distance functions present a method to measure the closeness of

two elements.

• Here, elements can be matrices, vectors or arbitrary objects and

do not necessarily need to be numbered.
Euclidean distance
 Euclidean distance is mainly used to calculate distances.
 The distance between two points in the plane with coordinates
(x, y) and (a, b) according to the Euclidean distance formula is
given by:

For example, the (Euclidean) distance between points

(-2, 2) and (2, -1) is calculated as
Euclidean dist((-2, 2), (2, - 1)) = √(-2 - (2))2 + (2 - (-1))2
= √(-4)2 + (3)2
= √16 + 9
= √25
=5
Distance Metrics

Data to calculate Euclidean distances among three persons

The calculation for the distance between person 1 and 2 is:???

The calculation for the distance between person 1 and 3is:???
The calculation for the distance between person 2 and 3 is:???
Distance Metrics
Manhattan distance
 Manhattan distance is also called L1-distance. It is defined as
the sum of the lengths of the projections of the line segment
between the two points on the coordinate axes.
 For example, the distance between two points in the plane with
coordinates (x, y) and (a, b) according to the Manhattan distance
formula, is given by:

Manhattan dist((x, y), (a, b)) = | x – a | + | y – b|

Distance Metrics
MAJOR CLUSTERING
METHODS/ALGORITHMS
 Clustering methods/algorithms can be categorized into five
categories which are given as follows:
• Partitioning method: It constructs random partitions and then
iteratively refines them by some criterion.
• Hierarchical method: It creates a hierarchical decomposition
of the set of data (or objects) using some criterion.
• Density-based method: It is based on connectivity and density
functions.
• Grid based method: It is based on a multiple-level granularity
structure.
• Model based method: A model is considered for each of the
clusters and the idea is to identify the best fit for that model.
Cluster
Analysis

Hierarchical Density based

Partitioning Grid based Model based
Methods Methods

Divisive
K-Means Agglomerative
1. PARTITIONING METHOD
 Clustering is the task of splitting a group of data or dataset into a
small number of clusters.

 For example, the items in a grocery store are grouped into

different categories (butter, milk, and cheese are clustered in dairy
products).

 This is a qualitative kind of partitioning.

 A quantitative approach on the other hand, measures certain
features of the products such as the percentage of milk and i.e.,
products having a high percentage of milk would be clustered
together.

 The k-means clustering is an important technique which falls

under partitioning clustering.
K-MEANS CLUSTERING ALGORITHM
 It is an unsupervised learning algorithm that is used to solve the
clustering problems in ML or data science.
 It groups the unlabeled dataset into different clusters.
 Here K defines the number of pre-defined clusters that need to be
created in the process, as if K=2, there will be two clusters, and
for K=3, there will be three clusters, and so on.
 It is an iterative algorithm that divides the unlabeled dataset
into k different clusters in such a way that each dataset belongs
only one group that has similar properties.
 It is a centroid-based algorithm, where each cluster is associated
with a centroid.
 The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding
clusters.
STEPS OF K-MEANS CLUSTERING
ALGORITHM

1. Select k points at random as centroids/cluster centers.

2. Assign data points to the closest cluster based on
Euclidean distance
3. Calculate centroid of all points within the cluster
4. Repeat iteratively till convergence. (Same points are
assigned to the clusters in consecutive iterations)
k-means clustering

Flow chart for K-means Clustering

k-means clustering

It is a centroid-based algorithm, where each cluster is associated with a

centroid. The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding clusters.
k-means clustering
k-means clustering
k-means clustering
Elbow method to Choose the value of K
• The performance of the K-means clustering algorithm depends upon
highly efficient clusters that it forms.
• But choosing the optimal number of clusters is a big task.
• The Elbow method is one of the most popular ways to find the
optimal number of clusters.
K-MEANS CLUSTERING
Step 1: take the mean value(random)
Step 2: Find nearest number of mean and place it in the
cluster
Step 3:Repeat Step 1 and 2 till you get same mean.

K={2,3,4,10,11,12,20,25,30}

K=2

Take 2 mean values(random)

m1=4 m2=12
K1={2,3,4} k2={10,11,12,20,25,30}
m1=3 m2=18
K1={2,3,4,10}k2={11,12,20,25,30}

m1=4.75 m2=19.6
K1={2,3,4,10,11,12} k2={20,25,30}

m1=7 m2=25
K1={2,3,4,10,11,12} k2={20,25,30}

m1=7 m2=25
Thus, we are getting same mean, stop…..
K means Clustering
Example-2
Step-1

Database after Initialization

K means Clustering Example-2

Step-1

Database after First Iteration

K means Clustering Example-2

Database after Second Iteration

K means Clustering Example-2

The clusters obtained are: {1,2} and {3,4,5,6,7}

At this iteration, there is no further change in
the clusters. Thus, the algorithm comes to a
halt
here and final result consists of 2 clusters, {1,2}
and {3,4,5,6,7} respectively.

Database after third Iteration

ADVANTAGES
 It is very easy to understand and implement.
 It works quickly, especially for large datasets.
 It performs very well when data is numeric.
 It can handle large amounts of data efficiently.
 It requires less memory and computational power compared to
many other clustering methods.
 Widely Used in Real time Applications (image segmentation,
market segmentation, and pattern recognition).
 Dense clusters are formed with K-means as compared to
Hierarchical clustering.
DISADVANTAGES
 It is difficult to predict the number of clusters (value of k).
 Output is strongly impacted by initial inputs like number of
clusters (value of k)
 Order of data will have strong impact on the final output.
 It is very sensitive to rescaling. If we will rescale our data by
means of normalization or standardization, then the output
will completely change.
 It is not good in doing clustering job if the clusters have a
complicated geometric shape.
 Works Only with Numerical Data (cannot handle categorical
data).
 Not Suitable for Non-Spherical Clusters
APPLICATIONS OF K-MEANS CLUSTERING
ALGORITHM
 Market segmentation: Grouping similar products based on
customer buying patterns.
 Document Clustering:Organizing similar documents or
articles into groups
 Customer segmentation: Grouping customers based on
purchasing behavior for targeted marketing
 Image segmentation: Dividing an image into different
regions based on color or intensity.
 Pattern Recognition: Used in speech recognition and
handwriting recognition systems.
 Data Compression: Reducing data size by grouping similar
data points together
ISSUES WITH K MEANS CLUSTERING
 The results of the k-means algorithm intensively depends on the initial
guesses of the seed records.
 It is sensitive to outliers. Thus, it may give poor results if an outlier is
selected as a starting seed record.
 It works on the basis of Euclidean distance and the mean(s) of the
attribute values of intra-cluster objects. Thus, it is only suitable for
continuous data.
 It does not take into account the size of the clusters.
 Clusters may be either large or small.
• It does not deal with overlapping clusters.
• It does not work well with clusters of different sizes and density.
HIERARCHICAL CLUSTERING
Hierarchical Clustering
 it is also known as hierarchical cluster analysis, is an
algorithm that groups similar objects into groups called
clusters.
 The endpoint is a set of clusters, where each cluster is
distinct from each other cluster, and the objects within each
cluster are broadly similar to each other.
TYPES OF HIERARCHICAL CLUSTERING
1. Agglomerative clustering uses a bottom-up approach,
wherein each data point starts in its own cluster.
 These clusters are then joined greedily, by taking the two most
similar clusters together and merging them.
2. Divisive clustering uses a top-down approach, wherein all
data points start in the same cluster.
 For each cluster, you further divide it down to two clusters
until you hit the desired number of clusters.
HIERARCHICAL CLUSTERING
DENDROGRAM IN HIERARCHICAL
CLUSTERING

 The dendrogram is a tree-like structure that is mainly used

to store each step as a memory that the HC algorithm
performs.
 In the dendrogram plot, the Y-axis shows the Euclidean
distances between the data points, and the x-axis shows all
the data points of the given dataset.
HIERARCHICAL CLUSTERING

Role of linkage metrics

•It is required to compute the distance between clusters by using some
metrics before any clustering is performed. These metrics are called linkage
metrics.
•It will be beneficial to identify the proximity matrix consisting of the
distance between each point using a distance function.

Types of linkage metrics

1. Single linkage
In this , the shortest distance
between two points in each cluster is
considered as the distance between
two clusters.
HIERARCHICAL CLUSTERING
2. Complete linkage
In this, the longest distance
between two points in each cluster
is considered as the distance
between two clusters.

3. Average linkage
In this, the average distance
between each point in one
cluster to every point in the
other cluster is considered as
the distance between two
clusters.
 Let’s solve the problem by hand using both the types of
agglomerative hierarchical clustering :
 Single Linkage : In this, we merge in each step the two
clusters, whose two closest members have the smallest distance.

Using single linkage two clusters are formed :

Cluster 1 : (7,10)
Cluster 2 : (20,28,35)
2. Complete Linkage : In this, we merge in the members of the
clusters in each step, which provide the smallest maximum
pairwise distance.

Using complete linkage two clusters are formed :

Cluster 1 : (7,10,20)
Cluster 2 : (28,35)
ADVANTAGES OF HIERARCHICAL
CLUSTERING

 The ability to handle clusters of different sizes and densities.

 The ability to handle missing data and noisy data.
 No need to specify the number of clusters in advance
 It creates a tree-like structure called a Dendrogram, which
helps visualize relationships among data points.
 Easy to understand and interpret
 Works well for small datasets
 Different distance measures such as Euclidean Distance or
Manhattan Distance can be used.
 Once clusters are formed, the algorithm does not need
repeated adjustments like some partitioning algorithms.
DRAWBACKS OF HIERARCHICAL
CLUSTERING
 High computational cost: requires large computation time
and memory, so it is not suitable for very large datasets.
 Sensitive to noise and outliers
 Irreversible merging or splitting: Once clusters are merged
or split during the process, they cannot be changed later.
 Difficult to determine the optimal number of clusters
 Poor scalability: The algorithm becomes inefficient as the
dataset size increases compared to methods like K-Means
Clustering.
 Choice of distance measure affects results: Different
measures such as Euclidean Distance or Manhattan Distance
may produce different clustering outcomes.
 Storage complexity: It requires storing a distance matrix,
which consumes large memory for big datasets.
APPLICATIONS OF HIERARCHICAL
CLUSTERING
 Biology and Genetics: Used to classify plants, animals, and
genes based on similarities. It is widely applied in
Bioinformatics for gene expression analysis.
 Document Classification: Helps group similar documents or
web pages for better organization and information retrieval.
 Market Research and Customer Segmentation:
Businesses use it to group customers based on purchasing
behavior and preferences.
 Image Processing: Used to group similar pixels or image
features for object recognition and pattern detection.
 Social Network Analysis: Helps identify communities or
groups of people with similar interactions.
 Medical Research: Used to group patients with similar
symptoms or diseases for better diagnosis and treatment
analysis.
 Data Mining and Pattern Recognition Helps discover
hidden patterns and relationships in large datasets.
DENSITY-BASED SPATIAL CLUSTERING OF
APPLICATIONS WITH NOISE (DBSCAN)
 It is based on this intuitive notion of “clusters” and “noise”.
 The key idea is that for each point of a cluster, the
neighborhood of a given radius has to contain at least a
minimum number of points.

Why DBSCAN?
 Partitioning methods and hierarchical clustering work for
finding spherical-shaped clusters.
 Moreover, they are also severely affected by the presence of
noise and outliers in the data.
PARAMETERS REQUIRED FOR DBSCAN
ALGORITHM
 eps: It defines the neighborhood around a data point i.e. if the
distance between two points is lower or equal to ‘eps’ then they are
considered neighbors.
 MinPts: Minimum number of neighbors (data points) within eps
radius.
 In this algorithm, we have 3 types of data points.
1. Core Point: A point is a core point if it has more than MinPts
points within eps.
2. Border Point: A point which has fewer than MinPts within eps
but it is in the neighborhood of a core point.
3. Noise or outlier: A point which is not a core point or border point
STEPS USED IN DBSCAN ALGORITHM
1. Find all the neighbor points within eps and identify the core
points or visited with more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster,
create a new cluster.
3. Find recursively all its density-connected points and assign
them to the same cluster as the core point.

 Iterate through the remaining unvisited points in the dataset.

Those points that do not belong to any cluster are noise.
ADVANTAGES
 It can find clusters with any shape.
 No need to specify number of clusters: Unlike K-Means
Clustering, it automatically determines clusters based on
density.
 Detects noise and outliers: It can identify noisy data points
that do not belong to any cluster.
 Works well with spatial data: It is effective for datasets where
clusters are formed based on density distribution
 Less sensitive to initialization: The algorithm does not
depend heavily on initial starting points.
DISADVANTAGES
 Difficulty in choosing parameters: Selecting appropriate
values for Epsilon (ε) and MinPts can be challenging.
 Poor performance with varying densities: it struggles
when clusters have different densities.
 High computational cost for large datasets: It may become
slow for very large datasets.
 Sensitive to distance measurement: The clustering result
depends on the distance metric used.
APPLICATIONS
 Spatial data analysis: Used in geographic and
environmental studies to detect clusters of locations.
 Image processing: Helps identify objects and regions in
images.
 Anomaly detection: Used to detect unusual patterns or
outliers in datasets.
 Market analysis: Helps group customers with similar
purchasing patterns.
 Social network analysis: Used to detect communities or
groups within large networks.
 Medical data analysis: Helps cluster patients with similar
symptoms or disease patterns.
DBSCAN VS K- MEANS CLUSTERING
ALGORITHMS

DBSCAN K-Means

• We need not specify the number • It is very sensitive to number of

of clusters. clusters so it need to specified

• Clusters formed can be of any • Clusters formed are spherical

arbitrary shape. or convex in shape

• Can work well with datasets • Does not work well with outliers
having noise and outliers data.

• Two parameters are required for • only one parameter is required is

training the Model for training the model
Given the points A(3, 7), B(4, 6), C(5, 5), D(6, 4), E(7,
3), F(6, 2), G(7, 2) and H(8, 4), Find the core points and
outliers using DBSCAN. Take Eps = 2.5 and MinPts =
3
 Given, Epsilon(Eps) = 2.5
Minimum Points(MinPts) = 3
Let’s represent the given data points in tabular form:
Step 1: To find the core points, outliers and clusters by using DBSCAN we
need to first calculate the distance among all pairs of given data point. Let us
use Euclidean distance measure for distance calculation.
The final distance matrix becomes as shown below:

Step 2: Now, finding all the data points that lie in the Eps-neighborhood of
each data points. That is, put all the points in the neighborhood set of each
data point whose distance is <=2.5.

(A) = {B}; — — — — — — -→ because distance of B is <= 2.5 with A

(B) = {A, C}; — — — — — → because distance of A and C is <= 2.5 with B
(C) = {B, D}; — — — — —→ because distance of B and D is <=2.5 with C
(D) = {C, E, F, G, H}; — → because distance of C, E, F,G and H is <=2.5 with D
(E) = {D, F, G, H}; — — → because distance of D, F, G and H is <=2.5 with E
(F) = {D, E, G}; — — — — → because distance of D, E and G is <=2.5 with F
(G) = {D, E, F, H}; — — -→ because distance of D, E, F and H is <=2.5 with G
(H) = {D, E, G}; — — — — → because distance of D, E and G is <=2.5 with H
Here, data points A, B and C have neighbors <= MinPts (i.e. 3) so can’t be
considered as core points. Since they belong to the neighborhood of other
data points, hence there exist no outliers in the given set of data points.

Data points D, E, F, G and H have neighbors >= MinPts (i.e. 3) and hence
are the core data points.

Point Status
CORE NOISE BORDER

A √
B √
C √
D √
E √
F √
G √
H √
EXAMPLE 2
P1 0.00
Eps=1.9
P2 1.41 0.00
MinPts=4
P3 2.83 1.41 0.00

P4 4.24 2.83 1.41 0.00

P5 5.66 4.24 2.83 1.41 0.00

P6 5.83 4.47 3.16 2.00 1.41 0.00

P7 6.40 5.00 3.61 2.24 1.00 1.00 0.00

P8 5.83 4.47 3.16 2.00 1.41 2.83 2.24 0.00

P9 4.00 3.16 2.83 3.16 4.00 3.16 4.12 5.10 0.00

P10 1.41 2.00 3.16 4.47 5.83 5.66 6.40 6.32 3.16 0.00

P11 2.00 1.41 2.00 3.16 4.47 4.24 5.00 5.10 2.00 1.41 0.00

P12 3.16 2.83 3.16 4.00 5.10 4.47 5.39 6.00 1.41 2.00 1.41 0.00

P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12

P1:P2,P10 P2:P1, P3, P11 P3:P2, P4 P4:P3, P5

PS:P4,P6,P7,P8 P6:P5, P7 P7:P5, P6 P8:P5

P9:P12 P10:P1, P11 P11: P2, P10,P12 P12:P9, P11

Point Status

Noise Border
MinPts=4 P1
Core
P2
Noise Border
P3
Noise Border
P4
க
Core
P5
12
Noise Border
P6
3 Noise Border
P7
2 Noise Border
P8

1 Noise
4
P9
1 2
Noise Border
P10
Core
P11
Noise Border
P12

P1:P2,P10 P2:P1, P3, P11 P3:P2, P4 P4:P3, P5

PS:P4, P6,P7,P8 P6:P5, P7 P7:P5, P6 P8:PS

P9:P12 P10:P1, P11 P11: P2,P10, P12 P12:P9, P11

Point X Y

3 7
P1
4 6
P2
5 5
P3 7

6 4
P4 6

7 3
P5 5
6 2
P6
7 2
P7 3

8 4
P8
2
3 3
P9
1
2 6
P10 1 2 4 5

3 5
P11
2 4
P12
Eps=1.9 MinPts=4
P1 0.00
Eps=1.9
P2 1.41 0.00
MinPts=4
P3 2.83 1.41 0.00

P4 4.24 2.83 1.41 0.00

P5 5.66 4.24 2.83 1.41 0.00

P6 5.83 4.47 3.16 2.00 1.41 0.00

P7 6.40 5.00 3.61 2.24 1.00 1.00 0.00

P8 5.83 4.47 3.16 2.00 1.41 2.83 2.24 0.00

P9 4.00 3.16 2.83 3.16 4.00 3.16 4.12 5.10 0.00

P10 1.41 2.00 3.16 4.47 5.83 5.66 6.40 6.32 3.16 0.00

P11 2.00 1.41 2.00 3.16 4.47 4.24 5.00 5.10 2.00 1.41 0.00

P12 3.16 2.83 3.16 4.00 5.10 4.47 5.39 6.00 1.41 2.00 1.41 0.00

P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12

P1:P2,P10 P2:P1, P3, P11 P3:P2, P4 P4:P3, P5

PS:P4,P6,P7,P8 P6:P5, P7 P7:P5, P6 P8:P5

P9:P12 P10:P1, P11 P11: P2, P10,P12 P12:P9, P11

Point Status

Noise Border
MinPts=4 P1
Core
P2
Noise Border
P3
Noise Border
P4
க
Core
P5
12
Noise Border
P6
3 Noise Border
P7
2 Noise Border
P8

1 Noise
4
P9
1 2
Noise Border
P10
Core
P11
Noise Border
P12

P1:P2,P10 P2:P1, P3, P11 P3:P2, P4 P4:P3, P5

PS:P4, P6,P7,P8 P6:P5, P7 P7:P5, P6 P8:PS

P9:P12 P10:P1, P11 P11: P2,P10, P12 P12:P9, P11

Cluster Analysis Techniques Explained
No ratings yet
Cluster Analysis Techniques Explained
19 pages
Grid-Based Clustering Overview
No ratings yet
Grid-Based Clustering Overview
19 pages
Clustering Techniques in Machine Learning
No ratings yet
Clustering Techniques in Machine Learning
125 pages
Clustering Techniques and K-Means Guide
No ratings yet
Clustering Techniques and K-Means Guide
23 pages
Unsupervised Learning: Clustering Techniques
No ratings yet
Unsupervised Learning: Clustering Techniques
82 pages
Understanding Clustering Algorithms
No ratings yet
Understanding Clustering Algorithms
74 pages
Clustering Techniques in Machine Learning
No ratings yet
Clustering Techniques in Machine Learning
38 pages
Unit 5 Materials
No ratings yet
Unit 5 Materials
65 pages
Cluster Analysis Methods Explained
No ratings yet
Cluster Analysis Methods Explained
45 pages
Unsupervised Learning in Machine Learning
No ratings yet
Unsupervised Learning in Machine Learning
31 pages
Clustering Techniques in Machine Learning
No ratings yet
Clustering Techniques in Machine Learning
40 pages
Clustering Techniques in Machine Learning
No ratings yet
Clustering Techniques in Machine Learning
40 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
115 pages
AI Data Mining: Clustering Techniques
No ratings yet
AI Data Mining: Clustering Techniques
77 pages
Understanding Clustering Techniques
No ratings yet
Understanding Clustering Techniques
23 pages
Unsupervised Learning: Clustering Methods
No ratings yet
Unsupervised Learning: Clustering Methods
28 pages
Cluster Analysis: Concepts & Methods
No ratings yet
Cluster Analysis: Concepts & Methods
98 pages
Unsupervised Learning: Clustering Methods
No ratings yet
Unsupervised Learning: Clustering Methods
19 pages
Unit IV Clustering
No ratings yet
Unit IV Clustering
116 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
90 pages
Clustering Techniques in Python
No ratings yet
Clustering Techniques in Python
31 pages
Understanding Clustering Techniques
No ratings yet
Understanding Clustering Techniques
49 pages
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
84 pages
Clustering and Time-Series Forecasting Techniques
No ratings yet
Clustering and Time-Series Forecasting Techniques
23 pages
Clustering Techniques for Data Insights
No ratings yet
Clustering Techniques for Data Insights
8 pages
Clustering
No ratings yet
Clustering
45 pages
MAL Unit 5
No ratings yet
MAL Unit 5
17 pages
Unit 4 Introduction To Clustering
No ratings yet
Unit 4 Introduction To Clustering
112 pages
Clustering Techniques and Algorithms
No ratings yet
Clustering Techniques and Algorithms
48 pages
Understanding Clustering in Machine Learning
No ratings yet
Understanding Clustering in Machine Learning
29 pages
Clustering Techniques - 10-12 - Shared
No ratings yet
Clustering Techniques - 10-12 - Shared
50 pages
Understanding Clustering Techniques in Data Mining
No ratings yet
Understanding Clustering Techniques in Data Mining
18 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
34 pages
ML Module 5
No ratings yet
ML Module 5
60 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
85 pages
K-means & Hierarchical Clustering Explained
No ratings yet
K-means & Hierarchical Clustering Explained
13 pages
Unsupervised Learning Techniques Overview
No ratings yet
Unsupervised Learning Techniques Overview
36 pages
Understanding Cluster Analysis in Data Mining
No ratings yet
Understanding Cluster Analysis in Data Mining
80 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
26 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
7 pages
Cluster Analysis Methods Explained
No ratings yet
Cluster Analysis Methods Explained
31 pages
Cluster Analysis Techniques in Marketing
No ratings yet
Cluster Analysis Techniques in Marketing
25 pages
Understanding Clustering in Data Mining
No ratings yet
Understanding Clustering in Data Mining
25 pages
Machine Learning Clustering Techniques
No ratings yet
Machine Learning Clustering Techniques
66 pages
Unsupervised Learning and Clustering Techniques
No ratings yet
Unsupervised Learning and Clustering Techniques
84 pages
Understanding Clustering Techniques
No ratings yet
Understanding Clustering Techniques
28 pages
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
76 pages
Unit IV For Data Mining
No ratings yet
Unit IV For Data Mining
20 pages
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
26 pages
Data Science CH5
No ratings yet
Data Science CH5
14 pages
Module 4 - Clustering and Outlier Detection - Google Slides
No ratings yet
Module 4 - Clustering and Outlier Detection - Google Slides
30 pages
3,4,5 Module AIML 4th ME
No ratings yet
3,4,5 Module AIML 4th ME
126 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
42 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
10 pages
FALLSEM2025 26 VL ISWE209L 00100 TH 2025-11-14 Module 7 Clsutering Theory and Problem
No ratings yet
FALLSEM2025 26 VL ISWE209L 00100 TH 2025-11-14 Module 7 Clsutering Theory and Problem
40 pages
DWDM Unit-Iv
No ratings yet
DWDM Unit-Iv
18 pages
Cluster Analysis Techniques Overview
No ratings yet
Cluster Analysis Techniques Overview
78 pages
Cluster Analysis: Methods and Concepts
No ratings yet
Cluster Analysis: Methods and Concepts
38 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
60 pages
Overview of Deep Learning and AI
No ratings yet
Overview of Deep Learning and AI
11 pages
AI's Impact on Financial Services
No ratings yet
AI's Impact on Financial Services
9 pages
K-Nearest Neighbors for Diabetes Prediction
No ratings yet
K-Nearest Neighbors for Diabetes Prediction
4 pages
Machine Learning for Stroke Prediction
No ratings yet
Machine Learning for Stroke Prediction
6 pages
AI and Humanity Exam Paper - Mahindra University
No ratings yet
AI and Humanity Exam Paper - Mahindra University
2 pages
AI-Driven UX Design: Balancing Ethics and Engagement
No ratings yet
AI-Driven UX Design: Balancing Ethics and Engagement
1 page
Sentiment Analysis of Financial Tweets
No ratings yet
Sentiment Analysis of Financial Tweets
15 pages
Securing Embeddings and Vector Stores
No ratings yet
Securing Embeddings and Vector Stores
52 pages
Trust in AI: Challenges and Future Directions
No ratings yet
Trust in AI: Challenges and Future Directions
11 pages
Detecting AI vs Human Text Analysis
No ratings yet
Detecting AI vs Human Text Analysis
2 pages
Artificial Intelligence in Educational Management For Enhanced Administrative Effectiveness in Rivers State Universities
No ratings yet
Artificial Intelligence in Educational Management For Enhanced Administrative Effectiveness in Rivers State Universities
24 pages
Overview of Natural Language Processing
No ratings yet
Overview of Natural Language Processing
29 pages
AI Monthly Test: Key Concepts and Metrics
No ratings yet
AI Monthly Test: Key Concepts and Metrics
3 pages
Deep Learning for Crop Disease Detection
No ratings yet
Deep Learning for Crop Disease Detection
9 pages
Enhancing Life Insurance Experience with AI
No ratings yet
Enhancing Life Insurance Experience with AI
37 pages
Real Time Network Traffic Analysis Using Artificial Intelligence Machine Learning and Deep Learning A Review of Methods Tools and Applications
No ratings yet
Real Time Network Traffic Analysis Using Artificial Intelligence Machine Learning and Deep Learning A Review of Methods Tools and Applications
8 pages
Entropy 23 00759
No ratings yet
Entropy 23 00759
17 pages
When Unsupervised Domain Adaptation Meets One-Class Anomaly Detection: Addressing The Two-Fold Unsupervised Curse by Leveraging Anomaly Scarcity
No ratings yet
When Unsupervised Domain Adaptation Meets One-Class Anomaly Detection: Addressing The Two-Fold Unsupervised Curse by Leveraging Anomaly Scarcity
10 pages
NLP Assignment: CBOW and Word Embeddings
No ratings yet
NLP Assignment: CBOW and Word Embeddings
3 pages
Big Data: Transforming Modern Management
No ratings yet
Big Data: Transforming Modern Management
49 pages
Deep Learning Neuron Concepts Workbook
No ratings yet
Deep Learning Neuron Concepts Workbook
9 pages
Cognitive Superiority: Information To Power Dean S. Hartley Iii Ebook Mobile-Friendly Edition
100% (1)
Cognitive Superiority: Information To Power Dean S. Hartley Iii Ebook Mobile-Friendly Edition
69 pages
Classification Techniques in Data Mining
No ratings yet
Classification Techniques in Data Mining
37 pages
Intelligent Traffic Management System
No ratings yet
Intelligent Traffic Management System
15 pages
Mathematical Foundations of Deep Learning
No ratings yet
Mathematical Foundations of Deep Learning
713 pages
Legal NLP: Enhancements and Challenges
No ratings yet
Legal NLP: Enhancements and Challenges
3 pages
Machine Learning Exam Questions December 2024
No ratings yet
Machine Learning Exam Questions December 2024
2 pages
Pramod Thota: AI & Backend Engineering Expert
No ratings yet
Pramod Thota: AI & Backend Engineering Expert
4 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
53 pages
Karthikeyan Subramanian's Resume
No ratings yet
Karthikeyan Subramanian's Resume
1 page

Odule Lustering Nalysis

Uploaded by

Odule Lustering Nalysis

Uploaded by

MODULE 4: CLUSTERING ANALYSIS

• Two elements are equal under a particular metric if the distance

• Distance functions present a method to measure the closeness of

• Here, elements can be matrices, vectors or arbitrary objects and

For example, the (Euclidean) distance between points

Data to calculate Euclidean distances among three persons

The calculation for the distance between person 1 and 2 is:???

Manhattan dist((x, y), (a, b)) = | x – a | + | y – b|

Hierarchical Density based

 For example, the items in a grocery store are grouped into

 This is a qualitative kind of partitioning.

 The k-means clustering is an important technique which falls

1. Select k points at random as centroids/cluster centers.

Flow chart for K-means Clustering

It is a centroid-based algorithm, where each cluster is associated with a

Take 2 mean values(random)

Database after Initialization

Database after First Iteration

Database after Second Iteration

The clusters obtained are: {1,2} and {3,4,5,6,7}

Database after third Iteration

 The dendrogram is a tree-like structure that is mainly used

Role of linkage metrics

Types of linkage metrics

Using single linkage two clusters are formed :

Using complete linkage two clusters are formed :

 The ability to handle clusters of different sizes and densities.

 Iterate through the remaining unvisited points in the dataset.

• We need not specify the number • It is very sensitive to number of

• Clusters formed can be of any • Clusters formed are spherical

• Two parameters are required for • only one parameter is required is

(A) = {B}; — — — — — — -→ because distance of B is <= 2.5 with A

P4 4.24 2.83 1.41 0.00

P5 5.66 4.24 2.83 1.41 0.00

P6 5.83 4.47 3.16 2.00 1.41 0.00

P7 6.40 5.00 3.61 2.24 1.00 1.00 0.00

P8 5.83 4.47 3.16 2.00 1.41 2.83 2.24 0.00

P9 4.00 3.16 2.83 3.16 4.00 3.16 4.12 5.10 0.00

P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12

P1:P2,P10 P2:P1, P3, P11 P3:P2, P4 P4:P3, P5

PS:P4,P6,P7,P8 P6:P5, P7 P7:P5, P6 P8:P5

P9:P12 P10:P1, P11 P11: P2, P10,P12 P12:P9, P11

P1:P2,P10 P2:P1, P3, P11 P3:P2, P4 P4:P3, P5

PS:P4, P6,P7,P8 P6:P5, P7 P7:P5, P6 P8:PS

P9:P12 P10:P1, P11 P11: P2,P10, P12 P12:P9, P11

P4 4.24 2.83 1.41 0.00

P5 5.66 4.24 2.83 1.41 0.00

P6 5.83 4.47 3.16 2.00 1.41 0.00

P7 6.40 5.00 3.61 2.24 1.00 1.00 0.00

P8 5.83 4.47 3.16 2.00 1.41 2.83 2.24 0.00

P9 4.00 3.16 2.83 3.16 4.00 3.16 4.12 5.10 0.00

P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12

P1:P2,P10 P2:P1, P3, P11 P3:P2, P4 P4:P3, P5

PS:P4,P6,P7,P8 P6:P5, P7 P7:P5, P6 P8:P5

P9:P12 P10:P1, P11 P11: P2, P10,P12 P12:P9, P11

P1:P2,P10 P2:P1, P3, P11 P3:P2, P4 P4:P3, P5

PS:P4, P6,P7,P8 P6:P5, P7 P7:P5, P6 P8:PS

P9:P12 P10:P1, P11 P11: P2,P10, P12 P12:P9, P11

You might also like