Understanding Clustering in Machine Learning
Understanding Clustering in Machine Learning
702AI0C012
Clustering
Unit-6
1
Contents
Hierarchical Methods:
Agglomerate versus Divisive Hierarchical Clustering
Partitioning Methods
k-Means Clustering,
Density-Based Clustering
DBSCAN.
Dimensionality reduction
Principle Component Analysis.
2
Clustering
• Clustering in machine learning is a type of unsupervised learning
where the goal is to group similar data points together based on
certain features or characteristics.
• Gathering the Data: The first step involves collecting a large dataset of
unlabeled data points. Think of it as gathering all the data points, each
representing an individual element like a customer, a gene, or a document.
• Pattern Recognition: The algorithm then analyzes the data, searching for patterns
and relationships between the data points based on their features.
• Group Formation: Based on these similarities, the algorithm groups data points
into clusters.
• Model Building: The algorithm builds a model that represents the discovered
clusters and their relationships.
5
Clustering Example: Marketing
Identify and characterize customer associations for
marketing
Get customer base and divide them into 3 well defined
groups to decide marketing strategies
Features of customers are
1. Age in years
2. Engagement with the page (Range is 0 to 7) in
days/week
6
Clustering Example: Marketing
• Customer details
Form clusters
8
Application
• Clustering has a wide range of applications across various fields. Here are some of
the most common ones:
• Anomaly detection: Clustering can be used to identify data points that fall
outside of the expected patterns. This is helpful in fraud detection, system health
monitoring, and other applications where detecting outliers is important. 10
Hierarchical Clustering
• Hierarchical clustering is another unsupervised machine learning
algorithm, which is used to group the unlabeled datasets into a
cluster and is also known as hierarchical cluster analysis or HCA.
11
• Divisive Clustering- Divisive clustering is known as the top-down
approach. We take a large cluster and start dividing it into two, three,
four, or more clusters.
12
Working of Hierarchical Clustering
• Let's consider that we have a few points on a 2D plane with x-y coordinates.
• Here, each data point is a cluster of its own. We want to determine a way to
compute the distance between each of these points. For this, we try to find the
1013
shortest distance between any two data points to form a cluster.
• Once we find those with the least distance between them, we start
grouping them and forming clusters of multiple points.
14
• As a result, we have three groups: P1-P2, P3-P4, and P5-P6. Similarly,
we have three dendrograms, as shown below:
15
• In the next step, we bring two groups together. Now the two groups
P3-P4 and P5-P6 are all under one dendrogram because they're closer
together than the P1-P2 group. This is as shown below:
16
• We finish when we’re left with one cluster and finally bring everything
together.
17
Distance Measure
• Distance measure determines the similarity between two elements
and it influences the shape of the clusters.
• There are various ways to calculate the distance between two clusters, and these
ways decide the rule for clustering. These measures are called Linkage methods. 18
• Some of the popular linkage
methods are:
20
Agglomerative Clustering
• Agglomerate clustering begins with each element as a separate
cluster and merges them into larger clusters.
• Let's assume that we have six data points in an Euclidean space.
We're dealing with X-Y dimensions in such a case.
• There are three key questions need to be answered:
• How do we represent a cluster that has more than one point?
• How do we determine the nearness of clusters?
• When do we stop combining clusters?
2021
• How do we represent a cluster of more than one point?
• We will use centroids, which is the average of its points.
• Let’s first take points (1,2) and (2,1), and we’ll group them because
they're close. For these points, we compute a point in the middle and
mark it as (1.5,1.5). It’s the centroid of those two points.
22
• Next, we measure the other group of points by taking (4,1) and (5,0).
We set up a centroid of those two points as (4.5,0.5).
23
• Once we have the centroid of the two groups, we see that the next
closest point to a centroid (1.5, 1.5) is (0,0) and group them,
computing a new centroid based on those three points. The new
centroid will be (1,1).
24
• We do the same with the last point (5,3), and it computes into the
first group. You can see that the dendrogram on the right is growing.
• Now each of these points is connected. We group them, and finally,
we get a centroid of that group, too, at (4.7,1.3).
•
25
• Finally, we combine the two groups by their centroids and end up
with one large group that has its centroid. Usually, we don't compute
the last centroid; we just put them all together.
26
• when do we stop combining clusters? There are many approaches
you can take:
• Pick several clusters(k) upfront
• We decide the number of clusters (say, the first six or seven) required in the
beginning, and we finish when we reach the value K
• Stop when the next merge would create a cluster with low cohesion
• We keep clustering until the next merge of clusters creates a bad cluster/low
cohesion setup.
• Diameter of a cluster
• Diameter is the maximum distance between any pair of points in the cluster.
We finish when the diameter of a new cluster exceeds the threshold.
27
Divisive Clustering
• The divisive clustering approach begins with a whole set composed of
all the data points and divides it into smaller clusters. This can be
done using a monothetic divisive method.
• We consider a space with six points in it as we did before.
28
• We name each point in the cluster as A, B, C, D, E and F.
• Here, we obtain all possible splits into two clusters, as shown below.
29
• We can see the hierarchical dendrogram coming down as we start
splitting everything apart. It continues to divide until every data point
has its node or until we get to K (if we have set a K value).
3031
Exercise
• Find the clusters using a single link technique. Use Euclidean distance
and draw the dendrogram.
Sample No. X Y
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30
32
• Step 1: Compute the distance matrix by:
33
• Step 2: Merging the two closest members of the two clusters and
finding the minimum element in the distance matrix.
• Here the minimum value is 0.10 and hence we combine P3 and P6 (as
0.10 came in the P6 row and P3 column).
• Now, form clusters of elements corresponding to the minimum value
and update the distance matrix. To update the distance matrix:
min ((P3,P6), P1) = min ((P3,P1), (P6,P1)) = min (0.22,0.24) = 0.22
min ((P3,P6), P2) = min ((P3,P2), (P6,P2)) = min (0.14,0.24) = 0.14
min ((P3,P6), P4) = min ((P3,P4), (P6,P4)) = min (0.13,0.22) = 0.13
min ((P3,P6), P5) = min ((P3,P5), (P6,P5)) = min (0.28,0.39) = 0.28
34
• The updated distance matrix:
• Now we will repeat the same process. Merge the two closest members of
the two clusters and find the minimum element in the distance matrix.
• The minimum value is 0.13 and hence we combine P3, P6 and P4. Now,
form the clusters of elements corresponding to the minimum values and
update the distance matrix.
35
min (((P3,P6) P4), P1) = min (((P3,P6), P1), (P4,P1)) = min (0.22,0.37) = 0.22
min (((P3,P6), P4), P2) = min (((P3,P6), P2), (P4,P2)) = min (0.14,0.19) = 0.14
min (((P3,P6), P4), P5) = min (((P3,P6), P5), (P4,P5)) = min (0.28,0.23) = 0.23
• The updated distance matrix:
36
• Now the minimum value is 0.14 and hence we combine P2 and P5.
Now, form a cluster of elements corresponding to the minimum value
and update the distance matrix.
min ((P2,P5), P1) = min ((P2,P1), (P5,P1)) = min (0.23, 0.34) = 0.23
min ((P2,P5), (P3,P6,P4)) = min ((P3,P6,P4), (P3,P6,P4)) = min (0.14. 0.23) = 0.14
• The updated distance matrix:
37
• Now the minimum value is 0.14 and hence we combine P2,P5 and
P3,P6,P4. Now, form a cluster of elements corresponding to the
minimum value and update the distance matrix.
min ((P2,P5,P3,P6,P4), P1) = min ((P2,P5), P1), ((P3,P6,P4), P1)) =
min (0.23, 0.22) = 0.22
• The updated distance matrix:
38
39
• For the given set of points, identify clusters using the complete link
agglomerative clustering.
Sample No X Y
P1 1 1
P2 1.5 1.5
P3 5 5
P4 3 4
P5 4 4
P6 3 3.5
40
• The distance matrix is :
4041
max (d(P4,P6), P1) = max (d(P4,P1), d(P6,P1)) = max (3.6, 3.2) = 3.6
max (d(P4,P6), P2) = max (d(P4,P2), d(P6,P2)) = max (2.92, 2.5) = 2.92
max (d(P4,P6), P3) = max (d(P4,P3), d(P6,P3)) = max (2.24, 2.5) = 2.5
max (d(P4,P6), P5) = max (d(P4,P5), d(P6,P5)) = max (1.0, 1.12) = 1.12
• The updated distance matrix:
42
• Now the minimum value is 0.71 and hence we combine P1 and P2.
max (d(P1, P2), P3) = max (d(P1, P3), d(P2, P3)) = max (5.66, 4.95) = 5.66
max (d(P1,P2), (P4,P6)) = max (d(P1, P4, P6), d(P2, P4, P6)) = max (3.6, 2.92) =
3.6
max (d(P1,P2), P5) = max (d(P1, P5), d(P2, P5)) = max (4.24, 3.53) = 4.24
• The updated distance matrix:
43
• Now the minimum value is 1.12 and hence we combine P4, P6 and
P5.
max (d(P4,P6,P5), (P1,P2)) = max (d(P4,P6,P1,P2), d(P5,P1,P2)) = max (3.6, 4.24)
= 4.24
max (d(P4,P6,P5), P3) = max (d(P4,P6,P3), d(P5, P3)) = max (2.5, 1.41) = 2.5
• Updated distance matrix is:
44
• Now the minimum value is 2.5 and hence combine P4,P6,P5 and P3.
max (d(P4,P6,P5,P3), (P1,P2)) = max (d(P4,P6,P5,P1,P2), d(P3,P1,P2)) =
max (3.6, 5.66) = 5.66
• Updated distance matrix is:
45
46
Partitioning Method
• This clustering method classifies the information into multiple groups
based on the characteristics and similarity of the data.
• It is for the data analysts to specify the number of clusters that have
to be generated for the clustering methods.
• In the partitioning method when database(D) contains multiple(N)
objects then the partitioning method constructs user-specified(K)
partitions of the data in which each partition represents a cluster and
a particular region.
• Most popular algorithm under the partitioning method is K-Means
clustering.
48
k-Means Clustering
• The resulting similarity among the data objects inside the group (intra-cluster)
is high but the similarity of data objects with the data objects from outside the
cluster is low (inter-cluster).
• The similarity of the cluster is determined with respect to the centroid of the
cluster. It is a type of square error algorithm.
49
• Method:
• Randomly assign K objects from the dataset(D) as cluster centroids(C)
• (Re) Assign each object to a cluster which is most similar based upon
centroids.
• Update Cluster centroids, i.e., recalculate the centroid of each cluster with the
updated values.
• Repeat Step 2 and 3 until no change occurs.
50
Example Point Coordinates
A1 (2,10)
A2 (2,6)
A6 (1,2)
• We need to make 3 A7 (5,10)
clusters. A8 (4,9)
A9 (10,12)
A10 (7,5)
A11 (9,11)
A12 (4,6)
A13 (3,10)
A14 (3,8)
A15 (6,11)
5051
• First, we will randomly choose 3 centroids from the given data. Let us
consider A2 (2,6), A7 (5,10), and A15 (6,11) as the centroids of the
initial clusters. Hence, we will consider that
• Centroid 1=(2,6) is associated with cluster 1.
• Centroid 2=(5,10) is associated with cluster 2.
• Centroid 3=(6,11) is associated with cluster 3.
• Now we will find the Euclidean distance between each point and the
centroids. Based on the minimum distance of each point from the
centroids, we will assign the points to a cluster.
52
Distance from Centroid 1 Distance from Centroid 2 Distance from Centroid 3
Point Assigned Cluster
(2,6) (5,10) (6,11)
54
Distance from Centroid 1 Distance from centroid 2 (4, Distance from centroid 3 (9,
Point Assigned Cluster
(3.833, 5.167) 9.6) 11.25)
55
• Now, we have completed the second iteration of the k-means clustering
algorithm and assigned each point to an updated cluster.
• Now, we will calculate the new centroid for each cluster for the third iteration.
• In cluster 1, we have 5 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5), and A12 (4,6). To
calculate the new centroid for cluster 1, we will find the mean of the x and y coordinates of
each point in the cluster. Hence, the new centroid for cluster 1 is (4, 4.6).
• In cluster 2, we have 7 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9), A13 (3,10), A14
(3,8), and A15 (6,11). Hence, the new centroid for cluster 2 is (4.143, 9.571)
• In cluster 3, we have 3 points i.e. A3 (11,11), A9 (10,12), and A11 (9,11). Hence, the new
centroid for cluster 3 is (10, 11.333).
• At this point, we have calculated new centroids for each cluster. Now, we will
calculate the distance of each data point from the new centroids. Then, we will
assign the points to clusters based on their distance from the centroids.
56
Distance from Centroid 1 (4, Distance from centroid Distance from centroid 3 (10,
Point Assigned Cluster
4.6) 2 (4.143, 9.571) 11.333)
X Y
2 4
2 6
5 6
4 7
8 3
6 6
5 2
5 7
6 3
4 4
59
Density-Based Clustering Algorithms
60
Density-Based Clustering Algorithms
• Sometimes data points are grouped/clustered in arbitrary shapes or
include outliers
• Density-based clustering algorithms are efficient at finding high-
density regions and outliers
• Sometimes it is required to detect outliers for some task, e.g.
anomaly detection
• One method is DBSCAN
• Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
is a base algorithm for density-based clustering.
• It can discover clusters of different shapes and sizes from a large
amount of data, which contains noise and outliers.
• Based on the fact that a point belongs to a cluster if it is close to
many points from that cluster
• Two key parameters are
• eps: The distance that specifies the neighborhoods. Two
points are considered to be neighbors if the distance
between them are less than or equal to eps.
• minPts: Minimum number of data points to define a cluster 63
(in neighborhood)
• These parameters can be understood if we explore two concepts
called Density Reachability and Density Connectivity.
• Reachability in terms of density establishes a point to be reachable from another if
it lies within a particular distance (eps) from it.
63
• There are three types of points after the DBSCAN clustering is
complete:
65
• For DBSCAN, the parameters ε and minPts are needed.
• minPts:
• As a rule of thumb, a minimum minPts can be derived from the number of
dimensions D in the data set, as minPts ≥ D + 1.
• The low value minPts = 1 does not make sense, as then every point on its own
will already be a cluster. With minPts ≤ 2, the result will be the same as of
hierarchical clustering with the single link metric.
• Therefore, minPts must be chosen at least 3. However, larger values are
usually better for data sets with noise and will yield more significant clusters.
As a rule of thumb, minPts = 2*dim can be used.
66
• ε:
• The value for ε can then be chosen by using a k-distance graph, plotting the
distance to the k = minPts-1 nearest neighbour ordered from the largest to
the smallest value.
• Good values of ε are where this plot shows an “elbow”: if ε is chosen much too
small, a large part of the data will not be clustered; whereas for a too-high value
of ε, clusters will merge and the majority of objects will be in the same cluster.
67
• K-means and Hierarchical Clustering both fail in creating clusters of
arbitrary shapes. They are not able to form clusters based on varying
densities. That’s why we need DBSCAN clustering.
68
69
70
• Use DBSCAN with ε = 2.0 and MinPts = 2 to determine:
• Core points
• Border points
• Noise points
• Final clusters
7071
• Step 1: Compute the Distance Matrix, we use the Euclidean distance
formula to compute pairwise distances:
72
Step 2: Identify Core Points (MinPts = 2)
A core point has at least MinPts = 2 points (including itself) within radius ε = 2.0.
• Q1: Neighbors within ε → Q2 (only 1 neighbor)
• Not core
• Q2: Neighbors within ε → Q1, Q3
• 2 neighbors → Core Point
• Q3: Neighbors within ε → Q2
• Only 1 neighbor → Not core
• Q4: Neighbors within ε → None
• Not core
• Q5: Neighbors within ε → None
• Not core
• So, Q2 is the only core point.
73
Step 3: Identify Border and Noise Points
• Q1: Not core but within ε of Q2 → Border Point
• Q3: Not core but within ε of Q2 → Border Point
• Q4: No neighbors within ε → Noise Point
• Q5: No neighbors within ε → Noise Point
Step 4: Form Final Clusters
• Q1, Q2, and Q3 belong to Cluster 1 (as Q2 is a core point, and Q1, Q3
are reachable).
• Q4 and Q5 are noise points (not part of any cluster). 74
75
76
77
78
Example
• We choose eps = 0.6 and MinPts =4
79
• Given the points A(3, 7), B(4, 6), C(5, 5), D(6, 4), E(7, 3), F(6, 2), G(7, 2)
and H(8, 4), Find the core points and outliers using DBSCAN. Take Eps
= 2.5 and MinPts = 3.
80
Dimensionality Reduction
• Dimensionality reduction is the process of reducing the number of
features (or dimensions) in a dataset while retaining as much
information as possible.
• This can be done for a variety of reasons, such as to reduce the
complexity of a model, to improve the performance of a learning
algorithm, or to make it easier to visualize the data.
• There are several techniques for dimensionality reduction, including
principal component analysis (PCA), singular value decomposition
(SVD), and linear discriminant analysis (LDA).
• Each technique uses a different method to project the data onto a
lower-dimensional space while preserving important information.
8081
• Some benefits of applying the dimensionality reduction technique to the
given dataset are given below:
• By reducing the dimensions of the features, the space required to store the dataset
also gets reduced.
• Less computation training time is required for reduced dimensions of features.
• Reduced dimensions of features of the dataset help in visualizing the data quickly.
• It removes the redundant features (if present) by taking care of multicollinearity.
• There are also some disadvantages of applying the dimensionality
reduction, which are given below:
• Some data may be lost due to dimensionality reduction.
• In the PCA dimensionality reduction technique, sometimes the principal components
required to consider are unknown.
82
Principle Component Analysis
• Principal Component Analysis (PCA) is a powerful technique used in data
analysis, particularly for reducing the dimensionality of datasets while
preserving crucial information.
• It does this by transforming the original variables into a set of new,
uncorrelated variables called principal components.
• Steps in PCA:
• Standardize the range of continuous initial variables
• Compute the covariance matrix to identify correlations
• Compute the eigenvectors and eigenvalues of the covariance matrix to identify the
principal components
• Create a feature vector to decide which principal components to keep
• Recast the data along the principal component axes
83
• There are as many principal components
as there are variables in the data.
• Principal components are constructed in
such a manner that the first principal
component accounts for the largest
possible variance in the data set.
• The second principal component is
calculated in the same way, with the
condition that it is uncorrelated with
(i.e., perpendicular to) the first principal
component and that it accounts for the
next highest variance.
84
STEP 1: STANDARDIZATION
• this step aims to standardize the range of the continuous initial variables so that each
one of them contributes equally to the analysis.
• It is critical to perform standardization before PCA, because it is quite sensitive regarding
the variances of the initial variables.
• If there are large differences between the ranges of initial variables, those variables with
larger ranges will dominate over those with small ranges (for example, a variable that
ranges between 0 and 100 will dominate over a variable that ranges between 0 and 1),
which will lead to biased results. So, transforming the data to comparable scales can
prevent this problem.
• Mathematically, this can be done by subtracting the mean and dividing by the standard
deviation for each value of each variable.
• Once the standardization is done, all the variables will be transformed to the same scale.
85
STEP 2: COVARIANCE MATRIX COMPUTATION
• this step aims to understand how the variables of the input dataset
vary from the mean with respect to each other, or in other words, to
see if there is any relationship between them.
• Because sometimes, variables are highly correlated in such a way that
they contain redundant information. So, to identify these
correlations, we compute the covariance matrix.
• The covariance matrix is a p × p symmetric matrix (where p is the
number of dimensions) that has as entries the covariances associated
with all possible pairs of the initial variables. For example, for a 3-
dimensional data set with 3 variables x, y, and z, the covariance
matrix is a 3×3 data matrix of this from:
86
87
STEP 3: COMPUTE THE EIGENVECTORS AND
EIGENVALUES OF THE COVARIANCE MATRIX TO
IDENTIFY THE PRINCIPAL COMPONENTS
• Eigenvectors and eigenvalues are the linear algebra concepts that we need
to compute from the covariance matrix to determine the principal
components of the data.
• Their number is equal to the number of dimensions of the data.
• It is eigenvectors and eigenvalues that are behind all the magic of principal
components because the eigenvectors of the Covariance matrix are the
directions of the axes where there is the most variance (most information)
and that we call Principal Components.
• Eigenvalues are simply the coefficients attached to eigenvectors, which give
the amount of variance carried in each Principal Component.
• By ranking your eigenvectors in order of their eigenvalues, highest to
lowest, you get the principal components in order of significance.
88
STEP 4: CREATE A FEATURE VECTOR
• The feature vector is simply a matrix that has as columns the
eigenvectors of the components that we decide to keep.
• This makes it the first step towards dimensionality reduction because
if we choose to keep only p eigenvectors (components) out of n, the
final data set will have only p dimensions.
89
STEP 5: RECAST THE DATA ALONG THE
PRINCIPAL COMPONENTS AXES
• In the previous steps, apart from standardization, you do not make
any changes to the data, you just select the principal components and
form the feature vector, but the input data set remains always in
terms of the original axes (i.e, in terms of the initial variables).
• In this step, the aim is to use the feature vector formed using the
eigenvectors of the covariance matrix, to reorient the data from the
original axes to the ones represented by the principal components
(hence the name Principal Components Analysis). This can be done by
multiplying the transpose of the original data set by the transpose of
the feature vector.
90