0% found this document useful (0 votes)
20 views88 pages

Understanding Clustering in Machine Learning

The document provides an overview of clustering in machine learning, focusing on unsupervised learning techniques used to group similar data points without predefined labels. It discusses various clustering methods, including hierarchical (agglomerative and divisive), k-means, and density-based clustering, along with their applications in fields like marketing, recommendation systems, and medical imaging. Additionally, it covers distance measures and linkage methods essential for determining cluster similarity.

Uploaded by

Anushka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views88 pages

Understanding Clustering in Machine Learning

The document provides an overview of clustering in machine learning, focusing on unsupervised learning techniques used to group similar data points without predefined labels. It discusses various clustering methods, including hierarchical (agglomerative and divisive), k-means, and density-based clustering, along with their applications in fields like marketing, recommendation systems, and medical imaging. Additionally, it covers distance measures and linkage methods essential for determining cluster similarity.

Uploaded by

Anushka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Machine Learning

702AI0C012

Clustering
Unit-6
1
Contents
Hierarchical Methods:
Agglomerate versus Divisive Hierarchical Clustering
Partitioning Methods
k-Means Clustering,
Density-Based Clustering
DBSCAN.
Dimensionality reduction
Principle Component Analysis.

2
Clustering
• Clustering in machine learning is a type of unsupervised learning
where the goal is to group similar data points together based on
certain features or characteristics.

• The objective is to discover inherent structures or patterns within the


data without predefined categories or labels.

• In other words, clustering algorithms partition the data into groups or


clusters such that data points within the same cluster are more
similar to each other than they are to those in other clusters.

• Data points of separate groups shows dissimilarity in characteristics 3


• Key points about clustering:
• Unlabeled Data: Clustering is performed on datasets without labelled output,
meaning that the algorithm doesn't have prior information about which group
a data point belongs to.
• Similarity Measure: Clustering relies on a similarity or distance measure to
determine how alike or dissimilar data points are. Common distance
measures include Euclidean distance, Manhattan distance, and cosine
similarity.
• Cluster Centers: Clustering algorithms often define cluster centres or
centroids, and data points are assigned to the cluster whose centroid is
closest to them.
4
• How clustering works:

• Gathering the Data: The first step involves collecting a large dataset of
unlabeled data points. Think of it as gathering all the data points, each
representing an individual element like a customer, a gene, or a document.

• Pattern Recognition: The algorithm then analyzes the data, searching for patterns
and relationships between the data points based on their features.

• Group Formation: Based on these similarities, the algorithm groups data points
into clusters.

• Model Building: The algorithm builds a model that represents the discovered
clusters and their relationships.

5
Clustering Example: Marketing
Identify and characterize customer associations for
marketing
Get customer base and divide them into 3 well defined
groups to decide marketing strategies
Features of customers are
1. Age in years
2. Engagement with the page (Range is 0 to 7) in
days/week

6
Clustering Example: Marketing

• Customer details

Group customers based on age or based on engagement

Not easy to define clusters.


Clustering Example: Marketing

Form clusters

8
Application
• Clustering has a wide range of applications across various fields. Here are some of
the most common ones:

• Market segmentation: Clustering helps identify distinct groups of customers


based on their purchase history, preferences, or behavior. This allows businesses
to tailor their marketing campaigns and promotions to specific customer
segments for better targeting and increased effectiveness.

• Recommendation engines: Clustering is used to group users with similar tastes or


preferences. This allows recommender systems to suggest products, movies,
music, or other items that a user might be interested

• Social network analysis: By clustering users in social networks, we can identify


communities with shared interests. This helps understand social structures, find
9
influential users, and develop targeted advertising strategies.
• Search result grouping: Search engines use clustering to group similar web pages
together. This improves the quality and relevance of search results by presenting
the user with a more organized set of information.

• Medical imaging analysis: Clustering is applied to classify medical images, such as


X-rays or MRIs, into normal or abnormal categories. This can aid in early disease
detection and diagnosis.

• Image segmentation: Clustering algorithms can be used to segment images into


different objects or regions. This is useful in applications like self-driving cars,
where segmenting objects like pedestrians and vehicles is crucial.

• Anomaly detection: Clustering can be used to identify data points that fall
outside of the expected patterns. This is helpful in fraud detection, system health
monitoring, and other applications where detecting outliers is important. 10
Hierarchical Clustering
• Hierarchical clustering is another unsupervised machine learning
algorithm, which is used to group the unlabeled datasets into a
cluster and is also known as hierarchical cluster analysis or HCA.

• In this algorithm, we develop the hierarchy of clusters in the form of a


tree, and this tree-shaped structure is known as the dendrogram.

• Hierarchical clustering is divided into:


• Agglomerative
• Divisive

11
• Divisive Clustering- Divisive clustering is known as the top-down
approach. We take a large cluster and start dividing it into two, three,
four, or more clusters.

• Agglomerative Clustering- Agglomerative clustering is known as a


bottom-up approach. Consider it as bringing things together.

12
Working of Hierarchical Clustering
• Let's consider that we have a few points on a 2D plane with x-y coordinates.

• Here, each data point is a cluster of its own. We want to determine a way to
compute the distance between each of these points. For this, we try to find the
1013
shortest distance between any two data points to form a cluster.
• Once we find those with the least distance between them, we start
grouping them and forming clusters of multiple points.

• This is represented in a tree-like structure called a dendrogram.

14
• As a result, we have three groups: P1-P2, P3-P4, and P5-P6. Similarly,
we have three dendrograms, as shown below:

15
• In the next step, we bring two groups together. Now the two groups
P3-P4 and P5-P6 are all under one dendrogram because they're closer
together than the P1-P2 group. This is as shown below:

16
• We finish when we’re left with one cluster and finally bring everything
together.

17
Distance Measure
• Distance measure determines the similarity between two elements
and it influences the shape of the clusters.

• Distance measures include:


• Euclidean distance measure
• Squared Euclidean distance measure
• Manhattan distance measure
• Cosine distance measure
• The closest distance between the two clusters is crucial for hierarchical clustering.

• There are various ways to calculate the distance between two clusters, and these
ways decide the rule for clustering. These measures are called Linkage methods. 18
• Some of the popular linkage
methods are:

• Single Linkage: It is the Shortest


Distance between the closest
points of the clusters.
• Complete Linkage: It is the farthest
distance between the two points of two
different clusters. It is one of the popular
19

linkage methods as it forms tighter


clusters than single-linkage.
• Average Linkage: It is the linkage method in
which the distance between each pair of
datapoints is added up and then divided by
the total number of datapoints to calculate
the average distance between two clusters

• Centroid Linkage: It is the linkage method


in which the distance between the
centroid of the clusters is calculated.

20
Agglomerative Clustering
• Agglomerate clustering begins with each element as a separate
cluster and merges them into larger clusters.
• Let's assume that we have six data points in an Euclidean space.
We're dealing with X-Y dimensions in such a case.
• There are three key questions need to be answered:
• How do we represent a cluster that has more than one point?
• How do we determine the nearness of clusters?
• When do we stop combining clusters?

2021
• How do we represent a cluster of more than one point?
• We will use centroids, which is the average of its points.
• Let’s first take points (1,2) and (2,1), and we’ll group them because
they're close. For these points, we compute a point in the middle and
mark it as (1.5,1.5). It’s the centroid of those two points.

22
• Next, we measure the other group of points by taking (4,1) and (5,0).
We set up a centroid of those two points as (4.5,0.5).

23
• Once we have the centroid of the two groups, we see that the next
closest point to a centroid (1.5, 1.5) is (0,0) and group them,
computing a new centroid based on those three points. The new
centroid will be (1,1).

24
• We do the same with the last point (5,3), and it computes into the
first group. You can see that the dendrogram on the right is growing.
• Now each of these points is connected. We group them, and finally,
we get a centroid of that group, too, at (4.7,1.3).

25
• Finally, we combine the two groups by their centroids and end up
with one large group that has its centroid. Usually, we don't compute
the last centroid; we just put them all together.

26
• when do we stop combining clusters? There are many approaches
you can take:
• Pick several clusters(k) upfront
• We decide the number of clusters (say, the first six or seven) required in the
beginning, and we finish when we reach the value K
• Stop when the next merge would create a cluster with low cohesion
• We keep clustering until the next merge of clusters creates a bad cluster/low
cohesion setup.
• Diameter of a cluster
• Diameter is the maximum distance between any pair of points in the cluster.
We finish when the diameter of a new cluster exceeds the threshold.

27
Divisive Clustering
• The divisive clustering approach begins with a whole set composed of
all the data points and divides it into smaller clusters. This can be
done using a monothetic divisive method.
• We consider a space with six points in it as we did before.

28
• We name each point in the cluster as A, B, C, D, E and F.
• Here, we obtain all possible splits into two clusters, as shown below.

29
• We can see the hierarchical dendrogram coming down as we start
splitting everything apart. It continues to divide until every data point
has its node or until we get to K (if we have set a K value).
3031
Exercise
• Find the clusters using a single link technique. Use Euclidean distance
and draw the dendrogram.
Sample No. X Y

P1 0.40 0.53

P2 0.22 0.38

P3 0.35 0.32

P4 0.26 0.19

P5 0.08 0.41

P6 0.45 0.30

32
• Step 1: Compute the distance matrix by:

• The distance matrix will be :

33
• Step 2: Merging the two closest members of the two clusters and
finding the minimum element in the distance matrix.
• Here the minimum value is 0.10 and hence we combine P3 and P6 (as
0.10 came in the P6 row and P3 column).
• Now, form clusters of elements corresponding to the minimum value
and update the distance matrix. To update the distance matrix:
min ((P3,P6), P1) = min ((P3,P1), (P6,P1)) = min (0.22,0.24) = 0.22
min ((P3,P6), P2) = min ((P3,P2), (P6,P2)) = min (0.14,0.24) = 0.14
min ((P3,P6), P4) = min ((P3,P4), (P6,P4)) = min (0.13,0.22) = 0.13
min ((P3,P6), P5) = min ((P3,P5), (P6,P5)) = min (0.28,0.39) = 0.28

34
• The updated distance matrix:

• Now we will repeat the same process. Merge the two closest members of
the two clusters and find the minimum element in the distance matrix.
• The minimum value is 0.13 and hence we combine P3, P6 and P4. Now,
form the clusters of elements corresponding to the minimum values and
update the distance matrix.

35
min (((P3,P6) P4), P1) = min (((P3,P6), P1), (P4,P1)) = min (0.22,0.37) = 0.22
min (((P3,P6), P4), P2) = min (((P3,P6), P2), (P4,P2)) = min (0.14,0.19) = 0.14
min (((P3,P6), P4), P5) = min (((P3,P6), P5), (P4,P5)) = min (0.28,0.23) = 0.23
• The updated distance matrix:

36
• Now the minimum value is 0.14 and hence we combine P2 and P5.
Now, form a cluster of elements corresponding to the minimum value
and update the distance matrix.
min ((P2,P5), P1) = min ((P2,P1), (P5,P1)) = min (0.23, 0.34) = 0.23
min ((P2,P5), (P3,P6,P4)) = min ((P3,P6,P4), (P3,P6,P4)) = min (0.14. 0.23) = 0.14
• The updated distance matrix:

37
• Now the minimum value is 0.14 and hence we combine P2,P5 and
P3,P6,P4. Now, form a cluster of elements corresponding to the
minimum value and update the distance matrix.
min ((P2,P5,P3,P6,P4), P1) = min ((P2,P5), P1), ((P3,P6,P4), P1)) =
min (0.23, 0.22) = 0.22
• The updated distance matrix:

38
39
• For the given set of points, identify clusters using the complete link
agglomerative clustering.
Sample No X Y

P1 1 1

P2 1.5 1.5

P3 5 5

P4 3 4

P5 4 4

P6 3 3.5

40
• The distance matrix is :

• The minimum value is 0.5 and hence we combine P4 and P6.

4041
max (d(P4,P6), P1) = max (d(P4,P1), d(P6,P1)) = max (3.6, 3.2) = 3.6
max (d(P4,P6), P2) = max (d(P4,P2), d(P6,P2)) = max (2.92, 2.5) = 2.92
max (d(P4,P6), P3) = max (d(P4,P3), d(P6,P3)) = max (2.24, 2.5) = 2.5
max (d(P4,P6), P5) = max (d(P4,P5), d(P6,P5)) = max (1.0, 1.12) = 1.12
• The updated distance matrix:

42
• Now the minimum value is 0.71 and hence we combine P1 and P2.
max (d(P1, P2), P3) = max (d(P1, P3), d(P2, P3)) = max (5.66, 4.95) = 5.66
max (d(P1,P2), (P4,P6)) = max (d(P1, P4, P6), d(P2, P4, P6)) = max (3.6, 2.92) =
3.6
max (d(P1,P2), P5) = max (d(P1, P5), d(P2, P5)) = max (4.24, 3.53) = 4.24
• The updated distance matrix:

43
• Now the minimum value is 1.12 and hence we combine P4, P6 and
P5.
max (d(P4,P6,P5), (P1,P2)) = max (d(P4,P6,P1,P2), d(P5,P1,P2)) = max (3.6, 4.24)
= 4.24
max (d(P4,P6,P5), P3) = max (d(P4,P6,P3), d(P5, P3)) = max (2.5, 1.41) = 2.5
• Updated distance matrix is:

44
• Now the minimum value is 2.5 and hence combine P4,P6,P5 and P3.
max (d(P4,P6,P5,P3), (P1,P2)) = max (d(P4,P6,P5,P1,P2), d(P3,P1,P2)) =
max (3.6, 5.66) = 5.66
• Updated distance matrix is:

45
46
Partitioning Method
• This clustering method classifies the information into multiple groups
based on the characteristics and similarity of the data.
• It is for the data analysts to specify the number of clusters that have
to be generated for the clustering methods.
• In the partitioning method when database(D) contains multiple(N)
objects then the partitioning method constructs user-specified(K)
partitions of the data in which each partition represents a cluster and
a particular region.
• Most popular algorithm under the partitioning method is K-Means
clustering.
48
k-Means Clustering

• K-Means (A centroid-based Technique): The K means algorithm takes the input


parameter K from the user and partitions the dataset containing N objects into
K clusters

• The resulting similarity among the data objects inside the group (intra-cluster)
is high but the similarity of data objects with the data objects from outside the
cluster is low (inter-cluster).

• The similarity of the cluster is determined with respect to the centroid of the
cluster. It is a type of square error algorithm.

49
• Method:
• Randomly assign K objects from the dataset(D) as cluster centroids(C)
• (Re) Assign each object to a cluster which is most similar based upon
centroids.
• Update Cluster centroids, i.e., recalculate the centroid of each cluster with the
updated values.
• Repeat Step 2 and 3 until no change occurs.

50
Example Point Coordinates

A1 (2,10)

A2 (2,6)

• You are given 15 points in A3 (11,11)

the Cartesian coordinate A4 (6,9)

system as follows. A5 (6,4)

A6 (1,2)
• We need to make 3 A7 (5,10)

clusters. A8 (4,9)

A9 (10,12)

A10 (7,5)

A11 (9,11)

A12 (4,6)

A13 (3,10)

A14 (3,8)

A15 (6,11)
5051
• First, we will randomly choose 3 centroids from the given data. Let us
consider A2 (2,6), A7 (5,10), and A15 (6,11) as the centroids of the
initial clusters. Hence, we will consider that
• Centroid 1=(2,6) is associated with cluster 1.
• Centroid 2=(5,10) is associated with cluster 2.
• Centroid 3=(6,11) is associated with cluster 3.
• Now we will find the Euclidean distance between each point and the
centroids. Based on the minimum distance of each point from the
centroids, we will assign the points to a cluster.

52
Distance from Centroid 1 Distance from Centroid 2 Distance from Centroid 3
Point Assigned Cluster
(2,6) (5,10) (6,11)

A1 (2,10) 4 3 4.123106 Cluster 2

A2 (2,6) 0 5 6.403124 Cluster 1

A3 (11,11) 10.29563 6.082763 5 Cluster 3

A4 (6,9) 5 1.414214 2 Cluster 2

A5 (6,4) 4.472136 6.082763 7 Cluster 1

A6 (1,2) 4.123106 8.944272 10.29563 Cluster 1

A7 (5,10) 5 0 1.414214 Cluster 2

A8 (4,9) 3.605551 1.414214 2.828427 Cluster 2

A9 (10,12) 10 5.385165 4.123106 Cluster 3

A10 (7,5) 5.09902 5.385165 6.082763 Cluster 1

A11 (9,11) 8.602325 4.123106 3 Cluster 3

A12 (4,6) 2 4.123106 5.385165 Cluster 1

A13 (3,10) 4.123106 2 3.162278 Cluster 2

A14 (3,8) 2.236068 2.828427 4.242641 Cluster 1

A15 (6,11) 6.403124 P


1i.y4u1s4h21
Kumar
4 Soni 0 Cluster 3 53
• Now, we will calculate the new centroid for each cluster.
• In cluster 1, we have 6 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5), A12 (4,6), A14
(3,8). To calculate the new centroid for cluster 1, we will find the mean of the x and y
coordinates of each point in the cluster. Hence, the new centroid for cluster 1 is
(3.833, 5.167).
• In cluster 2, we have 5 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9), and A13
(3,10). Hence, the new centroid for cluster 2 is (4, 9.6)
• In cluster 3, we have 4 points i.e. A3 (11,11), A9 (10,12), A11 (9,11), and A15 (6,11).
Hence, the new centroid for cluster 3 is (9, 11.25).
• Now that we have calculated new centroids for each cluster, we will
calculate the distance of each data point from the new centroids. Then, we
will assign the points to clusters based on their distance from the centroids.

54
Distance from Centroid 1 Distance from centroid 2 (4, Distance from centroid 3 (9,
Point Assigned Cluster
(3.833, 5.167) 9.6) 11.25)

A1 (2,10) 5.169 2.040 7.111 Cluster 2

A2 (2,6) 2.013 4.118 8.750 Cluster 1

A3 (11,11) 9.241 7.139 2.016 Cluster 3

A4 (6,9) 4.403 2.088 3.750 Cluster 2

A5 (6,4) 2.461 5.946 7.846 Cluster 1

A6 (1,2) 4.249 8.171 12.230 Cluster 1

A7 (5,10) 4.972 1.077 4.191 Cluster 2

A8 (4,9) 3.837 0.600 5.483 Cluster 2

A9 (10,12) 9.204 6.462 1.250 Cluster 3

A10 (7,5) 3.171 5.492 6.562 Cluster 1

A11 (9,11) 7.792 5.192 0.250 Cluster 3

A12 (4,6) 0.850 3.600 7.250 Cluster 1

A13 (3,10) 4.904 1.077 6.129 Cluster 2

A14 (3,8) 2.953 1.887 6.824 Cluster 2

A15 (6,11) 6.223 2.441 3.010 Cluster 2

55
• Now, we have completed the second iteration of the k-means clustering
algorithm and assigned each point to an updated cluster.
• Now, we will calculate the new centroid for each cluster for the third iteration.
• In cluster 1, we have 5 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5), and A12 (4,6). To
calculate the new centroid for cluster 1, we will find the mean of the x and y coordinates of
each point in the cluster. Hence, the new centroid for cluster 1 is (4, 4.6).
• In cluster 2, we have 7 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9), A13 (3,10), A14
(3,8), and A15 (6,11). Hence, the new centroid for cluster 2 is (4.143, 9.571)
• In cluster 3, we have 3 points i.e. A3 (11,11), A9 (10,12), and A11 (9,11). Hence, the new
centroid for cluster 3 is (10, 11.333).
• At this point, we have calculated new centroids for each cluster. Now, we will
calculate the distance of each data point from the new centroids. Then, we will
assign the points to clusters based on their distance from the centroids.

56
Distance from Centroid 1 (4, Distance from centroid Distance from centroid 3 (10,
Point Assigned Cluster
4.6) 2 (4.143, 9.571) 11.333)

A1 (2,10) 5.758 2.186 8.110 Cluster 2

A2 (2,6) 2.441 4.165 9.615 Cluster 1

A3 (11,11) 9.485 7.004 1.054 Cluster 3

A4 (6,9) 4.833 1.943 4.631 Cluster 2

A5 (6,4) 2.088 5.872 8.353 Cluster 1

A6 (1,2) 3.970 8.197 12.966 Cluster 1

A7 (5,10) 5.492 0.958 5.175 Cluster 2

A8 (4,9) 4.400 0.589 6.438 Cluster 2

A9 (10,12) 9.527 6.341 0.667 Cluster 3

A10 (7,5) 3.027 5.390 7.008 Cluster 1

A11 (9,11) 8.122 5.063 1.054 Cluster 3

A12 (4,6) 1.400 3.574 8.028 Cluster 1

A13 (3,10) 5.492 1.221 7.126 Cluster 2

A14 (3,8) 3.544 1.943 7.753 Cluster 2

A15 (6,11) 6.705 2.343 4.014 Cluster 2 57


• Now, we have completed the third iteration of the k-means clustering algorithm
and assigned each point into an updated cluster. In the above table, you can
observe that the point that is closest to the new centroid of a given cluster is
assigned to the cluster.
• Now, we will calculate the new centroid for each cluster for the third iteration.
• In cluster 1, we have 5 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5), and A12 (4,6). To
calculate the new centroid for cluster 1, we will find the mean of the x and y coordinates of
each point in the cluster. Hence, the new centroid for cluster 1 is (4, 4.6).
• In cluster 2, we have 7 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9), A13 (3,10), A14
(3,8), and A15 (6,11). Hence, the new centroid for cluster 2 is (4.143, 9.571)
• In cluster 3, we have 3 points i.e. A3 (11,11), A9 (10,12), and A11 (9,11). Hence, the new
centroid for cluster 3 is (10, 11.333).
• Here, you can observe that no point has changed its cluster compared to the
previous iteration. Due to this, the centroid also remains constant. Therefore, we
will say that the clusters have been stabilized. Hence, the clusters obtained after
the third iteration are the final clusters made from the given dataset.
58
K=3

X Y
2 4
2 6
5 6
4 7
8 3
6 6
5 2
5 7
6 3
4 4

59
Density-Based Clustering Algorithms

• Partition-based and hierarchical clustering techniques are highly


efficient with normal shaped clusters
• For arbitrary shaped clusters or detecting outliers, density-based
techniques are more efficient

60
Density-Based Clustering Algorithms
• Sometimes data points are grouped/clustered in arbitrary shapes or
include outliers
• Density-based clustering algorithms are efficient at finding high-
density regions and outliers
• Sometimes it is required to detect outliers for some task, e.g.
anomaly detection
• One method is DBSCAN
• Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
is a base algorithm for density-based clustering.
• It can discover clusters of different shapes and sizes from a large
amount of data, which contains noise and outliers.
• Based on the fact that a point belongs to a cluster if it is close to
many points from that cluster
• Two key parameters are
• eps: The distance that specifies the neighborhoods. Two
points are considered to be neighbors if the distance
between them are less than or equal to eps.
• minPts: Minimum number of data points to define a cluster 63
(in neighborhood)
• These parameters can be understood if we explore two concepts
called Density Reachability and Density Connectivity.
• Reachability in terms of density establishes a point to be reachable from another if
it lies within a particular distance (eps) from it.

• Connectivity, on the other hand, involves a transitivity-based chaining-approach


to determine whether points are located in a particular cluster.

• For example, p and q points could be connected if p->r->s->t->q, where a->b


means b is in the neighbourhood of a.

63
• There are three types of points after the DBSCAN clustering is
complete:

• Core — which have a sufficient number of neighbors within a


specified radius (eps). For each point in the dataset, count the
number of points within its eps neighborhood. If the count meets or
exceeds MinPts, mark the point as a core point.
• Border — This is a point that has at least one Core point at a distance eps.
• Noise/Outlier — This is a point that is neither a Core nor a Border. 64
• Algorithmic steps for DBSCAN clustering
• The algorithm proceeds by arbitrarily picking up a point in the dataset (until
all points have been visited).
• If there are at least ‘minPoint’ points within a radius of ‘ε (eps)’ to the point
then we consider all these points to be part of the same cluster.
• The clusters are then expanded by recursively repeating the neighbourhood
calculation for each neighbouring point

65
• For DBSCAN, the parameters ε and minPts are needed.

• minPts:
• As a rule of thumb, a minimum minPts can be derived from the number of
dimensions D in the data set, as minPts ≥ D + 1.
• The low value minPts = 1 does not make sense, as then every point on its own
will already be a cluster. With minPts ≤ 2, the result will be the same as of
hierarchical clustering with the single link metric.
• Therefore, minPts must be chosen at least 3. However, larger values are
usually better for data sets with noise and will yield more significant clusters.
As a rule of thumb, minPts = 2*dim can be used.
66
• ε:
• The value for ε can then be chosen by using a k-distance graph, plotting the
distance to the k = minPts-1 nearest neighbour ordered from the largest to
the smallest value.
• Good values of ε are where this plot shows an “elbow”: if ε is chosen much too
small, a large part of the data will not be clustered; whereas for a too-high value
of ε, clusters will merge and the majority of objects will be in the same cluster.

• In general, small values of ε are preferable, and as a rule of thumb, only a


small fraction of points should be within this distance of each other.

67
• K-means and Hierarchical Clustering both fail in creating clusters of
arbitrary shapes. They are not able to form clusters based on varying
densities. That’s why we need DBSCAN clustering.

68
69
70
• Use DBSCAN with ε = 2.0 and MinPts = 2 to determine:
• Core points
• Border points
• Noise points
• Final clusters

7071
• Step 1: Compute the Distance Matrix, we use the Euclidean distance
formula to compute pairwise distances:

72
Step 2: Identify Core Points (MinPts = 2)
A core point has at least MinPts = 2 points (including itself) within radius ε = 2.0.
• Q1: Neighbors within ε → Q2 (only 1 neighbor)
• Not core
• Q2: Neighbors within ε → Q1, Q3
• 2 neighbors → Core Point
• Q3: Neighbors within ε → Q2
• Only 1 neighbor → Not core
• Q4: Neighbors within ε → None
• Not core
• Q5: Neighbors within ε → None
• Not core
• So, Q2 is the only core point.

73
Step 3: Identify Border and Noise Points
• Q1: Not core but within ε of Q2 → Border Point
• Q3: Not core but within ε of Q2 → Border Point
• Q4: No neighbors within ε → Noise Point
• Q5: No neighbors within ε → Noise Point
Step 4: Form Final Clusters
• Q1, Q2, and Q3 belong to Cluster 1 (as Q2 is a core point, and Q1, Q3
are reachable).
• Q4 and Q5 are noise points (not part of any cluster). 74
75
76
77
78
Example
• We choose eps = 0.6 and MinPts =4

79
• Given the points A(3, 7), B(4, 6), C(5, 5), D(6, 4), E(7, 3), F(6, 2), G(7, 2)
and H(8, 4), Find the core points and outliers using DBSCAN. Take Eps
= 2.5 and MinPts = 3.

80
Dimensionality Reduction
• Dimensionality reduction is the process of reducing the number of
features (or dimensions) in a dataset while retaining as much
information as possible.
• This can be done for a variety of reasons, such as to reduce the
complexity of a model, to improve the performance of a learning
algorithm, or to make it easier to visualize the data.
• There are several techniques for dimensionality reduction, including
principal component analysis (PCA), singular value decomposition
(SVD), and linear discriminant analysis (LDA).
• Each technique uses a different method to project the data onto a
lower-dimensional space while preserving important information.
8081
• Some benefits of applying the dimensionality reduction technique to the
given dataset are given below:
• By reducing the dimensions of the features, the space required to store the dataset
also gets reduced.
• Less computation training time is required for reduced dimensions of features.
• Reduced dimensions of features of the dataset help in visualizing the data quickly.
• It removes the redundant features (if present) by taking care of multicollinearity.
• There are also some disadvantages of applying the dimensionality
reduction, which are given below:
• Some data may be lost due to dimensionality reduction.
• In the PCA dimensionality reduction technique, sometimes the principal components
required to consider are unknown.

82
Principle Component Analysis
• Principal Component Analysis (PCA) is a powerful technique used in data
analysis, particularly for reducing the dimensionality of datasets while
preserving crucial information.
• It does this by transforming the original variables into a set of new,
uncorrelated variables called principal components.
• Steps in PCA:
• Standardize the range of continuous initial variables
• Compute the covariance matrix to identify correlations
• Compute the eigenvectors and eigenvalues of the covariance matrix to identify the
principal components
• Create a feature vector to decide which principal components to keep
• Recast the data along the principal component axes

83
• There are as many principal components
as there are variables in the data.
• Principal components are constructed in
such a manner that the first principal
component accounts for the largest
possible variance in the data set.
• The second principal component is
calculated in the same way, with the
condition that it is uncorrelated with
(i.e., perpendicular to) the first principal
component and that it accounts for the
next highest variance.

84
STEP 1: STANDARDIZATION
• this step aims to standardize the range of the continuous initial variables so that each
one of them contributes equally to the analysis.
• It is critical to perform standardization before PCA, because it is quite sensitive regarding
the variances of the initial variables.
• If there are large differences between the ranges of initial variables, those variables with
larger ranges will dominate over those with small ranges (for example, a variable that
ranges between 0 and 100 will dominate over a variable that ranges between 0 and 1),
which will lead to biased results. So, transforming the data to comparable scales can
prevent this problem.
• Mathematically, this can be done by subtracting the mean and dividing by the standard
deviation for each value of each variable.

• Once the standardization is done, all the variables will be transformed to the same scale.

85
STEP 2: COVARIANCE MATRIX COMPUTATION
• this step aims to understand how the variables of the input dataset
vary from the mean with respect to each other, or in other words, to
see if there is any relationship between them.
• Because sometimes, variables are highly correlated in such a way that
they contain redundant information. So, to identify these
correlations, we compute the covariance matrix.
• The covariance matrix is a p × p symmetric matrix (where p is the
number of dimensions) that has as entries the covariances associated
with all possible pairs of the initial variables. For example, for a 3-
dimensional data set with 3 variables x, y, and z, the covariance
matrix is a 3×3 data matrix of this from:

86
87
STEP 3: COMPUTE THE EIGENVECTORS AND
EIGENVALUES OF THE COVARIANCE MATRIX TO
IDENTIFY THE PRINCIPAL COMPONENTS
• Eigenvectors and eigenvalues are the linear algebra concepts that we need
to compute from the covariance matrix to determine the principal
components of the data.
• Their number is equal to the number of dimensions of the data.
• It is eigenvectors and eigenvalues that are behind all the magic of principal
components because the eigenvectors of the Covariance matrix are the
directions of the axes where there is the most variance (most information)
and that we call Principal Components.
• Eigenvalues are simply the coefficients attached to eigenvectors, which give
the amount of variance carried in each Principal Component.
• By ranking your eigenvectors in order of their eigenvalues, highest to
lowest, you get the principal components in order of significance.

88
STEP 4: CREATE A FEATURE VECTOR
• The feature vector is simply a matrix that has as columns the
eigenvectors of the components that we decide to keep.
• This makes it the first step towards dimensionality reduction because
if we choose to keep only p eigenvectors (components) out of n, the
final data set will have only p dimensions.

89
STEP 5: RECAST THE DATA ALONG THE
PRINCIPAL COMPONENTS AXES
• In the previous steps, apart from standardization, you do not make
any changes to the data, you just select the principal components and
form the feature vector, but the input data set remains always in
terms of the original axes (i.e, in terms of the initial variables).
• In this step, the aim is to use the feature vector formed using the
eigenvectors of the covariance matrix, to reorient the data from the
original axes to the ones represented by the principal components
(hence the name Principal Components Analysis). This can be done by
multiplying the transpose of the original data set by the transpose of
the feature vector.

90

You might also like