0% found this document useful (0 votes)
18 views9 pages

Clustering Techniques in Machine Learning

Clustering is an unsupervised machine learning technique that groups unlabelled datasets into clusters based on similarities in data points. It is commonly used for tasks such as market segmentation and anomaly detection, with various methods including K-Means, Hierarchical, and Density-Based clustering. The K-Means algorithm specifically partitions data into K predefined clusters by iteratively assigning data points to the nearest centroid and recalculating centroids until convergence.

Uploaded by

suman.struc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views9 pages

Clustering Techniques in Machine Learning

Clustering is an unsupervised machine learning technique that groups unlabelled datasets into clusters based on similarities in data points. It is commonly used for tasks such as market segmentation and anomaly detection, with various methods including K-Means, Hierarchical, and Density-Based clustering. The K-Means algorithm specifically partitions data into K predefined clusters by iteratively assigning data points to the nearest centroid and recalculating centroids until convergence.

Uploaded by

suman.struc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Clustering in Machine Learning

Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. It
can be defined as "A way of grouping the data points into different clusters, consisting of
similar data points. The objects with the possible similarities remain in a group that has less or
no similarities with another group."

It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence and absence of those similar patterns.

It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it deals
with the unlabeled dataset.

After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML
system can use this id to simplify the processing of large and complex datasets.

The clustering technique is commonly used for statistical data analysis.

Note: Clustering is somewhere similar to the classification algorithm , but the difference is the type of
dataset that we are using. In classification, we work with the labeled data set, whereas in clustering, we
work with the unlabelled dataset.

Example: Let's understand the clustering technique with the real-world example of Mall: When we
visit any shopping mall, we can observe that the things with similar usage are grouped together. Such
as the t-shirts are grouped in one section, and trousers are at other sections, similarly, at vegetable
sections, apples, bananas, Mangoes, etc., are grouped in separate sections, so that we can easily find
out the things. The clustering technique also works in the same way. Other examples of clustering
are grouping documents according to the topic.

The clustering technique can be widely used in various tasks. Some most common uses of this
technique are:

o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.

The below diagram explains the working of the clustering algorithm. We can see the different fruits
are divided into several groups with similar properties.
Types of Clustering Methods
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one
group) and Soft Clustering (data points can belong to another group also). But there are also other
various approaches of Clustering exist. Below are the main clustering methods used in Machine
learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Basically, there are two types of hierarchical cluster analysis strategies –


1. Agglomerative Clustering: Also known as bottom-up approach or hierarchical
agglomerative clustering (HAC). A structure that is more informative than the unstructured set
of clusters returned by flat clustering. This clustering algorithm does not require us to
prespecify the number of clusters. Bottom-up algorithms treat each data as a singleton cluster
at the outset and then successively agglomerates pairs of clusters until all clusters have been
merged into a single cluster that contains all data.
Algorithm :
given a dataset (d1, d2, d3, ....dN) of size N
# compute the distance matrix
for i=1 to N:
# as the distance matrix is symmetric about
# the primary diagonal so we compute only lower
# part of the primary diagonal
for j=1 to i:
dis_mat[i][j] = distance[di, dj]
each data point is a singleton clusterrepeat
merge the two cluster having minimum distance
update the distance matrixuntil only a single cluster remains

Python implementation of the above algorithm using the scikit-learn library:

 Python3
from [Link] import AgglomerativeClustering
import numpy as np

# randomly chosen dataset


X = [Link]([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]])

# here we need to mention the number of clusters


# otherwise the result will be a single cluster
# containing all the data
clustering = AgglomerativeClustering(n_clusters = 2).fit(X)

# print the class labels


print(clustering.labels_)

Output :
[1, 1, 1, 0, 0, 0]
2. Divisive clustering: Also known as a top-down approach. This algorithm also does not
require to prespecify the number of clusters. Top-down clustering requires a method for
splitting a cluster that contains the whole data and proceeds by splitting clusters recursively
until individual data have been split into singleton clusters.
Algorithm :
given a dataset (d1, d2, d3, ....dN) of size N
at the top we have all data in one cluster
the cluster is split using a flat clustering method eg. K-Means etcrepeat
choose the best cluster among all the clusters to split
split that cluster by the flat clustering algorithmuntil each data is in its own singleton cluster

Hierarchical Agglomerative vs Divisive clustering –


a. Divisive clustering is more complex as compared to agglomerative clustering, as in the
case of divisive clustering we need a flat clustering method as “subroutine” to split each
cluster until we have each data having its own singleton cluster.
b. Divisive clustering is more efficient if we do not generate a complete hierarchy all the way
down to individual data leaves. The time complexity of a naive agglomerative clustering
is O(n3) because we exhaustively scan the N x N matrix dist_mat for the lowest distance in
each of N-1 iterations. Using priority queue data structure we can reduce this complexity
to O(n2logn). By using some more optimizations it can be brought down to O(n2).
Whereas for divisive clustering given a fixed number of top levels, using an efficient flat
algorithm like K-Means, divisive algorithms are linear in the number of patterns and
clusters.
c. A divisive algorithm is also more accurate. Agglomerative clustering makes decisions by
considering the local patterns or neighbor points without initially taking into account the
global distribution of data. These early decisions cannot be undone. whereas divisive
clustering takes into consideration the global distribution of data when making top-level
partitioning decisions.
K-Means Clustering Algorithm
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. In this topic, we will learn what is K-means clustering
algorithm, how the algorithm works, along with the Python implementation of k-means clustering.

What is K-Means Algorithm?


K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different


clusters in such a way that each dataset belongs only one group that has similar
properties.

It allows us to cluster the data into different groups and a convenient way to discover the categories
of groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of
this algorithm is to minimize the sum of distances between the data point and their corresponding
clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in
this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given
below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two
points as k points, which are not the part of our dataset. Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We
will compute it by applying some mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both the centroids. Consider the
below image:

From the above image, it is clear that points left side of the line is near to the K1 or blue centroid,
and points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow
for clear visualization.

o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two blue
points are right to the line. So, these three points will be assigned to new centroids.

As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or
K-points.

We will repeat the process by finding the center of gravity of centroids, so the new centroids will be
as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data points.
So, the image will be:

o We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two final clusters will
be as shown in the below image:

Common questions

Powered by AI

Agglomerative clustering, known as the bottom-up approach, starts with each data point as a singleton cluster and merges them based on similarity until a single cluster is formed. This method does not require specifying the number of clusters but is computationally intensive with a naive complexity of O(n^3), which can be optimized to O(n^2logn). Divisive clustering, the top-down approach, starts with one cluster containing all data, which is recursively split using a flat clustering method until individual data points remain. Although more complex due to the need for a flat clustering method, divisive clustering considers global data distribution, potentially leading to more accurate clustering .

The choice between hard and soft clustering can impact clustering tasks depending on the nature of the data and the specific goals. Hard clustering assigns each data point exclusively to one cluster, which is suitable when clear separations exist. Soft clustering allows data points to belong to multiple clusters, making it ideal for overlapping data or when membership uncertainty exists, such as in gene expression analysis or customer profiling where behaviors vary across categories. The choice affects the granularity and interpretability of the resultant clusters significantly .

The K-Means Clustering Algorithm may be unsuitable for datasets with non-convex shapes, varied cluster sizes, or noise, as it is based on Euclidean distance, which assumes spherical clusters. It struggles with outliers due to sensitivity in centroid calculation, and its requirement to predefine K limits its flexibility. Alternatives like DBSCAN, which is density-based and handles arbitrary shapes, or Hierarchical Clustering, allowing more complex structures, can be more adept for such datasets .

The initial choice of centroids in the K-Means Clustering algorithm greatly influences the final clusters because it dictates the algorithm's convergence path. Poorly chosen initial centroids can lead to suboptimal clusters or increased convergence time. Strategies such as the K-Means++ initialization improve results by spreading out initial centroids based on data distribution, reducing the likelihood of poor clustering outcomes by ensuring that initial centroids are further apart .

The K-Means Clustering Algorithm requires the number of clusters (K) to be specified prior to processing, which is a limitation. Choosing the correct number of clusters is crucial because it directly influences the model's performance. If the K value is too low, it may cluster dissimilar data points together, while if too high, it may separate similar data too much. The challenge lies in determining the optimal K value without prior knowledge of the data's natural clustering, which often requires methods like the elbow method or domain expertise .

Advantages of hierarchical clustering include its ability to create a dendrogram representing nested clusters, offering intuitive visualization and flexible data partitioning without prior knowledge of the number of clusters. It handles noise better than K-means and can provide insights into the hierarchical relationships within data. However, it is computationally intensive with high time complexity, especially agglomerative clustering, making it less suitable for large datasets. The inability to revisit previous decisions once made can also lead to inaccuracies in certain data distributions .

Distance metrics are pivotal in the clustering process as they define the algorithm's mechanism for assessing similarity between data points. Euclidean distance is commonly used due to its simplicity but assumes spherical clusters, which might not be ideal for all datasets. Alternative metrics like Manhattan, Mahalanobis, or Cosine distance can yield different cluster shapes or separations, impacting cluster compactness and separation drastically. The choice of distance metric influences convergence behavior and eventual accuracy of the clustering outcome, necessitating careful consideration based on the specific dataset characteristics .

Market segmentation involves dividing a broad market into subsets of consumers with common needs or characteristics, closely aligning with clustering in machine learning, which groups similar data points. Clustering algorithms, such as K-Means or hierarchical methods, can automate market segmentation by analyzing customer data to identify distinct groups. This enables businesses to target specific segments with tailored marketing strategies, optimizing resource allocation and enhancing customer satisfaction .

In agglomerative clustering, a distance matrix is calculated and utilized to determine the proximity between all pairs of data points. This matrix guides the merging process of clusters by identifying the closest pairs. The distance matrix is symmetric, allowing only the computation of its lower half, which reduces computational waste. The need to update and scan this matrix iteratively in each clustering step contributes significantly to the algorithm's initial O(n^3) complexity, which through optimizations like using a priority queue, can be reduced to O(n^2logn).

Implementing agglomerative clustering with a priority queue optimizes time complexity by efficiently managing pairwise distance evaluations. The priority queue allows quick access to the smallest current pair of clusters for merging. By limiting the number of operations required to update and find minimum distances across recursive iterations, it reduces the complexity from O(n^3) to O(n^2logn), significantly enhancing performance for larger datasets .

You might also like