0% found this document useful (0 votes)
12 views10 pages

Understanding Clustering in Data Mining

Clustering in data mining is the process of grouping similar objects into clusters to understand data distribution and structure. It involves various methods such as partitioning, hierarchical, density-based, grid-based, model-based, and constraint-based methods, each with its own advantages and challenges. Key requirements for effective clustering include scalability, ability to handle different data types, and interpretability of results.

Uploaded by

gmeghanareddy27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views10 pages

Understanding Clustering in Data Mining

Clustering in data mining is the process of grouping similar objects into clusters to understand data distribution and structure. It involves various methods such as partitioning, hierarchical, density-based, grid-based, model-based, and constraint-based methods, each with its own advantages and challenges. Key requirements for effective clustering include scalability, ability to handle different data types, and interpretability of results.

Uploaded by

gmeghanareddy27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

What is clustering in Data Mining?

o Clustering is the method of converting a group of abstract objects into classes of similar
objects.

o Clustering is a method of partitioning a set of data or objects into a set of significant


subclasses called clusters.

o It helps users to understand the structure or natural grouping in a data set and used either
as a stand-alone instrument to get a better insight into data distribution or as a pre-processing
step for other algorithms

What is a Cluster?

o A cluster is a subset of similar objects

o A subset of objects such that the distance between any of the two objects in the cluster is
less than the distance between any object in the cluster and any object that is not located
inside it.

o A connected region of a multidimensional space with a comparatively high density of objects

Requirements of Clustering in Data Mining


Clustering analysis has been an evolving problem in data mining due to its variety of
applications. The advent of various data clustering tools in the last few years and their
comprehensive use in a broad range of applications, including image processing,
computational biology, mobile communication, medicine, and economics, must contribute to
the popularity of these algorithms. The main issue with the data clustering algorithms is that it
cannot be standardized. The advanced algorithm may give the best results with one type of
data set, but it may fail or perform poorly with other kinds of data set. Although many efforts
have been made to standardize the algorithms that can perform well in all situations, no
significant achievement has been achieved so far. Many clustering tools have been proposed
so far. However, each algorithm has its advantages or disadvantages and cant work on all real
situations.

1. Scalability: Scalability in clustering implies that as we boost the amount of data objects,
the time to perform clustering should approximately scale to the complexity order of
the algorithm. For example, if we perform K- means clustering, we know it is O(n),
where n is the number of objects in the data. If we raise the number of data objects 10
folds, then the time taken to cluster them should also approximately increase 10 times.
It means there should be a linear relationship. If that is not the case, then there is some
error with our implementation process.
2. Ability to deal with different kinds of attributes − Algorithms should be capable to be
applied on any kind of data such as interval-based (numerical) data, categorical, and
binary data.
3. Discovery of clusters with attribute shape − The clustering algorithm should be
capable of detecting clusters of arbitrary shape. They should not be bounded to only
distance measures that tend to find spherical cluster of small sizes.
4. High dimensionality − The clustering algorithm should not only be able to handle
low-dimensional data but also the high dimensional space.
5. Ability to deal with noisy data − Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
6. Interpretability − The clustering results should be interpretable, comprehensible

Clustering Methods
Clustering methods can be classified into the following categories –
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs
‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It means that it will
classify the data into k groups, which satisfy the following requirements − Each group
contains at least one object. Each object must belong to exactly one group. Points to
remember − For a given number of partitions (say k), the partitioning method will create
an initial partitioning. Then it uses the iterative relocation technique to improve the
partitioning by moving objects from one group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We
can classify hierarchical methods on the basis of how the hierarchical decomposition
is formed. There are two approaches here –
• Agglomerative Approach
• Divisive Approach
Agglomerative Approach This approach is also known as the bottom-up approach.
In this, we start with each object forming a separate group. It keeps on merging the
objects or groups that are close to one another. It keep on doing so until all of the
groups are merged into one or until the termination condition holds.
Divisive Approach This approach is also known as the top-down approach. In this,
we start with all of the objects in the same cluster. In the continuous iteration, a cluster
is split up into smaller clusters. It is down until each object in one cluster or the
termination condition holds. This method is rigid, i.e., once a merging or splitting is
done, it can never be undone. Approaches to Improve Quality of Hierarchical
Clustering Here are the two approaches that are used to improve the quality of
hierarchical clustering Perform careful analysis of object linkages at each hierarchical
partitioning. Integrate hierarchical agglomeration by first using a hierarchical
agglomerative algorithm to group objects into micro-clusters, and then performing
macro-clustering on the micro-clusters.
Density-based Method
• Most partitioning methods cluster objects based on the distance between
objects. Such methods can find only spherical-shaped clusters and encounter
difficulty at discovering clusters of arbitrary shapes.
• Other clustering methods have been developed based on the notion of density.
Their general idea is to continue growing the given cluster as long as the
density in the neighborhood exceeds some threshold; that is, for each data
point within a given cluster, the neighborhood of a given radius has to contain
at least a minimum number of points. Such a method can be used to filter out
noise (outliers)and discover clusters of arbitrary shape.
• DBSCAN and its extension, OPTICS, are typical density-based methods that
growclusters according to a density-based connectivity analysis. DENCLUE is
a methodthat clusters objects based on the analysis of the value distributions
of density functions
Grid-based Method
• Grid-based methods quantize the object space into a finite number of cells that
form a grid structure.
• All of the clustering operations are performed on the grid structure i.e., on the
quantized space. The main advantage of this approach is its fast processing
time, which is typically independent of the number of data objects and
dependent only on the number of cells in each dimension in the quantized
space.
• STING is a typical example of a grid-based method. Wave Cluster applies
wavelet transformation for clustering analysis and is both grid-based and
density-based.
Model-based methods
• Model-based methods hypothesize a model for each of the clusters and find
the best fit of the data to the given model.
• A model-based algorithm may locate clusters by constructing a density function
that reflects the spatial distribution of the data points.
• It also leads to a way of automatically determining the number of clusters based
on standard statistics, taking ―noise‖ or outliers into account and thus yielding
robust clustering methods.
Constraint-based Method
Model-based methods hypothesize a model for each of the clusters and find the best
fit of the data to the given model.
A model-based algorithm may locate clusters by constructing a density function that
reflects the spatial distribution of the data points.
It also leads to a way of automatically determining the number of clusters based on
standard statistics, taking ―noise‖ or outliers into account and thus yielding robust
clustering methods.

.
Partitioning approach:
Construct various partitions and then evaluate them by some criterion, e.g., minimizing
the sum of square errors
Typical methods:
1.k-means Clustering
2.k-medoids Clustering
1. K-Means Clustering Algorithm

K-Means clustering is an unsupervised iterative clustering technique.


• It partitions the given data set into k predefined distinct clusters.
• A cluster is defined as a collection of data points exhibiting certain similarities

It partitions the data set such that-


• Each data point belongs to a cluster with the nearest mean.
• Data points belonging to one cluster have high degree of similarity.
• Data points belonging to different clusters have high degree of dissimilarity.
K-Means Clustering Algorithm
K-Means Clustering Algorithm involves the following steps
Step-01: • Choose the number of clusters K.
Step-02: • Randomly select any K data points as cluster centres.
• Select cluster centres in such a way that they are as farther as possible from each
other.
Step-03:
Calculate the distance between each data point and each cluster centre.
• The distance may be calculated either by using given distance function or by using
Euclidean distance formula.
Step-04: • Assign each data point to some cluster.
• A data point is assigned to that cluster whose centre is nearest to that data point.
Step-05: • Re-compute the centre of newly formed clusters.
• The centre of a cluster is computed by taking mean of all the data points contained
in that cluster.
Step-06: Keep repeating the procedure from Step-03 to Step-05 until any of the
following stopping criteria is met-
• Centre of newly formed clusters do not change
• Data points remain present in the same cluster

Maximum number of iterations are reached

Advantages

The following are some advantages of K-Means clustering algorithms −

• It is very easy to understand and implement.


• If we have large number of variables then, K-means would be faster than Hierarchical
clustering.
• On re-computation of centroids, an instance can change the cluster.
• Tighter clusters are formed with K-means as compared to Hierarchical clustering.

Disadvantages

The following are some disadvantages of K-Means clustering algorithms −


• It is a bit difficult to predict the number of clusters i.e. the value of k.
• Output is strongly impacted by initial inputs like number of clusters (value of k).
• Order of data will have strong impact on the final output.
• It is very sensitive to rescaling. If we will rescale our data by means of normalization
or standardization, then the output will completely change. Final output.
• It is not good in doing clustering job if the clusters have a complicated geometric shape.

K-Means Additional Issues

There are various issues of the K-Means Algorithm which are as follows −

• Handling Empty Clusters − The first issue with the basic K-means algorithm given
prior is that null clusters can be acquired if no points are allocated to a cluster during
the assignment phase. If this occurs, then a method is needed to choose a
replacement centroid, because the squared error will be larger than necessary.
• One method is to select the point that is farthest away from some recent centroid. If
this removes the point that currently contributes some total squared error. Another
method is to select the replacement centroid from the cluster that has the largest SSE.
This will generally divide the cluster and decrease the complete SSE of the clustering.
If there are multiple null clusters, then this process can be repeated multiple times.
• Outliers − When the squared error method is used, outliers can unduly tend to the
clusters that are discovered. In specific, when outliers are present, the resulting cluster
centroids (prototypes) cannot be as representative as they can be, and thus, the SSE
will be higher as well.
• It is beneficial to find outliers and remove them beforehand. It is essential to appreciate
that there are specific clustering applications for which outliers should not be removed.
When clustering is used for data compression, each point should be clustered, and in
some cases, including financial analysis, probable outliers, e.g.,unusually profitable
users, can be the interesting points.
• Reducing the SSE with Postprocessing − The method to reduce the SSE is to find
more clusters, i.e., to need a larger K. In such cases, it is likely to improve the SSE,
but don't require to increase the number of clusters. This is possible because Kmeans
generally converge to a local minimum.
• Various methods are used to "fix-up" the resulting clusters to make a clustering that
has lower SSE. The method is to target on individual clusters because the complete
SSE is easily the total of the SSE contributed by every cluster. It can change the total
SSE by implementing several operations on the clusters, including splitting or merging
clusters.
• One method is to use an alternate cluster splitting and merging procedure. During a
splitting procedure, clusters are divided, while during a merging procedure, clusters
are combined. In this method, it is accessible to withdrawal local SSE minima and
create a clustering solution with the seized number of clusters. The following are some
methods used in the splitting and merging phases which are as follows −

Bisecting k-means Algorithm


Bisecting k-means is a hybrid approach between Divisive Hierarchical Clustering (top
down clustering) and K-means Clustering. Instead of partitioning the data set into K
clusters in each iteration, bisecting k-means algorithm splits one cluster into two sub
clusters at each bisecting step (by using k-means) until k clusters are obtained.
Divisive Hierarchical Clustering

Divisive Clustering starts at the top level with a single cluster and divides it towards the
bottom level. In order to decide which clusters should be split, a measure of
dissimilarity between sets of observations is required. This is achieved by using a
measure of distance between pairs of observations.
How Bisecting K-means Work
1. Set K to define the number of cluster
2. Set all data as a single cluster
3. Use K-means with K=2 to split the cluster

4. Measure the distance for each intra cluster.


- Sum of square Distance

5. Select the cluster that have the largest distance and split to 2 cluster using K-means.
6. Repeat step 3–5 until the number of leaf cluster = K.

Final cluster is cluster [C,D,E,F]

Advantage of Using B-Kmeans over Kmeans


1. Bisecting k-means is more efficient when K is large.
For the kmeans algorithm, the computation involves every data point of the data set
and k centroids. On the other hand, in each Bisecting step of Bisecting k-means, only
the data points of one cluster and two centroids are involved in the computation. Thus,
the computation time is reduced.
2. Bisecting k-means produce clusters of similar sizes, while k-means is known to
produce clusters of widely different sizes.

You might also like