CLUSTERING
ClusterAnalysis
Types of Data in Cluster Analysis
A Categorization of Major Clustering
Methods
Cluster Analysis
Cluster is a collection of data objects:
Similar to one another within the same cluster,
Dissimilar to the objects in other clusters.
Cluster analysis is an analysis used to finding similarities
between data according to the characteristics found in the
data and grouping similar data objects into clusters.
Cluster analysis is used to form groups or clusters of similar
records based on several measures made on these records.
Cluster is Unsupervised learning, i.e., no predefined classes.
This data has been applied in many areas, including
astronomy, archaeology, medicine, chemistry, education,
psychology, linguistics and sociology.
Application of Clustering
Use of Clustering in Medical Image
Database
Pattern Recognition
Spatial Data Analysis
Detect spatial clusters or for other
spatial mining tasks
Image Processing
Economic Science
WWW
Requirements of Clustering in Data Mining
The following are typical requirements of
clustering in data mining.
Scalability.
Ability to deal with different types of attributes.
Discovery of clusters with arbitrary shape.
Ability to deal with noisy data.
Minimal requirements for domain knowledge to
determine input parameters.
Insensitivity to the order of input records.
High dimensionality.
Constraint-based clustering.
Interpretability and usability.
TYPES OF DATA IN CLUSTER ANALYSIS
Types of data in cluster analysis are:
Interval-scaled variables. Example: salary, height.
Binary variables. Example: gender (Male/Female), has_cancer
(True/False)
Nominal (categorical) variables. Example: religion (Christian,
Muslim, Buddhist, Hindu, etc.)
Ordinal variables. Example: military-rank (soldier, sergeant,
captain, etc.)
Ratio-scaled variables. Example: Population growth (1, 10, 100,
1000, ...)
Variables of mixed types.
CATEGORISATION OF MAJOR CLUSTERING
METHODS
We can divide the clustering methods
into two main groups:
Hierarchical methods
Partitioning methods
Additional three main categories:
Density-based methods,
Model-based clustering
Grid-based methods.
Partitioning Methods
Definition: Given a database of n objects. A partition
method constructs k partitions of the data, where
each partition represents a cluster and k<=n.
It classifies the data into k groups, which satisfy the
following requirements,
1. Each group must contain at least one object.
2. Each object must belong to exactly one group.
Partitioning Algorithms:
K-means
K- medoids
K- means Partitioning
Each cluster is represented by the mean value of the objects
in the cluster. Hence, it is known as Centroid-Based
technique.
Working method:
First, it randomly selects k of the objects, each of which
initially represents a cluster mean.
For each of the remaining objects, an object is assigned
to the cluster to which it is most similar, based on the
distance between the object and the cluster mean.
Then, it compute the new mean for each cluster.
The above process will be continue until the criterion
function converges.
K-Means Clustering
k-means algorithm is implemented as below:
Input: Number of clusters – k, database of n objects
Output: Set of k clusters that minimize the squared error
Choose k objects as the initial cluster centers Repeat
(Re)assign each object to the cluster to which the object is most
similar based on the mean value of the objects in the cluster
Update the cluster means
Until no change
9
K-Means Clustering
1
0
K-Means Clustering Method
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3
Update 3
3
2 each 2 the 2
1
objects
1
cluster 1
0 0
0
0 1 2 3 4 5 6 7 8 9 to 0 1
10
2 3 4 5 6 7 8 9 means 0 1 2 3 4 5 6 7 8 9
10
10
most
similar reassign reassign
center 10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
6 6
object as initial 5 5
cluster center 4 Update 4
3
the 3
2 2
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9
means 0
0 1 2 3 4 5 6 7 8 9
10 10
9
K-Means Method
Strength:Relatively efficient: O(tkn), where n is # objects, k
is # clusters, and t is # iterations. Normally, k, t << n.
Comment: Often terminates at a local optimum. The global
optimum may be found using techniques such as: deterministic
annealing and genetic algorithms
Weakness
Applicable only when mean is defined – Categorical data
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes
1
2
Variations of the K-Means
Method
A few variants of the k-means which differ in
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster
means categorical data: k-modes
Handling
Replacing means of clusters with modes
Using new dissimilarity measures to deal with categorical objects
A mixture of categorical and numerical data: k-prototype method
Expectation Maximization
Assigns objects to clusters based on the probability of membership
Scalability of k-means
Compressible, Discardable, To be maintained in main memory
Clustering Features
1
3
Problem of the K-Means
Method
The k-means algorithm is sensitive to outliers
Since an object with an extremely large value may substantially distort the
distribution of the data.
K-Medoids: Instead of taking the mean value of the object in a cluster
as a reference point, medoids can be used, which is the most
centrally located object in a cluster.
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9
10
1
4
K-Medoids Clustering
Method
PAM (Partitioning Around Medoids)
starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
All pairs are analyzed for replacement
PAM works effectively for small data sets, but does not scale well for
large data sets
CLARA
CLARANS
1
5
K-
Medoids
Input: k, and database of n objects
Output: A set of k clusters
Method:
Arbitrarily choose k objects as initial medoids
Repeat
Assign each remaining object to cluster with nearest medoid
Randomly select a non-medoid oarndom
Compute cost S of swapping oj with oarndom
If S < 0 swap to form new set of k medoids
Until no change
Working Principle: Minimize sum of the dissimilarities between each object and its
corresponding reference point. That is, an absolute-error criterion is used
E k |po |
j 1 pCj j
1
6
K-
medoids
Case 1: p currently belongs to medoid oj. If oj is replaced by orandomas a medoid and p is
closest to one of oi where i < > j then p is reassigned to oi.
Case 2: p currently belongs to medoid oj. If oj is replaced by orandomas a medoid and p is
closest to orandomthen p is reassigned to orandom.
Case 3: p currently belongs to medoid oi (i< >j) If oj is replaced by oarndomas a medoid
and p is still closest to oi assignment does not change
Case 4: p currently belongs to medoid oi (i < > j). If oj is replaced by orandomas a medoid
and p is closest to orandomthen p is reassigned to orandom.
1
7
K-
medoids
After reassignment difference in squared
error E is calculated. Total cost of
swapping – Sum of costs incurred by all
non-medoid objects
If total cost is negative, o is replaced with o as
j random
E will be reduced
1
8
K-medoids
Algorithm
1
9
Hierarchical Methods
Agglomerative hierarchical clustering:
Each object initially represents a
cluster of its own. Then clusters are
successively merged until the desired
cluster structure is obtained.
Divisive hierarchical clustering:
All objects initially belong to one
cluster. Then the cluster is divided into sub-
clusters, which are successively divided
into their own sub-clusters. This process
continues until the desired cluster structure
Hierarchical Methods
The hierarchical clustering methods could be further
divided according to the manner that the similarity
measure is calculated:
Single-link clustering (also called the
connectedness, the minimum method or the nearest
neighbour method)
Complete-link clustering (also called the diameter,
the maximum method or the furthest neighbour
method)
Average-link clustering (also called minimum
variance method)
Density – Based Clustering Method
This method is used to discover clusters with
arbitrary [Link] are three methods,
DBSCAN: grows clusters according to a
density-based connectivity analysis.
OPTICS: produce cluster obtained from a
wide range of parameter settings.
DENCLUE: clusters objects based on a set of
density distribution functions.
Grid Based Clustering
It uses multiresolution grid data structure.
It quantizes the object space into a finite number of
cells, that form a grid structure.
In that grid structure, all clustering operations are
performed.
Approaches of Grid-based clustering:
STING: which explores statistical information stored in
the grid cells.
WAVE CLUSTER: wavelet transform method used to
cluster objects.
CLIQUE: represents a grid & density based approach
for clustering in high-dimensional data space.
STING: STatistical Information Grid
In this technique, the spatial area is divided
into rectangular cells.
Each cell at high level is partitioned to form
a number of cells at next lower level.
Statistical information regarding the
attributes in each grid cell is precomputed
and stored.
These statistical parameters are useful for
query processing.
Wave Cluster
It is a multiresolution clustering algorithm.
It first summarizes the data on to the data
space.
Then it transform the original feature space,
to find dense region by using wavelet
transformation.